AI workloads in cloud environments pose unique performance challenges, particularly in managing data pipelines and processing I/O operations. These workloads demand storage solutions that deliver both low latency and high throughput while minimizing overhead, especially in virtualized environments. To meet these demands, it is crucial to provide virtual machines with high-performance block devices that deliver minimal latency and incur minimal overhead. At Xinnor, we’ve developed xiRAID Opus, a solution tailored to overcome these challenges. This blog post outlines how xiRAID Opus addresses these obstacles by delivering superior block device performance, seamless virtualization pass-through, and efficient integration with parallel file systems. This study was first presented at the SNIA SDC event in September 2024.
AI Workloads and Performance Challenges in the Cloud
One of the main performance issues in cloud environments, particularly for Software-Defined Storage (SDS), is handling the diverse workload profiles of AI applications. Different stages in an AI data pipeline can have widely varying demands. For example, some workloads require low latency and random small I/O operations, while others need high throughput for large sequential file transfers.
Traditional SDS systems are typically optimized for one workload type, making it difficult for them to perform well across different tasks. This limitation is compounded by the losses in performance introduced by virtualization, which often causes significant bottlenecks. Additionally, many SDS solutions lack high-performance shared volume support, further restricting their scalability and flexibility in distributed cloud environments.
At Xinnor, we believe that parallel file systems are needed to address the shared volume gap in cloud environments. These systems enable scalability and flexibility, allowing workloads to flow efficiently across AI data pipelines while minimizing performance degradation.
To tackle these performance challenges and provide a comprehensive solution, we focus on three key components:
- Block device (xiRAID Opus)
- Virtualization pass-through method
- Parallel file systems (e.g. Lustre and pNFS)
In this blog post, we will explore the first two components in detail, covering how each is configured and tested to optimize performance to form a comprehensive solution for high-performance cloud environments. The next blog post will be dedicated to the parallel file systems (read here).
Block Device: xiRAID Opus
At the core of our high-performance storage solution is xiRAID Opus, a block device optimized for cloud environments. The ability to deliver high-performance block storage is crucial for ensuring optimal performance in virtualized AI workloads. xiRAID Opus is specifically designed to provide minimal latency and minimal overhead when delivering block devices to VMs. This performance optimization is key for handling the demanding I/O operations associated with AI data pipelines.
Whether dealing with small random I/O or large sequential data transfers, xiRAID Opus consistently maintains high performance across virtual environments, ensuring seamless data access for AI applications.
Key features of xiRAID Opus include:
- Creation of RAID-protected volumes
- Provisioning of volumes to VMs
- Two major optimizations for performance enhancement:
- Polling (significantly reduces latency by actively checking for I/O completions);
- Zero-copy (eliminates unnecessary data copying within the storage pipeline, boosting throughput).
xiRAID Opus also offers significant deployment flexibility. Whether deployed in a bare-metal setup, as a virtual appliance, or on a DPU, xiRAID Opus maintains its performance characteristics across different infrastructure configurations.
Virtualization Pass-Through Method
To ensure optimal performance in virtualized environments, it is critical to provide virtual machines with high-performance block device that deliver minimal latency and overhead. Virtualization pass-through technology is the most efficient method for delivering this block device, ensuring that the performance of AI workloads remains uncompromised in cloud environments. There are multiple methods, each offering different benefits and performance trade-offs.
In our solution, we focus on three primary methods for delivering block device in virtual environments:
- VIRTIO: This widely-used interface supports both single I/O threads and multiple I/O threads, allowing efficient block device delivery.
- vhost-user-blk: A local interface that passes block device directly to virtual machines, operating entirely in user space. It ensures high performance by using a zero-copy approach, which reduces unnecessary data movement. At Xinnor, we've developed multithreading support for vhost-user-blk, a feature unique to our implementation. This support significantly boosts performance, especially when handling multiple concurrent workloads.
- VDUSE: A technology that allows the creation of Virtio devices in user space, presenting them to virtual machines through the vDPA mechanism. This method enables block devices to operate with high performance and low latency, leveraging data path acceleration benefits. VDUSE also simplifies the development process by eliminating the need to modify or load modules into the Linux kernel.
While other methods like ublk exist, their limited support in virtual environments means that we do not focus on them in our solution. Instead, we prioritize the methods that provide the highest performance and scalability for cloud-based AI workloads.
Comparing the methods
To better understand the performance benefits of xiRAID Opus, we compared it to MDRAID using a set of RAID configurations in a controlled testing environment. The test environment included:
- xiRAID Opus RAID 5 (23+1 drives) vs MDRAID RAID 0 (24 drives)
- Single Virtual Machine: 32 VCPUs, 32GB of RAM
- Operating System: Rocky Linux 9 with kernel-lt (6.10)
- Test Tools: FIO v3.36
- Workloads: 4k random reads, AIO (asynchronous I/O), direct I/O, full stripe writes
This configuration allowed us to evaluate the efficiency and performance of xiRAID Opus in random read and sequential write operations, comparing it directly with the MDRAID setup.
Passing shared block volume to 1 VM - Random read
At a lower number of jobs, xiRAID Opus performs slightly better than mdraid, but the real advantage shows as the workload scales. At 32 jobs, xiRAID Opus outperforms VIRTIO by almost 4x and VDUSE by about 15x. The difference between vhost-user-blk and NVMe/RDMA is minimal. As jobs and I/O depths increase, mdraid shows 3x higher latency, while xiRAID Opus maintains consistently low latency.
Passing shared block volume to 1 VM - Sequential write
In a workload of 1 job / 1 IO depth, xiRAID Opus (vhost-user-blk) achieves an impressive write of around 8 GBps, outperforming other solutions by almost 2x.
As the workload intensifies to 8 jobs / 32 IO depth, xiRAID Opus maintains strong performance, reaching approximately 70 GBps. While other solutions begin to close the gap under higher concurrency, xiRAID Opus still provides higher scalability and has the capacity to manage more complex storage tasks efficiently.
Passing shared block volume: Methods Comparison
Network performance plays a crucial role in maximizing the efficiency of xiRAID Opus. Our tests show that vhost-user-blk performs almost on par with NVMe/RDMA, which is capable of saturating a 200 Gbps network. However, TCP tends to hit limitations due to bandwidth constraints.
For example, in GPU clusters, random read workloads are commonly encountered when loading data, while sequential write workloads are prevalent when receiving data. By optimizing for both, xiRAID Opus delivers consistent high performance across these scenarios in virtualized AI environments.
Why We Chose vhost-user-blk
Based on the provided test results, we chose vhost-user-blk over the other pass-through methods. xiRAID Opus using our implementation of vhost-usr-blk is the only option to deliver a high-performance block device to VMs. Key reasons:
- Multithreading Interface for vhost-user-blk: We developed a unique multithreading interface for vhost-user-blk, allowing it to efficiently manage multiple I/O threads. This is critical for maintaining high performance in environments where multiple workloads are being processed concurrently, such as in AI data pipelines.
- Reuse of SPDK Best Practices: We leveraged best practices from SPDK, a toolkit designed to optimize storage performance.
- Optimized CPU Utilization: xiRAID Opus is optimized to make the most of advanced CPU features available on both AMD and Intel processors.
- Performance Optimization for Single VMs: Our solution is fine-tuned to deliver optimal performance even when operating on a single virtual machine and a single block device. This makes xiRAID Opus highly efficient in smaller-scale environments, without compromising on performance.
Conclusion
In summary, for AI workloads in cloud environments, minimizing latency and maximizing throughput are paramount. xiRAID Opus, alongside an optimized pass-through method, effectively addresses these needs, allowing virtualized environments to manage demanding I/O operations without compromising performance. Among the evaluated pass-through options, vhost-user-blk emerged as the ideal choice due to its multithreading support, SPDK optimizations, and robust performance under high concurrency workloads. By focusing on the needs of virtualized AI workloads, xiRAID Opus combined with vhost-user-blk provides a scalable, high-performance solution for cloud environments, ensuring low-latency, high-throughput access across various stages of the AI data pipeline. In addition to our comparison of pass-through methods, a well-configured parallel file system forms the third and final component of our comprehensive high-performance storage for AI workloads. We will delve into this in an upcoming blog, detailing how it complements the chosen vhost-user-blk approach and further enhances storage performance in demanding AI environments.
You can find the original blog here
Top comments (0)