Introduction
In the fast-paced world of hyperconverged infrastructure (HCI), performance and efficiency aren’t just buzzwords – they’re essential. As organizations push the boundaries of what their IT infrastructure can deliver, selecting the most effective solution becomes a critical decision. In this context, StarWind Virtual SAN (VSAN) and Microsoft Storage Spaces Direct (S2D) are two software-defined storage products that offer distinct approaches to leveraging NVMe and RDMA for high-performance HCI storage.
This article is the second in a series exploring the performance of StarWind VSAN and Microsoft S2D in a 2-node Hyper-V cluster setup. In the first article, we compared these two solutions using NVMe-oF over TCP, exploring their performance, capacity efficiency, and practical application. If you missed it, you can catch up here. Now, we’re turning our attention to RDMA-based configurations to give you an even clearer picture of which solution might be your ideal fit.
In this article, we’ll evaluate how these solutions perform in a 2-node Hyper-V cluster across two key scenarios:
- StarWind VSAN NVMe over RDMA
- Host Mirroring + MDRAID-5.
- Microsoft Storage Spaces Direct over RDMA
- Mirror-accelerated parity, workload placed in the mirror tier.
- Mirror-accelerated parity, workload placed in both tiers – mirror and parity.
By examining these configurations, we aim to provide insights into how each solution performs under varying workloads and how these performance characteristics translate into real-world benefits. In the sections that follow, we’ll walk you through our testbed setup, benchmarking methodology, and the results of our performance tests.
Solution diagram:
StarWind Virtual SAN NVMe over RDMA scenario:
StarWind Virtual SAN (VSAN) setup was designed to leverage the full potential of NVMe drives and RDMA for high-performance storage. Here’s how it was configured:
- NVMe drives: Each Hyper-V node was equipped with 5x NVMe drives, which were directly passed through to the StarWind VSAN Controller Virtual Machine (CVM). This direct pass-through ensures that the drives can fully leverage the speed and performance benefits of NVMe technology.
- RDMA: To enable RDMA (Remote Direct Memory Access) and achieve ultra-low latency communication between the nodes, Mellanox NICs were used. These NICs were configured with SR-IOV (Single Root I/O Virtualization), allowing their Virtual Functions to be passed through to the StarWind VSAN CVM. This setup provides the necessary RDMA compatibility for high-speed data transfer.
- MDRAID5 array creation: Inside the StarWind VSAN CVM, the 5x NVMe drives were assembled into an MDRAID-5 array. This RAID configuration provides a nice balance between performance, capacity, and redundancy.
- High Availability (HA): On top of the MDRAID-5 array, we created two StarWind High Availability (HA) devices. These HA devices replicate data between the two nodes, ensuring continuous availability even in the event of a node failure.
- NVMe-oF connectivity: The StarWind HA devices were connected to the nodes using StarWind NVMe-oF Initiator. The NVMe initiator plays a key role in establishing the high-speed NVMe-oF connection across the RDMA network, which is critical for maintaining low-latency and high-throughput operations.
- Cluster Shared Volumes: Finally, Cluster Shared Volumes (CSVs) were created on top of the connected HA devices. These CSVs allow both nodes to access the same storage simultaneously, enabling efficient load balancing and resource utilization.
It’s worth noting that we used StarWind NVMe-oF Initiator because, currently, Microsoft does not offer a native NVMe-oF initiator. Microsoft has announced plans to release an NVMe initiator for Windows Server 2025, but it will support NVMe over TCP only, with no confirmation yet regarding RDMA support.
Microsoft Storage Spaces Direct over RDMA scenario – Mirror-accelerated parity:
For the S2D setup, we implemented a mirror-accelerated parity configuration, which offers an optimal balance between performance and capacity efficiency. This setup allows us to evaluate how well S2D handles different workloads, particularly in scenarios where the workload is either fully placed in the high-performance mirror tier or spread across both the mirror and parity tiers.
Here’s how we structured the solution:
- Storage tiers: We created two distinct storage tiers, each configured to optimize specific aspects of data handling:
- NestedPerformance tier: Configured with the mirror resiliency setting, this tier uses SSDs and ensures high data redundancy by storing four copies of each piece of data. The command used to create this tier was:
New-StorageTier -StoragePoolFriendlyName s2d-pool -FriendlyName NestedPerformance -ResiliencySettingName Mirror -MediaType SSD -NumberOfDataCopies 4
-
- NestedCapacity tier: This tier focuses on capacity efficiency, using a parity resiliency setting. It stores two copies of each piece of data with one parity stripe, configured using the following command:
New-StorageTier -StoragePoolFriendlyName s2d-pool -FriendlyName NestedCapacity -ResiliencySettingName Parity -MediaType SSD -NumberOfDataCopies 2 -PhysicalDiskRedundancy 1 -NumberOfGroups 1 -FaultDomainAwareness StorageScaleUnit -ColumnIsolation PhysicalDisk -NumberOfColumns 4
- Volumes setup: Following Microsoft’s recommendations, two volumes were created across these tiers:
- Volume01 and Volume02: Both volumes were configured with 20% of their data in the high-performance mirror tier and the remaining 80% in the capacity-focused parity tier. This setup allows us to observe how the system handles data as it moves between tiers, particularly when the mirror tier reaches its capacity limits. The commands used to create these volumes were:
New-Volume -StoragePoolFriendlyName s2d-pool -FriendlyName Volume01 -StorageTierFriendlyNames NestedPerformance, NestedCapacity -StorageTierSizes 820GB, 3276GB New-Volume -StoragePoolFriendlyName s2d-pool -FriendlyName Volume02 -StorageTierFriendlyNames NestedPerformance, NestedCapacity -StorageTierSizes 820GB, 3276GB
- ReFS data movement: The Resilient File System (ReFS) is configured to automatically move data between the tiers when the mirror tier reaches 85% capacity. This threshold was left at its default setting to simulate a typical production environment.
- Testing Scenarios:
- Scenario 1: Workload in the mirror tier: Here, the entire workload was placed within the mirror tier, leveraging its high performance and redundancy.
- Scenario 2: Workload spilling into the parity tier: In the second scenario, we explored the performance impact when the workload exceeds the mirror tier’s capacity, forcing ReFS to start moving data to the slower parity tier. We also simulated conditions where writes were directed straight to the parity tier, representing a worst-case scenario in terms of performance.
In real-world applications, performance would likely fall somewhere between these two scenarios, depending on the specific workload and how much data resides in each tier. This dual-tier approach provides valuable insights into how S2D manages different types of data and how it balances performance with capacity efficiency.
Capacity efficiency:
In evaluating the capacity efficiency of these configurations, it’s essential to understand how each solution optimizes storage use while balancing performance and resiliency.
- StarWind Virtual SAN
Achieves a capacity efficiency of 40%, thanks to its combination of host mirroring and MDRAID-5. - Microsoft S2D mirror-accelerated parity
Delivers a capacity efficiency of 35.7% (20% mirror, 80% parity), though this can vary depending on the percentage of the volume allocated to the mirror tier. For more details on how to calculate capacity efficiency for mirror-accelerated parity, please refer to the provided link.
Microsoft also recommends keeping some storage capacity unallocated, about 20% of the total pool size, to enable “in-place” repairs if drives fail. This reserve space, in our case, 5.82 TB, allows for immediate parallel repairs, which means your data remains safe and the system stays resilient even if something goes wrong. This happens automatically. It’s an added layer of security that can be very important in maintaining uptime and performance.
So, when you’re planning your storage solution, it’s definitely something to keep in mind.
Testbed overview:
Our testbed setup is designed to push the limits of both StarWind VSAN and Microsoft S2D in a high-performance environment.
Hardware:
Server model | Supermicro SYS-220U-TNR |
---|---|
CPU | Intel(R) Xeon(R) Platinum 8352Y @2.2GHz |
Sockets | 2 |
Cores/Threads | 64/128 |
RAM | 256GB |
NICs | 2x Mellanox ConnectX®-6 EN 200GbE (MCX613106A-VDA) |
Storage | 5x NVMe Micron 7450 MAX: U.3 3.2TB |
Software:
Windows Server | Windows Server 2022 Datacenter 21H2 OS build 20348.2527 |
---|---|
StarWind VSAN | Version V8 (build 15469, CVM 20240530) (kernel – 5.15.0-113-generic) |
StarWind NVMe-oF Initiator | StarWind NVMe-oF Initiator.2.0.0.672(rev 674).Setup.486 |
StarWind VSAN CVM parameters:
CPU | 24 vCPU |
---|---|
RAM | 32GB |
NICs | 1x network adapter for management 4x Mellanox ConnectX-6 Virtual Function network adapter (SRIOV) |
Storage | MDRAID5 (5x NVMe Micron 7450 MAX: U.3 3.2TB) |
Testing methodology:
To accurately assess the performance of both StarWind VSAN and Microsoft S2D, we conducted a series of benchmarks using the FIO utility in client/server mode. Here’s a breakdown of the testing setup and methodology:
Virtual Machine Configuration:
- Total VMs: 20 (10 per host)
- VM Specs:
- vCPUs: 4 per VM
- RAM: 8GB per VM
- Disks: 3x RAW virtual disks per VM, each connected to a separate SCSI controller
Virtual Disk Sizes:
- For Microsoft S2D (Mirror-accelerated parity):
- Mirror-only: 10GB per virtual disk
- Both tiers: 100GB per virtual disk
- For StarWind VSAN NVMe-oF: 100GB per virtual disk
Preparation:
- Virtual disks were pre-filled with random data to simulate real-world usage conditions before running the tests.
Test Patterns: We evaluated the performance using the following I/O patterns:
- 4k random read
- 4k random read/write (70/30)
- 4k random write
- 64k random read
- 64k random write
- 1M read
- 1M write
Warm-Up Procedures:
- 4k random read/write (70/30) and 4k random write patterns: VM disks were warmed up using the 4k random write pattern for 4 hours.
- 64k random write pattern: VM disks were warmed up using the 64k random write pattern for 2 hours.
Test Execution:
- Each test was conducted three times, and the average result was used as the final performance metric.
- Duration:
- Read tests: 600 seconds
- Write tests: 1800 seconds
Microsoft S2D Specifics:
- Following Microsoft’s recommendations, the testing VMs were placed on the node that owns the volume. This setup minimizes network utilization by ensuring local data reads without using the network stack, thus reducing latency during write operations.
- Each VHDX file was placed in different subdirectories, which helps optimize ReFS performance by minimizing metadata operation size and allowing parallel execution, reducing overall application latency.
StarWind VSAN Specifics:
- VMs were evenly distributed across both hosts without being pinned to the node that owns the volume, which ensures a balanced load.
- Similar to the S2D setup, each VHDX file was placed in different subdirectories to optimize performance.
Benchmarking local NVMe performance:
Before diving into our performance verification, we took a moment to set the stage with vendor-claimed performance figures for the NVMe drives. Here is the image with vendor-claimed performance:
Using the FIO utility in client/server mode, we conducted a series of tests on a single Micron 7450 MAX U.3 3.2TB NVMe drive. The following results were observed:
1x NVMe Micron 7450 MAX: U.3 3.2TB | |||||
---|---|---|---|---|---|
Pattern | Numjobs | IOdepth | IOPs | MiB\s | Latency (ms) |
4k random read | 6 | 32 | 997,000 | 3,894 | 0.192 |
4k random read/write 70/30 | 6 | 16 | 531,000 | 2,073 | 0.142 |
4k random write | 4 | 4 | 385,000 | 1,505 | 0.041 |
64k random read | 8 | 8 | 92,900 | 5,807 | 0.688 |
64k random write | 2 | 1 | 27,600 | 1,724 | 0.072 |
1M read | 1 | 8 | 6,663 | 6,663 | 1.200 |
1M write | 1 | 2 | 5,134 | 5,134 | 0.389 |
Our tests confirmed that the NVMe drive’s performance is fully in line with the vendor’s claims. This validation step is crucial for ensuring that our subsequent benchmarks are based on accurate and trustworthy hardware performance.
Benchmark results in a table:
The benchmarking results are presented in tables to illustrate performance metrics such as IOPS, throughput (MiB/s), latency (ms), and CPU usage. An additional metric, “IOPS per 1% CPU usage,” highlights the performance dependency on the CPU usage for 4k random read/write patterns. This parameter is calculated using the following formula:
IOPS per 1% CPU usage = IOPS / Node count / Node CPU usage
Where:
- IOPS represents the number of I/O operations per second for each pattern.
- Node count is 2 nodes in our case.
- Node CPU usage denotes the CPU usage of one node during the test.
By incorporating this additional metric, we aimed to provide deeper insights into how CPU usage correlates with IOPS, offering a more nuanced understanding of performance characteristics.
Now let’s delve into the detailed benchmark results for each storage configuration.
StarWind VSAN NVMe over RDMA scenario
The table provides a detailed breakdown of StarWind VSAN’s performance under the Hyper-V NVMe over RDMA scenario, focusing on various workload patterns and configurations.
For 4k random reads, the IOPS ranges from 893,000 at lower queue depths to 1,624,000 at higher depths.
In mixed 4k random read/write (70%/30%) scenarios, the solution delivers up to 856,000 IOPS, maintaining strong performance even under mixed workloads.
For larger workloads, such as the 64k random read pattern, StarWind VSAN achieves up to 19,062 MiB/s while maintaining consistent latency and CPU utilization. In write-heavy scenarios like the 1024k write pattern, the throughput peaks at 4,479 MiB/s, with latency increasing as queue depth rises, yet the CPU usage remains stable between 16% and 19%.
VM count | Pattern | Numjobs | IOdepth | IOPs | MiB/s | Latency (ms) | Node CPU usage % | IOPs per 1% CPU usage |
---|---|---|---|---|---|---|---|---|
20 | 4k random read | 3 | 4 | 893,000 | 3,488 | 0.267 | 44.00% | 10,148 |
4k random read | 3 | 8 | 1,092,000 | 4,266 | 0.438 | 45.00% | 12,133 | |
4k random read | 3 | 16 | 1,399,000 | 5,465 | 0.683 | 50.00% | 13,990 | |
4k random read | 3 | 32 | 1,624,000 | 6,344 | 1.172 | 53.00% | 15,321 | |
4k random read | 3 | 64 | 1,558,000 | 6,086 | 2.461 | 53.00% | 14,698 | |
4k random read | 3 | 128 | 1,551,000 | 6,059 | 4.967 | 52.00% | 14,913 | |
4k random read/write (70%/30%) | 3 | 2 | 396,000 | 1,547 | 0.355 | 32.00% | 6,188 | |
4k random read/write (70%/30%) | 3 | 4 | 596,000 | 2,328 | 0.487 | 41.00% | 7,268 | |
4k random read/write (70%/30%) | 3 | 8 | 756,000 | 2,953 | 0.785 | 47.00% | 8,043 | |
4k random read/write (70%/30%) | 3 | 16 | 856,000 | 3,344 | 1.346 | 48.00% | 8,917 | |
4k random read/write (70%/30%) | 3 | 32 | 854,000 | 3,336 | 2.656 | 47.00% | 9,085 | |
4k random read/write (70%/30%) | 3 | 64 | 736,000 | 2,875 | 6.001 | 41.00% | 8,976 | |
4k random write | 3 | 2 | 201,000 | 785 | 0.595 | 25.00% | 4,020 | |
4k random write | 3 | 4 | 288,000 | 1,126 | 0.826 | 31.00% | 4,645 | |
4k random write | 3 | 8 | 341,000 | 1,332 | 1.406 | 34.00% | 5,015 | |
4k random write | 3 | 16 | 330,000 | 1,290 | 2.906 | 32.00% | 5,156 | |
4k random write | 3 | 32 | 196,000 | 766 | 9.818 | 21.00% | 4,667 | |
64k random read | 3 | 2 | 243,000 | 15,187 | 0.493 | 25.00% | ||
64k random read | 3 | 4 | 280,000 | 17,500 | 0.856 | 26.00% | ||
64k random read | 3 | 8 | 297,000 | 18,562 | 1.613 | 27.00% | ||
64k random read | 3 | 16 | 302,000 | 18,875 | 3.182 | 28.00% | ||
64k random read | 3 | 32 | 305,000 | 19,062 | 6.292 | 28.00% | ||
64k random write | 3 | 1 | 42,200 | 2,638 | 1.420 | 17.00% | ||
64k random write | 3 | 2 | 48,800 | 3,050 | 2.459 | 18.00% | ||
64k random write | 3 | 4 | 52,900 | 3,306 | 4.532 | 18.00% | ||
64k random write | 3 | 8 | 57,800 | 3,613 | 8.312 | 19.00% | ||
64k random write | 3 | 16 | 62,300 | 3,894 | 15.389 | 19.00% | ||
64k random write | 3 | 32 | 67,100 | 4,194 | 28.611 | 21.00% | ||
1024k read | 1 | 1 | 13,800 | 13,800 | 1.451 | 15.00% | ||
1024k read | 1 | 2 | 16,200 | 16,200 | 2.433 | 16.00% | ||
1024k read | 1 | 4 | 17,600 | 17,600 | 4.551 | 17.00% | ||
1024k read | 1 | 8 | 18,300 | 18,300 | 8.759 | 18.00% | ||
1024k read | 1 | 16 | 18,900 | 18,900 | 16.976 | 18.00% | ||
1024k write | 1 | 1 | 3,703 | 3,703 | 5.399 | 16.00% | ||
1024k write | 1 | 2 | 3,744 | 3,744 | 10.636 | 17.00% | ||
1024k write | 1 | 4 | 3,853 | 3,853 | 20.747 | 18.00% | ||
1024k write | 1 | 8 | 4,479 | 4,479 | 35.707 | 19.00% |
Overall, StarWind VSAN shows great performance at 4k random read/write patterns, consistent read and write performance regardless of VM location, and good capacity efficiency at 40%.
Microsoft Storage Spaces Direct over RDMA scenario (Mirror tier only)
The next table presents S2D’s performance with a mirror-accelerated parity configuration, focusing on workloads in the mirror tier.
For 4k random read patterns, IOPS ranges from 858,000 at lower queue depths to 2,615,000 at higher depths, with corresponding latencies between 0.278 ms and 2.921 ms.
In the 4k random read/write (70%/30%) scenarios, IOPS ranges from 58,200 to 941,000, with latency fluctuating from 0.305 ms to 8.247 ms as queue depth increases. The node CPU usage varies from 3% to 52%, reflecting how the system manages mixed workloads.
For larger data patterns like the 64k random read and 1024k write, S2D demonstrates robust throughput, reaching up to 10,500 MiB/s in the 1024k write pattern. Latency remains relatively low at the lower queue depths but increases significantly as the queue depth rises. CPU utilization is kept within a range of 5% to 26% for these larger workloads, showing the system’s ability to handle high-throughput tasks efficiently.
VM count | Pattern | Numjobs | IOdepth | IOPs | MiB/s | Latency (ms) | Node CPU usage % | IOPs per 1% CPU usage |
---|---|---|---|---|---|---|---|---|
20 | 4k random read | 3 | 4 | 858,000 | 3,352 | 0.278 | 28.00% | 15,321 |
4k random read | 3 | 8 | 782,000 | 3,055 | 0.620 | 21.00% | 18,619 | |
4k random read | 3 | 16 | 1,079,000 | 4,216 | 0.888 | 29.00% | 18,603 | |
4k random read | 3 | 32 | 1,615,000 | 6,308 | 1.189 | 41.00% | 19,695 | |
4k random read | 3 | 64 | 2,306,000 | 9,008 | 1.663 | 54.00% | 21,352 | |
4k random read | 3 | 128 | 2,615,000 | 10,215 | 2.921 | 67.00% | 19,515 | |
4k random read/write (70%/30%) | 3 | 2 | 410,000 | 1,602 | 0.305 | 29.00% | 7,069 | |
4k random read/write (70%/30%) | 3 | 4 | 113,400 | 443 | 2.112 | 7.00% | 8,100 | |
4k random read/write (70%/30%) | 3 | 8 | 58,200 | 227 | 8.247 | 3.00% | 9,700 | |
4k random read/write (70%/30%) | 3 | 16 | 667,000 | 2,605 | 1.607 | 38.00% | 8,776 | |
4k random read/write (70%/30%) | 3 | 32 | 908,000 | 3,547 | 2.791 | 48.00% | 9,458 | |
4k random read/write (70%/30%) | 3 | 64 | 941,000 | 3,676 | 6.017 | 52.00% | 9,048 | |
4k random write | 3 | 2 | 102,000 | 398 | 1.171 | 13.00% | 3,923 | |
4k random write | 3 | 4 | 50,100 | 196 | 4.794 | 7.00% | 3,579 | |
4k random write | 3 | 8 | 34,300 | 134 | 13.994 | 5.00% | 3,430 | |
4k random write | 3 | 16 | 66,100 | 258 | 14.504 | 8.00% | 4,131 | |
4k random write | 3 | 32 | 294,000 | 1,149 | 6.527 | 34.00% | 4,324 | |
64k random read | 3 | 2 | 319,000 | 19,938 | 0.374 | 17.00% | ||
64k random read | 3 | 4 | 504,000 | 31,500 | 0.475 | 26.00% | ||
64k random read | 3 | 8 | 439,000 | 27,438 | 1.081 | 22.00% | ||
64k random read | 3 | 16 | 611,000 | 38,187 | 1.572 | 27.00% | ||
64k random read | 3 | 32 | 851,000 | 53,187 | 2.252 | 38.00% | ||
64k random write | 3 | 1 | 120,000 | 7,475 | 0.500 | 19.00% | ||
64k random write | 3 | 2 | 130,000 | 8,153 | 0.919 | 20.00% | ||
64k random write | 3 | 4 | 51,150 | 3,197 | 4.696 | 7.00% | ||
64k random write | 3 | 8 | 38,700 | 2,419 | 12.334 | 6.00% | ||
64k random write | 3 | 16 | 46,500 | 2,906 | 20.895 | 6.00% | ||
64k random write | 3 | 32 | 161,000 | 10,063 | 11.905 | 26.00% | ||
1024k read | 1 | 1 | 19,900 | 19,900 | 1.004 | 5.00% | ||
1024k read | 1 | 2 | 31,800 | 31,800 | 1.257 | 7.00% | ||
1024k read | 1 | 4 | 44,000 | 44,000 | 1.815 | 11.00% | ||
1024k read | 1 | 8 | 50,300 | 50,300 | 3.176 | 14.00% | ||
1024k read | 1 | 16 | 52,300 | 52,300 | 6.114 | 16.00% | ||
1024k write | 1 | 1 | 9,887 | 9,887 | 2.022 | 8.00% | ||
1024k write | 1 | 2 | 10,150 | 10,150 | 3.912 | 8.00% | ||
1024k write | 1 | 4 | 10,200 | 10,200 | 7.841 | 9.00% | ||
1024k write | 1 | 8 | 10,500 | 10,500 | 15.250 | 10.00% |
Microsoft Storage Spaces Direct over RDMA scenario (Mirror + Parity tiers)
The performance metrics for the dual-tier configuration in S2D highlight workload management across both mirror and parity tiers.
In 4k random read patterns, IOPS ranges from 803,000 to 2,450,000, with latencies increasing from 0.297 ms to 3.133 ms as queue depth rises. Node CPU usage scales from 26% to 68%, with IOPS per 1% CPU usage showing efficient resource utilization, peaking at 19,773.
For the 4k random read/write (70%/30%) pattern, IOPS spans from 102,600 to 298,700, and latency escalates from 1.035 ms to 20.281 ms as queue depths increase. Node CPU usage varies between 20% and 50%, highlighting the system’s ability to manage mixed workloads, although the efficiency, measured by IOPS per 1% CPU usage, peaks at a more modest 3,075.
In the 64k random read and 1024k write patterns, throughput is substantial for reads, reaching up to 49,600 MiB/s, but write performance significantly declines in the 1024k write pattern, with throughput peaking at 2,341 MiB/s and latency increasing dramatically to 68.424 ms at higher queue depths. Despite the high node CPU efficiency in read scenarios, write performance shows noticeable degradation across tiers.
VM count | Pattern | Numjobs | IOdepth | IOPs | MiB/s | Latency (ms) | Node CPU usage % | IOPs per 1% CPU usage |
---|---|---|---|---|---|---|---|---|
20 | 4k random read | 3 | 4 | 803,000 | 3,137 | 0.297 | 27.00% | 14,870 |
4k random read | 3 | 8 | 774,000 | 3,023 | 0.620 | 26.00% | 14,885 | |
4k random read | 3 | 16 | 977,000 | 3,816 | 0.982 | 29.00% | 16,845 | |
4k random read | 3 | 32 | 1,531,000 | 5,980 | 1.252 | 42.00% | 18,226 | |
4k random read | 3 | 64 | 2,175,000 | 8,496 | 1.764 | 55.00% | 19,773 | |
4k random read | 3 | 128 | 2,450,000 | 9,570 | 3.133 | 68.00% | 18,015 | |
4k random read/write (70%/30%) | 3 | 2 | 152,700 | 598 | 1.035 | 32.00% | 2,386 | |
4k random read/write (70%/30%) | 3 | 4 | 157,200 | 614 | 1.924 | 32.00% | 2,456 | |
4k random read/write (70%/30%) | 3 | 8 | 102,600 | 400 | 4.926 | 20.00% | 2,565 | |
4k random read/write (70%/30%) | 3 | 16 | 260,200 | 1,016 | 4.759 | 45.00% | 2,891 | |
4k random read/write (70%/30%) | 3 | 32 | 298,700 | 1,167 | 9.019 | 50.00% | 2,987 | |
4k random read/write (70%/30%) | 3 | 64 | 282,900 | 1,105 | 20.281 | 46.00% | 3,075 | |
4k random write | 3 | 2 | 57,500 | 225 | 2.085 | 29.00% | 991 | |
4k random write | 3 | 4 | 70,600 | 276 | 3.398 | 33.00% | 1,070 | |
4k random write | 3 | 8 | 83,300 | 326 | 5.761 | 37.00% | 1,126 | |
4k random write | 3 | 16 | 89,000 | 348 | 10.774 | 41.00% | 1,085 | |
4k random write | 3 | 32 | 86,800 | 339 | 22.360 | 39.00% | 1,113 | |
64k random read | 3 | 2 | 312,000 | 19,500 | 0.383 | 18.00% | ||
64k random read | 3 | 4 | 470,000 | 29,375 | 0.510 | 26.00% | ||
64k random read | 3 | 8 | 386,000 | 24,125 | 1.259 | 22.00% | ||
64k random read | 3 | 16 | 555,600 | 34,725 | 1.728 | 27.00% | ||
64k random read | 3 | 32 | 776,000 | 48,500 | 2.474 | 38.00% | ||
64k random write | 3 | 1 | 14,100 | 881 | 4.258 | 13.00% | ||
64k random write | 3 | 2 | 13,700 | 856 | 8.771 | 14.00% | ||
64k random write | 3 | 4 | 14,300 | 894 | 16.719 | 14.00% | ||
64k random write | 3 | 8 | 15,400 | 962 | 31.095 | 16.00% | ||
64k random write | 3 | 16 | 14,800 | 925 | 64.890 | 19.00% | ||
64k random write | 3 | 32 | 14,800 | 925 | 129.896 | 18.00% | ||
1024k read | 1 | 1 | 19,700 | 19,700 | 1.015 | 5.00% | ||
1024k read | 1 | 2 | 31,000 | 31,000 | 1.256 | 8.00% | ||
1024k read | 1 | 4 | 41,800 | 41,800 | 1.914 | 11.00% | ||
1024k read | 1 | 8 | 47,600 | 47,600 | 3.358 | 13.00% | ||
1024k read | 1 | 16 | 49,600 | 49,600 | 6.452 | 16.00% | ||
1024k write | 1 | 1 | 1,904 | 1,904 | 10.707 | 4.00% | ||
1024k write | 1 | 2 | 1,810 | 1,810 | 22.290 | 5.00% | ||
1024k write | 1 | 4 | 1,981 | 1,981 | 40.353 | 5.00% | ||
1024k write | 1 | 8 | 2,341 | 2,341 | 68.424 | 5.00% |
Overall, S2D shows exceptional performance in both test cases, however, the storage capacity efficiency is about 35.7% and could be even less if additional space is assigned for in-place repairs.
Benchmarking results in graphs:
With all benchmarks completed and data collected, we can now compare the results using graphical charts for a clearer understanding.
4k random read:
Let’s start with the 4K random read test, where Figure 1 demonstrates the performance in IOPS.
StarWind VSAN NVMe over RDMA starts off strong, delivering 893,000 IOPS at a 4-depth queue and climbing to an impressive 1,624,000 IOPS at a 32-depth queue, and then slightly declining.
Microsoft Storage Spaces Direct (S2D) in both configurations (“mirror-only” and “mirror + parity”) showed significant variability. The “mirror-only” setup achieved a peak of 2,615,000 IOPS at a 128-depth queue, while “mirror + parity” peaked slightly lower at 2,450,000 IOPS. StarWind’s peak performance at 32-depth was about 62% of S2D “mirror-only” and 66% of S2D both tiers at their respective peaks.
This significant variability in S2D’s performance can be traced back to its sophisticated use of Cluster Shared Volumes (CSV). The CSV architecture enables multiple hosts to share access to the same disk, effectively coordinating read and write operations through the SMB 3.0 multichannel protocol. This approach is what gives S2D its impressive peak performance, especially in scenarios where the VM runs on the node that owns the volume. In this case, it can read data directly from the local disk, bypassing the network stack. This local read path minimizes latency and maximizes performance, leading to impressive IOPS numbers (if you want to explore this topic in more detail, please read here or check this article).
However, the very nature of CSV that boosts performance also introduces complexity. S2D’s architecture demands careful monitoring to ensure that VMs are optimally placed, as any deviation can lead to performance dips.
Latency is a critical factor, and in Figure 2 we analyze latency metrics for the 4K random read test.
We can see that latency increased with queue depth across all configurations. StarWind began with a low latency of 0.267 ms, rising to 4.967 ms at maximum queue depth.
The S2D “mirror-only” configuration had a low starting latency at 0.278 ms but escalated to 2.921 ms at 128 depth. Both tiers setup had a similar trend, starting at 0.297 ms and ending at 3.133 ms. At maximum queue depth, StarWind’s latency was approximately 70% higher than both S2D configurations.
The latency advantage of S2D is again attributed to local reads. While S2D enjoys lower latency, StarWind VSAN’s performance remains unaffected by VM location, offering simplicity at the cost of slightly higher latency.
Figure 3 showcases the results of the 4K random read test with a numjob=3, measuring IOPS per 1% CPU usage.
StarWind demonstrated a steady increase in IOPS per 1% CPU usage, peaking at 15,321 IOPS at 32-depth before a slight drop.
S2D “mirror-only” showed the highest efficiency, reaching 21,352 IOPS per 1% CPU at 64-depth. Both tiers configuration had a similar peak efficiency of 19,773 IOPS per 1% CPU at the same depth. StarWind’s efficiency was around 72% of S2D “mirror-only” and 77% of “mirror + parity” at their most efficient points.
4k random read/write 70/30:
In virtualized environments, the mixed 4K random read/write workload serves as the backbone of daily operations. The ability to maintain high performance with mixed I/O across varied queue depths is critical. Figure 4 shows IOPS for the 4K random read/write (70%/30%) pattern.
Interestingly, with Storage Spaces Direct, there’s a noticeable drop in performance at queue depths 4 and 8. This performance drop is not observed in StarWind VSAN tests. StarWind maintains consistent performance, hitting 596,000 IOPS at queue depth 4 and 756,000 IOPS at queue depth 8.
StarWind holds its ground well and demonstrates impressive stability, achieving 856,000 IOPS at a 16-depth queue before experiencing a slight dip. In comparison, the S2D “mirror-only” configuration reached a higher peak of 941,000 IOPS at a 64-depth queue, while “mirror + parity” setup lagged behind with a peak of 298,700 IOPS.
StarWind’s peak performance was about 91% of the S2D “mirror-only” configuration, but it significantly outshined “mirror + parity” setup, delivering nearly three times the IOPS.
The main reason for the lower performance in the S2D “mirror + parity” scenario is the overhead of ReFS, which has to move new data from the mirror to the parity tier, leading to performance degradation. As a result, S2D records 152,700 IOPS at queue depth 2, drops to a low of 102,600 IOPS at QD=8, and then peaks at 298,700 IOPS at queue depth 32. In contrast, StarWind’s more consistent performance makes it a strong contender, especially in virtualization environments where mixed workloads are common.
Figure 5 reveals the latency associated with the 4K random read/write (70%/30%) workload.
Here, the picture is the same: S2D’s mirror-accelerated parity setup struggles, especially when the workload spans both mirror and parity tiers, causing data movement delays.
StarWind’s consistent latency, starting with 0.355 ms at queue depth 2 and rising to 6.001 ms at queue depth 64, ensures smoother operations without the need for complex configurations.
StarWind’s latency at maximum depth was almost identical to S2D “mirror-only” but 70% lower compared to the S2D “mirror + parity” configuration.
In Figure 6, the IOPS per 1% CPU usage for the 4K random read/write (70%/30%) pattern is depicted.
StarWind VSAN shows strong efficiency with 9,085 IOPS per 1% CPU usage at 32 IO depth, nearing the performance of S2D’s “mirror-only” setup, while far surpassing “mirror + parity” configuration. StarWind’s efficiency was approximately 96% of S2D “mirror-only” and three times better than S2D in the “dual-tier” scenario.
For IOPS per 1% CPU usage, S2D’s performance is uneven, fluctuating with workload intensity, whereas StarWind provides steady and reliable results.
4k random write:
The 4K random write performance pattern, as shown in Figure 7, further highlights the disparities between Microsoft Storage Spaces Direct and StarWind VSAN.
S2D’s performance varies greatly depending on the workload’s placement within the mirror or parity tier, with significant drops in performance at higher queue depths. StarWind, meanwhile, maintains stable performance, unaffected by workload placement or queue depth.
In pure 4K random write operations, StarWind stands out, achieving 341,000 IOPS at an 8-depth queue, which is 16% higher than S2D mirror-only’s peak of 294,000 IOPS at QD=32.
The S2D “mirror + parity” configuration struggles even more, peaking at only 89,000 IOPS at QD=16. Here, StarWind represents a remarkable 283% higher performance in write operations than S2D in the “dual-tier” scenario, making it an obvious choice for environments where write speed is critical.
Latency during 4K random write operations, depicted in Figure 8, confirms StarWind VSAN’s domination in this test pattern. Starting at 0.595 ms, write latency increases to 9.818 ms, which is still considerably lower than Storage Spaces Direct with workload in mirror tier, which begins at 1.171 ms and peaks at 14.504 ms at a 16 IO depth.
When comparing StarWind VSAN to Microsoft S2D with workload within “mirror + parity”, the performance gap is even more pronounced, with its latency climbing to 22.360 ms at a 32 IO depth. StarWind’s maximum latency was about 68% of the S2D’s “mirror-only” latency and 44% of “mirror + parity” setup.
In 4K RW pattern, we see that latency under S2D can spike, particularly when ReFS is forced to shuffle data between tiers, while StarWind VSAN’s latency remains consistently lower.
Efficiency in 4K random write workloads is measured in IOPS per 1% CPU usage, as shown in Figure 9.
StarWind’s efficiency in write operations is impressive, with 5,156 IOPS per 1% CPU usage at 32 IO depth, outpacing Storage Spaces Direct with workload in mirror tier by about 19%. Both tiers configuration, once again, falls short, peaking at 1,126 IOPS per 1% CPU at 8 IO depth and being 77.6% lower than StarWind.
64k random read:
As we shift to larger block sizes, Figure 10 presents throughput for the 64K random read test.
StarWind started with a throughput of 15,187 MiB/s, increasing to 19,062 MiB/s at 32 IO depth.
Storage Spaces Direct with the workload in the mirror tier reached a peak throughput of 53,187 MiB/s at 32 IO depth, and “mirror + parity” setup had a slightly lower peak of 48,500 MiB/s at the same IO depth. StarWind’s maximum throughput was approximately 36% of “mirror-only” and 39% of “mirror + parity”.
With 64K random reads, Microsoft S2D shines again by leveraging local data access to push throughput to impressive levels.
Figure 11 delves into latency during 64K random reads. The results align with the throughput data discussed earlier.
StarWind’s latency started at 0.493 ms and increased to 6.292 ms at 32 IO depth.
S2D with workload in the mirror tier began with a lower latency of 0.374 ms, peaking at 2.252 ms, while “mirror + parity” configuration began at 0.383 ms, increasing to 2.474 ms at 32 IO depth.
In Figure 12, we explore CPU usage during 64K random reads.
StarWind shows stable CPU usage, ranging from 25% to 28%, across various queue depths.
The S2D “mirror-only” scenario started at 17% and increased to 38% at IO depth=32. Both tiers setup followed a similar trend, starting at 18% and reaching 38%.
64k random write:
Figure 13 vividly illustrates the stark differences in 64K random write throughput between StarWind VSAN and Microsoft Storage Spaces Direct (S2D).
The performance of Storage Spaces Direct with the workload in the mirror tier shows considerable fluctuations. It starts highest scoring 7,475 MiB/s at IO depth=1 and 8,153 MiB/s at IO depth=2. At a 4 IO depth, S2D achieves 3,197 MiB/s, which drops to 2,419 MiB/s at an 8 IO depth before slightly increasing to 2,906 MiB/s at a 16 IO depth and rebounding back to the high score of 10,063 at IOdepth=32, outrunning StarWind by 139.9%. This erratic pattern mirrors the behavior observed in other tests, such as the 4K random write operations.
StarWind starts off with lower initial throughput of 2,638 and 3,050 MiB/s at IO depths 1 and 2, but delivers much more consistent performance as the test progresses. At IO depth=4, StarWind clocks in at 3,306 MiB/s, outpacing S2D by 3.4%. The gap widens as we move to an 8 IO depth, where StarWind reaches 3,613 MiB/s – a 49.4% lead over S2D. At IO depth=16, StarWind is still leading with 3,894 MiB/s, outperforming S2D by 33.9%.
A different story unfolds when we examine S2D performance in “mirror + parity” tests. It struggles at IO depth 1, with a low of 881 MiB/s, peaks at 962 MiB/s at IO depth 8, and drops to 925 MiB/s at IO depth 32.
Latency for 64K random writes is detailed in Figure 14, where StarWind’s performance remains more consistent, avoiding the severe latency spikes observed in S2D’s at IO depths 4, 8, and 16.
Let’s move on to Figure 15, which compares CPU usage during 64K random writes.
Here, CPU usage follows a similar trend as in the previous 64K random writes figures: Microsoft S2D’s efficiency is better only under specific conditions, while StarWind delivers more reliable usage metrics, ranging from 17% to 21%.
1M read:
In Figure 16, we see the throughput results for 1024K read operations.
Microsoft S2D again benefits from local data access, achieving high throughput that significantly outpaces StarWind VSAN. Thus, StarWind’s 1024K read throughput ranged from 13,800 MiB/s to 18,900 MiB/s at a 16 IO depth. The S2D “mirror-only” setup peaks at 52,300 MiB/s, while “mirror + parity” reached a slightly lower peak of 49,600 MiB/s. StarWind’s peak throughput was approximately 36% of “mirror-only” and 38% of “mirror + parity”.
Figure 17 shows the latency results during the 1024K read test.
The resulting latency is predictably lower in Microsoft Storage Spaces Direct. StarWind’s latency increased from 1.451 ms to 16.976 ms at a 16 IO depth, whereas “mirror-only” S2D showed lower latency, starting at 1.004 ms and peaking at 6.114 ms at a 16 IO depth.
“Mirror + Parity” followed a similar pattern, starting at 1.015 ms and peaking at 6.452 ms. StarWind’s maximum latency was about 278% higher than “mirror-only” and 263% higher than S2D with workloads in both mirror and parity tiers.
Figure 18 highlights CPU usage during 1024K reads.
StarWind’s CPU usage ranged from 15% to 18%, while the S2D “mirror-only” setup started at 5% and increased to 16% at 16 IO depth. Even when workloads span both tiers, S2D maintains almost the same CPU usage levels as in the “mirror-only” benchmarks.
As IO depth increases, the CPU usage gap between StarWind VSAN and S2D narrows. S2D consistently uses less CPU across all IO depths, with the difference being most pronounced at lower IO depths (66.67% less at 1 IO depth than StarWind VSAN) and gradually decreasing to 11.11% less at 16 IO depth.
1M write:
When we shift our focus to 1024K sequential write throughput, Figure 19 underlines some clear distinctions in performance between StarWind VSAN and Storage Spaces Direct (S2D).
At IO depth=1, S2D in mirror-accelerated parity mode with workload in the mirror tier, reaches a throughput of 9,887 MiB/s, while StarWind VSAN manages 3,703 MiB/s. This represents an impressive 167% higher throughput for S2D.
As the IO depth increases to 8, S2D maintains its lead, achieving 10,500 MiB/s compared to StarWind’s 4,479 MiB/s. This results in a 134% higher throughput for S2D at this IO depth.
However, this performance advantage for S2D is primarily evident when the workload does not spill out of the mirror tier.
If the workload hits both tiers – mirror and parity – the results change significantly. Under these conditions, StarWind VSAN exhibits a more stable performance curve, delivering 94% higher throughput than S2D in “mirror + parity” at 1 IO depth to an impressive 91% higher at an 8 IO depth.
Latency during 1024K writes, as shown in Figure 20, displays exactly the same picture.
StarWind’s latency increases from 5.399 ms at 1 IO depth to 35.707 ms at an 8 IO depth, while the S2D “mirror-only” configuration has a lower latency peak at 15.250 ms. “Mirror + Parity” setup, however, suffers from extremely high latency, peaking at 68.188 ms. StarWind VSAN demonstrates significantly lower latency than the Storage Spaces Direct (S2D) “mirror + parity” configuration, with latency measurements that are approximately 92% lower.
Lastly, Figure 21 compares CPU usage during 1024K writes, with StarWind being significantly outmatched by both S2D setups.
StarWind VSAN’s CPU utilization increases from 16% at 1 IO depth to 19% as the queue depth rises. The S2D “mirror-only” configuration demonstrates a much lower CPU usage, capping at 10% at its highest throughput at IO depth=8. This efficiency gives S2D “mirror-only” an edge in terms of IOPS per CPU usage.
What’s really interesting, when the workload spans both tiers of S2D, it continues to exhibit even lower CPU usage, starting at just 4% at 1 IO depth and modestly rising to 5% at 2, 4, and 8 IO depths.
Additional benchmarking: 1 VM, 1 numjobs, 1 iodepth.
To gain a deeper understanding of how StarWind Virtual SAN and Storage Spaces Direct (S2D) perform under specific synthetic conditions, we conducted additional benchmarks focusing on a single-thread scenario, with 1 thread and 1 queue. Typically, this is the most effective way to measure storage access latency in an ideal scenario. The benchmarks focus on 4k random read and write patterns, including synchronous write operations.
Benchmark results in a table:
StarWind VSAN NVMe-oF HA (RDMA) – Host mirroring + MDRAID5 (1 VM) | |||||
---|---|---|---|---|---|
Pattern | Numjobs | IOdepth | IOPs | MiB\s | Latency (ms) |
4k random read | 1 | 1 | 2,974 | 12 | 0.335 |
4k random write | 1 | 1 | 2,379 | 10 | 0.419 |
4k random write (synchronous) | 1 | 1 | 967 | 4 | 1.032 |
Storage Spaces Direct (RDMA) – Nested mirror accelerated parity – Data in mirror tier (1 VM) | |||||
---|---|---|---|---|---|
Pattern | Numjobs | IOdepth | IOPs | MiB\s | Latency (ms) |
4k random read | 1 | 1 | 7,231 | 28 | 0.137 |
4k random write | 1 | 1 | 5,660 | 22 | 0.175 |
4k random write (synchronous) | 1 | 1 | 2,816 | 11 | 0.353 |
Storage Spaces Direct (RDMA) – Nested mirror accelerated parity – Data in mirror and parity tiers (1 VM) | |||||
---|---|---|---|---|---|
Pattern | Numjobs | IOdepth | IOPs | MiB\s | Latency (ms) |
4k random read | 1 | 1 | 5,922 | 23 | 0.167 |
4k random write | 1 | 1 | 2,575 | 10 | 0.387 |
4k random write (synchronous) | 1 | 1 | 1,754 | 7 | 0.568 |
Benchmark results in graphs:
This section presents visual comparisons of the performance and latency metrics across storage configurations under research.
4k random read:
Figure 1 demonstrates IOPS for the 4K random read test at 1 IO depth and with one numjobs.
Here, Storage Spaces Direct (S2D) with data in the mirror tier outshines the other configurations. It achieves 7,231 IOPS, which is 143% higher than StarWind VSAN’s 2,974 IOPS.
This superior performance is again due to S2D’s ability to perform local reads at the host level, whereas StarWind VSAN operates within a VM, leading to a longer IO datapath.
Even when data spans both the mirror and parity tiers, S2D still leads with 5,922 IOPS, outperforming StarWind by 99%.
Latency metrics for the 4K random read test at 1 IO depth, as shown in Figure 2, similarly favor Storage Spaces Direct with the workload in the mirror tier, which records a swift 0.137 ms. S2D’s latency is 59% faster than StarWind’s 0.335 ms.
Even when data spans both tiers, S2D maintains a respectable 0.167 ms, which is still 50% faster than StarWind.
4k random write:
Figure 3 showcases the results of the 4K random write test at IO depth=1 with a numjob=1.
For 4k random writes, S2D with data in the mirror tier proves its prowess, achieving 5,660 IOPS, which is 138% higher than StarWind’s 2,379 IOPS. The superior performance of S2D is due to the direct writing to the mirror tier, bypassing the need to calculate parity, which is resource-intensive. For a more detailed explanation of how reading and writing occur in a mirror-accelerated parity scenario, please refer to the following link.
In scenarios where data spans both the mirror and parity tiers, S2D’s performance drops to 2,575 IOPS, but it still edges out StarWind by 8%. The additional step of invalidating data in the parity tier in S2D slightly reduces performance compared to when the workload is fully contained within the mirror tier. In contrast, StarWind VSAN writes directly to the MDRAID5 array, resulting in read-modify-write (RMW) operations, which further reduce performance.
Moving on to Figure 4, we examine the latency metrics for 4K random writes.
No surprises here. S2D in the mirror tier shows a clear advantage with a latency of 0.175 ms, which is 139% lower than StarWind’s 0.419 ms. This advantage stems from S2D’s direct writing to the mirror tier, bypassing the parity calculations that slow down write operations.
When S2D data is spread across both tiers, the latency increases to 0.387 ms but remains 8% faster than StarWind’s latency.
These results suggest that S2D can more effectively manage latency in 4K write operations at IO depth=1 with a numjob=1, ensuring quicker data processing, while StarWind’s longer IO datapath from inside a VM increases latency.
4k random write (synchronous):
In our synchronous 4K RW single-threaded IO tests, as shown in Figure 5, S2D in the mirror tier reaches 2,816 IOPS, again outperforming StarWind’s 967 IOPS by a significant 191%. This difference is again due to S2D’s ability to write directly to the mirror tier, avoiding the overhead of parity calculations.
When S2D data is distributed across both tiers, the performance drops to 1,754 IOPS but still surpasses StarWind by 81%.
The latency figures for synchronous 4K RW single-threaded IO, depicted in Figure 6, tell a similar story, with S2D’s mirror tier configuration offering a quick 0.353 ms, which is 192% lower than StarWind’s 1.032 ms. Even with data in both tiers, S2D’s latency is 0.568 ms – 82% lower than StarWind.
This consistency in performance highlights S2D’s capability in managing synchronous write operations efficiently, while StarWind’s VM-based operation leads to a longer IO datapath and higher latency.
Conclusion
In conclusion, both Storage Spaces Direct and StarWind VSAN bring distinct strengths and weaknesses to the table, each catering to different needs within your IT infrastructure.
Storage Spaces Direct, being a native Microsoft solution, excels in read performance, especially when virtual machines are aligned with their corresponding volume-owning nodes. However, this advantage hinges on careful workload management. If the VMs aren’t perfectly aligned, or if the workload spills over from the mirror tier to the parity tier, you might see a significant dip in write performance as we observed during 4K and 64K random-write tests. Additionally, S2D’s capacity efficiency is somewhat compromised, especially when you factor in the need to reserve extra space for fault tolerance.
On the flip side, StarWind VSAN shines in environments that demand consistent write and mixed IO performance. Its stable read and write performance, regardless of VM placement, and superior capacity efficiency make it a compelling option. However, the absence of local read optimization and the need to deploy an additional VM (StarWind VSAN CVM) are considerations that might tip the scale depending on your specific needs.
Ultimately, if your priority is top-notch read performance and you’re prepared to closely monitor your workloads, Storage Spaces Direct could be your go-to. But if you’re looking for reliable write performance and better capacity efficiency, StarWind VSAN might be the better fit.
from StarWind Blog https://ift.tt/74sDaup
via IFTTT
No comments:
Post a Comment