Tuesday, May 26, 2015

When Virtualised Storage is Faster than Bare Metal [feedly]

When Virtualised Storage is Faster than Bare Metal
// Latest blog entries

An analysis of block size, inflight requests and outstanding data


Back in August 2014 I went to the Xen Project Developer Summit in Chicago (IL) and presented a graph that caused a few faces to go "ahn?". The graph was meant to show how well XenServer 6.5 storage throughput could scale over several guests. For that, I compared 10 fio threads running in dom0 (mimicking 10 virtual disks) with 10 guests running 1 fio thread each. The result: the aggregate throughput of the virtual machines was actually higher.

In XenServer 6.5 (used for those measurements), the storage traffic of 10 VMs corresponds to 10 tapdisk3 processes doing I/O via libaio in dom0. My measurements used the same disk areas (raw block-based virtual disks) for each fio thread or tapdisk3. So how can 10 tapdisk3 processes possibly be faster than 10 fio threads also using libaio and also running in dom0?

At the time, I hypothesised that the lack of indirect I/O support in tapdisk3 was causing requests larger than 44 KiB (the maximum supported request size in Xen's traditional blkif protocol) to be split into smaller requests. And that the storage infrastructure (a Micron P320h) was responding better to a higher number of smaller requests. In case you are wondering, I also think that people thought I was crazy.

You can check out my one year old hypothesis between 5:10 and 5:30 on the XPDS'14 recording of my talk: https://youtu.be/bbdWFB1mBxA?t=5m10s



For several years operating systems have been optimising storage I/O patterns (in software) before issuing them to the corresponding disk drivers. In Linux, this has been achieved via elevator schedulers and the block layer. Requests can be reordered, delayed, prioritised and even merged into a smaller number of larger requests.

Merging requests has been around for as long as I can remember. Everyone understands that less requests mean less overhead and that storage infrastructures respond better to larger requests. As a matter of fact, the graph above, which shows throughput as a function of request size, is proof of that: bigger requests means higher throughput.

It wasn't until 2010 that a proper means to fully disable request merging came into play in the Linux kernel. Alan Brunelle showed a 0.56% throughput improvement (and less CPU utilisation) by not trying to merge requests at all. I wonder if he questioned that splitting requests could actually be even more beneficial.


Given the results I have seen on my 2014 measurements, I would like to take this concept a step further. On top of not merging requests, let's forcibly split them.

The rationale behind this idea is that some drives today will respond better to a higher number of outstanding requests. The Micron P320h performance testing guide says that it "has been designed to operate at peak performance at a queue depth of 256" (page 11). Similar documentation from Intel uses a queue depth of 128 to indicate peak performance of its NVMe family of products.

But it is one thing to say that a drive requires a large number of outstanding requests to perform at its peak. It is a different thing to say that a batch of 8 requests of 4 KiB each will complete quicker than one 32 KiB request.


So let's put that to the test. I wrote a little script to measure the random read throughput of two modern NVMe drives when facing workloads with varying block sizes and I/O depth. For block sizes from 512 B to 4 MiB, I am particularly interested in analysing how these disks respond to larger "single" requests in comparison to smaller "multiple" requests. In other words, what is faster: 1 outstanding request of X bytes or Y outstanding requests of X/Y bytes?

My test environment consists of a Dell PowerEdge R720 (Intel E5-2643v2 @ 3.5GHz, 2 Sockets, 6 Cores/socket, HT Enabled), with 64 GB of RAM running Linux Jessie 64bit and the Linux 4.0.4 kernel. My two disks are an Intel P3700 (400GB) and a Micron P320h (175GB). Fans were set to full speed and the power profiles are configured for OS Control, with a performance governor in place.

#!/bin/bash  sizes="512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 \         1048576 2097152 4194304"  drives="nvme0n1 rssda"    for drive in ${drives}; do      for size in ${sizes}; do          for ((qd=1; ${size}/${qd} >= 512; qd*=2)); do              bs=$[ ${size} / ${qd} ]              tp=$(fio --terse-version=3 --minimal --rw=randread --numjobs=1  \                       --direct=1 --ioengine=libaio --runtime=30 --time_based \                       --name=job --filename=/dev/${drive} --bs=${bs}         \                       --iodepth=${qd} | awk -F';' '{print $7}')              echo "${size} ${bs} ${qd} ${tp}" | tee -a ${drive}.dat          done      done  done

There are several ways of looking at the results. I believe it is always worth starting with a broad overview including everything that makes sense. The graphs below contain all the data points for each drive. Keep in mind that the "x" axis represent Block Size (in KiB) over the Queue Depth.



While the Intel P3700 is faster overall, both drives share a common treat: for a certain amount of outstanding data, throughput can be significantly higher if such data is split over several inflight requests (instead of a single large request). Because this workload consists of random reads, this is a characteristic that is not evident in spinning disks (where the seek time would negatively affect the total throughput of the workload).

To make this point clearer, I have isolated the workloads involving 512 KiB of outstanding data on the P3700 drive. The graph below shows that if a workload randomly reads 512 KiB of data one request at a time (queue depth=1), the throughput will be just under 1 GB/s. If, instead, the workload would read 8 KiB of data with 64 outstanding requests at a time, the throughput would be about double (just under 2 GB/s).



Storage technologies are constantly evolving. At this point in time, it appears that hardware is evolving much faster than software. In this post I have discussed a paradigm of workload optimisation (request merging) that perhaps no longer applies to modern solid state drives. As a matter of fact, I am proposing that the exact opposite (request splitting) should be done in certain cases.

Traditional spinning disks have always responded better to large requests. Such workloads reduced the overhead of seek times where the head of a disk must roam around to fetch random bits of data. In contrast, solid state drives respond better to parallel requests, with virtually no overhead for random access patterns.

Virtualisation platforms and software-defined storage solutions are perfectly placed to take advantage of such paradigm shifts. By understanding the hardware infrastructure they are placed on top of, as well as the workload patterns of their users (e.g. Virtual Desktops), requests can be easily manipulated to better explore system resources.

Read More

Shared via my feedly reader

Sent from my iPhone