Cloudy Journey: Average Queue Size and Storage IO Metrics [feedly]

----
Average Queue Size and Storage IO Metrics
// Latest blog entries

Introduction

There seems to be a bit of confusion around the metric "average queue size". This is a metric reported by iostat as "avgqu-sz". The confusion seems to arise when iostat reports a different avgqu-sz in dom0 and in domU for a single Virtual Block Device (VBD), while other metrics such as Input/Output Operations Per Second (IOPS) and Throughput (often expressed in MB/s) are the same. This page will describe what all of this actually mean and how this should be interpreted.

Background

On any modern Operating System (OS), it is possible to concurrently submit several requests to a single storage device. This practice normally helps several layers of the data path to perform better, allowing systems to achieve higher numbers in metrics such as IOPS and throughput. However, measuring the average of outstanding (or "inflight") requests for a given block device over a period of time can be a bit tricky. This is because the number of outstanding requests is an "instant metric". That is, when you look, there might be zero requests pending for that device. When you look again, there might be 28. Without a lot of accounting and some intrusiveness, it is not really possible to tell what happened in-between.

Most users, however, are not interested in everything that happened in-between. People are much more interested in the average of outstanding requests. This average gives a good understanding of the workload that is taking place (i.e. how applications are using storage) and helps with tuning the environment for better performance.

Calculating the Average Queue Size

To understand how the average queue size is calculated, consider the following diagram which presents a Linux system running 'fio' as a benchmarking user application issuing requests to a SCSI disk.

Figure 1. Benchmark issuing requests to a disk

The application issues requests to the kernel through libraries such as libc or libaio. On the simple case where the benchmark is configured with an IO Depth of 1, 'fio' will attempt to keep one request "flying" at all times. As soon as one request completes, 'fio' will send another. This can be achieved with the following configuration file (which runs for 10 seconds and considers /dev/xvdb as the benchmarking disk):

[global]  bs=4k  rw=read  iodepth=1  direct=1  ioengine=libaio  runtime=10  time_based    [job-xvdb]  filename=/dev/xvdb

Table 1. fio configuration file for a test workload

NOTE: In this experiment, /dev/xvdb was configured as a RAW VDI. Ensure to fully populate VHD VDIs before running experiments (especially if they are read-based).

One of the metrics made available by the block layer for a device is the number of read and write "ticks" (see stat.txt on the Linux Kernel documentation). This exposes the amount of time per request that the device has been occupied. The block layer starts this accounting immediately before shipping the request to the driver and stops it immediately after the request completed. The figure below represents this time in the RED and BLUE horizontal bars.

Figure 2. Diagram representing request accounting

It is important to understand that this metric can grow quicker than time. This will happen if more than one request has been submitted concurrently. On the example below, a new (green) request has been submitted before the first (red) request has been completed. It completed after the red request finished and after the blue request was issued. During the moments where requests overlapped, the ticks metric increased at a rate greater than time.

Figure 3. Diagram representing concurrent request accounting

Looking at this last figure, it is clear that there were moments were no request was present in the device driver. There were also moments where one or two requests were present in the driver. To calculate the average of inflight requests (or average queue size) between two moments in time, tools like iostat will sample "ticks" at moment one, sample "ticks" again at moment two, and divide the difference between these ticks by the time interval between these moments.

Figure 4. Formula to calculate the average queue size

The Average Queue Size in a Virtualised Environment

In a virtualised environment, the datapath between the benchmarking application (fio) running inside a virtual machine and the actual storage is different. Considering XenServer 6.5 as an example, the figure below shows a simplification of this datapath. As in the examples of the previous section, requests start in a virtual machine's user space application. When moving through the kernel, however, they are directed to paravirtualised (PV) storage drivers (e.g. blkfront) instead of an actual SCSI driver. These requests are picked up by the storage backend (tapdisk3) in dom0's user space. They are submitted to dom0's kernel via libaio, pass the block layer and reach the disk drivers for the corresponding storage infrastructure (in this example, a SCSI disk).

Figure 5. Benchmark issuing requests on a virtualised environment

The technique described above to calculate the average queue size will produce different values depending on where it is applied. Considering the diagram above, it could be used in the virtual machine's block layer, in tapdisk3 or in the dom0's block layer. Each of these would show a different queue size and actually mean something different. The diagram below extends the examples used in this article to include these layers.

Figure 6. Diagram representing request accounting in a virtualised environment

The figure above contains (almost) vertical arrows between the layers representing requests departing from and arriving to different system components. These arrows are slightly angled, suggesting that time passes as a request moves from one layer to another. There is also some elapsed time between an arrow arriving at a layer and a new arrow leaving from that layer.

Another detail of the figure is the horizontal (red and blue) bars. They indicate where requests are accounted at a particular layer. Note that this accounting starts some time after a request arrives at a layer (and some time before the request passes to another layer). These offsets, however, are merely illustrative. A thorough look at the output of specific performance tools is necessary to understand what the "Average Queue Size" is for certain workloads.

Investigating a Real Deployment

In order to place real numbers in this article, the following environment was configured:

Hardware: Dell PowerEdge R310

Intel Xeon X3450 2.67GHz (1 Socket, 4 Cores/socket, HT Enabled)
BIOS Power Management set to OS DBPM
Xen P-State Governor set to "Performance", Max Idle State set to "1"
8 GB RAM
2 x Western Digital WD2502ABYS

/dev/sda: XenServer Installation + guest's root disk
/dev/sdb: LVM SR with one 10 GiB RAW VDI attached to the guest

dom0: XenServer Creedence (Build Number 88873)

4 vCPUs
752 MB RAM

domU: Debian Wheezy x86_64

2 vCPUs
512 MB RAM

When issuing the fio workload as indicated in Table 1 (sequentially reading 4 KiB requests using libaio and with io_depth set to 1 during 10 seconds), an iostat within the guest reports the following:

root@wheezy64:~# iostat -xm | grep Device ; iostat -xm 1 | grep xvdb  Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util  xvdb              0.00     0.00  251.05    0.00     0.98     0.00     8.00     0.04    0.18    0.18    0.00   0.18   4.47  xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00  xvdb              0.00     0.00 4095.00    0.00    16.00     0.00     8.00     0.72    0.18    0.18    0.00   0.18  72.00  xvdb              0.00     0.00 5461.00    0.00    21.33     0.00     8.00     0.94    0.17    0.17    0.00   0.17  94.40  xvdb              0.00     0.00 5479.00    0.00    21.40     0.00     8.00     0.96    0.18    0.18    0.00   0.18  96.40  xvdb              0.00     0.00 5472.00    0.00    21.38     0.00     8.00     0.95    0.17    0.17    0.00   0.17  95.20  xvdb              0.00     0.00 5472.00    0.00    21.38     0.00     8.00     0.97    0.18    0.18    0.00   0.18  97.20  xvdb              0.00     0.00 5443.00    0.00    21.27     0.00     8.00     0.96    0.18    0.18    0.00   0.18  95.60  xvdb              0.00     0.00 5465.00    0.00    21.34     0.00     8.00     0.96    0.17    0.17    0.00   0.17  95.60  xvdb              0.00     0.00 5467.00    0.00    21.36     0.00     8.00     0.96    0.18    0.18    0.00   0.18  96.00  xvdb              0.00     0.00 5475.00    0.00    21.39     0.00     8.00     0.96    0.18    0.18    0.00   0.18  96.40  xvdb              0.00     0.00 5479.00    0.00    21.40     0.00     8.00     0.97    0.18    0.18    0.00   0.18  96.80  xvdb              0.00     0.00 1155.00    0.00     4.51     0.00     8.00     0.20    0.17    0.17    0.00   0.17  20.00  xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00  xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00  xvdb              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

The value of interest is reported in the column "avgqu-sz". It is about 0.96 on average while the benchmark was running. This means that the guest's block layer (referring to Figure 6) is handling requests almost the entire time.

The next layer of the storage subsystem that accounts for utilisation is tapdisk3. This value can be obtained running /opt/xensource/debug/xsiostat in dom0. For the same experiment, it reports the following:

[root@dom0 ~]# /opt/xensource/debug/xsiostat | head -2 ; /opt/xensource/debug/xsiostat | grep 51728  --------------------------------------------------------------------    DOM   VBD         r/s        w/s    rMB/s    wMB/s rAvgQs wAvgQs      1,51728:       0.00       0.00     0.00     0.00   0.00   0.00      1,51728:    1213.04       0.00     4.97     0.00   0.22   0.00      1,51728:    5189.03       0.00    21.25     0.00   0.71   0.00      1,51728:    5196.95       0.00    21.29     0.00   0.71   0.00      1,51728:    5208.94       0.00    21.34     0.00   0.71   0.00      1,51728:    5208.10       0.00    21.33     0.00   0.71   0.00      1,51728:    5194.92       0.00    21.28     0.00   0.71   0.00      1,51728:    5203.08       0.00    21.31     0.00   0.71   0.00      1,51728:    5245.00       0.00    21.48     0.00   0.72   0.00      1,51728:    5482.02       0.00    22.45     0.00   0.74   0.00      1,51728:    5474.02       0.00    22.42     0.00   0.74   0.00      1,51728:    3936.92       0.00    16.13     0.00   0.53   0.00      1,51728:       0.00       0.00     0.00     0.00   0.00   0.00      1,51728:       0.00       0.00     0.00     0.00   0.00   0.00

Analogously to what was observed within the guest, xsiostat reports on the amount of time that it had outstanding requests. At this layer, this figure is reported at about 0.71 while the benchmark was running. This gives an idea of the time that passed between a request being accounted in the guest's block layer and at the dom0's backend system. Going further, it is possible to run iostat in dom0 and find out what is the perceived utilisation at the last layer before the request is issued to the device driver.

[root@dom0 ~]# iostat -xm | grep Device ; iostat -xm 1 | grep dm-3  Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util  dm-3              0.00     0.00 102.10  0.00     0.40     0.00     8.00     0.01    0.11   0.11   1.16  dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00  dm-3              0.00     0.00 281.00  0.00     1.10     0.00     8.00     0.06    0.20   0.20   5.60  dm-3              0.00     0.00 5399.00  0.00    21.09     0.00     8.00     0.58    0.11   0.11  58.40  dm-3              0.00     0.00 5479.00  0.00    21.40     0.00     8.00     0.58    0.11   0.11  57.60  dm-3              0.00     0.00 5261.00  0.00    20.55     0.00     8.00     0.61    0.12   0.12  61.20  dm-3              0.00     0.00 5258.00  0.00    20.54     0.00     8.00     0.61    0.12   0.12  61.20  dm-3              0.00     0.00 5206.00  0.00    20.34     0.00     8.00     0.57    0.11   0.11  56.80  dm-3              0.00     0.00 5293.00  0.00    20.68     0.00     8.00     0.60    0.11   0.11  60.00  dm-3              0.00     0.00 5476.00  0.00    21.39     0.00     8.00     0.64    0.12   0.12  64.40  dm-3              0.00     0.00 5480.00  0.00    21.41     0.00     8.00     0.61    0.11   0.11  60.80  dm-3              0.00     0.00 5479.00  0.00    21.40     0.00     8.00     0.66    0.12   0.12  66.40  dm-3              0.00     0.00 5047.00  0.00    19.71     0.00     8.00     0.56    0.11   0.11  56.40  dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00  dm-3              0.00     0.00  0.00  0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

At this layer, the block layer reports about 0.61 for the average queue size.

Varying the IO Depth

The sections above clarified why users might see a lower queue utilisation in dom0 when comparing the output of performance tools in different layers of the storage subsystem. The examples shown so far, however, covered mostly the case where IO Depth is set to "1". This means that the benchmark tool ran within the guest (e.g. fio) will attempt to keep one request inflight at all times. This tool's perception, however, might be incorrect given that it takes time for the request to actually reach the storage infrastructure.

Using the same environment described on the previous section and gradually increasing the IO Depth at the benchmark configuration, the following data can be gathered:

Figure 7. Average queue size vs. io depth as configured in fio

Conclusion

This article explained what the average queue size is and how it is calculated. As examples, it included real data from specific server and disk types. This should clarify why certain workloads cause different queue utilisations to be perceived from the guest and from dom0.

----

Shared via my feedly reader

Sent from my iPad

Cloudy Journey

Pages

Wednesday, December 3, 2014

Average Queue Size and Storage IO Metrics [feedly]