Sunday, July 27, 2014

In-memory read caching for XenServer [feedly]



----
In-memory read caching for XenServer
// Latest blog entries

Overview

In this blog post, I introduce a new feature of XenServer Creedence alpha.4, in-memory read caching, the technical details, the benefits it can provide, and how best to use it.

Technical Details

A common way of using XenServer is to have an OS image, which I will call the golden image, and many clones of this image, which I will call leaf images. XenServer implements cheap clones by linking images together in the form of a tree. When the VM accesses a sector in the disk, if a sector has been written into the leaf image, this data is retrieved from that image. Otherwise, the tree is traversed and data is retrieved from a parent image (in this case, the golden image). All writes go into the leaf image. Astute readers will notice that no writes ever hit the golden image. This has an important implication and allows read caching to be implemented.

tree.png

tapdisk is the storage component in dom0 which handles requests from VMs (see here for many more details). For safety reasons, tapdisk opens the underlying VHD files with the O_DIRECT flag. The O_DIRECT flag ensures that dom0's page cache is never used; i.e. all reads come directly from disk and all writes wait until the data has hit the disk (at least as far as the operating system can tell, the data may still be in a hardware buffer). This allows XenServer to be robust in the face of power failures or crashes. Picture a situation where a user saves a photo and the VM flushes the data to its virtual disk which tapdisk handles and writes to the physical disk. If this write goes into the page cache as a dirty page and then a power failure occurs, the contract between tapdisk and the VM is broken since data has been lost. Using the O_DIRECT flag allows this situation to be avoided and means that once tapdisk has handled a write for a VM, the data is actually on disk.

Because no data is ever written to the golden image, we don't need to maintain the safety property mentioned previously. For this reason, tapdisk can elide the O_DIRECT flag when opening a read-only image. This allows the operating system's page cache to be used which can improve performance in a number of ways:

  • The number of physical disk I/O operations is reduced (as a direct consequence of using a cache).
  • Latency is improved since the data path is shorter if data does not need to be read from disk.
  • Throughput is improved since the disk bottleneck is removed.

One of our goals for this feature was that it should have no drawbacks when enabled. An effect which we noticed initially was that data appeared to be read twice from disk which increases the number of I/O operations in the case where data is only read once from the VM. After a little debugging, we found that disabling O_DIRECT causes the kernel to automatically turn on readahead. Because data access pattern of a VM's disk tends to be quite random, this had a detrimental effect on the overall number of read operations. To fix this, we made use of a POSIX feature, posix_fadvise, which allows an application to inform the kernel how it plans to use a file. In this case, tapdisk tells the kernel that access will be random using the POSIX_FADV_RANDOM flag. The kernel responds to this by disabling readahead, and the number of read operations drops to the expected value (the same as when O_DIRECT is enabled).

Administration

Because of difficulties maintaining cache consistency across multiple hosts in a pool for storage operations, read caching can only be used with file-based SRs; i.e. EXT and NFS SRs. For these SRs, it is enabled by default. There shouldn't be any performance problems associated with this; however, if necessary, it is possible to disable read caching for an SR:

xe sr-param-set uuid=<UUID> other-config:o_direct=true

You may wonder how read caching differs from IntelliCache. The major difference is that IntelliCache works by caching reads from the network onto a local disk while in-memory read caching caches reads from either into memory. The advantage of in-memory read caching is that memory is still an order of magnitude faster than an SSD so performance in bootstorms and other heavy I/O situations should be improved. It is possible for them both to be enabled simultaneously; in this case reads from the network are cached by IntelliCache to a local disk and reads from that local disk are cached in memory with read caching. It is still advantageous to have IntelliCache turned on in this situation because the amount of available memory in dom0 may not be enough to cache the entire working set and reading the remainder from local storage is quicker than reading over the network. IntelliCache further reduces the load on shared storage when using VMs with disks that are not persistent across reboots by only writing to the local disk, not the shared storage.

Talking of available memory, XenServer admins should note that to make best use of read caching, the amount of dom0 memory may need to be increased. Ideally the amount of dom0 memory would be increased to the size of the golden image so that once cached, no more reads hit the disk. In case this is not possible, an approach to take would be to temporarily increase the amount of dom0 memory to the size of the golden image, boot up a VM and open the various applications typically used, determine how much dom0 memory is still free, and then reduce dom0's memory by this amount.

Performance Evaluation

Enough talk, let's see some graphs!

reads.png

In this first graph, we look at the number of bytes read over the network when booting a number of VMs on an NFS SR in parallel. Notice how without read caching, the number of bytes read scales proportionately with the number of VMs booted which checks out since each VM's reads go directly to the disk. When O_DIRECT is removed, the number of bytes read remains constant regardless of the number of VMs booted in parallel. Clearly the in-memory caching is working!

time.png

How does this translate to improvements in boot time? The short answer: see the graph! The longer answer is that it depends on many factors. In the graph, we can see that there is little difference in boot time when booting less than 4 VMs in parallel because the NFS server is able to handle that much traffic concurrently. As the number of VMs increases, the NFS server becomes saturated and the difference in boot time becomes dramatic. It is clear that for this setup, booting many VMs is I/O-limited so read caching makes a big difference. Finally, you may wonder why the boot time per VM increases slowly as the number of VMs increases when read caching is enabled. Since the disk is no longer a bottleneck, it appears that some other bottleneck has been revealed, probably CPU contention. In other words, we have transformed an I/O-limited bootstorm into a CPU-limited one! This improvement in boot times would be particularly useful for VDI deployments where booting many instances of the same VM is a frequent occurrence.

Conclusions

In this blog post, we've seen that in-memory read caching can improve performance in read I/O-limited situations substantially without requiring new hardware, compromising reliability, or requiring much in the way of administration.

As future work to improve in-memory read caching further, we'd like to remove the limitation that it can only use dom0's memory. Instead, we'd like to be able to use the host's entire free memory. This is far more flexible than the current implementation and would remove any need to tweak dom0's memory.

Credits

Thanks to Felipe Franciosi, Damir Derd, Thanos Makatos and Jonathan Davies for feedback and reviews.


----

Shared via my feedly reader


Sent from my iPhone

No comments:

Post a Comment