Saturday, July 2, 2016

XenServer 7.0 performance improvements part 2: Parallelised networking datapath [feedly]



----
XenServer 7.0 performance improvements part 2: Parallelised networking datapath
// Latest blog entries

The XenServer team has made a number of significant performance and scalability improvements in the XenServer 7.0 release. This is the second in a series of articles that will describe the principal improvements. For the first, see http://xenserver.org/blog/entry/dundee-tapdisk3-polling.html.

The topic of this post is network I/O performance. XenServer 7.0 achieves significant performance improvements through the support for multi-queue paravirtualised network interfaces. Measurements of one particular use-case show an improvement from 17 Gb/s to 41 Gb/s.

A bit of background about the PV network datapath

In order to perform network-based communications, a VM employs a paravirtualised network driver (netfront in Linux or xennet in Windows) in conjunction with netback in the control domain, dom0.

a1sx2_Original2_single-queue.png

To the guest OS, the netfront driver feels just like a physical network device. When a guest wants to transmit data:

  • Netfront puts references to the page(s) containing that data into a "Transmit" ring buffer it shares with dom0.
  • Netback in dom0 picks up these references and maps the actual data from the guest's memory so it appears in dom0's address space.
  • Netback then hands the packet to the dom0 kernel, which uses normal routing rules to determine that it should go to an Open vSwitch device and then on to either a physical interface or the netback device for another guest on the same host.

When dom0 has a network packet it needs to send to the guest, the reverse procedure applies, using a separate "Receive" ring.

Amongst the factors that can limit network throughput are:

  1. the ring becoming full, causing netfront to have to wait before more data can be sent, and
  2. the netback process fully consuming an entire dom0 vCPU, meaning it cannot go any faster.

Multi-queue alleviates both of these potential bottlenecks.

What is multi-queue?

Rather than having a single Transmit and Receive ring per virtual interface (VIF), multi-queue means having multiple Transmit and Receive rings per VIF, and one netback thread for each:

a1sx2_Original1_multi-queue.png

Now, each TCP stream has the opportunity to be driven through a different Transmit or Receive ring. The particular ring chosen for each stream is determined by a hash of the TCP header (MAC, IP and port number of both the source and destination).

Crucially, this means that separate netback threads can work on each TCP stream in parallel. So where we were previously limited by the capacity of a single dom0 vCPU to process packets, now we can exploit several dom0 vCPUs. And where the capacity of a single Transmit ring limited the total amount of data in-flight, the system can now support a larger amount.

Which use-cases can take advantage of multi-queue?

Anything involving multiple TCP streams. For example, any kind of server VM that handles connections from more than one client at the same time.

Which guests can use multi-queue?

Since frontend changes are needed, the version of the guest's netfront driver matters. Although dom0 is geared up to support multi-queue, guests with old versions of netfront that lack multi-queue support are limited to single Transmit and Receive rings.

  • For Windows, the XenServer 7.0 xennet PV driver supports multi-queue.
  • For Linux, multi-queue support was added in Linux 3.16. This means that Debian Jessie 8.0 and Ubuntu 14.10 (or later) support multi-queue with their stock kernels. Over time, more and more distributions will pick up the relevant netfront changes.

How does the throughput scale with an increasing number of rings?

The following graph shows some measurements I made using iperf 2.0.5 between a pair of Debian 8.0 VMs both on a Dell R730xd host. The VMs each had 8 vCPUs, and iperf employed 8 threads each generating a separate TCP stream. The graph reports the sum of the 8 threads' throughputs, varying the number of queues configured on the guests' VIFs.

5104.png

We can make several observations from this graph:

  • The throughput scales well up to four queues, with four queues achieving more than double the throughput possible with a single queue.
  • The blip at five queues probably arose when the hashing algorithm failed to spread the eight TCP streams evenly across the queues, and is thus a measurement artefact. With different TCP port numbers, this may not have happened.
  • While the throughput generally increases with an increasing number of queues, the throughput is not proportional to the number of rings. Ideally, the throughput would double when you double the number of rings. This doesn't happen in practice because the processing is not perfectly parallelisable: netfront needs to demultiplex the streams onto the rings, and there are some overheads due to locking and synchronisation between queues.

This graph also highlights the substantial improvement over XenServer 6.5, in which only one queue per VIF was supported. In this use-case of eight TCP streams, XenServer 7.0 achieves 41 Gb/s out-of-the-box where XenServer 6.5 could manage only 17 Gb/s – an improvement of 140%.

How many rings do I get by default?

By default the number of queues is limited by (a) the number of vCPUs the guest has and (b) the number of vCPUs dom0 has. A guest with four vCPUs will get four queues per VIF.

This is a sensible default, but if you want to manually override it, you can do so in the guest. In a Linux guest, add the parameter xen_netfront.max_queues=n, for some n, to the kernel command-line.


Read More
----

Shared via my feedly newsfeed


Sent from my iPhone