CPU Feature Levelling in Dundee
// Latest blog entries
As anyone with more than one type of server is no doubt aware, CPU feature levelling is the method by which we try to make it safe for VMs to migrate. If each host in a pool is identical, this is easy. However if the pool has non-identical hardware, we must hide the differences so that a VM which is migrated continues without crashing.
I don't think I am going to surprise anyone by admitting that the old way XenServer did feature levelling was clunky and awkward to use. Because of a step change introduced in Intel Ivy Bridge CPUs, feature levelling also ceased to work correctly. As a result, we took the time to redesign it, and the results are available for use in Dundee beta 2.
When a VM boots, the kernel looks at the available featureset and in general, turn as much on as it knows how to. Linux will go so far as to binary patch some of the hotpaths for performance reasons. Userspace libraries frequently have the same algorithm compiled multiple times, and will select the best one to use at startup, based on which CPU instructions are available.
On a bare metal system, the featureset will not change while the OS is running, and the same expectations exist in the virtual world. Migration introduces a problem with this expectation; it is literally like unplugging the hard drive and RAM from one computer, plugging it into another, and letting the OS continue from where it was. For the VM not to crash, all the features it is using must continue to work at the destination, or in other words, features which are in use must not disappear at any point.
Therefore, the principle of feature levelling is to calculate the common subset of features available across the pool, and restrict the VM to this featureset. That way the VMs featureset will always work, no matter which pool member it is currently running on.
There are many factors affecting the available featureset for VMs to use:
- The CPU itself
- The BIOS/firmware settings
- The hypervisor command line parameters
- The restrictions which the toolstack chooses to apply
Hiding features is also tricky; x86 provides no architectural means to do so. Feature levelling is therefore implemented using vendor-specific extensions, which are documented as unstable interfaces and liable to change at any point at any point inf the future (as happened with Ivy Bridge).
XenServer Pre Dundee
Older versions of XenServer would require a new pool member to be identical before it would be permitted to join. Making this happen involved consulting the
xe host-cpu-info features, divining some command line parameters for Xen and rebooting.
If everything went to plan, the new slave would be permitted to join the pool. Once a pool had been created, it was assumed to be homogeneous from that point on. The command line parameters had an effect on the entire host, including Xen itself. In a slightly heterogeneous case, the difference in features tended to be features which only Xen would care to use, so was needlessly penalised along with dom0.
In Dundee, we have made some changes:
There are two featuresets rather than one. PV and HVM guests are fundamentally different types of virtualisation, and come with different restrictions and abilities. HVM guests will necessarily have a larger potential featureset than PV, and having a single featureset which is safe for PV guests would apply unnecessary restrictions to HVM guests.
The featuresets are recalculated and updated every boot. Assuming that servers stay the same after initial configuration is unsafe, and incorrect.
So long as the CPU Vendor is the same (i.e. all Intel or all AMD), a pool join will be permitted to happen, irrespective of the available features. The pool featuresets are dynamically recalculated every time a slave joins or leaves a pool, and every time a pool member reconnects after reboot.
When a VM is started, it will be given the current pool featureset. This permits it to move anywhere in the pool, as the pool existed when it started. Changes in pool level have no effect on running VMs. Their featureset is fixed at boot (and is fixed across migrate, suspend/resume, etc.), which matches the expectation of the OS. (One release note is that to update the VM featureset, it must be shut down fully and restarted. This is contrary to what would be expected with a plain reboot, and exists because of some metatdata caching in the toolstack which is proving hard to untangle.)
Migration safety checks are performed between the VMs fixed featureset and the destination hosts featureset. This way, even if the pool level drops because a less capable slave joined, an already-running VM will still be able to move anywhere except the new, less capable slave.
The method of hiding features from VMs now has no effect on Xen and dom0. They never migrate, and will have access to all the available features (albeit with dom0 subject to being a VM in the first place).
We hope that the changes listed above will make Dundee far easier to use in a heterogeneous setup.
All of this information applies equally to inter-pool migration (Storage XenMotion) as intra-pool migration. In the case of upgrade from older versions of XenServer (Rolling Pool Upgrade, Storage XenMotion again), there is a complication because of having to fill in some gaps in the older toolstacks idea of the VMs featureset. In such a case, an incoming VM is assumed to have the featureset of the host it lands on, rather than the pool level. This matches the older logic (and is the only safe course of action), but does result in the VM possibly having a higher featureset than the pool level, and being less mobile as a result. Once the VM has been shut down and started back up again, it will be given the regular pool level and behave normally.
To summarise the two key points:
- We expect feature levelling to work between any combination of CPUs from the same vendor (with the exception of PV guests on pre-Nehalem CPUs, which lacked any masking capabilities whatsoever).
- We expect there to be no need for any manual feature configuration to join a pool together.
That being said, this is a beta release and I highly encourage you to try it out and comment/report back.
Shared via my feedly reader
Sent from my iPhone