Cloudy Journey: From the Field: XenServer 7.X Pool Upgrades, Part 2: Updates and Aftermath

----
From the Field: XenServer 7.X Pool Upgrades, Part 2: Updates and Aftermath
// Latest blog entries

The XenServer upgrade process with XenServer 7 can be a bit more involved than with past upgrades, particularly if upgrading from a version prior to 7.0. In this series I'll discuss various approaches to upgrading XS pools to 7.X and some of the issues you might encounter during the process based on my experiences in the field.

Part 1 dealt with some of the preliminary actions needed to be taken into consideration and the planning process, as well as the case for a clean installation. You can go back and review it here. Part 2 deals with the alternative, namely a true upgrade procedure, and what issues may arise during and after the procedure.

In-Place Upgrades to XS 7

An in-place upgrade to XS 7.X is a whole different beast compared to where the OS is overwritten from scratch. Mostly, it depends on whether or not you wish to retain the original partition scheme or go with the newer, larger, and more flexible disk partition layout. The bottom line is despite the added work and potential issues, you may as well go through with the update as it will make life easier later down the road.

Your choices here are to retain the current disk partitioning layout or to switch to the newer partition layout. If you want to know why the new XS disk layout changed, review the "Why was the XenServer 7 Disk Layout Changed?" section in Part 1 of this blog series.

In my experience, issues found with in-place upgrades cover five areas – I'm going to cover them as follows:

Pre-XS 7 Upgrade, Staying the Course with the Original Partition Layout
The 72% Solution
The (In)famous Dell Utility Partition
XS 7.X to XS 7.X Maintaining the Current Partition Layout
XS 6.X or7.X to XS 7.X With a Change to the New Partition Layout (plus possible issues on Dell Servers with iDRAC modules)

Pre-XS 7 Upgrade, Staying the Course with the Original Partition Layout

If you choose to stick with the original partition layout, the rest of your installation experience – whether you go with the rolling pool upgrade, or conventional upgrade – will be pretty much the same as before.

As with any such upgrade, make sure to check ahead of time that the pool is ready for this undertaking:

XenCenter has been upgraded
Assure the proper backups of VMs, metadata, etc. have been performed
That all VMs are agile or in the case of a rolling pool upgrade, at least shut down on all local SRs.
All hosts must not only be running the same version of XenServer, but must have the identical list of hotfixes applied.

Depending on the various ages of the hosts and how hotfixes were applied, you could run into the situation where there is a discrepancy. This can be particularly frustrating if an older hotfix has since been superseded and the older version is no longer available, yet shows up in the applied hotfix list. Though I'd only recommend this in dire circumstances (such as this, where the XenServer version is going to be replaced by a new one anyway) there are ways to make XenServers believe they have had the same specific patches applied by manipulating the contents of /var/update/applied directory (see, for example, this forum thread).

One thing you might run into is the following issue, covered in the next section, which incidentally appears to be independent of whichever partition layout you use.

The 72% Solution

This is not really so much of a solution, but rather a warning as to what you may encounter and what to do before you potentially run into this or what to do afterwards should you encounter it.

This issue apparently crops up frequently during the latter part of the installation process and is characterized by a long wait with a screen showing 72% completion of the installation that can last anywhere from about five minutes to over an hour in extreme cases.

One apparent cause of this is if you have an exceedingly large number of message files on your current version of XS. If you have the chance, it would be worthwhile taking the time to examine the contents of the area where these are stored and manually clean up any ancient versions within those areas under /var/lib/xcp/blobs/ in particular as most of these reside under the "messages" and "refs" subdirectories, paying attention also to the symbolic links.

If you do get stuck in the midst of a long-lasting installation you can escape out by pressing ALT+F2 to get into the shell and check the installer logs under /tmp/install-log to see if it has thrown any errors (thanks, Nekta -- @nektasingh -- for that tip!). If all looks OK, continue to wait until the process completes.

The (In)famous Dell Utility Partition

Using the rolling pool upgrade for the first time, I ran into a terrible situation in which the upgrade proceeded all the way to the very end and just as the final reboot was about to happen, I got this error popping up on the console:

Which read:

An unrecoverable error has occurred. The error was:

Failed to install bootloader: installing for i386-pc platform.

/usr/sbin/grub-install: warning: your embedding area is unusually small. core.img won't fit in it..

/usr/sbin/grub-install: warning: Embedding is not possible. GRUB can only be installed in this setup by using blocklists. However blocklists are UNRELIABLE and their use is discouraged..

/usr/sbin/grub-install: error: will not process with blocklists.

You can imagine the reaction.

What caused this and what can you do about it? After all, this was something I'd never encountered before in any previous XS upgrades.

It turns out that the reason for this is the Dell utility partition (see, for example, this link). The Dell utility partition is a five-block partition that Dell puts on for its own purposes at the beginning of many of its servers as they ship. This did not interfere with any installs up to and including XS 6.5 SP1, hence this came to me as a total surprise when I first encountered in during a XS 7 upgrade.

And while this wasn't an issue initially or in one particular upgrade which for whatever reason managed to squeak by without any errors being reported whatsoever, it's too small to hold the initial configuration needed to do the installation under most circumstances when installing XS 7.

What's bad is that the XenServer installation doesn't apparently perform any pre-checks to see if that first partition on the disk is big enough. This is the case for both UEFI and Legacy/BIOS boots.

The solution was simply to delete that sda1 partition altogether using fdisk and re-install. Deleting the partition can be performed live on the host prior to the installation process.

You can then successfully bypass this issue. I have performed this surgical procedure on a number of hosts with success and have not experienced any adverse effects; you do not even require a vorpal sword to accomplish this task.

Possibly future upgrade processes will be better accommodating and should perform a pre-check for this along with looking at other potential inconsistencies such as the lack of VM agility or uniformity of applied hotfixes.

XS 6.X or 7.X to XS 7.X With a Change to the New Partition Layout

This is the trickiest area where most, including me, seem to encounter issues.

One point that should be clarified right away is:

Any local partition on or partly contained on the system disk to be upgraded is going to get destroyed in the process.

There are no two ways about this. Hence, plan ahead: either

Storage Xenmotion any local VMs to pooled storage, or
Export them.

If you want or need to preserve your existing local storage SRs, you'll have to stay with the original partition scheme.

Should you decide to update the new partition scheme, before the upgrade, you will need to perform the following action:

# touch /var/preserve/safe2upgrade

The upgrade process (to preserve the pool) will need to be performed using the rolling pool upgrade. The caveat is that if anything goes wrong during any part of the upgrade process, you will have to exit and try to start over. The end point can vary from it being simple to reconvene from where you left off, to having things in a broken state, depending on the circumstances.The pre-checks performed are supposed to catch issues beforehand and allow you to take care of them before launching into the upgrade process, but these do not always trap everything! Above all, make sure that:

All VMs are agile, i.e. that there are no VMs running or even resident on any local storage
High Availability (HA) and Workload Balancing (WLB) are disabled
You have created metadata backups to at least one pooled storage device or exported it to an external location
You have plenty of space to hold VMs on whatever hosts remain in your pool, figuring you'll have one out at any given point and that initially, only the master will be potentially able to take on VMs from whichever hosts is in the process of being upgraded

Preferably also have recent backups of all your VMs. It's generally also a good idea to keep good notes about your various host configurations, including all physical and virtual networks, VLANs, iSCSI and NFS connection and the like. I'd recommend doing an "export resource data" for the pool, which has been available as a utility since XS 6.5 and can run from XenCenter.To export resource data:

In the XenCenter Navigation pane, click Infrastructure and then click on the pool.
From the XenCenter menu, click Pool and then select Export Resource Data.
Browse to a location where you would like to save report and then click Save.

You can chose between an XLS or CVS output. Note that this feature is only available on XenCenter with paid-for licensed version of XenServer. However, it can also be run via the CLI for any (including free) version of XS 6.5 or newer using:

# xe pool-dump-database file-name=target-output-file-name

Being prepared now for the upgrade, you may still run into issues along the way, in particular if you run out of space to move VMs to a different server or if the upgrade process hangs. If the master server manages to make it through the upgrade and the upgrade process fails at some point later, one of the first consequences will be that you will no longer be able to add any external hosts to the pool because the pool will be in a mixed state of hosts running different XS versions. This is not a good situation and makes it very difficult under some circumstances to recover from.

Recommended Process for Changing XenServer Partition Layout

Through various discussions on the XenServer discussion forum as well as some of my own experimentation, this is what I recommend as the most solid and reliable way of doing an upgrade to XS 7.X with the new partition layout coming from a version that still has the old layout. It will take quite a bit longer, but is more reliable.

In addition to all the preparatory steps listed above:

You will be needing to eject all hosts from the pool at one point or another, so make sure you have carefully recorded the various network and other settings, in particular some which are not retained as such in the metadata exports, such as the individual iSCSI network connections or NFS connections.
Copy your /etc/fstab file and also be sure to check for any other customizations you may have done, such as cron jobs or additions to /etc/rc.local (and also note that rc.local does not run on its own under XS 7.X, so you will need to manually enable it to do so – see the section "Enabling rc.local to Run On Its Own" below).

Once your enhanced preparation is complete:

Start with your pool master and do a rolling pool upgrade to it. Do not attempt to upgrade to the new partition layout! After it reboots, it should still have retained the role of pool master. Note that the rolling pool upgrade will not allow you to pick which host it will upgrade next beyond the master so be certain that any of these hosts can have all its VMs fit on the pool master. If necessary move some or all the VMs on the pool master onto other hosts within the pool before you commence the upgrade procedure.
Pick another host and migrate all its VMs to the pool master.
Follow this procedure for all hosts until the pool is completely upgraded to the identical version of XS 7.X on all pool members. You can either continue to use the rolling pool upgrade or if desired, switch to manual upgrade mode.
Making sure you've carefully recorded all important storage and network information on that host, migrate all VMs off a non-master host and eject it from the pool.
Touch the file /var/preserve/safe2upgrade and plan on the local SR being wiped and re-created. Then shutdown the host and perform a standalone installation to that just ejected host. It will have the new partition layout.
Reduce the network settings on this host to just a single primary management interface NIC, one that matches the same one it had initially. Rejoin this just upgraded host back into the pool. Many of the pool metadata settings should automatically get recreated, in particular any of the pooled resources. Update any host-specific network settings as well as other customizations.
Continue this process for the remainder of the non-master hosts.
When these have all been completed, designate a new pool master using the command "xe pool-designate-new-master host-uuid=new-master-uuid". Make sure the transition to the new pool master is properly completed.
Eject what was the original pool master and perform the standalone upgrade and rejoin the host to the pool as before.
Celebrate your hard work!

While this process will take easily two times as long as a standard upgrade, at least you know that things should not break so badly that you may have to spend hours of additional time making things right. As a side benefit (if you want to consider it as such), it will also force you to take stock of how your pool is configured and require you to take careful inventory, In the event of a disaster, you will be grateful to have gone through this process as it may be very close to what you may have to go through under less controlled circumstances!I have done this myself several of times and had it work correctly each time.

An Additional Item: If the Console Shows No Network Configured on Dell Servers with iDRAC modules

This condition can show up unexpectedly and while normally something like this can be handled if the host is found to be in emergency mode with an "xe pool-recover-slaves" command, that's not always the case. And even more oddly, if instead you ssh in and run xsconsole from the CLI, all looks perfectly normal, including all the network settings that appear present and correct and also match the settings visible in XenCenter. This condition, as far as I know, seems unique to Dell servers and was seen with iDRAC 7, 8 and 9 modules. Here's what it looks like:

The issue here appears to have been a change in behavior that kicked in starting with XenServer 7.1 and hence may not even be evident in XS 7.0.

The fix turned out to be an upgrade all the BIOS/firmware and iDRAC configurations. In this particular case, I made use of the method I described in this blog entry and that took care of it. Note that this still does not seem to consistently address this issue, in particular with some older hardware (e.g., R715 with an iDRAC 6, even after updating to DSU 1.4.2, BIOS 3.2.1, but being stuck at iDRAC version 1.97 – apparently not upgradeable).

Enabling rc.local To Run On Its Own

The rc.local file is not automatically executed by default under RHEL/CentOS 7 installations, and since XS 7 is based on CentOS 7, it is no exception. If you wish to customize your environment and assure this file is executed at boot time, you will have to manually enable /etc/rc.local to run automatically on reboot on XenServer 7.X and to do so, will need to run these two commands:

# chmod u+x /etc/rc.d/rc.local
# systemctl start rc-local

You can verify that rc.local is now running with the following command, which in turn should produce output similar to what is shown below:

# systemctl status rc-local

   rc-local.service - /etc/rc.d/rc.local Compatibility
   Loaded: loaded (/usr/lib/systemd/system/rc-local.service; static; vendor preset: disabled)
   Active: active (exited) since Sun 2017-06-18 09:09:50 MST; 1 weeks 6 days ago
   Process: 4649 ExecStart=/etc/rc.d/rc.local start (code=exited, status=0/SUCCESS)

A Final Word

Once again, I will reiterate that feedback is always appreciated and the XenServer forum is one good option. Errors, on the other hand, should be reported to the XenServer bugs site with as much information as possible to help the engineers understand and reproduce the issues.

I sincerely hope some of the information in this series will have been useful to some degree.

I would like to most gratefully acknowledge Andrew Wood, fellow Citrix Technology Professional, for the review of and constructive additions to this blog.

----

Read in my feedly

Sent from my iPhone

Cloudy Journey

Pages

Wednesday, August 2, 2017

From the Field: XenServer 7.X Pool Upgrades, Part 2: Updates and Aftermath