Monday, April 20, 2015

XenServer 6.5 and Asymmetric Logical Unit Access (ALUA) for iSCSI Devices [feedly]

XenServer 6.5 and Asymmetric Logical Unit Access (ALUA) for iSCSI Devices
// Latest blog entries


There are a number of ways to connect storage devices to XenServer hosts and pools, including local storage, HBA SAS and fiber channel, NFS and iSCSI. With iSCSI, there are a number of implementation variations including support for multipathing with both active/active and active/passive configurations, plus the ability to support so-called "jumbo frames" where the MTU is increased from 1500 to typically 9000 to optimize frame transmissions. One of the lesser-known and somewhat esoteric iSCSI options available on many modern iSCSI-based storage devices is Asymmetric Logical Unit Access (ALUA), a protocol that has been around for a decade and is furthermore mysterious and intriguing because of its ability to be used not only with iSCSI, but also with fiber channel storage. The purpose of this article is an attempt to both clarify and outline how ALUA can be used more flexibly now with iSCSI on XenServer 6.5.


ALUA support on XenServer goes way back to XenServer 5.6 and initially only with fiber channel devices. The support of iSCSI ALUA connectivity started on XenServer 6.0 and was initially limited to specific ALUA-capable devices, which included the EMC Clariion, NetApp FAS as well as the EMC VMAX and VNX series. Each device required specific multipath.conf file configurations to properly integrate with the server used to access them, XenServer being no exception. The upstream XenServer code also required customizations. The Citrix XenServer 6.5 Release Notes (March 2014, updated March 2015) currently only discuss ALUA support through XenServer 6.2 and only for specific devices, stating: "Most significant is the usability enhancement for ALUA; for EMC™ VNX™ and NetApp™ FAS™, XenServer will automatically configure for ALUA if an ALUA-capable LUN is attached" (CTX132976).

It was announced in the XenServer 6.5 release notes that XenServer will automatically connect to one of these aforementioned documented devices and it is now running the updated device mapper multipath (DMMP) version 0.4.9-72. This rekindled my interest in ALUA connectivity and after some research and discussions with Citrix and Dell about support, it appeared this might now be possible specifically for the Dell MD3600i units we have used on XenServer pools for some time now. What is not stated in the release notes is that XenServer 6.5 now has the ability to connect generically to a large number of ALUA-capable storage arrays. This will be gone into detail later. It is also of note that MPP-RDAC support is no longer available in XenServer 6.5 and DMMP is the exclusive multipath mechanism supported. This was in part because of support and vendor-specific issues (see, for example, the XenServer 6.5 Release Notes or this document from Dell, Inc.).

But first, how are ALUA connections even established? And perhaps of greater interest, what are the benefits of ALUA in the first place?


As the name suggests, ALUA is intended to optimize storage traffic by making use of optimized paths. With multipathing and multiple controllers, there are a number of paths a packet can take to reach its destination. With two controllers on a storage array and two NICs dedicated to iSCSI traffic on a host, there are four possible paths to a storage Logical Unit Number (LUN). On the XenServer side, LUNs then are associated with storage repositories (SRs). ALUA recognizes that once an initial path is established to a LUN that any multipathing activity destined for that same LUN is better served if routed through the same storage array controller. It attempts to do so as much as possible, unless of course a failure forces the connection to have to take an alternative path. ALUA connections fall into five self-explanatory categories (listed along with their associated hex codes):

  • Active/Optimized : x0
  • Active/Non-Optimized : x1
  • Standby : x2
  • Unavailable : x3
  • Transitioning : xf

For ALUA to work, it is understood that an active/active storage path is required and furthermore that an asymmetrical active/active mechanism is involved. The advantage of ALUA comes from less fragmentation of packet traffic by routing if at all possible both paths of the multipath connection via the same storage array controller as the extra path through a different controller is less efficient. It is very difficult to locate specific metrics on the overall gains, but hints of up to 20% can be found in on-line articles (e.g., this openBench Labs report on Nexsan), hence this is not an insignificant amount and potentially more significant that gains reached by implementing jumbo frames. It should be noted that the debate continues to this day regarding the benefits of jumbo frames and to what degree, if any, they are beneficial. Among numerous articles to be found are: The Great Jumbo Frames Debate from Michael Webster, Jumbo Frames or Not - Purdue University Research, Jumbo Frames Comparison Testing, and MTU Issues from ESNet. Each installation environment will have its idiosyncrasies and it is best to conduct tests within one's unique configuration to evaluate such options.

The SCSI Architecture Model version defines these SCSI Primary Commands (SPC-3) used to determine paths. The mechanism by which this is accomplished is target port group support (TPGS). The characteristics of a path can be read via an RTPG command or set with an STPG command. With ALUA, non-preferred controller paths are used only for fail-over purposes. This is illustrated in Figure 1, where an optimized network connection is shown in red, taking advantage of routing all the storage network traffic via Node A (e.g., storage controller module 0) to LUN A (e.g., 2).



Figure 1.  ALUA connections, with the active/optimized paths to Node A shown as red lines and the active/non-optimized paths shown as dotted black lines.


Various SPC commands are provided as utilities within the sg3_utils (SCSI generic) Linux package.

There are other ways to make such queries, for example, VMware has a "esxcli nmp device list" command and NetApp appliances support "igroup" commands that will provide direct information about ALUA-related connections.

Let us first examine a generic Linux server containing ALUA support connected to an ALUA-capable device. In general, note that this will entail a specific configuration to the /etc/multipath.conf file and typical entries, especially for some older arrays or XenServer versions, will use one or more explicit configuration parameters such as:

  • hardware_handler "1 alua"
  • prio "alua"
  • path_checker "alua"

Consulting the Citrix knowledge base article CTX132976, we see for example the EMC Corporation DGC Clariion device makes use of an entry configured as:

                vendor "DGC"
                product "*"
                path_grouping_policy group_by_prio
                getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
                prio_callout "/sbin/mpath_prio_emc /dev/%n"
                hardware_handler "1 alua"
                no_path_retry 300
                path_checker emc_clariion
                failback immediate

To investigate the multipath configuration in more detail, we can make use of the TPGS setting. The TPGS setting can be read using the sg_rtpg command. By using multiple "v" flags to increase verbosity and "d" to specify the decoding of the status code descriptor returned for the asymmetric access state, we might see something like the following for one of the paths:

# sg_rtpg -vvd /dev/sde
open /dev/sdg with flags=0x802
    report target port groups cdb: a3 0a 00 00 00 00 00 00 04 00 00 00
    report target port group: requested 1024 bytes but got 116 bytes
Report list length = 116
Report target port groups:
  target port group id : 0x1 , Pref=0
    target port group asymmetric access state : 0x01 (active/non optimized)
    T_SUP : 0, O_SUP : 0, U_SUP : 1, S_SUP : 0, AN_SUP : 1, AO_SUP : 1
    status code : 0x01 (target port asym. state changed by SET TARGET PORT GROUPS command)
    vendor unique status : 0x00
    target port count : 02
    Relative target port ids:

Noting the boldfaced characters above, we see here specifically that target port ID 1 is an active/non-optimized ALUA path, both from the "target port group id" line as well as from the "status code". We also see there are two paths identified, with target port IDs 1,1 and 1,2.

There are a slew of additional "sg" commands, such as the sg_inq command, often used with the flag "-p 0x83" to get the VPD (vital product data) page of interest, sg_rdac, etc. The sg_inq command will in general return, in fact, TPGS > 0 for devices that support ALUA. More on that will be discussed later on in this article. One additional command of particular interest, because not all storage arrays in fact support target port group queries (more also on this important point later!), is sg_vpd (sg vital product data fetcher), as it does not require TPG access. The base syntax of interest here is:

sg_vpd –p 0xc9 –hex /dev/…

Where "/dev/…" should be the full path to the device in question. Looking at an example of the output of a real such device, we get:

# sg_vpd -p 0xc9 --hex /dev/mapper/mpathb1
Volume access control (RDAC) VPD Page:
00     00 c9 00 2c 76 61 63 31  f1 01 00 01 01 01 00 00    ...,vac1........
10     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................
20     00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00    ................

If one reads the source code for various device handlers (see the multipath tools hardware table for an extensive list of hardware profiles as well as the Linux SCSI device handler regarding how the data are interpreted through the device handler), one can determine that the value of interest here is that of avte_cvp (part of the RDAC c9_inquiry structure), which is the sixth hex value, and will indicate if the connected device is using ALUA (if shifted right five bits together with a logical AND with 0x1, in the RDAC world, known as IOSHIP mode), AVT, or Automatic Volume Transfer mode (if shifted right seven bits together with a logical AND with 0x1), or otherwise defaults in general to basic RDAC (legacy) mode. In the case above we see "61" returned (indicated in boldface), so (0x61 >> 5 & 0x1) is equal to 1, and hence the above connection is indeed an ALUA RDAC-based connection.

I will revisit sg commands once again later on. Do note that the sg3_utils package is not installed on stock XenServer distributions and as with any external package, the installation of external packages may void any official Citrix support.


In addition to all the information that various sg commands provide, there is also an abundance of information available from the standard multipath command. We saw a sample multipath.conf file earlier, and at least with many standard Linux OS versions and ALUA-capable arrays, information on the multipath status can be more readily obtained using stock multipath commands.

For example, on an-ALUA enabled connection we might see output similar to the following from a "multipath –ll" command (there will be a number of variations in output, depending on the version, verbosity and implementation of the multipath utility):

mpath2 (3600601602df02d00abe0159e5c21e111) dm-4 DGC,VRAID
[size=100G][features=1 queue_if_no_path][hwhandler=1 alua][rw]
_ round-robin 0 [prio=50][active]
 _ 1:0:3:20  sds   70:724   [active][ready]
 _ 0:0:1:20  sdk   67:262   [active][ready]
_ round-robin 0 [prio=10][enabled]
 _ 0:0:2:20  sde   8:592    [active][ready]
 _ 1:0:2:20  sdx   128:592  [active][ready]

Recalling the device sde from the section above, note that it falls under a path with a lower priority of 10,  indicating it is part of an active, non-optimized network connection vs. 50, which indicates being in an active, optimized group; a priority of "1" would indicate the device is in the standby group. Depending on what mechanism is used to generate the priority values, be aware that these priority values will vary considerably; the most important point is that whatever path has a higher "prio" value will be the optimized path. In some newer versions of the multipath utility, the string "hwhandler=1 alua" shows clearly that the controller is configured to allow the hardware handler to help establish the multipathing policy as well as that ALUA is established for this device. I have read that the path priority will be elevated to typically a value of between 50 and 80 for optimized ALUA-based connections (cf. mpath_prio_alua in this Suse article), but have not seen this consistently.

The multipath.conf file itself has traditionally needed tailoring to each specific device. It is particularly convenient, however, that using a generic configuration is now possible for a device that makes use of the internal hardware handler and is rdac-based and can auto-negotiate an ALUA connection. The italicized entries below represent the specific device itself, but others should now work using this generic sort of connection:

device {
                vendor                  "DELL"
                product                 "MD36xx(i|f)"
                features                "2 pg_init_retries 50"
                hardware_handler        "1 rdac"
                path_selector           "round-robin 0"
                path_grouping_policy    group_by_prio
                failback                immediate
                rr_min_io               100
                path_checker            rdac
                prio                    rdac
                no_path_retry           30
                detect_prio             yes
                retain_attached_hw_handler yes

Note how this differs (the additional entries above are in boldface type) from the "stock" version (in XenServer 6.5) of the MD36xx multipath configuration):

device {
                vendor                  "DELL"
                product                 "MD36xx(i|f)"
                features                "2 pg_init_retries 50"
                hardware_handler        "1 rdac"
                path_selector           "round-robin 0"
                path_grouping_policy    group_by_prio
                failback                immediate
                rr_min_io               100
                path_checker            rdac
                prio                    rdac
                no_path_retry           30


The LSI controllers incorporated into Dell's MD32xx and MD36xx series of iSCSI storage arrays represent an unusual and interesting case. As promised earlier, we will get back to looking at the sg_inq command, which queries a storage device for several pieces of information, including TPGS. Typically, an array that supports ALUA will return a value of TPGS > 0, for example:

# sg_inq /dev/sda
standard INQUIRY:
PQual=0 Device_type=0 RMB=0 version=0x04 [SPC-2]
[AERC=0] [TrmTsk=0] NormACA=1 HiSUP=1 Resp_data_format=2
SCCS=0 ACC=0 TPGS=1 3PC=1 Protect=0 BQue=0
EncServ=0 MultiP=1 (VS=0) [MChngr=0] [ACKREQQ=0] Addr16=0
[RelAdr=0] WBus16=0 Sync=0 Linked=0 [TranDis=0] CmdQue=1
[SPI: Clocking=0x0 QAS=0 IUS=0]
length=117 (0x75) Peripheral device type: disk
Vendor identification: NETAPP
Product identification: LUN
Product revision level: 811a

Highlighted in boldface, we see in this case above that TPGS is reported to have a value of 1. The MD36xx has supported ALUA since RAID controller firmware and  NVSRAM  N26X0-784890-904, however, even with that (or newer) revision level, an sg_inq returns the following for this particular storage array:

# sg_inq /dev/mapper/36782bcb0002c039d00005f7851dd65de
standard INQUIRY:
  PQual=0  Device_type=0  RMB=0  version=0x05  [SPC-3]
  [AERC=0]  [TrmTsk=0]  NormACA=1  HiSUP=1  Resp_data_format=2
  SCCS=0  ACC=0  TPGS=0  3PC=1  Protect=0  BQue=0
  EncServ=1  MultiP=1 (VS=0)  [MChngr=0]  [ACKREQQ=0]  Addr16=0
  [RelAdr=0]  WBus16=1  Sync=1  Linked=0  [TranDis=0]  CmdQue=1
  [SPI: Clocking=0x0  QAS=0  IUS=0]
    length=74 (0x4a)   Peripheral device type: disk
 Vendor identification: DELL
 Product identification: MD36xxi
 Product revision level: 0784
 Unit serial number: 142002I

Various attempts to modify the multipath.conf file to try to force TPGS to appear with any value greater than zero all failed. Above all, it seemed that without access to the TPGS command, there was no way to query the device for ALUA-related information.  Furthermore, the command mpath_prio_alua and similar commands appear to have been deprecated in newer versions of the device-mapper-multipath package, and so offer no help.

This proved to be a major roadblock in making any progress. Ultimately it turned out that the key to looking for ALUA connectivity in this particular case comes oddly from ignoring what TPGS reports, and rather focusing on what the MD36xx controller is doing. What is going on here is that the hardware handler is taking over control and the clue comes from the sg_vpd output shown above. To see how a LUN is mapped for these particular devices, one needs to hunt back through the /var/log/messages file for entries that appear when the LUN was first attached. To investigate this for the MD36xx array, we know it uses the internal "rdac" connection mechanism for the hardware handler, so a Linux grep command for "rdac" in the /var/log/messages file around the time the connection was established to a LUN should reveal how it was established.

Sure enough, if one looks at a case where the connection is known to not be making use of ALUA, you might see entries such as these:

[   98.790309] rdac: device handler registered
[   98.796762] sd 4:0:0:0: rdac: AVT mode detected
[   98.796981] sd 4:0:0:0: rdac: LUN 0 (owned (AVT mode))
[   98.797672] sd 5:0:0:0: rdac: AVT mode detected
[   98.797883] sd 5:0:0:0: rdac: LUN 0 (owned (AVT mode))
[   98.798590] sd 6:0:0:0: rdac: AVT mode detected
[   98.798811] sd 6:0:0:0: rdac: LUN 0 (owned (AVT mode))
[   98.799475] sd 7:0:0:0: rdac: AVT mode detected
[   98.799691] sd 7:0:0:0: rdac: LUN 0 (owned (AVT mode))

In contrast, an ALUA-based connection to LUNs shown below on an MD3600i that has new enough firmware to support ALUA and using an appropriate client that also supports ALUA and has a properly configured entry in the /etc/multipath.conf file will instead show the IOSHIP connection mechanism (see p. 124 of this IBM System Storage manual for more on I/O Shipping):

Mar 11 09:45:45 xs65test kernel: [   70.823257] scsi 8:0:0:1: rdac: LUN 1 (IOSHIP) (owned)
Mar 11 09:45:46 xs65test kernel: [   71.385835] scsi 9:0:0:0: rdac: LUN 0 (IOSHIP) (unowned)
Mar 11 09:45:46 xs65test kernel: [   71.389345] scsi 9:0:0:1: rdac: LUN 1 (IOSHIP) (owned)
Mar 11 09:45:46 xs65test kernel: [   71.957649] scsi 10:0:0:0: rdac: LUN 0 (IOSHIP) (owned)
Mar 11 09:45:46 xs65test kernel: [   71.961788] scsi 10:0:0:1: rdac: LUN 1 (IOSHIP) (unowned)
Mar 11 09:45:47 xs65test kernel: [   72.531325] scsi 11:0:0:0: rdac: LUN 0 (IOSHIP) (owned)

Hence, we happily recognize that indeed, ALUA is working.

The even better news is that not only is ALUA now functional in XenServer 6.5 but should, in fact, work now with a large number of ALUA-capable storage arrays, both with custom configuration needs as well as potentially many that may work generically. Another surprising find was that for the MD3600i arrays tested, it turns out that even the "stock" version of the MD36xxi multipath configuration entry provided with XenServer 6.5 creates ALUA connections. The reason for this is that the hardware handler is being used consistently, provided no specific profile overrides are intercepted, and so primarily the storage device is doing the negotiation itself instead of being driven by the file-based configuration. This is what made the determination of ALUA connectivity more difficult, namely that the TPGS setting was never changed from zero and could consequently not be used to query for the group settings.


First off, it is really nice to know now that many modern storage devices support ALUA and that XenServer 6.5 now provides an easier means to leverage this protocol. It is also a lesson that documentation can be either hard to find and in some cases, is in need of being updated to reflect the current state. Individual vendors will generally provide specific instructions regarding iSCSI connectivity, and should of course be followed. Experimentation is best carried out on non-production servers where a major faux pas will not have catastrophic consequences.

To me, this was also a lesson in persistence as well as an opportunity to share the curiosity and knowledge among a number of individuals who were helpful throughout this process. Above all, among many who deserve thanks, I would like to thank in particular Justin Bovee from Dell and Robert Breker of Citrix for numerous valuable conversations and information exchanges.

Read More

Shared via my feedly reader

Sent from my iPhone