Thursday, April 7, 2016

XenServer + Docker + CEPH/RBDSR for shared local storage [feedly]



----
XenServer + Docker + CEPH/RBDSR for shared local storage
// Latest blog entries

No need to dance around it - the title says it all :)

So straight to it. What I've got here is an example of how to set up a XenServer pool, to use local storages as CEPH OSDs, running on each host as Docker containers, and present RBD objects from those CEPH docker containers as a shared storage back to the pool - wow, this is a mouthful :)

Please keep in mind that this post is full of unsupported, potentially dangerous, fairly unstable commands and technologies - just the way I like it :) But don't go and set it up in production - this is just a demo, mostly for fun.

What's the point of it all? Well, you don't get more space that's for sure. The idea is that you would run copy of the same data on each host, which might make it easier to migrate things around and manage backups(since we have copy of everything everywhere). Besides - local storage is cheap and fun :)

So let's get cooking!

Ingredients:

Cookery:

Prepare XenServer Hosts

Install XenServer with defaults, except don't use any local storage (i.e. untick sda when selecting disk for local storage see step 4-6 for details).

This should give you a XenServer hosts with Removable Storage and CDROM SRs only.

Once you've installed all three hosts, join them into a pool and patch to latest hotfix level(it's always a good idea anyway ;) )

Install the supplemental pack for Docker and my RBDSR.py stuff on each one.

Prepare SRs and physical partitions

Now we are going to create partitions on the reaming space of the local disk and map those partition as VDI:

In dom0 of each host create folder, let's call it partitions:

# mkdir /partitions

Then create SR type=file:

# xe sr-create type=file device-config:location=/partitions name-label=partitions host-uuid=<host you created folder on>

Here you can place file vdis(VHD or RAW) and link physical disks and partitions. 

Let's do just that. Create third and fourth partitions on the local disk:

# gdisk /dev/sda
  • n -> enter(3 partition default) -> enter(starting sector, default first available)  -> +20G -> enter
  • n -> enter(4 partition default) -> enter(starting sector, default next after P4) -> enter(end sector, default remainder of the disk)
  • w -> Y

To recap, we've used gpt disk utility(gdisk) to create two new partitions with number 3 and 4 starting at the end of the sda2 and spanning across remaining space. First partition we've created(sda3 of 20GB) will be used as a VDI for the VM that will run docker, and second partition(sda4 of around 448GB) will be used as OSD disk for the CEPH. Then we've written changes to the disk and said "Y"es to the warning of potentially dangerous task.

Now that we have our raw disk partitioned, we will introduce it as VDIs in "partitions" SR. But, at the moment those partitions are not available, since we've modified partition table of the root disk, and Dom0 is not too keen on letting it go to re-read it. We could do some remount magic, but since nothing running yet, we'll just reboot it before mapping the VDI…

OK, so the hosts are now rebooted, let's map the VDI:

# ln -s /dev/sda3 /partitions/$(uuidgen).raw

This will create symlink with random(but unique) uuid and extension .raw, inside of partitions folder pointing to the sda3. If you rescan that SR now, new disk will pop up with size 0 and no name. Give it name and don't worry about the size. SM stuff don't get sg info from symlinks of partitions - which is a good thing I guess :) besides, if you want to know how big it is, just run gdisk -l /dev/sda3 in dom0. Our guest that will use this disk, will be fully aware of its geometry and properties, thanks to tapdisk magic.

repeat the same with sda4:

# ln -s /dev/sda4 /partitions/$(uuidgen).raw

Rescan and add sensible name. 

*note: you may want to link path /dev/disk/by-id/ or /dev/disk/by-uuid/ instead of /dev/sda<num> to avoid problems in future if you decide to add more disks to hosts and mess up sda/sdb order. But for the sake of this example, I think /dev/sda<num> is good enough.

*Accidentally useful stuff: with "partitions" SR added to XS host, you can now check space of Dom0 in XenCenter. because SM stuff read size of that folder and inadvertently the size and utilisation of rootfs(sda1), the free and utilised space of that SR will represent Dom0 filesystem.

Prepare Docker Guest

At this stage you could have installed CoreOS, RancherOS or Boot2Docker to deploy CEPH container. But, while it would work with the RBDSR plugin demo that I've made(i.e. through using ssh public key in Dom0 to obtain information over ssh instead of a password), we would need to set up automated container start up and I'm not familiar enough with those systems to write about it. 

Instead, here I'll use Linux guest(openSUSE leap) to run CEPH docker container inside of it. There are two additional reasons for doing that:

  1. I love SUSE - it's a Swiss knife with precision of samurai sword. 
  2. It will help to demonstrate how the supplemental pack, and xscontainer monitoring in particular, works.

Anyway, let's create the guest using latest x64 SLES template.. 

If you haven't already, download openSUSE DVD or net install here

Add ISO SR to the pool. 

Since we don't have any local storage, but need to build guest on each host locally, we are going to use the sda3(20GB) as a system VDI for that guest.

However, this presents a problem: default disk for the SLES guest is 8 GB but we don't have any local storage to place it on(only SR is /partitions and that is just 4GB of sda1). But do not worry - everything will fit just fine. To work around the SLES template recommendations of 8GB local disk, there are two possible solutions:

Solution 1

Change the recommendations of the template's disk size like so: 

  • xe template-list | grep SUSE -A1
  • xe template-list uuid=<uuid of SUSE 12 x64 from command above> params
  • xe template-param-set uuid=<uuid from the command above> other-config:disks='<copy of the disks section from the original template with size value changed to 1073741824>'
  • create the VM from the template, then delete attached disk and replace it with sda3 from "partitions" SR.

Solution 2

Or, you can use command line to install the template like so:

  • xe vm-install template="SUSE Linux Enterprise Server 12 (64-bit)" new-name-label=docker1 sr-uuid=<uuid of the partition sr on the master> 
  • Once it create VM, you need to delete the 8GB disk and replace it with sda3 VDI from "partition" SR. in storage tab also click to create CDROM and insert openSUSE ISO.
  • set install-repository to cdrom: xe vm-param-set uuid=<docker1 uuid> other-config:install-repository=cdrom
  • add vif from xencenter or: xe vif-create vm-uuid=<docker1 uuid> network-uuid=<uuid of network you want to use> mac=random device=0
  • set cdrom as bootable device: 
    • xe vbd-list vm-name-label=docker1 type=CD
    • xe vbd-param-set uuid=<command above> bootable=true

Boot the VM and install it onto sda3 VDI. Few changes have to be made in default installation of openSUSE:

  1. set boot partition to ext4 instead of BTRFS since pygrub doesn't read btrfs(yet :))
    • Remove proposed partitions. 
    • create first boot partition of around 100MB
    • then create swap
    • then create btrfs for the remaining space
  2. disable snapshoting on BTRFS(because we only have 20GB to play with and a lot of btrf operation related to CEPH)
    • This option is in "subvolume handling" of the partition.
  3. enable ssh(in firewall section of installation summary) and go.

Once the VM is up and running, login and patch it:

# zypper -n up

Then install xs-tools for dynamic devices: 

  • eject openSUSE DVD and insert xs-tools.iso(can be done in one click of the general tab of VM)
  • mount cdrom inside the VM: mount /dev/xvdd /mnt
  • install sles type tools: cd /mnt/Linux && ./install -d sles -m 12
  • reboot

We need to remove DVD as a repository with:

# zypper rr openSUSE-42.1-0

then install docker and ncat:

# zypper -n in docker ncat

Start and enable docker service:

# systemctl enable docker.service && systemctl start docker.service

In Dom0, enable monitoring of docker for that VM with:

# xscontainer-prepare-vm -v <uuid of the docker1> -u root

Note: XenServer doesn't support openSUSE as a docker host. I'm not sure what the problem exactly is, but there is something wrong with the docker socket in openSUSE or ncat that takes too long for the command to return when you are trying to do GET or there is not enough \n in the script. So as workaround you'd need to do following on each host:

edit /usr/lib/python2.4/site-packages/xscontainer/docker_monitor/__init__.py and comment out docker.wipe_docker_other_config with adding 'pass' above it:

016-01-13 12:35:32.000000000 +1100  @@ -101,7 +101,8 @@                       try:                           self.__monitor_vm_events()                       finally:  -                        docker.wipe_docker_other_config(self)  +                        pass  +                    #    docker.wipe_docker_other_config(self)                   except (XenAPI.Failure, util.XSContainerException):                       log.exception("__monitor_vm_events threw an exception, "                                     "will retry")  

With this in place, docker status will be shown correctly in the XenCenter.

Prepare CEPH container

Pull the ceph/daemon image:

# sudo docker pull ceph/daemon.

Now that we have VM prepared with docker instance of CEPH inside, we can replicate this instance across remaining 3 hosts.

  • Shutdown the VM
  • detach the disk and make two additional copies of the VM.
  • Reassign home server for the two copies to remaining hosts and attach sda3 and sda4 as respective disks.

At the moment we have docker guest installed only on the master, so let's copy it to other hosts:

# dd if=/dev/sda2 bs=4M | ssh root@<ip of slave 1> dd of=/dev/sda2

Then repeat the same for the second slave. Once that has been completed, you can boot all three VMs up. this is a good time to configure static IPs and add second NIC if you wish to use dedicated network from CEPH data syncing.

You'd also need to run xscontainer-prepare-vm for the two other VMs(alternatively you can just set other-config flag xscontainer-monitor=true and xscontainer-username=root, since we already have XenServer pool rsa key added to the authorised keys inside of the guest)

Once you've done all that and restarted xscontainer-monitor(/etc/init.d/xscontainer-monitor restart) to apply the changes we've made in the __init__.py, you should see Docker information inside of the each docker guest general tab. As you can see, xscontainer supplementary pack uses ssh to login into the instance, then ncat the docker unix socket and using GET, obtain information about docker version and containers it has.

Prepare CEPH cluster

So we are ready to deploy CEPH. 

Note, it's a good idea to configure NTP at this stage because monitors are very sensitive to time skew between hosts.(use yast or /etc/ntp.conf)

In addition, it's a good idea to disable "wallclock" as otherwise it will never stay accurately in sync and CEPH will report problems on the cluster:

# sysctl xen.independent_wallclock=1

So all the information on how to deploy monitor and osd is available on the github page of the ceph/daemon docker here: https://github.com/ceph/ceph-docker/tree/master/daemon

Also interesting page is the ceph.config sample page: https://github.com/ceph/ceph/blob/master/src/sample.ceph.conf

One useful part in that config that of an interest for me is osd configuration to make btrfs partition instead of default xfs.

So bring up a monitor container on the host docker running on master, I run :

# sudo docker run -d --net=host --name=mon -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -e MON_IP=192.168.0.20 -e CEPH_PUBLIC_NETWORK=192.168.0.0/24 ceph/daemon mon

Note: this container requires folder /etc/ceph and /var/lib/ceph to work, you can create it manually or just install ceph on your base system(in this case openSUSE host) so it populates all default paths.

If you have dedicated network that you want to use for the data sync between OSD you would need to specify CEPH_CLUSTER_NETWORK parameter as well

Once the monitor is started, you'd need to copy both /etc/ceph and /var/lib/ceph folders to two other docker hosts. keep in mind that you need to preserve permissions, as by the default it configures things as "ceph" user or uid 64045 and gid 64045.

# cd /etc  # tar -cvzf - ./ceph | ssh root@<second docker host> 'cd /etc && tar -xvzf -'  # cd /var/lib  # tar -cvzf - ./ceph | ssh root@<second docker host> 'cd /etc && tar -xvzf -'  

Then repeat for the third docker host.

Once all hosts have those two folders you can start monitors on the other two hosts.

At this stage you should be able to use ceph -s to query status, It should display cluster status with Health Error, which is fine until we add OSDs.

With the OSD deployment, you should make sure that disk that you use have proper permissions. One way to do it is via udev rule. Thanks to Adrian Gillard on ceph mailing list, in this case the trick is to set permissions for the /dev/xvdb to 64045:

# cat > /etc/udev/rules.d/89-ceph-journal.rules << EOF
KERNEL=="xvdb?" SUBSYSTEM=="block" OWNER="64045" GROUP="disk" MODE="0660"
EOF

At this stage it's probably good idea to reboot guest to make sure that new udev rule applies and deploy OSD after starting monitor like so:

# sudo docker start mon
# sudo docker run -d --net=host --privileged=true --pid=host -v /etc/ceph:/etc/ceph -v /var/lib/ceph/:/var/lib/ceph/ -v /dev/:/dev/ -e OSD_DEVICE=/dev/xvdb -e OSD_TYPE=disk ceph/daemon osd

If there was some filesystem on that disk before, it's a good idea to add -e OSD_FORCE_ZAP=1 as well and if you wish to use btrfs instead of xfs, change /etc/ceph/ceph.conf by adding this line: 

osd mkfs type = btrfs

If things get hairy, and creating OSD didn't workout, make sure to clear xvdb1 partition with

# wipefs -o 0x10040 /dev/xvdb1

and then zap it with gdisk:

# gdisk /dev/xvdb -> x -> z -> yes

Create RBDoLVM SR

Once you have started all OSDs and you have pool of data, you can create new pool and add rbd object that will be used as an LVM SR. Since rbd object is thin provisioned and LVM based SR in XenServer is not, you can create it with fairly large size, like 2 or 4 TB for example. 

Then use XenCenter to create new SR on the that RBD object(example is available here)

Automate startup

With SR attached to the pool, it would be a good thing to make sure that containers start automatically:

To configure start of containers on boot you can use instructions here: https://docs.docker.com/engine/articles/host_integration/

In case of openSUSE, the systemd file would look something like that:

[Unit]  Description=CEPH MON container  After=network.target docker.service  Requires=docker.service     [Service]  ExecStart=/usr/bin/docker start -a mon  ExecStop=/usr/bin/docker stop -t 2 mon  Restart=always  RestartSec=5     [Install]  WantedBy=multi-user.target  

There are two changes from docker documentation to the systemd example above:

the restart needs a timeout. Reason for that is the docker service starts up but doesn't accept connection on the unix socket right away, so first attempt to start container might fail, but one after that should bring container up.

Depending on your linux distro, the target might be different from the docker's example(i.e. local.target) in my case I've set it to multi-user.target, that's where my docker boxes aim for.

The same systemd script will be required for the OSDs as well.

once scripts prepared, you can put them to test with:

# systemctl enable docker-osd
# systemctl enable docker-mon

And restart docker host to check if both containers come up.

Now that we have dockers starting automatically, we need to make sure that docker VMs come up on XenServer boot up automatically as well. You can use article here: http://support.citrix.com/article/CTX133910

Master host PBD fixup

One last settings that you might need to adjust manually: due to limitation in the RBDSR, the only monitor provided in the pbd is first in the list(i.e. first initial monitor in ceph configuration). As the result, if for example your first initial monitor is the one running on the master, when you reboot master it will be waiting for docker and monitor container to come up, thus most likely timeout and fail to plug pbd. To workaround that you can recreate pbd on the master to point to monitors on one of the slaves - this is a nasty solution but it would work for as long as this slave is part of the pool and running an active monitor.

# xe pbd-list host-name-label=[name of the master] sr-name-label=[name of the CEPH SR] params  # xe pbd-uplug uuid=[uuid from command above]  # xe pbd-destroy uuid=[the same]  # xe secret-create value=[ceph user password provided in CHAP configuration of the SR]  

Now you can create PBD, by copying details from "pbd-list params" output and replacing target with the IP of the monitor on the slave

# xe pbd-create host-uuid=[master uuid] sr-uuid=[CEPH sr uuid] device-config:SCSIid=[name of the rbd object] \         device-config:chapuser=[docker admin that can run ceph commands] device-config:port=6789  \         device-config:targetIQN=[rbd pool name] device-config:target=[docker instance on slave] \         device-config:chappassword_secret=[secret uuid from command above]  

That should be it - you now have local storage presented as a single SR to the pool with one copy of data on each host. Have fun!


Read More
----

Shared via my feedly reader


Sent from my iPad