Thursday, August 20, 2015

Using Virtual Machines to Improve Container Security with rkt v0.8.0 [feedly]

Using Virtual Machines to Improve Container Security with rkt v0.8.0
// CoreOS Blog

Today we are releasing rkt v0.8.0. rkt is an application container runtime built to be efficient, secure and composable for production environments.

This release includes new security features, including initial support for user namespaces and enhanced container isolation using hardware virtualization. We have also introduced a number of improvements such as host journal integration, container socket activation, improved image caching, and speed enhancements.

Intel Contributes rkt stage1 with Virtualization

Intel and rkt

The modular design of rkt enables different execution engines and containerization systems to be built and plugged in. This is achieved using a staged architecture, where the second stage ("stage1") is responsible for creating and launching the container. When we launched rkt, it featured a single, default stage1 which leverages Linux cgroups and namespaces (a combination commonly referred to as "Linux containers").

With the help of engineers at Intel, we have added a new rkt stage1 runtime that utilizes virtualization technology. This means an application running under rkt using this new stage1 can be isolated from the host kernel using the same hardware features that are used in hypervisors like Linux KVM.

In May, Intel announced a proof-of-concept of this feature built on top of rkt, as part of their Intel® Clear Containers effort to utilize hardware-embedded virtualization technology features to better secure container runtimes and isolate applications. We were excited to see this work taking place and being prototyped on top of rkt as it validated some of the early design choices we made, such as the concepts of runtime stages and pods. Here is what Arjan van de Ven from Intel's Open Source Technology Center had to say:

"Thanks to rkt's stage-based architecture, the Intel®Clear Containers team was able to rapidly integrate our work to bring the enhanced security of Intel® Virtualization Technology (Intel® VT-x) to the container ecosystem. We are excited to continue working with the rkt community to realize our vision of how we can enhance container security with hardware-embedded technology, while delivering the deployment benefits of containerized apps."

Since the prototype announcement in May we have worked closely with the team from Intel to ensure that features such as one IP-per-pod networking and volumes work in a similar way when using virtualization. Today's release of rkt sees this functionality fully integrated to make the lkvm backend a first-class stage1 experience. So, let's try it out!

In this example, we will first run a pod using the default cgroups/namespace-based stage1. Let's launch the container with systemd-run, which will construct a unit file on the fly and start it. Checking the status of this unit will show us what's going on under the hood.

$ sudo systemd-run --uid=0 \     ./rkt run \     --private-net --port=client:2379 \     --volume data-dir,kind=host,source=/tmp/etcd \,version=v2.2.0-alpha.0 \      -- --advertise-client-urls="" \       --listen-client-urls=""  Running as unit run-1377.service.    $ systemctl status run-1377.service  ● run-1377.service      CGroup: /system.slice/run-1377.service             ├─1378 stage1/rootfs/usr/bin/systemd-nspawn             ├─1425 /usr/lib/systemd/systemd              └─system.slice               ├─etcd.service               │ └─1430 /etcd               └─systemd-journald.service                 └─1426 /usr/lib/systemd/systemd-journald  

Notice that we can see the complete process hierarchy inside the pod, including a systemd instance and the etcd process.

Next, let's launch the same container under the new KVM-based stage1 by adding the --stage1-image flag:

$ sudo systemd-run -t --uid=0 \    ./rkt run --stage1-image=sha512-c5b3b60ed4493fd77222afcb860543b9 \    --private-net --port=client:2379 \    --volume data-dir,kind=host,source=/tmp/etcd2 \,version=v2.2.0-alpha.0 \    -- --advertise-client-urls="" \    --listen-client-urls=""  ...    $ systemctl status run-1505.service  ● run-1505.service     CGroup: /system.slice/run-1505.service             └─1506 ./stage1/rootfs/lkvm  

Notice that the process hierarchy now ends at lkvm. This is because the entire pod is being executed inside a KVM process, including the systemd process and the etcd process: to the host system, it simply looks like a single virtual machine process. By adding a single flag to our container invocation, we have taken advantage of the same KVM technologies used by public clouds to isolate tenants to isolate our application container from the host, adding another layer of security to the host.

Thank you to Piotr Skamruk, Paweł Pałucki, Dimitri John Ledkov, Arjan van de Ven from Intel for their support and contributions. For more details on this feature see the lkvm stage1 guide.

Seamless Integration With Host Level-Logging

On systemd hosts, the journal is the default log aggregation system. With the v0.8.0 release, rkt now automatically integrates with the host journal, if detected, to provide a systemd native log management experience. To explore the logs of a rkt pod, all you need to do is add a machine specifier like -M rkt-$UUID to a journalctl command on the host.

As a simple example, let's explore the logs of the etcd container we launched earlier. First we use machinectl to list the pods that rkt has registered with systemd:

$ machinectl list  MACHINE                                  CLASS     SERVICE  rkt-bccc16ea-3e63-4a1f-80aa-4358777ce473 container nspawn  rkt-c3a7fabc-9eb8-4e06-be1d-21d57cdaf682 container nspawn    2 machines listed.  

We can see our etcd pod listed as the second machine known by systemd. Now we use the journal to directly access the logs of the pod:

$ sudo journalctl -M rkt-c3a7fabc-9eb8-4e06-be1d-21d57cdaf682  etcd[4]: 2015-08-18 07:04:24.362297 N | etcdserver: set the initial cluster version to 2.2.0  

User Namespace Support

This release includes initial support for user namespaces to improve container isolation. By leveraging user namespaces, an application may run as the root user inside of the container but will be mapped to a non-root user outside of the container. This adds an extra layer of security by isolating containers from the real root user on the host. This early preview of the feature is experimental and uses privileged user namespaces, but future versions of rkt will improve on the foundation found in this release and offer more granular control.

To turn user namespaces on, two flags need to be added to our original example: --private-users and --no-overlay. The first turns on the user namespace feature and the second disables rkt's overlayfs subsystem, as it is not currently compatible with user namespaces:

$ ./rkt run --no-overlay --private-users \    --private-net --port=client:2379 \    --volume data-dir,kind=host,source=/tmp/etcd \,version=v2.2.0-alpha.0 \    -- --advertise-client-urls="" \       --listen-client-urls=""  

We can confirm this is working by using curl to verify etcd's functionality and then checking the permissions on the etcd data directory, noting that from the host's perspective the etcd member directory is owned by a very high user id:

$ curl  {"etcdserver":"2.2.0-alpha.0","etcdcluster":"2.2.0"}    $ ls -la /tmp/etcd  total 0  drwxrwxrwx  3 core       core        60 Aug 18 07:31 .  drwxrwxrwt 10 root       root       200 Aug 18 07:31 ..  drwx------  4 1037893632 1037893632  80 Aug 18 07:31 member  

Adding user namespaces support is an important step towards our goal of making rkt the most secure container runtime, and we will be working hard to improve this feature in coming releases - you can see the roadmap in this issue.

Open Containers Initiative Progress

With rkt v0.8.0 we are furthering our efforts with security hardening and moving closer to a 1.0 stable and production-ready release. We are also dedicated to ensuring that the container ecosystem continues down a path that enables people publishing containers to "build once, sign once, and run anywhere." Today rkt is an implementation of the App Container spec (appc), and in the future we hope to make rkt an implementation of the Open Container Initiative (OCI) specification. However, the OCI effort is still in its infancy and there is a lot of work left to do. To check on the progress of the effort to harmonize OCI and appc, you can read more about it on the OCI dev mailing list.

Contribute to rkt

One of the goals of rkt is to make it the most secure container runtime, and there is a lot of exciting work to be done as we move closer to 1.0. Join us on our mission: we welcome your involvement in the development of rkt, via discussion on the rkt-dev mailing list, filing GitHub issues, or contributing directly to the project.


Shared via my feedly reader

Sent from my iPhone