Tuesday, February 9, 2016

rkt Network Modes and Default CNI Configurations [feedly]



----
rkt Network Modes and Default CNI Configurations
// CoreOS Blog

In the past few months we've been working on rkt, an implementation of the App Container (appc) spec and a pod runtime designed for security and composability. In the specification and in rkt itself, common application "containers" are grouped into a pod that can contain one or more applications. A pod is the unit of execution in rkt, and we use "pod" in this sense throughout this post.

Alongside rkt, we developed the Container Network Interface (CNI), a proposed standard for configuring network interfaces for Linux containers. In the context of CNI, container refers specifically to a Linux network namespace. rkt uses CNI plugins that conform to the CNI specification. Kubernetes, a cluster orchestration system, is also in the process of integrating native support for CNI. Other projects that have integrated support for CNI include Project Calico and Weaveworks. Stay tuned for updates on the integration with Kubernetes by joining the Kubernetes networking special interest group mailing list.

rkt and CNI are a good match for a diverse variety of pod network configurations, ranging from simple use cases to advanced and complex environments. Today we'll explain how to use existing CNI plugins with rkt, showing how to set up networking as you get started.

This document was written with rkt version 1.0. Be sure to check the latest rkt documentation.

rkt's network modes

The rkt networking document describes how to configure networking and briefly touches on some of the possibilities. In short, networks can be configured by passing the flag --net one or more times when invoking rkt run.

The --net option is set to a named network that should be loaded by rkt when the pod is run. By repeating the option, a list of networks can be constructed. Before we look at the handling of associated network configuration files, there are some basic networking modes to review. These modes are chosen by providing special network names to the --net option that are understood by rkt as mode selectors.

So far there are three of these special network names. The first two are the network names none and host. The third exception, when --net is given with no names at all, is a synonym for --net=default, and will try to load the network with the name default.

Now, let's explore the meaning of these network names and some of the more common network environments with concrete examples:

No network connectivity: –net=none

If you want to isolate a pod from the network completely, you can do this by using --net=none. This is a security best practice that limits the vectors by which a pod can be accessed when an application has no network requirements. In this case, the pod will end up with only the loopback network interface.

$ sudo rkt run --interactive --net=none busybox-1.23.2-linux-amd64.aci  (...)  / # ip address  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue  	link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00  	inet 127.0.0.1/8 scope host lo     	valid_lft forever preferred_lft forever  	inet6 ::1/128 scope host     	valid_lft forever preferred_lft forever  / # ip route  / # ping localhost  PING localhost (127.0.0.1): 56 data bytes  64 bytes from 127.0.0.1: seq=0 ttl=64 time=0.022 ms  ^C  

The situation here is very straightforward: no routes, and only the interface lo with the local address. The resolution of localhost is enabled in rkt by default, as it will generate a minimal /etc/hosts inside the pod if the image does not provide one.

No Pod network connectivity
Pod is isolated with no outside network connectivity

Host network - no isolation

rkt doesn't require you to isolate the pod's network stack. Complete access to the host network can be easily granted to a pod by setting the option --net=host on the rkt command line. This causes all the pod's processes to share the network namespace of the host. For a quick demonstration, let's compare the network namespaces of a host process and a pod process directly. We can check the process's network namespace by reading a link in the procfs.

$ readlink /proc/self/ns/net  net:[4026531969]  $ sudo rkt run --interactive --net=host busybox-1.23.2-linux-amd64.aci --exec /bin/busybox -- readlink /proc/self/ns/net  ()  net:[4026531969]  

We just confirmed that the pod has not left the host's network namespace. The pod can employ the host's network connectivity and can also listen on all of the host's interfaces and ports.

Shared host network namespace
Pod shares host network namespace

Making DNS work

For most users, "connectivity" implies not only the network namespace, but also DNS configuration for name resolution.

To achieve a working DNS setup with just a busybox ACI and the previously shown rkt invocation, there's a little more work to do. This can be seen here, where a ping fails because the domain name coreos.com cannot be resolved:

$ sudo rkt run --net=host busybox-1.23.2-linux-amd64.aci --exec /bin/busybox -- ping -c1 coreos.com  ()  [57764.811928] busybox[4]: ping: bad address 'coreos.com'  

We can improve things and provide full network connectivity by using the --dns option, which allows specification of a nameserver:

$ sudo rkt run --dns 8.8.8.8  --net=host busybox-1.23.2-linux-amd64.aci --exec /bin/busybox -- ping -c1 coreos.com  ...  [57724.185917] busybox[4]: PING coreos.com (141.101.112.174): 56 data bytes  [57724.186222] busybox[4]: 64 bytes from 141.101.112.174: seq=0 ttl=55 time=22.394 ms  

Safety and compatibility notes

Leaving the pod in the host's network namespace has implications for isolation and security, some of which are explained in this section.

Restricted Network Administration Capabilities

Using --net=host with rkt doesn't automatically mean that the pod can reconfigure and change the whole network stack. For example, this attempt to deactivate the lo interface will fail:

/ # ip link set down dev lo  ip: SIOCSIFFLAGS: Operation not permitted  

Despite being UID 0 and inheriting the host's network namespace, the pod is not capable of altering the network configuration. This is a consequence of dropping the CAP_NET_ADMIN capability in rkt's stage1.

Network namespace resources exposed

Applications running in a pod that shares the host network namespace are able to access everything associated with the host's network interfaces: IP addresses, routes, iptables rules, and sockets, including abstract UNIX sockets. Depending on the host's setup, abstract UNIX sockets, used by applications like X11 and D-Bus, might expose critical endpoints to the pod's applications. This risk can be avoided by configuring a separate namespace for pods, which we cover in the next section.

Contained networking

So far, we've explored the two extremes of network isolation: full and none. Using CNI, you're on the ideal course to properly carry out network configuration for pods with more fine-grained network containment.

The following sections show how rkt makes use of CNI by default, and how you can customize this behavior.

CNI - Plugins and configuration

In rkt, the upstream CNI plugin binaries are built into and shipped with the stage1 flavor named coreos, which is also the upstream default choice. First, we'll locate these on the disk to get a feeling for what's happening when a pod starts. To ensure they can be found, the stage1 aci needs to be extracted. Since this is done automatically when rkt starts its first pod, if you've followed all of this guide's steps so far, you should be all set.

The plugins that are used by default in rkt are ptp with host-local, and they can be found in the rkt tree store as plain binary files:

$ sudo find /var/lib/rkt/cas/tree -regex ".*/\(host-local\|ptp\)" -exec dirname {} \; | uniq  /var/lib/rkt/cas/tree/deps-sha512-b0d71ba92df0764c60df72b43bc15b3da391733e55f918de17105ac6da4277eb/rootfs/usr/lib64/rkt/plugins/net  

Despite the lengthy directory name taken from the pod's UUID, you can see the extracted rootfs is nevertheless a standard file hierarchy. Both CNI plugins are in the net/ directory, along with the other plugins shipped in rkt:

$ sudo ls -1 /var/lib/rkt/cas/tree/deps-sha512-b0d71ba92df0764c60df72b43bc15b3da391733e55f918de17105ac6da4277eb/rootfs/usr/lib64/rkt/plugins/net  bridge  dhcp  flannel  host-local  ipvlan  macvlan  ptp  

These binaries are executed by rkt after loading the network configuration files and determining which ones to execute and which parameters to pass. The next paragraph will bring you one step closer to understanding the connection between configuration files and the pod's network stack.

Default networking

The default network configuration is shipped by default with the coreos stage1 flavor of rkt, which makes it possible to refer to its network name default as an argument to --net=....

{  "name": "default",  "type": "ptp",  "ipMasq": true,  "ipam": {  "type": "host-local",  "subnet": "172.16.28.0/24",  "routes": [  { "dst": "0.0.0.0/0" }  ]  }  }  

To understand what the default network does we'll analyze its configuration. Specified by the type attribute in the network's configuration, the default network uses the CNI PTP Plugin, and for ipam (IP Address Management) the network uses the host-local plugin with a private address range. Two important factors for connectivity to other networks are ptp's ipMasq and host-local's routes settings. IP masquerading is enabled to provide Network Address Translation, which together with the default route ("0.0.0.0/0" for IPv4), allows the pods to reach any other network the host is able to connect to.

Connect Pod to the Internet

The default network is the quickest way to allow a pod to connect to the Internet. Here's a quick inspection from inside the pod using the default network. Again, we make DNS lookups possible by providing a DNS server.

$ sudo rkt run --dns 8.8.8.8  --interactive --debug busybox-1.23.2-linux-amd64.aci  (...)  2015/12/14 11:34:24 Loading network default with type ptp  (...)  / # ip -4 address  1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue  	inet 127.0.0.1/8 scope host lo     	valid_lft forever preferred_lft forever  3: eth0@if17: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue  	inet 172.16.28.2/24 scope global eth0     	valid_lft forever preferred_lft forever  / # ip -4 route  default via 172.16.28.1 dev eth0  172.16.28.0/24 via 172.16.28.1 dev eth0  src 172.16.28.2  172.16.28.1 dev eth0  src 172.16.28.2  

The host is accessible via 172.16.28.1 which is also used as the gateway for the default route.

CNI default network plus DNS
Pod has isolated network connectivity

Host-local address management

Using host-local for IPAM, every pod will receive an IP address from the subnet provided in the configuration file. Here's how it looks when three pods are run consecutively:

$ for i in seq 0 2; do sudo rkt run busybox-1.23.2-linux-amd64.aci --exec /bin/busybox -- ip -4 address --- 2>&1 | grep "scope global"; done  [77573.033304] busybox[4]: inet 172.16.28.2/24 scope global eth0  [77578.738758] busybox[4]: inet 172.16.28.3/24 scope global eth0  [77584.597918] busybox[4]: inet 172.16.28.4/24 scope global eth0  

CNI serializes the addresses for each pod to disk under _/var/lib/cni/networks/default/172.16.28.$x_, where $x is the host address. Every pod that runs and loads the default network uses one IP out of the /24-subnet. Therefore, keep in mind that:

  • you can only have 253 (255 - 1 gateway - 1 broadcast) Pods allocated using the default network at the same time, and
  • you should always use rkt gc to free old allocations. On CoreOS, this happens periodically in the background via this systemd timer unit

Port forwarding

rkt can be instructed to set up iptables rules in order to forward packets on specific ports on the host to designated pods. The rules match the given ports on the host's addresses and forward packets received there to the pod's veth address.

If the manifest file lists such ports, the --port option is used during rkt's invocation to configure these rules. The next example uses a modified busybox image that contains a port in the manifest as seen here:

"ports": [  {  "name": "nc",  "protocol": "tcp",  "port": 1024,  "count": 1,  "socketActivated": false  }  ]  

The next command will run the pod and redirect TCP port 1024, named nc in the manifest, from all host addresses to the pod's veth address. The pod will stop as soon as netcat exits after the connection from the remote end has been closed. We'll need two terminals for this one!

The first terminal runs rkt:

$ sudo rkt run --dns 8.8.8.8  --interactive --debug --port nc:1024 busybox-1.23.2-pfwd-linux-amd64.aci --exec /bin/busybox -- nc -p 1024 -l  (..)       	Starting Application=busybox Image=busybox...  [  OK  ] Reached target rkt apps target.  

The second terminal runs nc on the host:

$ echo the host says HI |  nc 172.16.28.1 1024  

This will cause the rkt terminal to output the echoed message and terminate:

the host says HI  Sending SIGTERM to remaining processes...  (..)  

While any of the host's addresses could've been used to reach the pod, this example uses the host's IP address on the default network, which will be matched by the iptables rules as well. This address is the beginning of the subnet range 172.16.28.0/24 and is therefore 172.16.28.1.

You might have noticed that before the release of rkt 1.0, it was not possible to use the localhost address (127.0.0.1) to connect to forwarded ports. Don't give up hope! We've just merged NAT loopback, which makes it possible to reach forwarded ports via the loopback interface.

Default restricted

As opposed to its sibling default, the default-restricted network doesn't provide a connection to the outside world. The differences between the two configurations make this immediately obvious:

--- ./stage1/net/conf/99-default.conf 2015-11-25 09:47:23.255816427 +0100  +++ ./stage1/net/conf/99-default-restricted.conf 2015-11-25 09:47:23.255816427 +0100  @@ -1,12 +1,9 @@   {  - "name": "default",  + "name": "default-restricted",  	 "type": "ptp",  - "ipMasq": true,  + "ipMasq": false,  	 "ipam": {  		 "type": "host-local",  - "subnet": "172.16.28.0/24",  - "routes": [  - { "dst": "0.0.0.0/0" }  - ]  + "subnet": "172.16.28.0/24"  	 }   }  

Having no IP Masquerading and no default route, the pod will not be able to reach beyond its own subnet, which only includes the host's veth end and the other pods in the default-restricted network. The default-restricted network is automatically loaded if you pass a real network name to --net; that is, is any name other than none, host, default or (for the sake of completeness) default-restricted.

Next steps

You've learned about getting started with rkt and CNI plugins, and we welcome your feedback. Keep up to date on rkt via the rkt-dev mailing list or by joining our appc and rkt community sync announced on the mailing list.

We will dive into more details about bridge, macvlan, dhcp and flannel plugins in the next installment of this blog series.


----

Shared via my feedly reader


Sent from my iPad