Tuesday, February 23, 2016

Improving Kubernetes Scheduler Performance [feedly]

Improving Kubernetes Scheduler Performance
// CoreOS Blog

Scalability is one of the important factors that make a successful distributed system. At CoreOS, we love Kubernetes and help drive the project forward through upstream contributions. Last November, we formed a team focused on Kubernetes scalability. We set out with a goal to test and understand Kubernetes cluster performance to gain insight into cluster behavior under large workloads and learn where there are performance issues.

Very soon after starting this work, we found a number of bottlenecks. By resolving them we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds in the Kubernetes scheduler benchmark. In this post, we share the story of how we accomplished this more than tenfold performance improvement, and offer ideas for how the community can push Kubernetes scaling even further.

Kubernetes Architecture Overview

To understand and improve Kubernetes performance, we first need to establish a baseline of current capacity. A Kubernetes cluster consists of a set of API nodes that control and manage cluster operations on a set of worker nodes where application pods are run. We initially focused on the control plane components because their involvement in every scheduling operation, cluster-wide, offered the largest potential for improvement. The Kubernetes architecture is outlined in the schematic below.

Kubernetes cluster architecture
Kubernetes cluster architecture

Experiments with Kubemark

The Kubernetes community recently introduced Kubemark for testing the performance of control plane components. Kubemark simulates the worker nodes and can simulate 18 virtual nodes on a single CPU core, allowing us to simulate large clusters with minimized complexity and expense.

Our first use of Kubemark was a density test that measures how long it takes to schedule 30 pods onto every node. For a 100-node cluster there are 3,000 pods to schedule and run; for a 1,000-node cluster there would be 30,000 pods. These test results show us the number of pods at different phases. For all our tests, the cluster was connected to an etcd cluster running on a separate machine.

Pods:  229 out of 3000 created,  211 running, 18 pending, 0 waiting  Pods:  429 out of 3000 created,  412 running, 17 pending, 0 waiting  Pods:  622 out of 3000 created,  604 running, 17 pending, 1 waiting  ...  Pods: 2578 out of 3000 created, 2561 running, 17 pending, 0 waiting  Pods: 2779 out of 3000 created, 2761 running, 18 pending, 0 waiting  Pods: 2979 out of 3000 created, 2962 running, 16 pending, 1 waiting  Pods: 3000 out of 3000 created, 3000 running, 0 pending,  0 waiting  

We started our investigation with a 100-node test cluster and it completed scheduling 3,000 pods in about 150 seconds. The pod throughput was ~20 pods/sec. To understand the log better, we developed a plot tool to draw the graph.

The graph below shows the number of pods created but unscheduled by a replication controller ("created") and shows them as "running" once they are scheduled to a machine in the cluster.

Baseline Kubernetes pod scheduling throughput
Baseline Kubernetes pod scheduling throughput

We quickly noticed that the graph showed a linear rate of pod creation (the "created" state) at 20 pods per second, which seemed low, indicating a potential bottleneck and target for improvement. We started our performance journey here.

Journey to Better Throughput

The first thing that came to mind is perhaps a hardware resource was starved. If it was, we could improve the performance of the scheduling by simply adding more resources. We ran the test again and monitored CPU/memory/IO usage with the top utility. The results looked like this:

Resource usage during the test
Resource usage during the test

As far as we observed, none of the physical resources were fully utilized. This implied the bottleneck was a software issue.

Finding bottlenecks in a large codebase, like Kubernetes, is seldom trivial. We needed to narrow the scope step-by-step. Fortunately, Kubernetes components provide metrics (collected by Prometheus) for most end-to-end API calls. Our first step was to look into that. We ran the test again and monitored the scheduler's metrics. The results looked like this:

# HELP scheduler_e2e_scheduling_latency_microseconds E2e scheduling latency (scheduling algorithm + binding)  # TYPE scheduler_e2e_scheduling_latency_microseconds summary  scheduler_e2e_scheduling_latency_microseconds{quantile="0.5"} 23207  scheduler_e2e_scheduling_latency_microseconds{quantile="0.9"} 35112  scheduler_e2e_scheduling_latency_microseconds{quantile="0.99"} 40268  scheduler_e2e_scheduling_latency_microseconds_sum 7.1321295e+07  

We found the scheduler end-to-end latency was 7ms, equivalent to a throughput of 140 pods/sec, much higher than the 20 pods/sec observed in our tests. This implied that the bottleneck was outside of the scheduler itself.

We continued narrowing down the problem by observing metrics and logs. We found the bottleneck was in the replication controller. Looking at the code:

wait.Add(diff)  for i := 0; i < diff; i++ {  go func() {  defer wait.Done()  if err := rm.podControl.CreatePods(...); err != nil {  ...  }  }()  }  wait.Wait()  

We printed logs and found that waiting for 500 CreatePods() took 25 seconds. That's exactly 20 pods/sec. But the latency of a CreatePod() API call is only 1ms.

Before long, we found a rate limiter inside the client. The rate limiter was used to protect the API server from being overwhelmed. In our test, we didn't need it. We increased the limits and wanted to see the boundary of scheduling rate. It didn't solve the problem completely.

In the end, we found one more rate limiter and also fixed a minor inefficient code path (#17885). After those changes, we were able to see our final improvement. In the 100-node setup, the average creation rate went up from 20 pods/sec to 300+ pods/sec, and the work finished in 50 seconds with average pod throughput being 60 pods/sec.

Increased pod scheduling throughput from our changes
Increased pod scheduling throughput from our changes

A Change in Focus to the Scheduler

We then tried to test our improvement on a 1,000-node cluster. However, we weren't lucky this time. See the plot:

Test running against a 1,000 node Kubernetes cluster
Test running against a 1,000 node Kubernetes cluster

It took 5,243 seconds to run 30,000 pods on 1,000-node cluster with average pod throughput of 5.72 pods/sec. The creation rate kept increasing but the running rate remained low.

We then ran the test again and looked into scheduler's metrics. This time, scheduling latency became 60ms at the beginning and increased to 200ms at the end (30,000 pods). Combining it with what we saw in the log and plot, we then realized two points:

  • 60ms latency is very high. The scheduler likely became the bottleneck. If we could reduce it and thus increase the scheduling rate, then the pod running rate is very likely to be higher.
  • Scheduling latency increased as the total number of scheduled pods increased, which led to decreased cluster pod running rate. This was a scalability issue in the scheduler.

The scheduler codebase is complex and we needed fine-grained profiling to understand where the scheduler was spending time. But, repeating the same process on Kubemark was time-intensive, as our test took longer than two hours to complete. We wanted a more lightweight method to do testing of the scheduler component to focus our effort and time. Thus, we wrote a scheduler benchmark tool as a Go unit test. The tool tested the scheduler as a whole without starting unnecessary components. More details can be found in our slides about the scheduler performance test and in Kubernetes Pull Request #18458.

By using the benchmarking tool, we were able to work efficiently and break through the bottleneck of scheduler for 1,000 nodes in a 30,000-pod setup.

For example, in Kubernetes issue #18126, we had the following pprof result:

pprof output of slow rounding
pprof output of slow rounding

We saw the Round() method took 18 seconds out of 79 seconds of total time. We thought that was inefficient for rounding a number. We fixed it with more efficient implementation (PR #18170). In the end, we reduced average scheduling latency from 53 seconds to 23 seconds for scheduling 1,000 pods on 1,000 nodes.

This way, we were able to dig out more and more inefficient code that had become a performance bottleneck. We reported and improved them in the following upstream issues:

By applying those changes, we gained an incredible performance improvement – scheduling throughput of 51 pods/sec. We ran the scheduler benchmark of scheduling 30,000 pods on 1,000 nodes again and compared it with previous result. See the following plots:

Re-running the scheduler benchmark
Re-running the scheduler benchmark

Note that this process took longer to run than Kubemark, which was probably caused by garbage collection. We put everything in the same program, while Kubemark ran the scheduler, API server, and controller manager in separate processes. But, this difference doesn't affect the comparison result, given this an apples-to-apples comparison.

Current State and Next Steps

With our optimization applied, we re-ran the Kubemark "1,000 nodes/30,000 pods" test suite again. The plot from the result is below:

Re-running the 1,000 node test
Re-running the 1,000 node test

Now, the average pod throughput is 16.30 pods/sec, which was 5.72 pods/sec before. First, we can see the improvement on throughput. Second, this was still lower than the result from our scheduling benchmark. We are sure that the scheduler was unlikely the bottleneck at this point. There are many other factors like remote call latencies, garbage collection in the API server, etc. This could be a good next step to explore future performance improvements.

Scaling Kubernetes

In this blog, we talked about how we analyzed experiment results like metrics and CPU profiling to identify performance bottlenecks and improve the scheduler.

To better understand the scheduler, we contributed a benchmarking tool, which we've used to validate our performance improvements. With the work so far, we reduced time to schedule 30,000 pods onto 1,000 nodes from 8,780 seconds to 587 seconds with a total of 4 pull requests to Kubernetes.

Although the techniques described here were very simple, sharing our thought process will help aid others in debugging complex distributed systems by breaking it down into manageable chunks of investigation.

The key takeaways are:

  • Metrics enable a convenient and much-needed view into the system.
  • Benchmarking is a good way to illustrate performance and reveal any potential problems.
  • Plotting graphs is an even better way to get the introspection into the system over time.

Open-source is at the center of everything we do at CoreOS, and we're happy to help increase Kubernetes performance and scale alongside the rest of the community.

Thanks to the Kubernetes Community

During this work, we received a lot of help from the Kubernetes community. We want to thank the SIG-Scale group for discussing the scalability related issues with us. We also want to thank David Oppenheimer, Daniel Smith, Marek Grabowski, Timothy St. Clair, Wojciech Tyczynski, and Xiang Li for their help on this work.

Join the CoreOS Team

If you are an engineer looking to work to improve the scale, security, and resilience of internet systems by working on containers, operating systems, and distributed systems, consider joining CoreOS. Learn more about our available positions in New York, San Francisco, and Berlin on our careers page.


Shared via my feedly reader

Sent from my iPhone