Tuesday, November 10, 2015

How about 130x faster stats? [feedly]

How about 130x faster stats?
http://xen-orchestra.com/blog/how-about-130x-faster-stats/

-- via my feedly.com reader 

A great news for those using our live stats and even our datavisualizations! We managed to provide a huge boost in performance when we fetch XenServer statistics (RRDs).

Until now

That's not the first time we made a tremendous leap in terms of performances. This time, it's also something very annoying: fetching the metrics from a host or a VM, every 5 seconds.

Basically, we done this before:

  1. on a host/VM view, refresh the latest stats every 5 seconds
  2. each request on a client (xo-web) will trigger a request on xo-server
  3. and it will fetch the last 120 data points (i.e last 10 minutes) 
  4. XenServer returns a huge XML that we have to parse (remember, every 5 seconds) 
  5. we send a JSON with all the values to xo-web

And yes, this was very CPU intensive. Guess why? Hint: XML. Again.

Solutions

We worked on two aspects: removing CPU intensive operations and provide a cache.

XML: usual suspect

Exactly as our previous performance issue, we solved the problem by removing XML out of the equation.

Some figures maybe? On a small host (meaning only 4 CPUs), we spent ~1300ms waiting for XML parsing. Every 5 seconds. Imagine on a host with a LOT of CPUs: you have to parse a bigger XML, because more CPUs is more entries to parse.

We managed to find a way to avoid XML, and we discovered that by reading the source code of XenCenter. There is some issues related to that (some metrics are not accessible as we wanted), but we found some workarounds at the end in order to get everything.

In short: it wasn't trivial but we made it.

What about the same request now? How much time we have to block the event loop for getting stats? Answer: 0ms.

Okay, great, but how about the total execution time for the same function? It's now under 10ms (compared to ~1300ms before), and again, without blocking the event loop.

Cache

By rewriting the whole statistics stuff, we also worked on an intelligent cache system.

Mainly, that's two things:

  • If you have multiple clients fetching stats on the same VM (or host), we'll use the result of the first client request to give data to the other. Thus, you can't have more than 1 request every 5 seconds (minimal granularity).
  • You'll only request the data you need. E.g the last 5 seconds and not the 119 previous points if you already got them.

Yeah, stats are going faster now.

Conclusion

  • almost no CPU usage on xo-server (thus on XOA) during stats fetching
  • graphs are loaded almost instantly (when you visit a VM or a host view)
  • dataviz on RRDs metrics are also 10 to 100 times faster to display
  • far better scalability for XOA clients connected at the same time

This will be available in 4.8, which will be out very soon!