Introducting Kanaloa: make your reactive service more resilient

Announcing a new iHeartRadio open source project Kanaloa. Kanaloa is a set of work dispatchers implemented using Akka actors. These dispatchers sit in front of your service and dispatch received work to them. They make your service more resilient through the following means:

  1. Auto scaling - it dynamically figures out the optimal number of concurrent requests your service can handle, and make sure that at any given time your service handles no more than that number of concurrent requests. This algorithm was also ported and contributed to Akka as Optimal Size Exploring Resizer (although with some caveats).

  2. Back pressure control - this control is Little’s law inspired. It rejects requests when estimated wait time of which exceeds a certain threshold.

  3. Circuit breaker - when error rate from your service goes above a certain threshold, kanaloa dispatcher stops all requests for a short period of time to give your service a chance to “cool down”.

  4. Real-time monitoring - a built-in statsD reporter allows you to monitor a set of critical metrics (throughput, failure rate, queue length, expected wait time, service process time, number of concurrent requests, etc) in real time. It also provides real-time insights into how kanaloa dispatchers are working. An example on Grafana:Dashboard

Disclosure: part of the following descriptions were adapted from the documentation of Akka’s OptimalSizeExploringResizer (also written by the author of this post).

Behind the scene kanaloa dispatchers create a set of workers that work with your services. These workers wait for results coming back from the service before they accept more work from the dispatcher. This way it controls the number of concurrent requests dispatchers send to services. It auto-scales the work pool to an optimal size that provides the highest throughput.

This auto-scaling works best when you expect the pool size to performance function to be a convex function, with which you can find a global optimal by walking towards a better size. For example, CPU bound service may have an optimal worker pool size tied to the CPU cores available. When your service is IO bound, the optimal size is bound to optimal number of concurrent connections to that IO service - e.g. a 4 node Elasticsearch cluster may handle 4-8 concurrent requests at optimal speed.

The dispatchers keep track of throughput at each pool size and perform the following three resizing operations (one at a time) periodically:

  1. Downsize if it hasn’t seen all workers ever fully utilized for a period of time.

  2. Explore to a random nearby pool size to try and collect throughput metrics.

  3. Optimize to a nearby pool size with a better (than any other nearby sizes) throughput metrics.

When the pool is fully-utilized (i.e. all workers are busy), it randomly chooses between exploring and optimizing. When the pool has not been fully-utilized for a period of time, it will downsize the pool to the last seen max utilization multiplied by a configurable ratio.

By constantly exploring and optimizing, the dispatcher will eventually walk to the optimal size and remain nearby. When the optimal size changes it will start walking towards the new one.

If auto-scaling is all you need, you should consider using OptimalSizeExploringResizer in an Akka router. (The caveat is that you need to make sure that your routees block on handling messages, i.e. they can’t simply asynchronously pass work to the backend service).

For instruction on installation and usage, please check the github page.

For any other questions, feel free to ask in the gitter channel or submit a github issue. Any contribution and/or feedback is more than welcome.

Special Thanks

The auto scaling algorithm was first suggested by @richdougherty. @ktoso kindly reviewed this library and provided valuable feedback.