Using Scala in Amazon Lambda
Here at iHeartRadio we have made a significant investment in choosing Scala and Akka for our MicroService backend. We have also recently made an investment in moving a lot of our infrastructure over to AWS to give us a lot more freedom and flexibility into how we manage our infrastructure and deployments.
One of the really exciting technologies coming out of AWS is Lambda. Lambda allows you to listen to various “events” in AWS, such as file creation in S3, stream events from Kinesis, messages from SQS and then invoke your custom code to react to those events. Additionally the applications you deploy that react to these events, require no infrastructure and are completely auto-scaled by Amazon (on what seems like a cluster of containers). Currently Lambda supports writing these applications in Python, Node and Java.
For our CDN, we leverage Fastly for a lot of our web and API properties, which has been really powerful for us. However, sometimes we need to get some detail around whats happening with our traffic at a fine grain level that the Fastly dashboards do not provide. For example, we may want to know the cache hitrate for a specific URL, or know who our top referrers are broken down by browser. Fastly gives you the ability to ship logs in realtime(to a remote syslog server, S3, etc), but pouring over Gigabytes of logs with grep, seemed less than ideal. Additionally , rolling out an entire log processing framework along with a front-end that could give us visualizations, groupings, timeseries and facets was going to be a fair lift as well.
NewRelic released an interesting product called “Insights” which is quite simply, a way to send arbitrary data events to their storage backend and provide relatively complex visual operations on that data such as TimeSeries, Facets, Where clause filters, Percentages, etc. We quickly realized if we can build a simple bridge from the real-time Fastly logs to the NewRelic backend, we would have a quick and powerful solution.
Lambda was a perfect fit for this, since we could easily ship logs in real-time to S3 from Fastly. Once we did that, we could write a Lambda function that would get invoked every time a new file was uploaded to our S3 bucket, parse the logfile and post the events to NewRelic.
Since Lambda supports Java, we spent some time experimenting with getting it to work in Scala. We eventually got it to work and open sourced our solution. We learned a few things along the way, however, that I outlined below.
Firstly, only Java 8 is supported, so your build.sbt needs to have some configuration in it that enables 1.8 only support
javacOptions ++= Seq(“-source”, “1.8”, “-target”, “1.8”, “-Xlint”)
Additionally, the Lambda platform doesn’t understand native Scala types (it’s fine for primitives) so you need to map things like Scala List to a java.util.List. Importing scala.collection.JavaConverters._ into your source files should help you handle this for you by giving you helpers .asScalaand .asJava to convert in either direction.
Another issue that came up was dealing with threading and Futures. When using an async library (like Ning), you need to be careful about how you handle multi-threading. Lambda re-uses the handler instance between invocations so you need to be careful about spawning work on another thread. Amazon recommends to block the main thread until that work was completed. In our case, when firing an async webservice call, the proper behavior was to invoke the future, and then wrap the result in anAwait.result().
For java, there is absolutely some “jvm warmup” time that was noticed when there was low activity on the lambda function. To avoid this, we increased the frequency which Fastly was pushing S3 logs and we saw a 2x decrease in run time (from about 6 seconds to 3 seconds). Also note, that reading from S3 was not particularly fast, so be sure to tweak the timeout for your function, especially for first run times.
Testing your lambda function is not particularly simple either. We had to rely on a few unit tests to enable more rapid testing of the workflow. Getting your new jar published on Lambda is without a doubt a multi-step process, especially if your jar exceeds the 10Mb limit. You need to 1) upload your jar to S3, 2) Publish a new function to Lambda using the S3 URL 3) go review the cloudWatch logs to ensure everything worked.
Once everything was published and working the results were pretty great. With data in Insights, we can run queries like
SELECT count(uri) FROM Fastly SINCE 1 HOUR AGO where uri like ‘%/someuri%’ FACET hitMiss TIMESERIES
This queries all urls that match /someuri, facets on hitMiss (an arbitrary attribute we named that represents Fastly cache Hit or Miss) and presents it as a TimeSeries. This quickly shows a cache hit ratio for a regex of URLs which can be incredibly powerful.
Lastly, note that Fastly is pretty flexible in terms of what data you can send. The expected defaults are included like Time, URI, StatusCode, etc. But you can also include almost any Varnish variable in the log format which can be quite powerful.