Neville Li Neville Li on

Learn about Scio, a Scala API for Google Cloud Dataflow (incubated as Apache Beam). Apache Beam offers a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with, e.g. Spark and Scalding. Neville will cover design and implementation of the framework, including features like typesafe BigQuery macros, REPL, and serialization. There will also be a live coding demo.

Neville is a software engineer at Spotify who works mainly on data infrastructure and tools for machine learning and advanced analytics. In the past few years he has been driving the adoption of Scala and new data tools for music recommendation, including Scalding, Spark, Storm and Parquet. Before that he worked on search quality at Yahoo! and old school distributed systems like MPI.

This talk was given at the NYC Data Engineering meetup in June 2016.

Adam Denenberg Adam Denenberg on

Here at iHeartRadio we have made a significant investment in choosing Scala and Akka for our MicroService backend. We have also recently made an investment in moving a lot of our infrastructure over to AWS to give us a lot more freedom and flexibility into how we manage our infrastructure and deployments.

One of the really exciting technologies coming out of AWS is Lambda. Lambda allows you to listen to various “events” in AWS, such as file creation in S3, stream events from Kinesis, messages from SQS and then invoke your custom code to react to those events. Additionally the applications you deploy that react to these events, require no infrastructure and are completely auto-scaled by Amazon (on what seems like a cluster of containers). Currently Lambda supports writing these applications in Python, Node and Java.

For our CDN, we leverage Fastly for a lot of our web and API properties, which has been really powerful for us. However, sometimes we need to get some detail around whats happening with our traffic at a fine grain level that the Fastly dashboards do not provide. For example, we may want to know the cache hitrate for a specific URL, or know who our top referrers are broken down by browser. Fastly gives you the ability to ship logs in realtime(to a remote syslog server, S3, etc), but pouring over Gigabytes of logs with grep, seemed less than ideal. Additionally , rolling out an entire log processing framework along with a front-end that could give us visualizations, groupings, timeseries and facets was going to be a fair lift as well.

NewRelic released an interesting product called “Insights” which is quite simply, a way to send arbitrary data events to their storage backend and provide relatively complex visual operations on that data such as TimeSeries, Facets, Where clause filters, Percentages, etc. We quickly realized if we can build a simple bridge from the real-time Fastly logs to the NewRelic backend, we would have a quick and powerful solution.

Lambda was a perfect fit for this, since we could easily ship logs in real-time to S3 from Fastly. Once we did that, we could write a Lambda function that would get invoked every time a new file was uploaded to our S3 bucket, parse the logfile and post the events to NewRelic.

Since Lambda supports Java, we spent some time experimenting with getting it to work in Scala. We eventually got it to work and open sourced our solution. We learned a few things along the way, however, that I outlined below.

Firstly, only Java 8 is supported, so your build.sbt needs to have some configuration in it that enables 1.8 only support

javacOptions ++= Seq(“-source”, “1.8”, “-target”, “1.8”, “-Xlint”)

Additionally, the Lambda platform doesn’t understand native Scala types (it’s fine for primitives) so you need to map things like Scala List to a java.util.List. Importing scala.collection.JavaConverters._ into your source files should help you handle this for you by giving you helpers .asScalaand .asJava to convert in either direction.

Another issue that came up was dealing with threading and Futures. When using an async library (like Ning), you need to be careful about how you handle multi-threading. Lambda re-uses the handler instance between invocations so you need to be careful about spawning work on another thread. Amazon recommends to block the main thread until that work was completed. In our case, when firing an async webservice call, the proper behavior was to invoke the future, and then wrap the result in anAwait.result().

For java, there is absolutely some “jvm warmup” time that was noticed when there was low activity on the lambda function. To avoid this, we increased the frequency which Fastly was pushing S3 logs and we saw a 2x decrease in run time (from about 6 seconds to 3 seconds). Also note, that reading from S3 was not particularly fast, so be sure to tweak the timeout for your function, especially for first run times.

Testing your lambda function is not particularly simple either. We had to rely on a few unit tests to enable more rapid testing of the workflow. Getting your new jar published on Lambda is without a doubt a multi-step process, especially if your jar exceeds the 10Mb limit. You need to 1) upload your jar to S3, 2) Publish a new function to Lambda using the S3 URL 3) go review the cloudWatch logs to ensure everything worked.

Once everything was published and working the results were pretty great. With data in Insights, we can run queries like

SELECT count(uri) FROM Fastly SINCE 1 HOUR AGO where uri like ‘%/someuri%’ FACET hitMiss TIMESERIES

This queries all urls that match /someuri, facets on hitMiss (an arbitrary attribute we named that represents Fastly cache Hit or Miss) and presents it as a TimeSeries. This quickly shows a cache hit ratio for a regex of URLs which can be incredibly powerful.

Lastly, note that Fastly is pretty flexible in terms of what data you can send. The expected defaults are included like Time, URI, StatusCode, etc. But you can also include almost any Varnish variable in the log format which can be quite powerful.

Owein Reese Owein Reese on

Owein Reese, Senior Engineer at Mediamath, gives an overview of Autolifts, an open source dependently typed library for auto lifting and auto mapping of functions built in Scala.

Autolifts takes advantage of Scala’s advanced type system to yield a set of abstractions for working with complex objects. We’ll introduce the concept of lifting and why you might want to incorporate this pattern in your code. Then we’ll show how the library takes that concept, mixes it with dependent types and implicit extensions to automatically lift in a type safe manner. Finally, we’ll show how using these extensions simplifies code, reducing boilerplate while making code more easily understood and maintained.

Owein Reese has been a full-time Scala developer for over five years and has spoken at several Scala meetups and conferences. He has several open source projects and leads several engineering teams at Media Math.



Kailuo Wang Kailuo Wang on

Announcing a new iHeartRadio open source project Kanaloa. Kanaloa is a set of work dispatchers implemented using Akka actors. These dispatchers sit in front of your service and dispatch received work to them. They make your service more resilient through the following means:

  1. Auto scaling - it dynamically figures out the optimal number of concurrent requests your service can handle, and make sure that at any given time your service handles no more than that number of concurrent requests. This algorithm was also ported and contributed to Akka as Optimal Size Exploring Resizer (although with some caveats).

  2. Back pressure control - this control is Little’s law inspired. It rejects requests when estimated wait time of which exceeds a certain threshold.

  3. Circuit breaker - when error rate from your service goes above a certain threshold, kanaloa dispatcher stops all requests for a short period of time to give your service a chance to “cool down”.

  4. Real-time monitoring - a built-in statsD reporter allows you to monitor a set of critical metrics (throughput, failure rate, queue length, expected wait time, service process time, number of concurrent requests, etc) in real time. It also provides real-time insights into how kanaloa dispatchers are working. An example on Grafana:Dashboard

Unknown author on

In this talk Adam Gibson from presents the ND4J framework with an iScala notebook.
Combined with Spark's dataframes, this is making real data science viable in Scala. ND4J is "Numpy for Java." It works with multiple architectures (or backends) that allow for run-time-neutral scientific computing as well as chip-specific optimizations -- all while writing the same code. Algorithm developers and scientific engineers can write code for a Spark, Hadoop, or Flink cluster while keeping underlying computations that are platform-agnostic. A modern runtime for the JVM with the capability to work with GPUs lets engineers leverage the best parts of the production ecosystem without having to pick which scientific library to use.

Adam Warski Adam Warski on

Spray, once a stand-alone project, now part of Akka, is a toolkit for building and consuming REST services. SoftwareMill CTO and Co-founder Adam Warski demos how to build a simple REST service with Spray, and then consume it with a Spray-based client. He shows that new routes can be added very quickly, how to use type-safe query and path parameters, as well as how to create custom directives, reusing existing code.


This talk was given at the Scala Bay meetup hosted at SumoLogic in SF.

Chris Sachs Chris Sachs on

Type members and path-dependent types breathe new life into objects. This talk focusses on type abstraction, and how to wield it effectively to create simple, robust, performant designs. Chris Sachs (Senior Software Engineer, Sensity Systems) designed the experimental Scala Basis library, which implements simple collection interfaces with macro optimized extensions providing all auxiliary functions, efficient container implementations, and more.

Paul Phillips Paul Phillips on

In this talk, Paul Phillips, co-founder of Typesafe, will talk about collections for Scala. Based on Paul's extensive experience with scala collections, he decided to write his own. According to Paul, "The focus is much tighter: immutable, performant, predictable, correct." His talk will "will alternate between why the scala collections manage none of those things, and how I hope to do better." This talk was recorded at the Scala Bay meetup at LinkedIn.

Bill Venners Bill Venners on

In this talk, Bill Venners talks about  ScalaTest 2.0, and how it is a vast enough forest that it is hard to see all the useful trees. Bill will guide you through ScalaTest 2.0 by pointing out lesser-known features of the testing toolkit that can help you get stuff done. You'll gain insight into what ScalaTest offers, why ScalaTest is designed the way it is, and how you can get more out of it. This talk was recorded at the Scala Bay meetup at Box.