Batch Data Processing at Spotify by Erik Bernhardsson

Speaker: Erik Bernhardsson, Tech Lead, Spotify

Erik is the Tech Lead of the discovery team at Spotify.

Previously head of the Business Intelligence team in the Stockholm office where he was responsible for collecting, aggregating and making sense out of all the data.

About the talk:

Recorded at our own NYC Data Engineering group in NYC, Erik will be talking about Luigi, a recently open-sourced Python framework that simplifies batch data processing, helping you build complex pipelines of batch jobs, handle dependency resolution and create visualizations to help manage multiple workflows.

It also comes with Hadoop support built in (and that's really where its strength becomes clear). Luigi provides an infrastructure that powers several Spotify features including recommendations, top lists, A/B test analysis, external reports, internal dashboards and many more.

Read more about it on Github.

Want to hear from more top engineers?

Our weekly email contains the best software development content and interviews with top CTOs. Enter your email address now to stay in the loop.

[caption id="attachment_791" align="alignleft" width="553"]Photo from the event Photo from the event[/caption]










About batch data processing

Batch data processing at its most basic is the execution of a number of scripts one after another without the need for human intervention in running these scripts. This is a very basic concept and it has been around for a long time. The reason it is interesting for us today is that Luigi, which is Spotify's batch processing framework is opensource and is written in Python, and can be used by Python developers when working in high scale environments. The framework is also reliable and has high fault tolerance.

Additionally, one of the key advancements of Spotify's Luigi is that it is made to operate in a high-scale environment. This is very noteworthy because batch processing, despite being automated, has never been great at adapting to high scale environments.

Background on Hadoop

Hadoop is an opensource software framework that allows for the distributed processing of large data sets across clusters of computers. Hadoop was created by Doug Cutting and Mike Cafarella in 2005. It is licensed under the Apache v2 license. Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing.

For more background on Hadoop, please take a look at our Hadoop ETL article or our article and video about the Stinger Initiative at Hortonworks. And for examples how Hadoop is used in practice and with different languages, you can take a look at our Scala and Hadoop article.