Unknown author on

Nick Elprin, founder of Domino Data Lab, talks about how to deploy predictive models into production, specifically in the context of a corporate enterprise use case. Nick demonstrates an easy way to “operationalize” your predictive models by exposing them as low-latency web services that can be consumed by production applications. In the context of a real-world use case this translates into more subtle requirements for hosting predictive models, including zero-downtime upgrades and retraining/redeploying against new data. Nick also focuses on the best practices for writing code that will make your predictive models easier to deploy.


This video was recorded at the SF Data Mining meetup at Runway.io in SF.

Small 1120345 anna smith on

Anna Smith from Rent the Runway talks about how they've evolved their data pipeline over time to deal with infrastructure constraints, disparate data sources, and changing data sources/quality all while still serving reports and data back to the website with minimal downtime. Anna also covers how they leveraged Luigi to ensure robust reporting without forcing non-technical analysts to learn Python.


This video was recorded at the NYC Data Engineering meetup at Spotify in NYC.

Unknown author on

Interested in learning more about data engineering and data science? Don't miss our 2 day DataEngConf with top engineers in San Francisco, April 2016.

In this article we put together 12 of the top Kafka talks on Hakka Labs.

Introduction to Apache Kafka

Apache Kafka is a commit log for your entire data center and infrastructure. In this lightning talk, Joe Stein, founder of Big Data Open Source Security LLC, gives a brief introduction to Kafka and talks about the producers, consumers, and client libraries it has to offer. This talk was given at the Apache Kafka NYC meetup at Tapad.


Kafka and Hadoop

Getting data from Kafka to Hadoop should be simple, which is why the community has so many options to choose from. Cloudera engineer, Gwen Shapira, reviews some popular solutions: Storm, Spark, Flume and Camus. She goes over the pros and cons of each, and recommends use-cases and future development plans as well. This talk was given at the Apache Kafka NYC meetup at Tapad.



Small rafe bigger Rafe Coburn on

Three years ago, Etsy's analytics data pipeline was built around a pixel hosted on Akamai, FTP uploads, and Amazon EMR. Rafe Colburn, manager of the data engineering team at Etsy, talks about their migration to a data ingestion pipeline based on Kafka. He gives an overview on how they rebuilt their data pipeline without disrupting ongoing analytics work, as well as the tradeoffs made in building these systems.

Dataenconfnyc2016 logos3


This talk was given at the NYC Data Engineering meetup at Spotify.

This talk is included in our collection of the Top 12 Apache Kafka talks. Check out the others here.

Small kinshuk mishra Kinshuk Mishra on

Spotify has built several real-time pipelines using Apache Storm for use cases like ad targeting, music recommendation, and data visualization. Each of these real-time pipelines have Apache Storm wired to different systems like Kafka, Cassandra, Zookeeper, and other sources and sinks. Building applications for over 50 million active users globally requires perpetual thinking about scalability to ensure high availability and good system performance.

Interested in learning more about data engineering and data science? Don't miss our 2 day DataEngConf with top engineers in San Francisco, April 2016.


Small 1626054 Andrew Otto on

Andrew Otto, Systems Engineer at Wikimedia Foundation, talks about the analytics cluster at Wikimedia that allows them to support ~20 billion page views a month (Kafka, Hadoop, Hive, etc). Andrew shares how and why they chose to go with Kafka (scalable log transport) and how they've implemented Kafka with four brokers, a custom-built producer and kafkatee and Camus as their consumers.


This talk was given at the Apache Kafka NYC meetup at Yodle.

We have many more articles on Apache Kafka. Check out our collection of top 12 tech talks on Apache Kafka.

Small 350938 Dean Chen on

Apache Spark is a next generation engine for large scale data processing built with Scala. Dean Chen, software engineer at ebay, discusses how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API for big data analysis. Dean covers the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases: Spark SQL, Spark Streaming, MLib and GraphX.


This was recorded at the Scala Bay meetup at PayPal.

Small 1054171 Gandalf Hernandez on


Spotify joined forces with The Echo Nest this spring. The Echo Nest specializes in, among other things, knowing as much as possible about individual songs. For example, they can figure out the tempo and key of a song, if it is acoustic or electric, if it is instrumental or has vocals, if it is a live recording, and so on. Very exciting stuff!

During the past couple of months The Echo Nest has been running their audio analyzer over a big portion of the tracks in our catalog, uploading the resulting analysis files to an S3 bucket for storage.

As we integrate The Echo Nest into Spotify, we want to start making use of the analysis output within the main Spotify pipelines.

My problem is that I have data for 35 million tracks sitting in 35 million individual files, totaling 15-20TB, in an S3 bucket that we need to get into our Hadoop cluster, which is the starting point for most of our pipelines.

Unknown author on

While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. Eric Sammer, CTO at ScalingData and author of Hadoop Operations talks about how ScalingData uses Kafka together with other open source systems such as Hadoop, Solr, and Impala/Hive to collect, transformation and aggregate event data and then build applications on top of this platform.

If you are interested in learning more about Apache Kafka, check out our top 12 tech talks on how top tech companies use Apache Kafka.

This talk was given at the Apache Kafka NYC meetup at Tapad.


Small michael hwang Michael Hwang on

Here at SumAll, we provide analytics from over 40 different sources that range from social media data to health and fitness tracking. We collect and connect metrics from a wide range of 3rd party API data sources for our clients to help them make better business decisions. One of engineering's biggest challenge has been on building and maintaining a data pipeline that is both performant and reliable.

We currently have a few different implementations that are running in production. One set uses Esper at its core while another set uses an in-house aggregation library written in Java. The use of Esper makes the aggregation piece of the processing asynchronous unlike the pipeline that uses the in-house aggregation library. This has been one of the debate points in discussions amongst our engineering team over which technology to ultimately use for our aggregator services. As the number of additional platforms we integrate with continues to increase, we have found that having different approaches to solving the same problem fragments the engineering team and leads to code overhead.


Join Us