Silviu Calinoiu Silviu Calinoiu on

This talk shows how to build an ETL pipeline using Google Cloud Dataflow/Apache Beam that ingests textual data into a BigQuery table. Google engineer Silviu Calinoiu gives a live coding demo and discusses concepts as he codes. You don't need any previous background with big data frameworks, although people familiar with Spark or Flink will see some similar concepts. Because of the way the framework operates the same code can be used to scale from GB files to TB files easily.

Asim Jalis Asim Jalis on

Big Data applications need to ingest streaming data and analyze it. HBase is great at ingesting streaming data but not so good at analytics. On the other hand, HDFS is great at analytics but not at ingesting streaming data. Frequently applications ingest data into HBase and then move it to HDFS for analytics.

Andy Dirnberger Andy Dirnberger on

iHeartRadio ingests hundreds of thousands of products each month. Historically, as a new product delivery was received, a user would manually initiate the ingestion process by entering its file path into a form on a web page, triggering the ingestion application to parse the delivery and update the database. Downstream systems would constantly poll the database, run at regularly scheduled intervals, or be triggered manually. This process, roughly visualized below, was reasonable, with new content arriving in the catalog within a few days of its receipt.

Linear ingestion flow

Ingestion v2

As new content providers were added, new distribution formats needed to be accommodated. More and more code was added to the application. Eventually we developed a new version of the application, this time introducing an XSLT stylesheet for each provider. These stylesheets transformed the providers’ formats into a single, canonical format. This simplified the application as the ingestion application only then needed to know how to parse one XML format.

Over time, though, provider-specific logic found its way into the application to handle cases that couldn’t be handled by XSLT. Also, one provider delivered their content in a format that couldn’t easily be handled by XSLT at all. This meant that both versions of the application were used to ingest new content. Changes targeted at all providers needed to be made to two different applications. This also meant two applications needed to be tested against changes.

Ingestion v3

We made another pass at creating a provider-agnostic version of the ingestion application. This time, however, the goal was to include other types of content. The first two iterations focused solely on music. Data models and their business logic were pushed out of the application and into configuration files. This would allow for new instances of the application to be spun up with configuration files that described different content types.

To help with the additional workload, the application was designed to distribute its work. As this version was written in Python (the previous two were both written in Java), Celery, backed by RabbitMQ, was used to distribute tasks across multiple workers.

Unfortunately the simplicity of the application came at the cost of code that was very difficult to debug. It was also difficult to efficiently debug a data model when the model was defined in configuration rather than in code. Problems were hard to diagnose, and it quickly became clear that adding additional content types would only make this worse.

In addition to debugging problems, there was one issue this version failed to address, something that had been plaguing us from the first version on: when a delivery failed, it needed to be placed through the entire process again.

Ingestion Pipeline

We set out to build the fourth version of the ingestion application. This time, however, we decided to split it up into smaller, single-purpose applications. Each application would be connected to the next through a message queue.

With our new approach, applications can be run in succession or in parallel. Applications can be added or removed at any time without affecting the entire system. The flow of products through the ingestion system takes on a very different shape.

Distributed ingestion flow

By logging each outgoing message we gain visibility into the state of the system and can better monitor its health and performance. We send all of our logs — both the messages between systems and all application-level logs — to Logstash. This enables us to easily get messages into Elasticsearch, either in their original format or with keys remapped. This, coupled withKibana for dashboards, allows us to gain insight into how the system is performing and the stakeholders to gain insight into products being ingested.

We can also recover from errors much easier. For errors from which we know we can recover, an application can place its incoming message back into the incoming queue — this could also be done by not acknowledging the message and allowing it to return to the queue, but we want to know about the error and be able to fail if it’s been retried too many times — to allow it to be processed again later.

Brought to you by the letter H


To accomplish this, we built a framework known as Henson. Henson allows us to hook up a consumer (usually an instance of our Henson-AMQP plugin) to a callback function. Any message received from the consumer will be passed to the callback function. To help simplify the callback function associated with each application, Henson also supports processing the message before giving it to the callback through a series of callbacks (e.g., message schema validation, setting timestamps) and processing the results received from the callback (e.g., more timestamps, sending the message through a producer).

Henson allows us to reduce each application to just the amount of code required to implement its core functionality. The rest can be handled through code contained in shared libraries, registered as the appropriate processor.

The boilerplate for one of our services can be as simple as:

from henson import Application

from henson_amqp import AMQP
from ingestion import send_message, validate_schema

from .callback import callback

app = Application('iheartradio', callback=callback, consumer=AMQP(app))


All we need to do when creating a new service is to implement callback.

async def callback(application, message):

"""Return a list of results after printing the message.


application (henson.base.Application): The application
instance that received the message.
message (dict): The incoming message.


List[dict]: The results of processing the message.
print('Message received:', message)
return [message]

Once this is done, we can then run the application with

$ henson run service_name

We decided to use asyncio’s coroutines and event loop to allow multiple messages to be processed while awaiting for traditionally blocking actions to complete. This is especially important in our content enrichment jobs, many of which poll APIs from third parties.

In addition to Henson itself, we’re also developing several plugins covering message queues, databases, and logging.

Adam Denenberg Adam Denenberg on

Here at iHeartRadio we have made a significant investment in choosing Scala and Akka for our MicroService backend. We have also recently made an investment in moving a lot of our infrastructure over to AWS to give us a lot more freedom and flexibility into how we manage our infrastructure and deployments.

One of the really exciting technologies coming out of AWS is Lambda. Lambda allows you to listen to various “events” in AWS, such as file creation in S3, stream events from Kinesis, messages from SQS and then invoke your custom code to react to those events. Additionally the applications you deploy that react to these events, require no infrastructure and are completely auto-scaled by Amazon (on what seems like a cluster of containers). Currently Lambda supports writing these applications in Python, Node and Java.

For our CDN, we leverage Fastly for a lot of our web and API properties, which has been really powerful for us. However, sometimes we need to get some detail around whats happening with our traffic at a fine grain level that the Fastly dashboards do not provide. For example, we may want to know the cache hitrate for a specific URL, or know who our top referrers are broken down by browser. Fastly gives you the ability to ship logs in realtime(to a remote syslog server, S3, etc), but pouring over Gigabytes of logs with grep, seemed less than ideal. Additionally , rolling out an entire log processing framework along with a front-end that could give us visualizations, groupings, timeseries and facets was going to be a fair lift as well.

NewRelic released an interesting product called “Insights” which is quite simply, a way to send arbitrary data events to their storage backend and provide relatively complex visual operations on that data such as TimeSeries, Facets, Where clause filters, Percentages, etc. We quickly realized if we can build a simple bridge from the real-time Fastly logs to the NewRelic backend, we would have a quick and powerful solution.

Lambda was a perfect fit for this, since we could easily ship logs in real-time to S3 from Fastly. Once we did that, we could write a Lambda function that would get invoked every time a new file was uploaded to our S3 bucket, parse the logfile and post the events to NewRelic.

Since Lambda supports Java, we spent some time experimenting with getting it to work in Scala. We eventually got it to work and open sourced our solution. We learned a few things along the way, however, that I outlined below.

Firstly, only Java 8 is supported, so your build.sbt needs to have some configuration in it that enables 1.8 only support

javacOptions ++= Seq(“-source”, “1.8”, “-target”, “1.8”, “-Xlint”)

Additionally, the Lambda platform doesn’t understand native Scala types (it’s fine for primitives) so you need to map things like Scala List to a java.util.List. Importing scala.collection.JavaConverters._ into your source files should help you handle this for you by giving you helpers .asScalaand .asJava to convert in either direction.

Another issue that came up was dealing with threading and Futures. When using an async library (like Ning), you need to be careful about how you handle multi-threading. Lambda re-uses the handler instance between invocations so you need to be careful about spawning work on another thread. Amazon recommends to block the main thread until that work was completed. In our case, when firing an async webservice call, the proper behavior was to invoke the future, and then wrap the result in anAwait.result().

For java, there is absolutely some “jvm warmup” time that was noticed when there was low activity on the lambda function. To avoid this, we increased the frequency which Fastly was pushing S3 logs and we saw a 2x decrease in run time (from about 6 seconds to 3 seconds). Also note, that reading from S3 was not particularly fast, so be sure to tweak the timeout for your function, especially for first run times.

Testing your lambda function is not particularly simple either. We had to rely on a few unit tests to enable more rapid testing of the workflow. Getting your new jar published on Lambda is without a doubt a multi-step process, especially if your jar exceeds the 10Mb limit. You need to 1) upload your jar to S3, 2) Publish a new function to Lambda using the S3 URL 3) go review the cloudWatch logs to ensure everything worked.

Once everything was published and working the results were pretty great. With data in Insights, we can run queries like

SELECT count(uri) FROM Fastly SINCE 1 HOUR AGO where uri like ‘%/someuri%’ FACET hitMiss TIMESERIES

This queries all urls that match /someuri, facets on hitMiss (an arbitrary attribute we named that represents Fastly cache Hit or Miss) and presents it as a TimeSeries. This quickly shows a cache hit ratio for a regex of URLs which can be incredibly powerful.

Lastly, note that Fastly is pretty flexible in terms of what data you can send. The expected defaults are included like Time, URI, StatusCode, etc. But you can also include almost any Varnish variable in the log format which can be quite powerful.

Unknown author on

The data science team at iHeartRadio has been developing collaborative filtering models to drive a wide variety of features, including recommendations and radio personalization. Collaborative filtering is a popular approach in recommendation systems that makes predictive suggestions to users based on the behavior of other users in a service. For example, it’s often used to recommend new artists to users based on what they and similar users are listening to. In this blog post we discuss our initial experiences applying these models to increase user engagement through targeted marketing campaigns.

One of the most successful approaches to collaborative filtering in recent years has been matrix factorization, which decomposes all the activity on a service into compact representations of its users and what they interact with, e.g. artists. At iHeartRadio we use matrix factorization models to represent our users and artists with ~100 numbers each (called their latent vectors) instead of processing the trillions of possible interactions between them. This simplifies our user activity by many orders of magnitude, and allows us to quickly match artists to users by applying a simple scoring function to the user’s and artist’s latent vectors.

Erik Bernhardsson Erik Bernhardsson on

Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik Bernhardsson developed a library called "Annoy" that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.

Unknown author on

Word2Vec is an interesting unsupervised way to construct vector representations of words to act as features for downstream algorithms or as a basis for similarity searches. We look at using the Spark implementation of Word2Vec shipped in MLLib to help us organize and make sense of some non-textual data by treating discrete clinical events (I.e. Diagnoses, drugs prescribed, etc.) in a medical dataset as non-textual "words”.

Unknown author on

Anomaly detection in healthcare data is an enabling technology for the detection of overpayment and fraud. In this talk, we demonstrate how to use PageRank with Hadoop and SociaLite (a distributed query language for large-scale graph analysis) to identify anomalies in healthcare payment information. We demonstrate a variant of PageRank applied to graph data generated from the Medicare-B dataset for anomaly detection, and show real anomalies discovered in the dataset.

Unknown author on

Xiangrui Meng, a committer on Apache Spark, talks about how to make machine learning easy and scalable with Spark MLlib. Xiangrui has been actively involved in the development of Spark MLlib and the new DataFrame API. MLlib is an Apache Spark component that focuses on large-scale machine learning (ML). With 50+ organizations and 110+ individuals contributing, MLlib is one of the most active open-source projects on ML. In this talk, Xiangrui shares his experience in developing MLlib. The talk covers both higher-level APIs, ML pipelines, that make MLlib easy to use, as well as lower-level optimizations that make MLlib scale to massive datasets.

Simon Chan Simon Chan on

In this talk, Simon Chan (co-founder of PredictionIO) introduces the latest developments and shows how to use PredictionIO to build and deploy predictive engines in real production environments. PredictionIO is an open source machine learning server built on Apache Spark and MLlib. It is designed for data scientists and developers to build predictive engines for real-world applications in a fraction of the time normally required.

Using PredictionIO’s DASE design pattern, Simon illustrates how developers can develop machine learning applications with the separation of concerns (SoC) in mind.
“D" stands for Data Source and the Data Preparator, which take care of the preparation of data for model training.
“A" stands for Algorithm, which is where the code of one or more algorithms are implemented. MLlib, the machine learning library of Apache Spark, is natively supported here.
“S” stands for Serving, which handles the application logic during the retrieval of predicted results.
Finally, “E” stands for Evaluation.

Simon also covers upcoming development work, including new Engine Templates for various business scenarios.


This video was recorded at the SF Data Mining meetup at in SF.

Unknown author on

Nick Elprin, founder of Domino Data Lab, talks about how to deploy predictive models into production, specifically in the context of a corporate enterprise use case. Nick demonstrates an easy way to “operationalize” your predictive models by exposing them as low-latency web services that can be consumed by production applications. In the context of a real-world use case this translates into more subtle requirements for hosting predictive models, including zero-downtime upgrades and retraining/redeploying against new data. Nick also focuses on the best practices for writing code that will make your predictive models easier to deploy.

anna smith anna smith on

Anna Smith from Rent the Runway talks about how they've evolved their data pipeline over time to deal with infrastructure constraints, disparate data sources, and changing data sources/quality all while still serving reports and data back to the website with minimal downtime. Anna also covers how they leveraged Luigi to ensure robust reporting without forcing non-technical analysts to learn Python.


This video was recorded at the NYC Data Engineering meetup at Spotify in NYC.

Matthew Zeiler Matthew Zeiler on

Matthew Zeiler, PhD, Founder and CEO of Clarifai Inc, speaks about large convolutional neural networks. These networks have recently demonstrated impressive object recognition performance making real world applications possible. However, there was no clear understanding of why they perform so well, or how they might be improved. In this talk, Matt covers a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the overall classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that perform exceedingly well.


This talk was presented at the NYC Machine Learning Meetup at Pivotal Labs.

Unknown author on

Machine learning researcher, Edouard Grave, gives a presentation on the field of information extraction (pulling structured data from unstructured documents). Edouard talks about current challenges in the field and introduces distant supervision for relation extraction.

Distant supervision is a recent paradigm for learning to extract information by using an existing knowledge base instead of label data as a form of supervision. The corresponding problem is an instance of multiple label, multiple instance learning. Edouard shows how to obtain a convex formulation of this problem, inspired by the discriminative clustering framework.

He also presents a method to learn to extract named entities from a seed list of such entities. This problem can be formulated as PU learning (learning from positive and unlabeled examples only) and Edouard describe a convex formulation for this problem.


This talk was presented at the NYC Machine Learning Meetup at Pivotal Labs.

Rafe Coburn Rafe Coburn on

Three years ago, Etsy's analytics data pipeline was built around a pixel hosted on Akamai, FTP uploads, and Amazon EMR. Rafe Colburn, manager of the data engineering team at Etsy, talks about their migration to a data ingestion pipeline based on Kafka. He gives an overview on how they rebuilt their data pipeline without disrupting ongoing analytics work, as well as the tradeoffs made in building these systems.

Unknown author on

Raghavendra Prabhu, engineering manager for the infrastructure team at Pinterest, walks through their new storage product, Zen. Built at Pinterest, Zen was originally conceived in summer 2013 and since then has grown to be one of the thier most widely used storage solutions, powering the home feed, interest graph, messages, news and other key features.

In this talk, RVP goes over the design motivation for Zen and describes its internals including the API, type system and HBase backend. He also discusses their learnings from running the system in production over the last year and a half, the features added and performance improvements made to accommodate the fast adoption we have seen since launch.


Slides here:

This talk was given at SF Data Engineering meetup at Galvanize in San Francisco.

Kinshuk Mishra Kinshuk Mishra on

Spotify has built several real-time pipelines using Apache Storm for use cases like ad targeting, music recommendation, and data visualization. Each of these real-time pipelines have Apache Storm wired to different systems like Kafka, Cassandra, Zookeeper, and other sources and sinks. Building applications for over 50 million active users globally requires perpetual thinking about scalability to ensure high availability and good system performance.

Noel Cody Noel Cody on

Sometimes the answer to a sluggish data pipeline isn’t more power in the Hadoop cluster, but a shift in technique. We hit one of these moments recently at Spotify.

One of our critical ad analysis pipelines had issues. First it was slow. Then a few days later it was dead, unrunnable at < 20GB memory/reducer.

We traced the problem back to a single bottleneck: One expensive join and a handful of overloaded reducers. We solved things by switching up our join strategy, cutting memory usage by over 75%. Here’s how.

Reducer Overload

At one stage in our pipeline, we performed a large join between raw ad impression logs and a smaller set of ads metadata. Some of our ads have far more impressions than others; the nature of this join was such that several reducers were processing an unreasonable number of rows while others in the cluster sat idle.

This kind of bottleneck tends to happen with “skewed” data, where a handful of keys make up a disproportionate number of logs. A standard join maps all logs with the same key to the same reducer. If too many logs hit a single reducer, the reducer chokes.

So the problem was really the data, not the cluster.

We needed a solution that would distribute the join across reducers to lighten the load. And it had to work in Apache Crunch, our platform for writing, testing, and running MapReduce pipelines on Hadoop.

Enter the Sharded Join

As we shifted focus from the initial memory problem to the underlying data problem, we settled on a simple solution: To "shard" the join. A sharded join aims to distribute the load of a single expensive join across the cluster so no one reducer gets overloaded.

In abstract, the sharded join pattern looks like this:

Consider a join between a larger dataset Logs (with multiple values, or multiple logs, per key) and a smaller metadata set Metadata (which has just one value, or one log, per key). A sharded join splits up the key space by:

  1. In Logs: Dividing key/value pairs into groups by combining each log's key with some random integer. The output of this looks like ([A1, V1], [A2, V2], [A3, V3]), where “A1” represents a combination of key “A” and a random number and “V1” represents an unchanged log value. All values are unique here; we're dividing into groups.

  2. In Metadata: Replicating key/value pairs by creating multiple copies of each unique key, one for each possible value of the random integer. Output looks like ([A1, M1], [A2, M1], [A3, M1]).



While this setup increases the number of total mapped logs, the average reducer load is decreased. In the example above, the normal map would have sent three logs and a metadata row to each of two reduce tasks; the sharded map breaks up the job by sending just one log and one metadata row to each of six separate tasks.

This brings the necessary memory per reducer down and removes the bottleneck.

How to do this in Apache Crunch?

If you’re fancy and use Crunch, you’ve got a nice built-in sharded join strategy waiting for you. Use of that strategy would look like this:

PTable<K, V1> one = ;
    PTable<K, V2> two = ;
    JoinStrategy<K, V1, V2> strategy = new ShardedJoinStrategy<K, V1, V2>(num_shards);
    PTable<K, Pair<V1, V2>> joined = strategy.join(one, two, JoinType.INNER_JOIN);

...and our first step was to implement this sharded join from the standard Crunch library. However, we soon hit an underlying bug in the Crunch library affecting our specific implementation of the sharded join. Unable to use the built-in implementation, we fell back to constructing the join ourselves.

This turned out to be a good opportunity to review basics. When all else fails...

Do it From Scratch

If you’re stuck using vanilla MapReduce or are unable to use the built-in sharded join, here’s the DIY approach. Code below is in Crunch, but the pattern holds regardless of what you’re using.

First we split the key space from the large dataset into shards using a random integer. In Crunch, we can do this with a standard MapFn, which is a specialized DoFn for the case of emitting one value per input record (while a DoFn is the base abstract class for all data processing functions).

We store the output in a built-in Crunch collection called a PTable, which consists of a <key, value> map. Here our key is a Pair<String, Integer>.

PTable<Pair<String, Integer>, Value> shardedValues = largeDataset.parallelDo(
    new MapFn<Pair<String, Value>, Pair<Pair<String, Integer>, Value>>() {

  final private Random rand = new Random();

  public Pair<Pair<String, Integer>, Value> map(
      Pair<String, Value> largeDatasetPair) {
    int randnum = rand.nextInt(NUM_SHARDS) + 1;

    return Pair.of(Pair.of(largeDatasetPair.first(), randnum),
} , tableOf(pairs(strings(), ints()), specifics(Value.class)));

When determining the number of shards to use, think about your current and ideal reducer memory usage. Two shards will approximately halve the number of logs joined per reducer and affect memory accordingly. The cost of sharding is an increase in total reduce tasks and an increase in mapped metadata logs.

Then we replicate metadata values using a DoFn and an Emitter, allowing us to output multiple values per log:

PTable<Pair<String, Integer>, MetadataValue> replicatedValues = metadataSet.parallelDo(
    new DoFn<Pair<String, MetadataValue>, Pair<Pair<String, Integer>, MetadataValue>>() {

      public void process(Pair<String, MetadataValue> metadataSetPair,
                          Emitter<Pair<Pair<String, Integer>, MetadataValue>> pairEmitter) {
        for (int i = 1; i <= NUM_SHARDS; ++i) {
          pairEmitter.emit(Pair.of(Pair.of(metadataSetPair.first(), i),
    }, tableOf(pairs(strings(), ints()), specifics(MetadataValue.class)));

Finally, join and filter:
JoinStrategy<Key, MetadataValue, Value> strategy = new DefaultJoinStrategy<Key, MetadataValue, Value>();
PTable<Key, Pair<Value, MetadataValue>> joined = strategy.join(one, two, JoinType.LEFT_JOIN)
.whateverElseYouNeedToDo(); // Remove the sharded int values from keys here

This splits up the ad logs, joins them in small batches, then restores them to their initial state.

The sharded join can work wonders on reducer load when combining a large, skewed dataset with a smaller dataset. Our shift in focus from the initial memory overload to the data bottleneck allowed us to settle on this simple solution and get the job back in line. Pipeline revived.

Andrew Otto Andrew Otto on

Andrew Otto, Systems Engineer at Wikimedia Foundation, talks about the analytics cluster at Wikimedia that allows them to support ~20 billion page views a month (Kafka, Hadoop, Hive, etc). Andrew shares how and why they chose to go with Kafka (scalable log transport) and how they've implemented Kafka with four brokers, a custom-built producer and kafkatee and Camus as their consumers.

Dean Chen Dean Chen on

Apache Spark is a next generation engine for large scale data processing built with Scala. Dean Chen, software engineer at ebay, discusses how Spark takes advantage of Scala's function idioms to produce an expressive and intuitive API for big data analysis. Dean covers the design of Spark RDDs and the abstraction enables the Spark execution engine to be extended to support a wide variety of use cases: Spark SQL, Spark Streaming, MLib and GraphX.


This was recorded at the Scala Bay meetup at PayPal.

Gandalf Hernandez Gandalf Hernandez on


Spotify joined forces with The Echo Nest this spring. The Echo Nest specializes in, among other things, knowing as much as possible about individual songs. For example, they can figure out the tempo and key of a song, if it is acoustic or electric, if it is instrumental or has vocals, if it is a live recording, and so on. Very exciting stuff!

During the past couple of months The Echo Nest has been running their audio analyzer over a big portion of the tracks in our catalog, uploading the resulting analysis files to an S3 bucket for storage.

As we integrate The Echo Nest into Spotify, we want to start making use of the analysis output within the main Spotify pipelines.

My problem is that I have data for 35 million tracks sitting in 35 million individual files, totaling 15-20TB, in an S3 bucket that we need to get into our Hadoop cluster, which is the starting point for most of our pipelines.

Naive solution

One option is to just set up a script to copy each of the 35 million tracks into Hadoop as individual files.

No, that is too many files. There is no reason to burden the Hadoop namenodes with that many new items. It is also unlikely that we will want to only process one file at a time, so there is again no reason for them to sit in individual files. There is a high likelihood of eye rolling if we proceed down this path.


Probably also not an acceptable solution (source)


Most of our data in Hadoop is stored in Avro format with one entity per line, so that is a good starting point. We are lacking a list of what Spotify songs are in the S3 bucket, and the bucket has other data too, so we cannot blindly download everything. But what we can do is:

  •   Take the latest internal list of all tracks we have at Spotify (which we call metadata dumps) and

  •   Check if the song exists in the S3 bucket, since we have the mapping from Spotify ID to S3 key and then

  •   Download the Echo Nest analysis of the song and

  •   Write out a line with the Spotify track ID followed by the analysis data

This simple algorithm looks very much like a map-only job using a metadata dump as the input. But there are a few considerations before we kick this off.


Our Hadoop infrastructure is critical since it is the end point for logging from all our backend services - including data used for financial reporting. 20 TB of data is not much for internal processing in the Hadoop cluster, but we normally do not pull this much data in from outside sources. After talking to some of our infrastructure teams, we got the all clear to start testing.

At this point we had no idea of how long this would take. After initial testing to see that the overall idea was sound by pulling down a few thousand files, we decided to break up the download into twenty pieces. That way if a job fails catastrophically with no usable output we would not have to restart from the beginning. A quick visit to Amazon’s website, using retail pricing, led to an estimate that the total download costs could top $1,000. That is enough money that it would be embarrassing to fail and have to completely start over again.


From testing we decided to pull data down in twenty chunks. The first job was kicked off on a Monday night. By the next morning the data had downloaded and we ran sanity checks on it the data. The number of unexpected failures turned out to be six out of 1.75 million attempted downloads. After a brief celebration for being six sigma compliant, we estimated that it would take around a week to download the data if we constrained ourselves to process one chunk at a time. We were satisfied with this schedule.

After a (happily) uneventful week we had around 6TB of packed Avro files in Hadoop.

From that we wrote regular, albeit heavy, Map Reduce jobs to calculate the audio attributes (tempo, key, etc.) for each of the tracks that we surface to other groups.

Fortunately or unfortunately–depending on how you look at it–the data is not static. Fortunately, since that means that there is new music released and coming into Spotify, and who doesn’t want that? Unfortunately that means our job is not done.

Analyzing new tracks

The problem now turns to how we analyze all new tracks provided to Spotify as new music is released.

Most of our datasets are stored in a structure of the form ‘/project/dataset/date’. For example, our track play data set divides up all the songs that were played on a particular day. Other data sets are snapshots. Today’s user data set contains a list of all our users and their particular attributes for that given day. Snapshot datasets typically do not vary much from day to day.

The analysis data set stays mostly static, but needs to be mated with incoming tracks for that day. Typically we’d join yesterday’s data with any new incoming data, emit a new data set, and call it a day. The analyzer data set is significantly larger than most other snapshots and it feels like overkill to process and copy 6TB of packed data each day in order to just add in new tracks.

Other groups at Spotify are interested in the audio attributes data set we compute from the track analysis, and they will expect that this data set looks similar to other snapshots, where today’s data set is a full dump with all track attributes. The attributes set is two orders of magnitude smaller, and thus much easier to deal with.

But others are not interested in the underlying analysis data set, so we could organize it as:

  •   Store periodic full copies of all the analysis (standard snapshot) and in between this

  •   Generate incremental daily data sets of new analysis files only

  •   Every once in a while mate the full data set with the following incremental sets and emit a new more up to date full data set making all older sets obsolete and delete them.

We can then generate daily attributes snapshots (remembering that they are much smaller) by:

  •   Taking yesterday’s attributes snapshot and join it with

  •   Today’s full or incremental analysis set (only one of them will exist) and

  •   Check if we are missing attributes for a given track analysis and if so calculate it, otherwise

  •   Emit the existing attributes for the given track to not redo work

This way we can commit to having standard daily snapshots without running the risk of being asked why we copy 6 TB of data daily to add in attributes for new tracks, and everybody will be happy.


As an ongoing project, there are a lot of interesting problems yet to consider. We want to make the attributes available in real time by making sure we provide Cassandra databases to speed up ad-hoc modeling as well as allowing backend services to make on the fly decisions about what songs to recommend based on song characteristics. Cool stuff.

Thank you for reading,

Gandalf Hernandez


Shane Conway Shane Conway on

Machine learning is often divided into three categories: supervised, unsupervised, and reinforcement learning.  Reinforcement learning concerns problems with sequences of decisions (where each decision affects subsequent opportunities), in which the effects can be uncertain, and with potentially long-term goals.  It has achieved immense success in various different fields, especially AI/Robotics and Operations Research, by providing a framework for learning from interactions with an environment and feedback in the form of rewards and penalties.

Unknown author on

While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. Eric Sammer, CTO at ScalingData and author of Hadoop Operations talks about how ScalingData uses Kafka together with other open source systems such as Hadoop, Solr, and Impala/Hive to collect, transformation and aggregate event data and then build applications on top of this platform.

Kinshuk Mishra Kinshuk Mishra on

Spotify Tech Lead Kinshuk Mishra and Engineer Noel Cody share their experience about building personalized ad experiences for users through iterative engineering and product development. They explain their process of continuous problem discovery, hypothesis generation, product development and experimentation. Later they deep dive into the specific ad personalization problems Spotify is solving and explain their data infrastructure technology stack in detail. They also speak about how they've experimented various product hypothesis and iteratively evolved their infrastructure to keep up with the product requirements.

Adam Gibson Adam Gibson on

Deep learning is all the rage in advanced analytics. How does it work and how can it scale? Adam Gibson, Data Scientist and Co-founder of Skymind, explains why representational learning is an advance over traditional machine learning techniques. He also gives a demo of a working deep-belief net with a tour through DL4J's API, showing how a DBN extracts features and classifies data.


This talk was given at the SF Data Mining meetup at Trulia.

Michael Hwang Michael Hwang on

Here at SumAll, we provide analytics from over 40 different sources that range from social media data to health and fitness tracking. We collect and connect metrics from a wide range of 3rd party API data sources for our clients to help them make better business decisions. One of engineering's biggest challenge has been on building and maintaining a data pipeline that is both performant and reliable.

Hakka Labs Hakka Labs on

Check out the first 20 minutes of our previous Practical Machine Learning training taught by Juan M. Huerta, Senior Data Scientist at PlaceIQ.


Join us for our next 3-day training! November 10th-12th. This course is designed to help engineers collaborate with data scientists and create code that tackles increasingly complex machine learning problems. The course will be taught by Rachit Srivastava (Senior Data Scientist, PlaceIQ) and supervised by Juan.

By the end of this training, you will be able to:

  • Apply common classification methods for supervised learning when given a data set

  • Apply algorithms for unsupervised learning problems

  • Select/reduce features for both supervised and unsupervised learning problems

  • Optimize code for common machine learning tasks by correcting inefficiencies by using advanced data structures

  • Choose basic tools and criteria to perform predictive analysis

We screen applicants for engineering ability and drive, so you'll be in a room full of passionate devs who ask the right questions. Applicants should have 3+ years of coding experience, knowledge of Python, and previous exposure to linear algebra concepts.

You can apply for a seat on our course page.

Hakka Labs Hakka Labs on

Hosted by Hakka Labs

This 3-day course will demonstrate the fundamental concepts of machine learning by working on a dataset of moderate size, using open source software tools.

Course Goals
This course is designed to help engineers collaborate with data scientists and create code that tackles increasingly complex machine learning problems. By the end of this course, you will be able to:
-Apply common classification methods for supervised learning when given a data set
-Apply algorithms for unsupervised learning problems
-Select/reduce features for both supervised and unsupervised learning problems -Optimize code for common machine learning tasks by correcting inefficiencies by using advanced data structures
-Choose basic tools and criteria to perform predictive analysis

Intended Audience
The intended audience of this Machine Learning course is the engineer with strong programming skills as well as a certain level of exposure to linear algebra and probability. Students should understand the basic issue of prediction as well as Python.

Class Schedule

Day 1: Linear Algebra/Probability Fundamentals and Supervised Learning
The goal of day one is to give engineers the linear algebra/probability foundation they need to tackle problems during the rest of the course and introduce tools for supervised learning problems.

-Quick Introduction to Machine Learning
-Linear Algebra, Probability and Statistics,
-Regression Methods
-Linear and Quadratic Discriminant Analysis
-Support Vector Machines and Kernels
-Lab: Working on classification problems on a data set

Day 2: Unsupervised learning, Feature Selection and Reduction
The goal of day two is to help students understand the mindset and tools of data scientists.

-Classification Continued
-K nearest neighbors, Random Forests, Naive Bayes Classifier
-Boosting Methods
-Information Theoretic Approaches
-Feature Selection and Model Selection/Creation
-Unsupervised Learning
-Principal Component Analysis/Kernel PCA
-Independent Component Analysis
- Lab: Choosing Features and applying unsupervised learning methods to a data set

Day 3: Performance Optimization of Machine Learning Algorithms
The goal of day three is to help students understand how developers contribute to complex machine learning projects.
-Unsupervised Learning Continued
-DB-SCAN and K-D Trees
-Anomaly Detection
-Locality-Sensitive Hashing
-Recommendation Systems and Matrix Factorization Methods
-Lab: Longer lab working on back-end Machine Learning optimization programming problems in Python

Get your tickets here

Nick Gorski Nick Gorski on

TellApart Software Engineer Nick Gorski takes us through a technical deep-dive into TellApart's personalization system. He discusses the machine learning data pipeline at TellApart that powers the models, real-time calculations of the expected value of shoppers, and how to translate that value into a bid price for every bid request received (hundreds of thousands per second).

Anand Henry Anand Henry on

In this talk, Anand Henry, Senior Software Engineer at Eventbrite, talks about their use of Apache Cassandra. Anand focuses on the Eventbrite data model & access patterns and the architecture of an Apache Cassandra Powered Recommendation Engine. He also goes over Cassandra as a data store to serve recommendations based on email, mobile push notifications, and web APIs. Later in the talk he touches on user audit logging with Apache Cassandra.

Toby Matejovsky Toby Matejovsky on

Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, Tapad Senior Software Developer Toby Matejovsky speaks about the creation and evolution of the pipeline. He demonstrates a concrete example – a day in the life of an event tracking pixel. Toby also talks about common challenges that his team has overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.

Kiyoto Tamura Kiyoto Tamura on

Fluentd is an open source data collector started by Treasure Data, that helps simplify and scale log management. In this talk, Kiyoto Tamura, Director of Developer Relations at Treasure Data, Inc. gives an overview and demo of Fluentd and goes through its most popular use cases.

Chris Wiggins Chris Wiggins on

Nearly all fields have been or are being transformed by the availability of copious data and the tools to learn from them. Dr. Chris Wiggins (Chief Data Scientist, New York Times) will talk about using machine learning and large data in both academia and in business. He shares some ways re-framing domain questions as machine learning tasks has opened up new avenues for understanding both in academic research and in real-world applications.

Jeffrey Picard Jeffrey Picard on

Understanding the billions of data points we ingest each month is no easy task. Through the development of models that allow us to do so, we’ve noticed some commonalities in the process of converting raw data to real-world understanding. Although you can get pretty good results with simple models and algorithms, digging beyond the obvious abstractions and using more sophisticated methods requires a lot of effort. In school we often learn different techniques and algorithms in isolation, with neatly fitted input sets, and study their properties. In the real world, however, especially the world of location data, we often need to combine these approaches in novel ways in order to yield usable results.

Toby Matejovsky Toby Matejovsky on

Tapad Director of Engineering, Toby Matejovsky talks about how his team built and scaled their cross device digital advertising platform that handles over 50,000 queries per second per server with sub-millisecond latency, 95-99% of the time. Toby shares lessons learned, scaling tips and best practices as well as answer questions ranging from tools and technologies to people and processes.

Josh Schwartz Josh Schwartz on

Josh Schwartz, Chief Data Scientist at Chartbeat, talks through the data pipeline they've built to ingest data from billions of browsing sessions per day, as well as the analytics system that processes this data, computes quantitative facts, and parses those facts into text."]

Sravya Tirukkovalur Sravya Tirukkovalur on

Apache Sentry (incubating) is a new open source authorization module that integrates with Hadoop-based SQL query engines (Hive and Impala). Two Apache Sentry developers, Xuefu Zhang and Srayva Tirukkovalur, Software Engineers at Cloudera, provide details on its implementation and do a short demo.

Ian Hummel Ian Hummel on

At MediaMath, we’re big users of Elastic MapReduce (EMR). EMR’s incredible flexibility makes it a great fit for our data analytics team, which processes TBs of data each day to provide insights to our clients, to better understand our own business, and to power the various product back-ends that make Terminal 1 the “marketing operating system” that it is.

Joe Crobak Joe Crobak on

Big data processing with Apache Hadoop, Spark, Storm and friends is all the rage right now. But getting started with one of these systems requires an enormous amount of infrastructure, and there are an overwhelming number of decisions to be made. Oftentimes you don't even know what kinds of questions you can or should be answering with your data.

Rong Yan Rong Yan on

Machine learning applications like fraud detection and recommendation have played a key role in helping Square achieve their mission to rethink buying and selling. In this talk, Dr. Rong Yan (Director of Data Science and Infrastructure, Square), gives a high-level overview of data applications at Square followed by a deep dive on how machine learning is used in our industrial leading fraud detection models.

Andrew Geweke Andrew Geweke on

The last few years have seen an explosion of interest in NoSQL data-storage layers, and then some retrenchment as the limitations of these systems became increasingly apparent. (It turns out they’re not magic, after all!) Today we seem faced with a choice. On one hand, we can reach for some of the potential “big wins” of NoSQL systems, but many of them are still relatively immature — at least when compared to the RDBMS — and the things we give up (transactionality, durability, manageability) we often discover to be very painful losses. On the other hand, we can reach for the security of a traditional RDBMS; we get incredibly well-understood, robust, durable, manageable systems…but we often sacrifice a lot of potential future growth.

Reynold Xin Reynold Xin on

Mining Big Data can be an incredibly frustrating experience due to its inherent complexity and a lack of tools. Reynold Xin and Aaron Davidson are Committers and PMC Members for Apache Spark and use the framework to mine big data at Databricks. In this presentation and interactive demo, you'll learn about data mining workflows, the architecture and benefits of Spark, as well as practical use cases for the framework.

Debora Donato Debora Donato on

StumbleUpon indexes over 100 million web pages for serendipitous retrieval for over 25 million registered users. Debora Donato (Principal Data Scientist, StumbleUpon) walks through StumbleUpon's big data architecture, data pipelines, mobile optimization efforts, and data mining projects.

Mark Weiss Mark Weiss on

RTBkit is an open source software framework designed to make it easy to create real-time ad bidding systems. Current users of RTBkit include the Rubicon Project, Nexage, App Nexus, The Google Ad Exchange.  Mark Weiss (Head of Client Solutions, Datacratic) shares some of the challenges companies and developers face today as they move into real-time bidding. He covers the architecture, implementation choices, plugins, and use cases.

David Greenberg David Greenberg on

When you need to execute code on a cluster of machines, deciding which machine should run that code becomes a complex problem, known as scheduling. We're all familiar with routing problems, such as the recent RapGenius incident. It turns out that simple improvements to randomized routing can dramatically improve performance! Sparrow is a distributed scheduling algorithm for low latency, high throughput workloads.

Christos Kalantzis Christos Kalantzis on

In this talk, Christos Kalantzis (Cloud Persistence Engineering Manager, Netflix), explains why he believes companies should adopt an optimistic software design model that is similar to what is used at Netflix. This is a practical talk with strategies for implementing the design model, testing for eventual consistency, and convincing your organization that it's the right move.

Evan Chan Evan Chan on

Evan Chan (Software Engineer, Ooyala), describes his experience using the Spark and Shark frameworks for running real-time queries on top of Cassandra data. He starts by surveying the Cassandra analytics landscape, including Hadoop and HIVE, and touches on the use of custom input formats to extract data from Cassandra. Then, he dives into Spark and Shark (two memory-based cluster computing frameworks) and explains how they enable often dramatic improvements in query speed and productivity.

Les Hazlewood Les Hazlewood on

Need to scale user session loads? Les Hazlewood (Co-Founder and CTO of Stormpath), explains Shiro's enterprise session management capabilities and how to use Cassandra as Shiro's session store. This enables a distributed session cluster supporting hundreds of thousands or even millions of concurrent sessions. As a working example, Les will show how to set up a session cluster in under 10 minutes using Cassandra.

Jimmy Mårdell Jimmy Mårdell on

Spotify users have generated 1 billion+ playlists. At peak usage, over there are over 40,000 requests per second - not to mention support for "offline mode" and concurrent changes. In this excellent talk from C* Summit EU, Jimmy Mardell (Developer, Spotify) gives an overview of Spotify's playlist architecture, the Cassandra data model, and lessons learned working with Cassandra.

Michael Kjellman Michael Kjellman on

So far, I've explained why you shouldn't migrate to C* and the origins and key terms. Now, I'm going to turn my attention to how Cassandra stores data.

Cassandra nodes, clusters, rings

At a very high level, Cassandra operates by dividing all data evenly around a cluster of nodes, which can be visualized as a ring. Nodes generally run on commodity hardware. Each C* node in the cluster is responsible for and assigned a token range (which is essentially a range of hashes defined by a partitioner, which defaults to Murmur3Partitioner in C* v1.2+). By default this hash range is defined with a maximum number of possible hash values ranging from 0 to 2^127-1.

Matt Jurik Matt Jurik on

Hulu users view 400 million videos and 2  billion advertisements each month. Hugetop is the service that allows users to track their progress in video content. The Hulu engineering team switched to a Cassandra-based architecture in the wake of unbounded data growth, MySQL servers that were running out of space, and the horrors of manual resharding.

Al Tobey Al Tobey on

As we move into the world of big data, systems architectures and data models we've relied on for decades are hindering growth. At the core of the problem is the read-modify-write cycle. In this talk, Al Tobey (Open Source Mechanic, DataStax) explains  how to build systems that don't rely on RMW, with a focus on Cassandra. For those times when RMW is unavoidable, he covers how and when to use Cassandra's lightweight transactions and collections.

Patrick McFadin Patrick McFadin on

In this talk, Patrick McFadin (Chief Evangelist for Apache Cassandra, DataStax) explains how to work with data throughout the application life cycle. You'll learn how to store objects, index for fast retrieval, and select a data model for Cassandra apps.

Michael Kjellman Michael Kjellman on

A new class of databases (sometimes referred to as “NoSQL”) has been developed and designed with 18+ years worth of lessons learned from traditional relational databases such as MySQL. Cassandra (and other distributed or “NoSQL” databases) aim to make the “right” tradeoffs to ultimately deliver a database that provides the scalability, redundancy, and performance needed in todays applications. Although MySQL may have performed well for you in the past, new business requirements and/or the need to both scale and improve the reliability of your application might mean that MySQL is no longer the correct fit.

Neville Li Neville Li on

This is the first time that a Spotify engineer has spoken publicly about their deployment and use cases for Storm! In this talk, Software Engineer Neville Li describes:

  • Real-time features developed using Storm and Kafka including recommendations, social features, data visualization and ad targeting

  • Architecture

  • Production integration

  • Best practices for deployment


Tadas Vilkeliskis Tadas Vilkeliskis on

At Chartbeat we are thinking about adding probabilistic counters to our infrastructure, HyperLogLog (HLL) in particular. One of the challenges with something like this is to make it redundant and have somewhat good performance. Since HyperLogLog is a relatively new approach to cardinality approximation there are not many off the shelf solutions, so why not try and implement HLL in Cassandra?

Sandeep Jain Sandeep Jain on

The Gist: Ever wonder why you keep getting ads for Budweiser when you're clearly a Coors aficionado? In this Lyceum, Sandeep Jain, technical advisor at Axial, will dive into the algorithms and systems that decide how online ads are delivered on the internet. Never again will you wonder why you're being hawked bad beer. This Lyceum will mix behavioral economics, game theory, distributed systems, and graph theory into one fun and informative talk.

Click here to register for the event

Speaker Bio: Sandeep is currently a technical adviser to Axial. Before that, he cofounded Reschedge, a SaaS enterprise recruiting tool which was recently sold to Hirevue. He started his career at Google where he spent 5 years working on Google Maps and Doubleclick products. He finished his career there as the technical lead of the display advertising backend.

Daniel Krasner Daniel Krasner on

In this talk, Daniel Krasner covers rapid development of high performance scalable text processing solutions for tasks such as classification, semantic analysis, topic modeling and general machine learning. He demonstrates how Python modules, in particular the Rosetta Python library, can be used to process, clean, tokenize, extract features, and build statistical models with large volumes of text data. The Rosetta library focuses on creating small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing package. Daniel also touches on LDA topic modeling and different implementations thereof (Vowpal Wabbit and Gensim). The talk is part presentation, and part “real life” example tutorial. This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.

Thomas Meimarides Thomas Meimarides on

In early December, we held our first ever hack-day. Each product manager teamed up with one to two engineers for the day to think up and develop any idea that they wanted. At the end of hack-day, each team then presented their concepts, demoed the results, and explained how they thought their hack improved the application.

Radu Gheorghe Radu Gheorghe on

In this talk, Radu Gheorghe, from SemaText, talks about using Elasticsearch or Solr to index your logs, so you can search and also analyze them in real-time. The term “logs” can range from server logs and application events to metrics or even social media information.  This talk was recorded at the NYC Search, Discovery and Analytics meetup at Pivotal Labs.  

Pete Soderling Pete Soderling on

In response to a recent post from MongoHQ entitled “You don’t have big data," I would generally agree with many of the author’s points.

However, regardless of whether you call it big data, small data, hot data or cold data - we are all in a position to admit that *more* data is here to stay - and that’s due to many different factors.

Perhaps primarily, as the article mentions, this is due to the decreasing cost of storage over time. Other factors include access to open APIs, the sheer volume of ever-increasing consumer activity online, as well as a plethora of other incentives that are developing (mostly) behind the scenes as companies “share” data with each other. (You know they do this, right?)

But one of the most important things I’ve learned over the past couple of years is that it’s crucial for forward thinking companies to start to design more robust data pipelines in order to collect, aggregate and process their ever-increasing volumes of data. The main reason for this is to be able to tee up the data in a consistent way for the seemingly-magical quant-like operations that infer relationships between the data that would have otherwise surely gone unnoticed - ingeniously described in the referenced article as correctly “determining the nature of needles from a needle-stack.”

But this raises the question - what are the characteristics of a well-designed data pipeline? Can’t you just throw all your data in Hadoop and call it a day?

As many engineers are discovering - the answer is a resounding "no!" We've rounded up four examples from smart engineers at Stripe, Tapad, Etsy & Square that show aspects of some real-world data pipelines you'll actually see in the wild.

How does Stripe do it?

We spoke to Avi Bryant at Stripe who gave us a nice description of the way Stripe has approached the building of their data pipeline.

"Stripe feeds data to HDFS from various sources, many of them unstructured or semi-structured -  server logs, for example, or JSON and BSON documents. In every case, the first step is to translate these into a structured format. We've standardized on using Thrift to define the logical structure, and Parquet as the on-disk storage format.

"We chose Parquet because it's an efficient columnar format which is natively understood by Cloudera's Impala query engine, which gives us very fast relational access to our data for ad-hoc reporting. The combination of Parquet and Thrift can also be used efficiently and idiomatically from Twitter's Scalding framework, which is our tool of choice for complex batch processing.

"The next stage is "denormalization": to keep our analytical jobs and queries fast, we do the most common joins ahead of time, in Scalding, writing to new set of Thrift schemas. At the same time, we do a bunch of enhancing and annotating of the data: for example, geocoding IP addresses, parsing user agent strings, or cleaning up missing values.

"In many cases, this results in schemas with nested structure, which works well from Scalding and which Parquet is happy to store, but which Impala cannot currently query. We've developed a simple tool which will transform arbitrarily nested Parquet data into an equivalent flattened schema, and where necessary we use this to maintain a parallel copy of each data source for use from Impala. We're looking forward to future versions of Impala which might remove this extra step."

Tapad’s Data Pipeline

Tapad is an ad-tech business in NYC that’s experienced lots of growth in both traffic and data over the past several years. So I reached out to their CTO, Dag Liodden, to find out how they’ve built their data pipeline, and some of the strategies and tools they use. In Dag’s own words, here’s how they do it:

"- All ingested data flows through a message queue in a pub-sub fashion (we use Kafka and push multiple TB of data through it every hour)

- All data is encoded with a consistent denormalized schema that supports schema evolution (we use Avro and Protocol Buffers)

- Most of our data stores are updated in real-time from processes consuming the message queues (hot data is pushed to Aerospike and Cassandra, real-time queryable data to Vertica and the raw events, often enriched with data from our Aerospike cluster, is stored in HDFS)
- Advanced analytics and data science computation is typically executed on the denormalized data in HDFS

- The real-time updates can always be reproduced through offline batch jobs over the HDFS stored data. We strive to make our computation logic so that it can be run in-stream *and* in batch MR-mode without any modification”

He notes that the last point allows them to retroactively change their streaming computation at-will and then backfill the other data stores with updated projections.

Dag also explains the “why” behind their use of multiple types of data technologies on the storage side and explains that each of them has its own particular “sweet-spot” which makes it attractive to them:

"- Kafka: High-throughput parallel pub-sub, but relaxed delivery and latency guarantees, limited data retention and no querying capabilities.

- Aerospike: Extremely fast random access read/write performance, by key (we have 3.2 billion keys and 4TB of replicated data), cross data center replication, high availability but very limited querying capabilities

- Cassandra: Medium random access read/write performance, atomic counters and a data model that lends it well to time-series storage. Flexible consistency model and cross data center replication.

- HDFS: High throughput and cheap storage.

- Vertica: Fast and powerful ad-hoc querying capabilities for interactive analysis, high availability, but no support for nested data structure, multi-valued attributes. Storage based pricing makes us limit the amount of data we put here."

How Etsy handles data

For another example, we reached Rafe Colburn, the engineering manager of Etsy's data team and asked how they handled their pipeline. Here's the scoop from Rafe:

"Etsy's analytics pipeline isn't especially linear. It starts with our instrumentation, which consists of an event logger that runs in the browser and another that can be called from the back end. Both ping some internal "beacon" servers.

"We actually use the good old logrotate program to ship the generated Apache access logs to HDFS when they reach a certain size, and process them using Hadoop. We also snapshot our production data (which resides in MySQL) nightly and copy it to HDFS as well, so that we can join our clickstream data to our transactional data.

"We usually send the output of our Hadoop jobs to our Vertica data warehouse, where we replicate our production data as well, for further analysis. We use that data to feed our homemade reporting and analytics tools.

"For features on that use data generated from Hadoop jobs, we have a custom tool that takes the output of a job and stores it on our sharded MySQL cluster where it can be accessed at scale. This year we're looking at integrating Kafka into our pipeline to move data both from our instruments to Hadoop (and to streaming analytics tools), and also to send data back from our analytics platforms to the public site."

Square’s approach

Another example from a company that has a sophisticated data pipeline is Square. We reached one of their engineering managers, Pascal-Louis Perez, who gave us a strategic view of their pipeline architecture.

Because of the importance of payments flowing through its system, Square has extended the concept of ‘reconciliation’ throughout its entire data pipeline; with each transformation data must be able to be audited and verified. The main issue with this approach, according to Pascal, is that it can be challenging to scale. For every payment received there are "roughly 10 to 15 accounting entries required and the reconciliation system must therefore scale at one order of magnitude above that of processing, which already is very large.”

Square’s approach to solving this problem leverages stream processing, which allows them to map a corresponding data domain to a different stream. In Pascal’s words, "The streams represent the first level abstraction distancing the data-pipeline from the diversity of sources of data. The next level are operators combining one of multiple streams, and producing themselves one or multiple streams. One example operator is a "matcher" which takes two streams, extracts similarly kinded keys from those, and produces two streams separated based on matching criterias.”

Pascal notes that  the system of stream processing and stream based operators is similar to relational algebra and its operators, but in this case it’s in real-time and on infinite relations.

It’s pretty obvious that cramming data into Hadoop won’t give you these capabilities!

Interested in getting better with data?

Hakka Labs has launched Machine Learning for Engineers, a new 3-day training course that teaches engineers the theory and practice behind common machine learning algorithms. Get more info here.

Rosaria Silipo Rosaria Silipo on

Open source tools usually delegate their support service to community forums. How reliable is this strategy? In this talk, Rosaria Silipo answers that question and this one, "who says that Open Source Software does not have support?"  She measures the efficiency of the community forum from 2007 to 2012 of KNIME, an open source data analytics platform. Commonly used techniques in social media analysis, such as web crawling, web analytics, text mining, and network analytics, are used to investigate the forum characteristics. Each part is described in detail during this presentation. This talk was recorded at the SF Data Mining meetup at inPowered.

Evan Chan Evan Chan on

In this talk, Evan Chan, Software Engineer at Ooyala, presents on real-time analytics using Cassandra, Spark & Shark at Ooyala. He offers a review of the Cassandra analytics landscape (Hadoop & HIVE), goes over custom input formats to extract data from Cassandra, and shows how Spark & Shark increase query speed and productivity over standard solutions. This talk was recorded at the DataStax Cassandra South Bay Users meetup at Ooyala.

Joe Stein Joe Stein on

In this talk, Joe Stein, Apache Kafka committer, member of the PMC, and Founder and Principal Architect at Big Data Open Source Security, will talk on Apache Kafka an open source, distributed publish-subscribe messaging system. Joe will focus on how to get started with Apache Kafka, how replication works and more! Storm is a great system for real-time analytics and stream processing but to get the data into Storm, you need to collect your data streams with consistency and availability at high loads and large volumes. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. This talk was recorded at the NYC Storm User Group meetup at WebMD Health.

Al Tobey Al Tobey on

Two exciting talks on Cassandra and Go in this video! In the first talk, Kyle Kingsbury, who has tested Cassandra's behavior with respect to consistency, isolation, and transactions as part of the Jepsen project to educate users about distributed consensus, shares his surprising test results. In the second talk, Al Tobey, Open Source Mechanic at DataStax presents a brief introduction to Go and Cassandra, explaining how they are a great fit for each other using code samples and a live demo. These talks were recorded at the DataStax Cassandra SF Users meetup at Disqus.

Max Sklar Max Sklar on

When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible user-generated ratings (like and dislike), then each item is represented as a vector of 2 counts.  In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents.  The Dirichlet distribution is one of the basic probability distributions for describing this type of data. In this talk, Max Sklar, from Foursquare, takes a closer look at the Dirichlet distribution and it's properties, as well as some of the ways it can be computed efficiently.  This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.

Russell Jurney Russell Jurney on

In this talk, Russell Jurney (author of Agile Data) presents about rapidly prototyping analytics applications using the Hadoop stack to return to agility in light of the ever deepening analytics stack. This presentation uses Hadoop, Pig, NoSQL stores and lightweight web frameworks to rapidly connect end-users to real insights. This talk was recorded at the SF Data Mining meetup at Trulia.

Alexis Lê-Quôc Alexis Lê-Quôc on

Imagine you are tasked with building a platform to monitor the performance of 500,000 servers in real-time. How would you design it? What tools would you choose? (Cassandra? Storm? Spark? HBase? ...) What technical challenges would you expect? As a monitoring company, Datadog receives tens of billions of telemetry data points every day and is working to change the way operations teams understand and troubleshoot their infrastructure and applications. In this talk, Alexis Lê-Quôc from Datadog talks about how they built their (Python-based) low-latency, real-time analytics pipeline. This talk was recorded at the NYC Data Engineering meetup at The Huffington Post.

Fangjin Yang Fangjin Yang on

In this talk, Fangjin Yang of MetaMarkets will talk on their motivations for building druid, its architecture, how it works, and its real-time capabilities. Druid is open source infrastructure for Real²time Exploratory Analytics on Large Datasets. The system uses an always-on, distributed, shared-nothing, architecture designed for real-time querying and data ingestion. It leverages column-orientation and advanced indexing structures to allow for cost effective, arbitrary exploration of multi-billion-row tables with sub-second latencies. This talk was recorded at the SF Data Engineering meetup at Square.

Jon Hyman Jon Hyman on

In this talk, "MongoDB, Analytics, and Flexible Schemas," Jon Hyman, CTO and co-founder of Appboy, discusses how Appboy takes advantage of MongoDB's schemaless data modeling for analytic pre-aggregation. Jon will also discuss how Appboy uses the aggregation framework and statistical analysis to estimate results of ad-hoc queries over tens of millions of database documents. Furthermore, he will also cover other tips and hints that he learned from growing a MongoDB set up to support thousands of writes per second. This talk was recorded at the New York MongoDB user group meetup at Ebay NYC.

Jeroen Janssens Jeroen Janssens on

In this talk, Jeroen Janssens, senior data scientist at YPlan, introduces both the outlier selection and one-class classification setting. He then presents a novel algorithm called Stochastic Outlier Selection (SOS). The SOS algorithm computes for each data point an outlier probability. These probabilities are more intuitive than the unbounded outlier scores computed by existing outlier-selection algorithms. Jeroen has evaluated SOS on a variety of real-world and synthetic datasets, and compared it to four state-of-the-art outlier-selection algorithms. The results show that SOS has a superior performance while being more robust to data perturbations and parameter settings. Click Here for the link to Jeroen's blogpost on the subject, it contains links to the d3 demo! This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.

Matt Story Matt Story on

About the talk: NoSQL databases seem to be everywhere you look these days, whether it's 10gen becoming MongoDB, AWS exposing DynamoDB as a service, or a heated argument overheard at a meetup pinning Riak against Voldemort. In all the hubbub, there is one key-value store replete with name-spacing support, backed by an open standard and supporting a robust and battle-tested authorization scheme that is consistently overlooked -- POSIX filesystems.

Paul Dix Paul Dix on

In this presentation, Paul Dix from Errplane gives and introduction to InfluxDB, an open source distributed time series database that he created. Paul talks about why one would want a database that's specifically for time series and also covers its API as well as some of the key features of InfluxDB, including:

Rich Hickey Rich Hickey on

In this talk, Rich Hickey from Datomic, gives an introductory talk on Datomic as a functional database. He talks on a Datomic as value Database, which means that you can write functions that take values as arguments and similarly can return a database value as its result. He also talks on the importance of a durable, consistent Database that can be shared across processes.  Rich also offers some hands on use from Clojure. This talk was recorded at the LispNYC meetup at Meetup HQ.

Cliff Click Cliff Click on

In this talk on Machine Learning Distributed GBM, Earl Hathaway, resident Data Scientist at 0xdata, talks about distributed GBM, one of the most popular machine learning algorithms used in data mining competitions. He will discuss where distributed GBM is applicable, and review recent KDD & Kaggle uses of machine learning and distributed GBM. Also, Cliff Click, CTO of 0xdata, will talk about implementation and design choices of a Distributed GBM. This talk was recorded at the SF Data Mining meetup at Trulia.

Patrick McFadin Patrick McFadin on

In this introduction to Cassandra, Patrick McFadin, Chief Evangelist for Apache Cassandra at DataStax, will be presenting on why Cassandra is a key player in database technologies. Both large and small companies alike choose to use Apache Cassandra as their database solution and Patrick will be presenting on why they made this choice. Patrick will also be discussing Cassandra's architecture, including: data modeling, time-series storage and replication strategies, providing a holistic overview of how Cassandra works and the best way to get started. This talk was recorded at the Big Data Gurus meetup at Samsung R&D.

Joe Doliner Joe Doliner on

In this talk, "How RethinkDB Works," Joe Doliner, Lead Engineer at RethinkDB will discuss the value of RethinkDB's flexible schemas, ease of use, and how to scale a RethinkDB cluster from one to many nodes. He will also talk about how RethinkDB fits into the CAP theorem, and its persistence semantics. Finally, Joe will give a live demo, showing how to load and analyze data, how to scale out the cluster to achieve higher performance, and even destroy a node and show how RethinkDB handles failure. This talk was recorded at the SF Data Engineering meetup at StumbleUpon Offices.

Mike Curtis Mike Curtis on

Data Driven Growth at Airbnb by Mike Curtis -- As Airbnb's VP of Engineering, Mike Curtis is tasked with using big data infrastructure to provide a better UX and drive massive growth. He's also responsible for delivering simple, elegant ways to find and stay at the most interesting places in the world. He is currently working to build a team of engineers that will have a big impact as Airbnb continues to construct a bridge between the online and offline worlds. Mike's particular focus is on search and matching, systems infrastructure, payments, trust and safety, and mobile.

André Spiegel André Spiegel on

(Contributor article “Tracking Twitter Followers with MongoDB by André Spiegel,” Consulting Engineer at MongoDB. Originally appeared on MongoDB blog

As a recently hired engineer at MongoDB, part of my ramping-up training is to create a number of small projects with our software to get a feel for how it works, how it performs, and how to get the most out of it. I decided to try it on Twitter. It’s the age-old question that plagues every Twitter user: who just unfollowed me? Surprising or not, Twitter won’t tell you that. You can see who’s currently following you, and you get notified when somebody new shows up. But when your follower count drops, it takes some investigation to figure out who you just lost.

Tim Moreton Tim Moreton on

"Understanding and Managing Cassandra's Vnodes + Under the Hood: Acunu Analytics" - In this talk, Tim Moreton, Founder and CTO at Acunu Analytics, and Nicolas Favre-Felix, Software Engineer at Acunu Analytics, share the concept, implementation and benefits of virtual nodes in Apache Cassandra 1.2 & 2.0. They also go over why virtual nodes are a replacement to token management, and how to use Acunu Analytics to collect event data, build OLAP-style cubes and ask SQL-like queries via a RESTful API, on top of Cassandra. This talk was recorded at the DataStax Cassandra SF users group meetup.

Camille Fournier Camille Fournier on

This is an Apache Zookeeper introduction - In this talk, Camille Fournier, from Rent The Runway, gives an introduction to ZooKeeper. She talks on why it's useful and how you should use it once you have it running. Camille goes over the high-level purpose of ZooKeeper and covers some of the basic use cases and operational concerns. One of the requirements for running Storm or a Hadoop cluster is to have a reliable Zookeeper setup. When you’re running a service distributed across a large cluster of machines, even tasks like reading configuration information, which are simple on single-machine systems, can be hard to implement reliably. This talk was recorded at the NYC Storm User Group meetup at WebMD Health.

Avery Rosen Avery Rosen on

Big Data and Wee Data - We all know MongoDB is great for Big Data, but it's also great for work on the other end of the scale -- call it "Wee Data". In this talk, MongoDB expert and Principal at Bringing Fire Consulting, Avery Rosen, talks on how this type of data is far more common than Big Data scenarios. Avery discusses how just about every project starts with it. In this domain, we don't care about disk access and indices; instead, we care about skipping past the wheel inventing and getting right down to playing with the data. MongoDB lets you persist your prototype or small-working-set data without making you deal with freeze-drying and reconstitution, provides structure well beyond csv, gets out of your way as you evolve your schemas, and provides simple tools for introspecting data and crunching numbers. This talk was recorded at the New York MongoDB User Group meetup at

Dustin Mulcahey Dustin Mulcahey on

This is a friendly Lambda Calculus Introduction by Dustin Mulcahey. LISP has its syntactic roots in a formal system called the lambda calculus. After a brief discussion of formal systems and logic in general, Dustin will dive in to the lambda calculus and make enough constructions to convince you that it really is capable of expressing anything that is "computable". Dustin then talks about the simply typed lambda calculus and the Curry-Howard-Lambek correspondence, which asserts that programs and mathematical proofs are "the same thing". This talk was recorded at the Lisp NYC meetup at Meetup HQ.

Adam Laiacano Adam Laiacano on

In this talk, Adam Laiacano from Tumblr gives an "Introduction to Digital Signal Processing in Hadoop". Adam introduces the concepts of digital signals, filters, and their interpretation in both the time and frequency domain, and he works through a few simple examples of low-pass filter design and application. It's much more application focused than theoretical, and there is no assumed prior knowledge of signal processing. This talk was recorded at the NYC Machine Learning Meetup at Pivotal Labs.

Unknown author on

(Contributor article by Alex Giames, Co-Founder and CTO of CareAcross. Originally appeared on MongoDB blog)

When releasing software, most teams focus on correctness, and rightly so. But great teams also QA their code for performance. MMS Monitoring can also be used to quantify the effect of code changes on your MongoDB database. Our staging environment is an exact mirror of our production environment, so we can test code in staging to reveal performance issues that are not evident in development. We take code changes to staging, where we pull data from MMS to determine if feature X will impact performance.

John Myles White John Myles White on

In this talk, "Streaming Data Analysis and Online Learning," John Myles White of Facebook surveys some basic methods for analyzing data in a streaming manner. He focuses on using stochastic gradient descent (SGD) to fit models to data sets that arrive in small chunks, discussing some basic implementation issues and demonstrating the effectiveness of SGD for problems like linear and logistic regression as well as matrix factorization. He also describes how these methods allow ML systems to adapt to user data in real-time. This talk was recorded at the New York Open Statistical Programming meetup at Knewton.

Eric Sammer Eric Sammer on

In this talk, Eric Sammer, from Cloudera discusses Cloudera's new open source project, Cloudera Development Kit (CDK), which helps Hadoop developers get new projects off the ground more easily. The CDK is both a framework and long-term initiative for documenting proven development practices and providing helpful doc and APIs that will make Hadoop application development as easy as possible. This talk was recorded at the Video Big Data Gurus meetup at Samsung R&D.

Joe Crobak Joe Crobak on

In this talk, Joe Crobak, formerly from Foursquare, will give a brief overview of how a workflow engine fits into a standard Hadoop-based analytics stack. He will also give an architectural overview of Azkaban, Luigi, and Oozie, elaborating on some features, tools, and practices that can help build a Hadoop workflow system from scratch or improve upon an existing one. This talk was recorded at the NYC Data Engineering meetup at Ebay.

Geoffrey Anderson Geoffrey Anderson on

In this talk, Geoffrey Anderson from Box, discusses how Box made a shift from the Cacti monitoring system to OpenTSDB. He details the changes made to their servers as well as daily interactions with monitoring to increase agility in identifying and addressing changes in database behavior. This talk was recorded at the SF MySQL Meetup Group at Lithium Technologies.

Ben Engber Ben Engber on

Ben Engber, CEO and founder of Thumbtack Technology, will discuss how to perform tuned benchmarking across a number of NoSQL solutions. He describes a NoSQL Database Comparison across Couchbase, Aerospike, MongoDB, Cassandra, HBase, and others in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results. This talk was recorded at the Scale Warriors of NYC meetup at adMarketplace.

John Jensen John Jensen on

John Jensen and Mike Sherman will be speaking about their problem domain over at Rich Relevance . At Rich Relevance, they provide content personalization as a service, mostly to retailers. Unlike Pandora, they don't use intrinsic similarity metrics with in-depth knowledge about the domain they are recommending. This talk was recorded at the SF Data Mining meetup at Pandora HQ.

Todd Holloway Todd Holloway on

Recommendation engines typically produce a list of recommendations in one of two ways - through collaborative or content-based filtering. Collaborative filtering approaches to build a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users, then use that model to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties.

Christian Posse Christian Posse on

christian posse a/b testingDr. Christian Posse was the last panelist at the recent The Hive Big Data Think Tank meetup at Microsoft. In this talk, Christian shares some of the problems he's seen in the social network field. Not a single piece of code, algorithm, feature, or user experience goes out without A/B Testing. He discusses their development of a system of hashing functions over at LinkedIn that allow them to run millions of A/B tests concurrently without interactions between them.

Dr. Christian Posse recently joined Google as Program Manager, Technology. Before that he was Principal Product Manager and Principal Data Scientist at LinkedIn where he led the development of recommendation products as well as the next generation online experimentation platform. Prior to LinkedIn, Dr. Posse was a founding member and technology lead of Cisco Systems Network Collaboration Business Unit where he designed the search and advanced social analytics of Pulse, Cisco’s network-based search and collaboration platform for the enterprise. Prior to Cisco, Dr. Posse worked in a wide range of environments, from holding faculty positions in US universities, to leading the R&D at software companies and a US National Laboratory in the social networks, biological networks and behavioral analytics fields. His interests are diverse and include search and recommendation engines, social networks analytics, computational social and behavioral sciences, online experimentation and information fusion. He has written over 40 scientific peer-reviewed publications and holds several patents in those fields. Dr. Posse has a PhD in Statistics from the Swiss Federal Institute of Technology, Switzerland.

Caitlin Smallwood Caitlin Smallwood on

Controlled Experimentation (or A/B testing) has evolved into a powerful tool for driving product strategy and innovation. The dramatic growth in online and mobile content, media, and commerce has enabled companies to make principled data-driven decisions. Large numbers of experiments are typically run to validate hypotheses, study causation, and optimize user experience, engagement, and monetization.

Rajesh Parekh Rajesh Parekh on

Controlled Experimentation (or A/B testing) has evolved into a powerful tool for driving product strategy and innovation. The dramatic growth in online and mobile content, media, and commerce has enabled companies to make principled data-driven decisions. Large numbers of experiments are typically run to validate hypotheses, study causation, and optimize user experience, engagement, and monetization.

Sean Grove Sean Grove on

Sean Grove from ZenBox will be presenting 'Future-proofing your app using Redis' in this recording of the second talk from Redis' first meetup (in a while).

More on Redis from their website:
Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets. You can run atomic operations on these types, like appending to a string; incrementing the value in a hash; pushing to a list; computing set intersection, union and difference; or getting the member with highest ranking in a sorted set.

Sham Kakade Sham Kakade on

We are happy to share with you a recent talk by Sham Kakade from Microsoft recorded at the NYC Machine Learning meetup . In this talk he discusses a general and (computationally and statistically) efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models and latent Dirichlet allocation---by exploiting a certain tensor structure in their low-order observable moments.


Claudia Perlich Claudia Perlich on

Here's a new talk on targeted online advertising recorded at one of the NYC Machine Learning meetups. Two presenters from Media6 labs spoke about their respective papers from the recent Knowledge Discover and Data Mining conference (KDD). Claudia Perlich presented "Bid Optimizing and Inventory Scoring in Targeted Online Advertising" and Troy Raeder presented "Design Principles of Massive, Robust Prediction Systems." Full abstracts and audio below.

Laurent Gautier Laurent Gautier on

We were lucky to attend the Bay Area R users group last week where we recorded Laurent Gautier's talk on the RPy2 bridge which allows one to use Python as the glue language to develop applications while using R for the statistics and data analysis engine. He also demonstrated how a web application could be developed around an existing R script.

Alex Baranau Alex Baranau on


In this talk from the HBase NYC group, hear Alex Baranau, Software Engineer at Sematext International, give an Introduction to HBase.

This presentation will consist of two parts and will cover the "Introduction to HBase" and "Introduction to HBase Internals" topics. In the first part you'll hear about the key features of HBase and their importance, what HBase setups look like, HBase usage patterns, and when to choose HBase. The second part will cover some aspects of HBase underlying architecture, as well as some schema design insights. Understanding this will help HBase users make better use of this powerful database technology and avoid common mistakes.