Unknown author on

Anomaly detection in healthcare data is an enabling technology for the detection of overpayment and fraud. In this talk, we demonstrate how to use PageRank with Hadoop and SociaLite (a distributed query language for large-scale graph analysis) to identify anomalies in healthcare payment information. We demonstrate a variant of PageRank applied to graph data generated from the Medicare-B dataset for anomaly detection, and show real anomalies discovered in the dataset.

Gandalf Hernandez Gandalf Hernandez on


Spotify joined forces with The Echo Nest this spring. The Echo Nest specializes in, among other things, knowing as much as possible about individual songs. For example, they can figure out the tempo and key of a song, if it is acoustic or electric, if it is instrumental or has vocals, if it is a live recording, and so on. Very exciting stuff!

During the past couple of months The Echo Nest has been running their audio analyzer over a big portion of the tracks in our catalog, uploading the resulting analysis files to an S3 bucket for storage.

As we integrate The Echo Nest into Spotify, we want to start making use of the analysis output within the main Spotify pipelines.

My problem is that I have data for 35 million tracks sitting in 35 million individual files, totaling 15-20TB, in an S3 bucket that we need to get into our Hadoop cluster, which is the starting point for most of our pipelines.

Naive solution

One option is to just set up a script to copy each of the 35 million tracks into Hadoop as individual files.

No, that is too many files. There is no reason to burden the Hadoop namenodes with that many new items. It is also unlikely that we will want to only process one file at a time, so there is again no reason for them to sit in individual files. There is a high likelihood of eye rolling if we proceed down this path.


Probably also not an acceptable solution (source)


Most of our data in Hadoop is stored in Avro format with one entity per line, so that is a good starting point. We are lacking a list of what Spotify songs are in the S3 bucket, and the bucket has other data too, so we cannot blindly download everything. But what we can do is:

  •   Take the latest internal list of all tracks we have at Spotify (which we call metadata dumps) and

  •   Check if the song exists in the S3 bucket, since we have the mapping from Spotify ID to S3 key and then

  •   Download the Echo Nest analysis of the song and

  •   Write out a line with the Spotify track ID followed by the analysis data

This simple algorithm looks very much like a map-only job using a metadata dump as the input. But there are a few considerations before we kick this off.


Our Hadoop infrastructure is critical since it is the end point for logging from all our backend services - including data used for financial reporting. 20 TB of data is not much for internal processing in the Hadoop cluster, but we normally do not pull this much data in from outside sources. After talking to some of our infrastructure teams, we got the all clear to start testing.

At this point we had no idea of how long this would take. After initial testing to see that the overall idea was sound by pulling down a few thousand files, we decided to break up the download into twenty pieces. That way if a job fails catastrophically with no usable output we would not have to restart from the beginning. A quick visit to Amazon’s website, using retail pricing, led to an estimate that the total download costs could top $1,000. That is enough money that it would be embarrassing to fail and have to completely start over again.


From testing we decided to pull data down in twenty chunks. The first job was kicked off on a Monday night. By the next morning the data had downloaded and we ran sanity checks on it the data. The number of unexpected failures turned out to be six out of 1.75 million attempted downloads. After a brief celebration for being six sigma compliant, we estimated that it would take around a week to download the data if we constrained ourselves to process one chunk at a time. We were satisfied with this schedule.

After a (happily) uneventful week we had around 6TB of packed Avro files in Hadoop.

From that we wrote regular, albeit heavy, Map Reduce jobs to calculate the audio attributes (tempo, key, etc.) for each of the tracks that we surface to other groups.

Fortunately or unfortunately–depending on how you look at it–the data is not static. Fortunately, since that means that there is new music released and coming into Spotify, and who doesn’t want that? Unfortunately that means our job is not done.

Analyzing new tracks

The problem now turns to how we analyze all new tracks provided to Spotify as new music is released.

Most of our datasets are stored in a structure of the form ‘/project/dataset/date’. For example, our track play data set divides up all the songs that were played on a particular day. Other data sets are snapshots. Today’s user data set contains a list of all our users and their particular attributes for that given day. Snapshot datasets typically do not vary much from day to day.

The analysis data set stays mostly static, but needs to be mated with incoming tracks for that day. Typically we’d join yesterday’s data with any new incoming data, emit a new data set, and call it a day. The analyzer data set is significantly larger than most other snapshots and it feels like overkill to process and copy 6TB of packed data each day in order to just add in new tracks.

Other groups at Spotify are interested in the audio attributes data set we compute from the track analysis, and they will expect that this data set looks similar to other snapshots, where today’s data set is a full dump with all track attributes. The attributes set is two orders of magnitude smaller, and thus much easier to deal with.

But others are not interested in the underlying analysis data set, so we could organize it as:

  •   Store periodic full copies of all the analysis (standard snapshot) and in between this

  •   Generate incremental daily data sets of new analysis files only

  •   Every once in a while mate the full data set with the following incremental sets and emit a new more up to date full data set making all older sets obsolete and delete them.

We can then generate daily attributes snapshots (remembering that they are much smaller) by:

  •   Taking yesterday’s attributes snapshot and join it with

  •   Today’s full or incremental analysis set (only one of them will exist) and

  •   Check if we are missing attributes for a given track analysis and if so calculate it, otherwise

  •   Emit the existing attributes for the given track to not redo work

This way we can commit to having standard daily snapshots without running the risk of being asked why we copy 6 TB of data daily to add in attributes for new tracks, and everybody will be happy.


As an ongoing project, there are a lot of interesting problems yet to consider. We want to make the attributes available in real time by making sure we provide Cassandra databases to speed up ad-hoc modeling as well as allowing backend services to make on the fly decisions about what songs to recommend based on song characteristics. Cool stuff.

Thank you for reading,

Gandalf Hernandez


Sravya Tirukkovalur Sravya Tirukkovalur on

Apache Sentry (incubating) is a new open source authorization module that integrates with Hadoop-based SQL query engines (Hive and Impala). Two Apache Sentry developers, Xuefu Zhang and Srayva Tirukkovalur, Software Engineers at Cloudera, provide details on its implementation and do a short demo.

Ian Hummel Ian Hummel on

At MediaMath, we’re big users of Elastic MapReduce (EMR). EMR’s incredible flexibility makes it a great fit for our data analytics team, which processes TBs of data each day to provide insights to our clients, to better understand our own business, and to power the various product back-ends that make Terminal 1 the “marketing operating system” that it is.

Evan Chan Evan Chan on

Evan Chan (Software Engineer, Ooyala), describes his experience using the Spark and Shark frameworks for running real-time queries on top of Cassandra data. He starts by surveying the Cassandra analytics landscape, including Hadoop and HIVE, and touches on the use of custom input formats to extract data from Cassandra. Then, he dives into Spark and Shark (two memory-based cluster computing frameworks) and explains how they enable often dramatic improvements in query speed and productivity.

Evan Chan Evan Chan on

In this talk, Evan Chan, Software Engineer at Ooyala, presents on real-time analytics using Cassandra, Spark & Shark at Ooyala. He offers a review of the Cassandra analytics landscape (Hadoop & HIVE), goes over custom input formats to extract data from Cassandra, and shows how Spark & Shark increase query speed and productivity over standard solutions. This talk was recorded at the DataStax Cassandra South Bay Users meetup at Ooyala.

Russell Jurney Russell Jurney on

In this talk, Russell Jurney (author of Agile Data) presents about rapidly prototyping analytics applications using the Hadoop stack to return to agility in light of the ever deepening analytics stack. This presentation uses Hadoop, Pig, NoSQL stores and lightweight web frameworks to rapidly connect end-users to real insights. This talk was recorded at the SF Data Mining meetup at Trulia.

Camille Fournier Camille Fournier on

This is an Apache Zookeeper introduction - In this talk, Camille Fournier, from Rent The Runway, gives an introduction to ZooKeeper. She talks on why it's useful and how you should use it once you have it running. Camille goes over the high-level purpose of ZooKeeper and covers some of the basic use cases and operational concerns. One of the requirements for running Storm or a Hadoop cluster is to have a reliable Zookeeper setup. When you’re running a service distributed across a large cluster of machines, even tasks like reading configuration information, which are simple on single-machine systems, can be hard to implement reliably. This talk was recorded at the NYC Storm User Group meetup at WebMD Health.

Adam Laiacano Adam Laiacano on

In this talk, Adam Laiacano from Tumblr gives an "Introduction to Digital Signal Processing in Hadoop". Adam introduces the concepts of digital signals, filters, and their interpretation in both the time and frequency domain, and he works through a few simple examples of low-pass filter design and application. It's much more application focused than theoretical, and there is no assumed prior knowledge of signal processing. This talk was recorded at the NYC Machine Learning Meetup at Pivotal Labs.

Eric Sammer Eric Sammer on

In this talk, Eric Sammer, from Cloudera discusses Cloudera's new open source project, Cloudera Development Kit (CDK), which helps Hadoop developers get new projects off the ground more easily. The CDK is both a framework and long-term initiative for documenting proven development practices and providing helpful doc and APIs that will make Hadoop application development as easy as possible. This talk was recorded at the Video Big Data Gurus meetup at Samsung R&D.

Joe Crobak Joe Crobak on

In this talk, Joe Crobak, formerly from Foursquare, will give a brief overview of how a workflow engine fits into a standard Hadoop-based analytics stack. He will also give an architectural overview of Azkaban, Luigi, and Oozie, elaborating on some features, tools, and practices that can help build a Hadoop workflow system from scratch or improve upon an existing one. This talk was recorded at the NYC Data Engineering meetup at Ebay.