We have an S3 bucket with 35M files
Spotify joined forces with The Echo Nest this spring. The Echo Nest specializes in, among other things, knowing as much as possible about individual songs. For example, they can figure out the tempo and key of a song, if it is acoustic or electric, if it is instrumental or has vocals, if it is a live recording, and so on. Very exciting stuff!
During the past couple of months The Echo Nest has been running their audio analyzer over a big portion of the tracks in our catalog, uploading the resulting analysis files to an S3 bucket for storage.
As we integrate The Echo Nest into Spotify, we want to start making use of the analysis output within the main Spotify pipelines.
My problem is that I have data for 35 million tracks sitting in 35 million individual files, totaling 15-20TB, in an S3 bucket that we need to get into our Hadoop cluster, which is the starting point for most of our pipelines.
One option is to just set up a script to copy each of the 35 million tracks into Hadoop as individual files.
No, that is too many files. There is no reason to burden the Hadoop namenodes with that many new items. It is also unlikely that we will want to only process one file at a time, so there is again no reason for them to sit in individual files. There is a high likelihood of eye rolling if we proceed down this path.
Probably also not an acceptable solution (source)
Most of our data in Hadoop is stored in Avro format with one entity per line, so that is a good starting point. We are lacking a list of what Spotify songs are in the S3 bucket, and the bucket has other data too, so we cannot blindly download everything. But what we can do is:
- Take the latest internal list of all tracks we have at Spotify (which we call metadata dumps) and
- Check if the song exists in the S3 bucket, since we have the mapping from Spotify ID to S3 key and then
- Download the Echo Nest analysis of the song and
- Write out a line with the Spotify track ID followed by the analysis data
This simple algorithm looks very much like a map-only job using a metadata dump as the input. But there are a few considerations before we kick this off.
Our Hadoop infrastructure is critical since it is the end point for logging from all our backend services - including data used for financial reporting. 20 TB of data is not much for internal processing in the Hadoop cluster, but we normally do not pull this much data in from outside sources. After talking to some of our infrastructure teams, we got the all clear to start testing.
At this point we had no idea of how long this would take. After initial testing to see that the overall idea was sound by pulling down a few thousand files, we decided to break up the download into twenty pieces. That way if a job fails catastrophically with no usable output we would not have to restart from the beginning. A quick visit to Amazon’s website, using retail pricing, led to an estimate that the total download costs could top $1,000. That is enough money that it would be embarrassing to fail and have to completely start over again.
From testing we decided to pull data down in twenty chunks. The first job was kicked off on a Monday night. By the next morning the data had downloaded and we ran sanity checks on it the data. The number of unexpected failures turned out to be six out of 1.75 million attempted downloads. After a brief celebration for being six sigma compliant, we estimated that it would take around a week to download the data if we constrained ourselves to process one chunk at a time. We were satisfied with this schedule.
After a (happily) uneventful week we had around 6TB of packed Avro files in Hadoop.
From that we wrote regular, albeit heavy, Map Reduce jobs to calculate the audio attributes (tempo, key, etc.) for each of the tracks that we surface to other groups.
Fortunately or unfortunately–depending on how you look at it–the data is not static. Fortunately, since that means that there is new music released and coming into Spotify, and who doesn’t want that? Unfortunately that means our job is not done.
Analyzing new tracks
The problem now turns to how we analyze all new tracks provided to Spotify as new music is released.
Most of our datasets are stored in a structure of the form ‘/project/dataset/date’. For example, our track play data set divides up all the songs that were played on a particular day. Other data sets are snapshots. Today’s user data set contains a list of all our users and their particular attributes for that given day. Snapshot datasets typically do not vary much from day to day.
The analysis data set stays mostly static, but needs to be mated with incoming tracks for that day. Typically we’d join yesterday’s data with any new incoming data, emit a new data set, and call it a day. The analyzer data set is significantly larger than most other snapshots and it feels like overkill to process and copy 6TB of packed data each day in order to just add in new tracks.
Other groups at Spotify are interested in the audio attributes data set we compute from the track analysis, and they will expect that this data set looks similar to other snapshots, where today’s data set is a full dump with all track attributes. The attributes set is two orders of magnitude smaller, and thus much easier to deal with.
But others are not interested in the underlying analysis data set, so we could organize it as:
- Store periodic full copies of all the analysis (standard snapshot) and in between this
- Generate incremental daily data sets of new analysis files only
- Every once in a while mate the full data set with the following incremental sets and emit a new more up to date full data set making all older sets obsolete and delete them.
We can then generate daily attributes snapshots (remembering that they are much smaller) by:
- Taking yesterday’s attributes snapshot and join it with
- Today’s full or incremental analysis set (only one of them will exist) and
- Check if we are missing attributes for a given track analysis and if so calculate it, otherwise
- Emit the existing attributes for the given track to not redo work
This way we can commit to having standard daily snapshots without running the risk of being asked why we copy 6 TB of data daily to add in attributes for new tracks, and everybody will be happy.
As an ongoing project, there are a lot of interesting problems yet to consider. We want to make the attributes available in real time by making sure we provide Cassandra databases to speed up ad-hoc modeling as well as allowing backend services to make on the fly decisions about what songs to recommend based on song characteristics. Cool stuff.
Thank you for reading,