Spotify joined forces with The Echo Nest this spring. The Echo Nest specializes in, among other things, knowing as much as possible about individual songs. For example, they can figure out the tempo and key of a song, if it is acoustic or electric, if it is instrumental or has vocals, if it is a live recording, and so on. Very exciting stuff!
During the past couple of months The Echo Nest has been running their audio analyzer over a big portion of the tracks in our catalog, uploading the resulting analysis files to an S3 bucket for storage.
As we integrate The Echo Nest into Spotify, we want to start making use of the analysis output within the main Spotify pipelines.
My problem is that I have data for 35 million tracks sitting in 35 million individual files, totaling 15-20TB, in an S3 bucket that we need to get into our Hadoop cluster, which is the starting point for most of our pipelines.
One option is to just set up a script to copy each of the 35 million tracks into Hadoop as individual files.
No, that is too many files. There is no reason to burden the Hadoop namenodes with that many new items. It is also unlikely that we will want to only process one file at a time, so there is again no reason for them to sit in individual files. There is a high likelihood of eye rolling if we proceed down this path.
Probably also not an acceptable solution (source)
Most of our data in Hadoop is stored in Avro format with one entity per line, so that is a good starting point. We are lacking a list of what Spotify songs are in the S3 bucket, and the bucket has other data too, so we cannot blindly download everything. But what we can do is:
- Take the latest internal list of all tracks we have at Spotify (which we call metadata dumps) and
- Check if the song exists in the S3 bucket, since we have the mapping from Spotify ID to S3 key and then
- Download the Echo Nest analysis of the song and
- Write out a line with the Spotify track ID followed by the analysis data
This simple algorithm looks very much like a map-only job using a metadata dump as the input. But there are a few considerations before we kick this off.
Our Hadoop infrastructure is critical since it is the end point for logging from all our backend services - including data used for financial reporting. 20 TB of data is not much for internal processing in the Hadoop cluster, but we normally do not pull this much data in from outside sources. After talking to some of our infrastructure teams, we got the all clear to start testing.
At this point we had no idea of how long this would take. After initial testing to see that the overall idea was sound by pulling down a few thousand files, we decided to break up the download into twenty pieces. That way if a job fails catastrophically with no usable output we would not have to restart from the beginning. A quick visit to Amazon’s website, using retail pricing, led to an estimate that the total download costs could top $1,000. That is enough money that it would be embarrassing to fail and have to completely start over again.
From testing we decided to pull data down in twenty chunks. The first job was kicked off on a Monday night. By the next morning the data had downloaded and we ran sanity checks on it the data. The number of unexpected failures turned out to be six out of 1.75 million attempted downloads. After a brief celebration for being six sigma compliant, we estimated that it would take around a week to download the data if we constrained ourselves to process one chunk at a time. We were satisfied with this schedule.
After a (happily) uneventful week we had around 6TB of packed Avro files in Hadoop.
From that we wrote regular, albeit heavy, Map Reduce jobs to calculate the audio attributes (tempo, key, etc.) for each of the tracks that we surface to other groups.
Fortunately or unfortunately–depending on how you look at it–the data is not static. Fortunately, since that means that there is new music released and coming into Spotify, and who doesn’t want that? Unfortunately that means our job is not done.
Analyzing new tracks
The problem now turns to how we analyze all new tracks provided to Spotify as new music is released.
Most of our datasets are stored in a structure of the form ‘/project/dataset/date’. For example, our track play data set divides up all the songs that were played on a particular day. Other data sets are snapshots. Today’s user data set contains a list of all our users and their particular attributes for that given day. Snapshot datasets typically do not vary much from day to day.
The analysis data set stays mostly static, but needs to be mated with incoming tracks for that day. Typically we’d join yesterday’s data with any new incoming data, emit a new data set, and call it a day. The analyzer data set is significantly larger than most other snapshots and it feels like overkill to process and copy 6TB of packed data each day in order to just add in new tracks.
Other groups at Spotify are interested in the audio attributes data set we compute from the track analysis, and they will expect that this data set looks similar to other snapshots, where today’s data set is a full dump with all track attributes. The attributes set is two orders of magnitude smaller, and thus much easier to deal with.
But others are not interested in the underlying analysis data set, so we could organize it as:
- Store periodic full copies of all the analysis (standard snapshot) and in between this
- Generate incremental daily data sets of new analysis files only
- Every once in a while mate the full data set with the following incremental sets and emit a new more up to date full data set making all older sets obsolete and delete them.
We can then generate daily attributes snapshots (remembering that they are much smaller) by:
- Taking yesterday’s attributes snapshot and join it with
- Today’s full or incremental analysis set (only one of them will exist) and
- Check if we are missing attributes for a given track analysis and if so calculate it, otherwise
- Emit the existing attributes for the given track to not redo work
This way we can commit to having standard daily snapshots without running the risk of being asked why we copy 6 TB of data daily to add in attributes for new tracks, and everybody will be happy.
As an ongoing project, there are a lot of interesting problems yet to consider. We want to make the attributes available in real time by making sure we provide Cassandra databases to speed up ad-hoc modeling as well as allowing backend services to make on the fly decisions about what songs to recommend based on song characteristics. Cool stuff.
Thank you for reading,
Hosted by Hakka Labs
This 3-day course will demonstrate the fundamental concepts of machine learning by working on a dataset of moderate size, using open source software tools.
This course is designed to help engineers collaborate with data scientists and create code that tackles increasingly complex machine learning problems. By the end of this course, you will be able to:
-Apply common classification methods for supervised learning when given a data set
-Apply algorithms for unsupervised learning problems
-Select/reduce features for both supervised and unsupervised learning problems -Optimize code for common machine learning tasks by correcting inefficiencies by using advanced data structures
-Choose basic tools and criteria to perform predictive analysis
The intended audience of this Machine Learning course is the engineer with strong programming skills as well as a certain level of exposure to linear algebra and probability. Students should understand the basic issue of prediction as well as Python.
Day 1: Linear Algebra/Probability Fundamentals and Supervised Learning
The goal of day one is to give engineers the linear algebra/probability foundation they need to tackle problems during the rest of the course and introduce tools for supervised learning problems.
-Quick Introduction to Machine Learning
-Linear Algebra, Probability and Statistics,
-Linear and Quadratic Discriminant Analysis
-Support Vector Machines and Kernels
-Lab: Working on classification problems on a data set
Day 2: Unsupervised learning, Feature Selection and Reduction
The goal of day two is to help students understand the mindset and tools of data scientists.
-K nearest neighbors, Random Forests, Naive Bayes Classifier
-Information Theoretic Approaches
-Feature Selection and Model Selection/Creation
-Principal Component Analysis/Kernel PCA
-Independent Component Analysis
- Lab: Choosing Features and applying unsupervised learning methods to a data set
Day 3: Performance Optimization of Machine Learning Algorithms
The goal of day three is to help students understand how developers contribute to complex machine learning projects.
-Unsupervised Learning Continued
-DB-SCAN and K-D Trees
-Recommendation Systems and Matrix Factorization Methods
-Lab: Longer lab working on back-end Machine Learning optimization programming problems in Python