Anomaly detection in healthcare data is an enabling technology for the detection of overpayment and fraud. In this talk, we demonstrate how to use PageRank with Hadoop and SociaLite (a distributed query language for large-scale graph analysis) to identify anomalies in healthcare payment information. We demonstrate a variant of PageRank applied to graph data generated from the Medicare-B dataset for anomaly detection, and show real anomalies discovered in the dataset.Continue
Spotify joined forces with The Echo Nest this spring. The Echo Nest specializes in, among other things, knowing as much as possible about individual songs. For example, they can figure out the tempo and key of a song, if it is acoustic or electric, if it is instrumental or has vocals, if it is a live recording, and so on. Very exciting stuff!
During the past couple of months The Echo Nest has been running their audio analyzer over a big portion of the tracks in our catalog, uploading the resulting analysis files to an S3 bucket for storage.
As we integrate The Echo Nest into Spotify, we want to start making use of the analysis output within the main Spotify pipelines.
My problem is that I have data for 35 million tracks sitting in 35 million individual files, totaling 15-20TB, in an S3 bucket that we need to get into our Hadoop cluster, which is the starting point for most of our pipelines.
One option is to just set up a script to copy each of the 35 million tracks into Hadoop as individual files.
No, that is too many files. There is no reason to burden the Hadoop namenodes with that many new items. It is also unlikely that we will want to only process one file at a time, so there is again no reason for them to sit in individual files. There is a high likelihood of eye rolling if we proceed down this path.
Probably also not an acceptable solution (source)
Most of our data in Hadoop is stored in Avro format with one entity per line, so that is a good starting point. We are lacking a list of what Spotify songs are in the S3 bucket, and the bucket has other data too, so we cannot blindly download everything. But what we can do is:
- Take the latest internal list of all tracks we have at Spotify (which we call metadata dumps) and
- Check if the song exists in the S3 bucket, since we have the mapping from Spotify ID to S3 key and then
- Download the Echo Nest analysis of the song and
- Write out a line with the Spotify track ID followed by the analysis data
This simple algorithm looks very much like a map-only job using a metadata dump as the input. But there are a few considerations before we kick this off.
Our Hadoop infrastructure is critical since it is the end point for logging from all our backend services - including data used for financial reporting. 20 TB of data is not much for internal processing in the Hadoop cluster, but we normally do not pull this much data in from outside sources. After talking to some of our infrastructure teams, we got the all clear to start testing.
At this point we had no idea of how long this would take. After initial testing to see that the overall idea was sound by pulling down a few thousand files, we decided to break up the download into twenty pieces. That way if a job fails catastrophically with no usable output we would not have to restart from the beginning. A quick visit to Amazon’s website, using retail pricing, led to an estimate that the total download costs could top $1,000. That is enough money that it would be embarrassing to fail and have to completely start over again.
From testing we decided to pull data down in twenty chunks. The first job was kicked off on a Monday night. By the next morning the data had downloaded and we ran sanity checks on it the data. The number of unexpected failures turned out to be six out of 1.75 million attempted downloads. After a brief celebration for being six sigma compliant, we estimated that it would take around a week to download the data if we constrained ourselves to process one chunk at a time. We were satisfied with this schedule.
After a (happily) uneventful week we had around 6TB of packed Avro files in Hadoop.
From that we wrote regular, albeit heavy, Map Reduce jobs to calculate the audio attributes (tempo, key, etc.) for each of the tracks that we surface to other groups.
Fortunately or unfortunately–depending on how you look at it–the data is not static. Fortunately, since that means that there is new music released and coming into Spotify, and who doesn’t want that? Unfortunately that means our job is not done.
Analyzing new tracks
The problem now turns to how we analyze all new tracks provided to Spotify as new music is released.
Most of our datasets are stored in a structure of the form ‘/project/dataset/date’. For example, our track play data set divides up all the songs that were played on a particular day. Other data sets are snapshots. Today’s user data set contains a list of all our users and their particular attributes for that given day. Snapshot datasets typically do not vary much from day to day.
The analysis data set stays mostly static, but needs to be mated with incoming tracks for that day. Typically we’d join yesterday’s data with any new incoming data, emit a new data set, and call it a day. The analyzer data set is significantly larger than most other snapshots and it feels like overkill to process and copy 6TB of packed data each day in order to just add in new tracks.
Other groups at Spotify are interested in the audio attributes data set we compute from the track analysis, and they will expect that this data set looks similar to other snapshots, where today’s data set is a full dump with all track attributes. The attributes set is two orders of magnitude smaller, and thus much easier to deal with.
But others are not interested in the underlying analysis data set, so we could organize it as:
- Store periodic full copies of all the analysis (standard snapshot) and in between this
- Generate incremental daily data sets of new analysis files only
- Every once in a while mate the full data set with the following incremental sets and emit a new more up to date full data set making all older sets obsolete and delete them.
We can then generate daily attributes snapshots (remembering that they are much smaller) by:
- Taking yesterday’s attributes snapshot and join it with
- Today’s full or incremental analysis set (only one of them will exist) and
- Check if we are missing attributes for a given track analysis and if so calculate it, otherwise
- Emit the existing attributes for the given track to not redo work
This way we can commit to having standard daily snapshots without running the risk of being asked why we copy 6 TB of data daily to add in attributes for new tracks, and everybody will be happy.
As an ongoing project, there are a lot of interesting problems yet to consider. We want to make the attributes available in real time by making sure we provide Cassandra databases to speed up ad-hoc modeling as well as allowing backend services to make on the fly decisions about what songs to recommend based on song characteristics. Cool stuff.
Thank you for reading,
Getting data from Kafka to Hadoop should be simple, which is why the community has so many options to choose from. Cloudera engineer, Gwen Shapira, reviews some popular solutions: Storm, Spark, Flume and Camus. She goes over the pros and cons of each, and recommends use-cases and future development plans as well.Continue
Apache Sentry (incubating) is a new open source authorization module that integrates with Hadoop-based SQL query engines (Hive and Impala). Two Apache Sentry developers, Xuefu Zhang and Srayva Tirukkovalur, Software Engineers at Cloudera, provide details on its implementation and do a short demo.Continue
At MediaMath, we’re big users of Elastic MapReduce (EMR). EMR’s incredible flexibility makes it a great fit for our data analytics team, which processes TBs of data each day to provide insights to our clients, to better understand our own business, and to power the various product back-ends that make Terminal 1 the “marketing operating system” that it is.Continue
Apache Tez is a modern data processing engine designed for YARN on Hadoop 2. In this talk, Bikas Saha (Technical Staff Member, Hortonworks) explains Tez features via real use cases from early adopters like Hive, Pig and Cascading. He shows examples of the Tez API for targeting as well as new and existing applications to the Tez engine.Continue
Watch as Jeremy Carroll (Operations engineer, Pinterest) explains how Pinterest achieves scalability with HBase and EC2. He'll describe the architecture, deployment strategies, monitoring systems, and troubleshooting workflows.Continue
Spotify has over 24 million active users. 1 out of every 4 users is a paying subscriber. Ad revenues allow 3 out of 4 users to enjoy a free experience.Continue
Evan Chan (Software Engineer, Ooyala), describes his experience using the Spark and Shark frameworks for running real-time queries on top of Cassandra data. He starts by surveying the Cassandra analytics landscape, including Hadoop and HIVE, and touches on the use of custom input formats to extract data from Cassandra. Then, he dives into Spark and Shark (two memory-based cluster computing frameworks) and explains how they enable often dramatic improvements in query speed and productivity.Continue
In this talk, Evan Chan, Software Engineer at Ooyala, presents on real-time analytics using Cassandra, Spark & Shark at Ooyala. He offers a review of the Cassandra analytics landscape (Hadoop & HIVE), goes over custom input formats to extract data from Cassandra, and shows how Spark & Shark increase query speed and productivity over standard solutions. This talk was recorded at the DataStax Cassandra South Bay Users meetup at Ooyala.Continue
In this talk, Russell Jurney (author of Agile Data) presents about rapidly prototyping analytics applications using the Hadoop stack to return to agility in light of the ever deepening analytics stack. This presentation uses Hadoop, Pig, NoSQL stores and lightweight web frameworks to rapidly connect end-users to real insights. This talk was recorded at the SF Data Mining meetup at Trulia.Continue
This is an Apache Zookeeper introduction - In this talk, Camille Fournier, from Rent The Runway, gives an introduction to ZooKeeper. She talks on why it's useful and how you should use it once you have it running. Camille goes over the high-level purpose of ZooKeeper and covers some of the basic use cases and operational concerns. One of the requirements for running Storm or a Hadoop cluster is to have a reliable Zookeeper setup. When you’re running a service distributed across a large cluster of machines, even tasks like reading configuration information, which are simple on single-machine systems, can be hard to implement reliably. This talk was recorded at the NYC Storm User Group meetup at WebMD Health.Continue
In this talk, Adam Laiacano from Tumblr gives an "Introduction to Digital Signal Processing in Hadoop". Adam introduces the concepts of digital signals, filters, and their interpretation in both the time and frequency domain, and he works through a few simple examples of low-pass filter design and application. It's much more application focused than theoretical, and there is no assumed prior knowledge of signal processing. This talk was recorded at the NYC Machine Learning Meetup at Pivotal Labs.Continue
In this talk, Terence Yim, from Continuuity, discusses Weave, a simple set of libraries that allow you to easily manage distributed applications through an abstraction layer built on Hadoop YARN. Weave allows you to use YARN’s distributed capabilities with a programming model that is similar to running threads. This talk was recorded at the Big Data Gurus meetup at Samsung R&D.Continue
In this talk, Eric Sammer, from Cloudera discusses Cloudera's new open source project, Cloudera Development Kit (CDK), which helps Hadoop developers get new projects off the ground more easily. The CDK is both a framework and long-term initiative for documenting proven development practices and providing helpful doc and APIs that will make Hadoop application development as easy as possible. This talk was recorded at the Video Big Data Gurus meetup at Samsung R&D.Continue
In this talk, Joe Crobak, formerly from Foursquare, will give a brief overview of how a workflow engine fits into a standard Hadoop-based analytics stack. He will also give an architectural overview of Azkaban, Luigi, and Oozie, elaborating on some features, tools, and practices that can help build a Hadoop workflow system from scratch or improve upon an existing one. This talk was recorded at the NYC Data Engineering meetup at Ebay.Continue
In this talk, Abhijit Lele from Hortonworks, discusses YARN architecture and how to get started developing for the next generation of Hadoop. This talk was recorded at the New York Hadoop User Group meetup at Gilt.Continue
In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Big Data Gurus meetup at Samsung R&D. Comments are available here.
Hey! Join us as we host out second SF Data Engineering meetup on July 25th at 7PM at the Stumbleupon Offices at 301 Brannen, San Franciso, CA. We'll be on the 2nd floor with Wolfgang Hoschek, a Senior software engineer at Cloudera as he discusses using Morphline for on-the-fly ETL.Continue
This talk is by Jan Vitek, a professor in computer science at Purdue University. In it, Jan discusses the design and implementation of Distributed Random Forest, a big data algorithm for H2O. This talk was recorded at the SF Data Mining meetup at Trulia.Continue
This talk is by Adam Ilardi, a data scientist at eBay, and was recorded at the NY Scala meetup at eBay NYC. Adam talks about eBay's transition from Pig and raw Cascading to Scalding and explains other ways they use Scala.Continue
Recently we recorded Josh Wills (Cloudera, Google) who gave an awesome overview on the best practices of building analytical applications on Hadoop at the SF Data Mining group. The talk and slides are after the break.Continue