Solving an Ad Tech Machine Learning Problem Using Clustering and Viterbi

Understanding the billions of data points we ingest each month is no easy task. Through the development of models that allow us to do so, we’ve noticed some commonalities in the process of converting raw data to real-world understanding. Although you can get pretty good results with simple models and algorithms, digging beyond the obvious abstractions and using more sophisticated methods requires a lot of effort. In school we often learn different techniques and algorithms in isolation, with neatly fitted input sets, and study their properties. In the real world, however, especially the world of location data, we often need to combine these approaches in novel ways in order to yield usable results.

In this article we look at the process of understanding the importance of different locations as they relate to consumers, which takes us from simple joins in Hadoop to a sophisticated time-series algorithm called Viterbi.

At PlaceIQ we’re focused on understanding the consumer journey so our clients can deliver the right message at the right time. And with 58% of American adults owning a smartphone (as of January 2014 - targeted mobile ads have never been a more powerful medium. Where people go and what they do before and after they visit a place of business, buy a product, or procure a service, tells us a lot about what’s important to different consumers.

Challenging the status quo

Our engineering team uses mobile device movement data for analytics, which is essentially a time-series of latitudes and longitudes, keyed on device. We begin by abstracting the lat/long of location to 100-meter by 100-meter tiles. With this abstraction we can do basic analysis to understand what consumers are doing within these tiles. We do this by applying different types of filters we’ve developed for our tiles, which provide information on what sorts of buildings or businesses are found on that tile.

Our initial approach was to join the physical data we have on a tile with the device IDs found in that same space, and then conduct analysis from there. As initial attempts tend to be, this method proved to be too cumbersome.

In general our data for both devices and tiles follows the power law distribution. If we plot the number of devices on the y-axis and the number of requests on the x-axis (both log scaled), then our data looks like a line with a negative slope. See Figure 1. While we don't have a ton of devices or tiles with hundreds of thousands, or millions of ad requests, they do exist. These are the ones that cause issues with map-reduce, as the reducers computing those tiles can take much longer to finish.


Figure 1 Example from our data showing the power law distribution

If we consider, for example, the number of ad requests per day in any 100m x 100m square in Times Square – where over 300,000 people pass through daily – we start to see reducers that would take orders of magnitude longer to finish than others. This is an example where the data is legitimate and causes issues. There are also cases that we refer to tiles as "spammy". Many of these tiles, for instance, are located in zip code or metro centroids that hundreds of thousands of ad requests originate from daily, and tend to be entirely meaningless – such as an open field or graveyard.

These scaling problems make sense when you think of the scaling in reference to the complexity of the algorithm being run. Most of these issues wouldn't be so bad if everything ran in linear time. However, once you start using O(n^2) or O(n^3) algorithms, then data, which increases in size to that degree, can start costing you enough memory and space to make running it in map-reduce a challenge.

Cracking the code

Our solution to this problem was to pass the contextual information needed for each tile as a pre-computed table through the distributed cache. We then iterated over each device in the time series, doing a lookup into the tile associated with the location to add the statistics that we needed from the tile to the time series.

At this point our data is now a time series keyed on device that contains the location for each device, and contextual information and statistics from the tile it originated. This allows us to do some basic analysis such as applying a residential or commercial filter, or finding unique IDs in each tile. While these approaches can be powerful, we still sought more sophisticated methods in order to understand this data and the consumer journey.

This leads us into DBSCAN, Hidden Markov Models (HMMs) and the Viterbi algorithm. One of the first things to notice about our time series is that the information we have, which we truly know, is just the data points: the time and location. In order to extract understanding and meaning from these points we need to infer what is going on behind that data. What is the hidden sequence of events that is causing the data we see?

Our previously mentioned tile filter is one way of doing this, but is simplistic. So we turned to HMMs, a mathematical tool for taking observed events and modeling them as if they were the result of some other hidden sequence of events, which we can't observe. We make some underlying assumptions about the probabilities of seeing each observed event due to the actual state, and the probability of transitioning from each state to the next. From this model we seek to extract the "hidden" time series of events, which caused our observed sequence. An efficient algorithm for doing this, which has become ubiquitous in the signal-processing world, is the Viterbi algorithm.


   Figure 2 Example of an HMM

In order to make use of HMMs we need to abstract our data into states, which have transitions between them. If we're trying to understand significant places that a particular device frequents, our tile abstraction isn't necessarily correct for this model. There could be a large number of tiles around a significant point to a consumer for which we have data. Really, we want to treat that entire cluster of points as one state, and we’re interested in when the device transitions from this state to another, or from one cluster of ad requests to another.

Taking it to the next cluster

This clustering is done through an algorithm called DBSCAN, which starts at one point and looks for nearby points (within a pre-defined tolerance) to recruit to the cluster. It continues this process on the points it just included and iterates until the chain breaks and all the remaining points are too far away to be considered part of the cluster. This allows us to solve the problem of having ad requests over multiple tiles because we can cluster the points together and call that a single state.

The output of this clustering is a time series that is no longer transitioning from one lat/long to another at each time step, but transitioning from one state to another at each time step, which is exactly what we need for our HMM. In the pursuit of greater efficiency and sophistication, we’re now interested in finding the best fit for the sequence of events that would explain these observed transitions. This is where the Viterbi algorithm comes in and, after getting our optimal fit of sequence of events, we use our underlying assumptions about how people transition from one place or state to another to assign a certain significance to each state.

At this point there’s an important point worth making. The HMM is a powerful tool, but without domain knowledge of human consumer behavior it would be worthless, or perhaps even give worse results than our initial naive approach. With this knowledge however, we can use the results of Viterbi to better understand what a device is doing at each location and how that consumer’s journey continues.

To reiterate a point made earlier, one of the interesting aspects of the process we just outlined is that the first steps were relatively simplistic. Minus the issue due to the power law distribution of our data, it took rather basic techniques to go from raw data to having some idea of the sorts of things consumers might be doing at their location, based on our simple tile abstraction. Further refinement of these results required multiple approaches. While both DBSCAN and Viterbi are powerful tools on their own, it was the combination of these approaches that finally got us where we wanted to be. DBSCAN to cluster our data to improve our tile abstraction, and then Viterbi to make probabilistic inferences that improve upon the static information we gained from the basic tile filters.

What we’ve been able to do in this area so far has worked very well, however, the process of refinement and improvement is a continuous one. To keep an edge over competitors and provide the best results to our clients we are constantly searching for and evaluating new techniques to get the most out of every data point we ingest.