Max Sklar and Maryam Aly from Foursquare lead this session in Tech@NYU's Startup Week. They cover the theory and history of natural language processing (NLP) as well as the specific journey that Foursquare went on in dealing with their millions of "tips" that users write for other users.
About Foursquare Engineering
Foursquare is a small but highly ambitious company that aims to change the way people keep up with friends and discover what's nearby. We have 85 engineers distributed across New York and San Francisco, working to turn nearly 5 billion check-ins into automatic, personalized recommendations that ping your phone. We're not afraid to move fast and break things as we release, launch, iterate, update and announce -- sometimes all in the same day. We're a closely-knit team and, especially at the end of a long day over beers, we feel like we're inventing the future together.
When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible user-generated ratings (like and dislike), then each item is represented as a vector of 2 counts. In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents. The Dirichlet distribution is one of the basic probability distributions for describing this type of data. In this talk, Max Sklar, from Foursquare, takes a closer look at the Dirichlet distribution and it's properties, as well as some of the ways it can be computed efficiently. This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.
In this talk, Joe Crobak, formerly from Foursquare, will give a brief overview of how a workflow engine fits into a standard Hadoop-based analytics stack. He will also give an architectural overview of Azkaban, Luigi, and Oozie, elaborating on some features, tools, and practices that can help build a Hadoop workflow system from scratch or improve upon an existing one. This talk was recorded at the NYC Data Engineering meetup at Ebay.
Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe talks about what features and qualities are important for a workflow system.
(Original post with audio and slides is here )
Blake Shaw: Thank you all for coming. As was mentioned, my name is Blake, and today I’m going to be talking about machine learning with large networks of people and places. So, here at Foursquare, we think there’s a great opportunity to leverage massive amounts of location data to help people better understand and connect with places all over the world.
Foursquare is now aware of over 1.5 billion check-ins from 15 million people at 30 million different places all over the world. Each check-in can be thought of as an edge in a vast network connecting people to each other and to the places that they care about most.