Digging into the Dirichlet Distribution

Plus log white Plus log red log it > Small photo Max Sklar on

When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible user-generated ratings (like and dislike), then each item is represented as a vector of 2 counts.  In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents.  The Dirichlet distribution is one of the basic probability distributions for describing this type of data. In this talk, Max Sklar, from Foursquare, takes a closer look at the Dirichlet distribution and it's properties, as well as some of the ways it can be computed efficiently.  This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.

55:15

The Dirichlet distribution is surprisingly expressive on its own, but it can also be used as a building block for even more powerful and deep models such as mixtures and topic models.

Bio: Max Sklar is an engineer and a machine learning specialist. At Foursquare, his continuing objective is to make the app smarter and more interesting. Over the last two years, Max has spearheaded the effort to apply Natural Language Processing technology to Foursquare’s user-generated text corpus. He has spoken at a variety of conferences and meetups in New York’s tech scene, and has been an adjunct instructor for NYU’s data structures course for four semesters. He holds an M.S. in Information Systems from NYU, and a B.S. in Computer Science from Yale, and can be found on Twitter @maxsklar.

Log this article
share on
next read

Agile Analytics Applications

Small russell jurney Russell Jurney on

In this talk, Russell Jurney (author of Agile Data) presents about rapidly prototyping analytics applications using the Hadoop stack to return to agility in light of the ever deepening analytics stack. This presentation uses Hadoop, Pig, NoSQL stores and lightweight web frameworks to rapidly connect end-users to real insights. This talk was recorded at the SF Data Mining meetup at Trulia.