Globally Scalable Web Document Classification Using Word2Vec

Extracting information from unstructured web documents is a common problem for many applications and determining which category they belong to can be especially challenging at planetary scale.

In this talk, Kohei Nakaji, software engineer from SmartNews, shows how they achieve globally scalable, real-time web document classification using new machine learning techniques, especially Word2Vec's extended distributed representation model. He also discusses the pros and cons for using distributed representation from a real-world, operational standpoint, as well as new classification approaches being used in Japan.

This video was recorded at the SF Bayarea Machine Learning meetup at Digital Garage Development LLC.