(Easy) High performance text processing in Machine Learning

In this talk, Daniel Krasner covers rapid development of high performance scalable text processing solutions for tasks such as classification, semantic analysis, topic modeling and general machine learning. He demonstrates how Python modules, in particular the Rosetta Python library, can be used to process, clean, tokenize, extract features, and build statistical models with large volumes of text data. The Rosetta library focuses on creating small and simple modules (each with command line interfaces) that use very little memory and are parallelized with the multiprocessing package. Daniel also touches on LDA topic modeling and different implementations thereof (Vowpal Wabbit and Gensim). The talk is part presentation, and part “real life” example tutorial. This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.