Spark MLlib: Making Practical Machine Learning Easy and Scalable

Xiangrui Meng, a committer on Apache Spark, talks about how to make machine learning easy and scalable with Spark MLlib. Xiangrui has been actively involved in the development of Spark MLlib and the new DataFrame API. MLlib is an Apache Spark component that focuses on large-scale machine learning (ML). With 50+ organizations and 110+ individuals contributing, MLlib is one of the most active open-source projects on ML. In this talk, Xiangrui shares his experience in developing MLlib. The talk covers both higher-level APIs, ML pipelines, that make MLlib easy to use, as well as lower-level optimizations that make MLlib scale to massive datasets.

ML workflows often involve a sequence of processing and learning stages. Realistic workflows are often even more complex, including cross-validation to choose parameters and combining multiple data sources. Inspired by scikit-learn, we proposed simple APIs to help users quickly assemble and tune ML pipelines. Under the hood, it seamlessly integrates with Spark SQL’s DataFrames and utilizes its data sources, flexible column operations, rich data types, as well as execution plan optimization to create efficient and scalable implementations.

There are many factors affecting a parallel implementation of an ML algorithm, e.g., optimization algorithm, platform limitation, communication pattern, data locality, numerical stability and performance, and fault-tolerance. Different implementations of the same ML algorithm can perform dramatically different. Xiangrui shares lessons learned from optimizing the alternating least squares (ALS) implementation in MLlib.

This talk was recorded at the NYC ML Meetup at Pivotal Labs in NYC.