K-nearest Neighbors for Apache Flink

Dan Blazevski from Insight Data Science presents some recent progress on Apache Flink's machine learning library, focusing on a new implementation of the k-nearest neighbors (knn) algorithm for Flink.

In the spirit of the Kappa Architecture, Apache Flink is a distributed batch and stream processing tool that treats batch as a special case of stream processing. Dan discusses a few ways, both exact and approximate, to do distributed knn queries, focusing on using quadtrees to spatially partition the training set and using z-value based hashing to reduce dimensionality.

Dan Blazevski loves distributed computing. He has prior academic/lab work experience at ETH Zurich and Oak Ridge National Laboratory in computational physics and engineering after completing his PhD in Mathematics from UT Austin. Although he still occasionally misses the good 'ol days of Fortran and MPI, he's pretty excited to have made the transition to industry as a Data Engineering Insight Fellow in 2015 where he started working on Flink, and now helps lead the Fellows program in NYC.

This talk was given at the NY Scala meetup in May 2016.