Unknown author on

Xiangrui Meng, a committer on Apache Spark, talks about how to make machine learning easy and scalable with Spark MLlib. Xiangrui has been actively involved in the development of Spark MLlib and the new DataFrame API. MLlib is an Apache Spark component that focuses on large-scale machine learning (ML). With 50+ organizations and 110+ individuals contributing, MLlib is one of the most active open-source projects on ML. In this talk, Xiangrui shares his experience in developing MLlib. The talk covers both higher-level APIs, ML pipelines, that make MLlib easy to use, as well as lower-level optimizations that make MLlib scale to massive datasets.

Hakka Labs Hakka Labs on


Hosted by Hakka Labs

This 3-day course will demonstrate the fundamental concepts of machine learning by working on a dataset of moderate size, using open source software tools.

Course Goals
This course is designed to help engineers collaborate with data scientists and create code that tackles increasingly complex machine learning problems. By the end of this course, you will be able to:
-Apply common classification methods for supervised learning when given a data set
-Apply algorithms for unsupervised learning problems
-Select/reduce features for both supervised and unsupervised learning problems -Optimize code for common machine learning tasks by correcting inefficiencies by using advanced data structures
-Choose basic tools and criteria to perform predictive analysis

Intended Audience
The intended audience of this Machine Learning course is the engineer with strong programming skills as well as a certain level of exposure to linear algebra and probability. Students should understand the basic issue of prediction as well as Python.

Class Schedule

Day 1: Linear Algebra/Probability Fundamentals and Supervised Learning
The goal of day one is to give engineers the linear algebra/probability foundation they need to tackle problems during the rest of the course and introduce tools for supervised learning problems.

-Quick Introduction to Machine Learning
-Linear Algebra, Probability and Statistics,
-Regression Methods
-Linear and Quadratic Discriminant Analysis
-Support Vector Machines and Kernels
-Lab: Working on classification problems on a data set

Day 2: Unsupervised learning, Feature Selection and Reduction
The goal of day two is to help students understand the mindset and tools of data scientists.

-Classification Continued
-K nearest neighbors, Random Forests, Naive Bayes Classifier
-Boosting Methods
-Information Theoretic Approaches
-Feature Selection and Model Selection/Creation
-Unsupervised Learning
-Principal Component Analysis/Kernel PCA
-Independent Component Analysis
- Lab: Choosing Features and applying unsupervised learning methods to a data set

Day 3: Performance Optimization of Machine Learning Algorithms
The goal of day three is to help students understand how developers contribute to complex machine learning projects.
-Unsupervised Learning Continued
-DB-SCAN and K-D Trees
-Anomaly Detection
-Locality-Sensitive Hashing
-Recommendation Systems and Matrix Factorization Methods
-Lab: Longer lab working on back-end Machine Learning optimization programming problems in Python

Get your tickets here