Big Data - Distributed Random Forest by Jan Vitek

This talk is by Jan Vitek, a professor in computer science at Purdue University. In it, Jan discusses the design and implementation of Distributed Random Forest, a big data algorithm for H2O. This talk was recorded at the SF Data Mining meetup at Trulia.


Want to hear from more top engineers?

Our weekly email contains the best software development content and interviews with top CTOs. Enter your email address now to stay in the loop.


Bios: Dr. Vitek paired with 0xdata to make a better world for Math.  He is on sabbatical with 0xdata & a full Professor of Computer Science at Purdue. Jan's students are solving some of the hardest problems in Programming Language and Virtual Machine Implementations. Jan is a hacker - He developed Distributed Random Forest for H2O.

History of the distributed random forest algorithm

The algorithm was first developed by Yali Amit, and Donald German. Their paper was first published in 1997 under the title "Shape quantization and recognition with random trees." This paper was significant because it was the first to mention the idea of searching over a random subset of available decisions when splitting a node, in the context of growing a single tree. But that wasn't the random forest algorithm as we know it today.  In 2001, Leo Breiman introduced the algorithm as we recognize and use it today. Leo Breiman is generally regarded as one of the forefathers and pioneers of data mining and machine learning.

[caption id="attachment_779" align="aligncenter" width="200"]Leo Breiman Photo Leo Breiman Photo[/caption]

Breiman's approach had a forest of of uncorrelated trees using what is described as a cart-like procedure. These random trees combined with randomized node optimization and a procedure called bagging (machine learning meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It is usually applied to decision tree methods). And just a little bit of trivia about the algorithm: Leo Breiman originally wrote the algorith in Fortran 77.

Random forest algorithm in use today

Today, the Random Forest Algorithm is one of the best available classification algorithms. It is used to accurately classify big sets of data. Of course, today, big data is a hot topic. Many sites need to aggregate a large quantity of information about their users, their friends, and their friends of friends, and then apply various business logic and analytics algorithms on top of that data in order to extract various types of business intelligence. It takes special algorithms and special people who can work with these algorithms to accomplish that. And the need to be able to handle big data is growing especially fast with the proliferation of social media and social use of various features across the web.

Stay in touch

If you enjoyed the article, the video and the slide presentation, stay in touch with us. We send out a weekly newsletter with some of the top engineering articles, tech talks, upcoming meetups and even tech jobs.  You can also follow our founder on Twitter @petesoder and tell him what you thought of this or any other article on our site.