Workflow Engines for Hadoop by Joe Crobak

In this talk, Joe Crobak, formerly from Foursquare, will give a brief overview of how a workflow engine fits into a standard Hadoop-based analytics stack. He will also give an architectural overview of Azkaban, Luigi, and Oozie, elaborating on some features, tools, and practices that can help build a Hadoop workflow system from scratch or improve upon an existing one. This talk was recorded at the NYC Data Engineering meetup at Ebay.


Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe talks about what features and qualities are important for a workflow system.


Joe Crobak worked on Hadoop and analytics infrastructure at Foursquare, where he built internal tools and APIs used by dozens of engineers and analysts on a daily basis.

Get updates of upcoming tech talks and presentations

If you'd like to be notified when we post new tech talks, developer presentations and opensource updates, you can subscribe to our newsletter, or YouTube channel.

Want to hear from more top engineers?

Our weekly email contains the best software development content and interviews with top CTOs. Enter your email address now to stay in the loop.