Cloudera - Large Scale ETL with Hadoop by Eric Sammer

This talk is by Eric Sammer, Engineering Manager at Cloudera,recorded at our SF Data Engineering meetup.

Eric will explain how to architect an ETL system that scales.

http://www.youtube.com/watch?feature=player_detailpage&v=1SQWzG3FIu4

 

Our next meetup, where Cloudera will be talking about using Morphlines for on-the-fly ETL, will take place on July 25. 

Register here.

Hadoop is commonly used for processing large swaths of data in batch. While many of the necessary building blocks for data processing exist within the Hadoop ecosystem – HDFS, MapReduce, HBase, Hive, Pig, Oozie and so on – it can be a challenge to assemble and operationalize them as a production ETL platform.

This presentation will cover one approach to data ingest, organization, format selection, process orchestration and external system integration, based on collective experience acquired across many production Hadoop deployments.

Bio: Eric Sammer is an Engineering Manager at Cloudera, where he is focused on highly available, efficient, distributed, and parallel data collection, analysis and reporting back end systems. He has a solid background in software development, systems and networking & data management systems, with a career track that spans for over 10 years. This presentation will cover one approach to data ingest, organization, format selection, process orchestration and external system integration, based on collective experience acquired across many production Hadoop deployments.

Introduction to ETL

ETL stands for Extract, transform, and load. It is a process in database usage and especially in data warehousing. It is commonly used with Hadoop to handle very large data sets and enable web systems to achieve high performance at great scale.

Screen Shot 2013-08-07 at 6.28.41 PM

Let's go over the three distinct parts of ETL. The first part is the extraction. That is a fancy word of saying that the data is retrieved. So the extraction component of ETL is basically the process of getting the data from the database. The second component of ETL is transform.  That can be a complex process because the data (of which there is quite a bit) is transformed from one structure into another.  The reason for doing that transformation is to be able to add that information into a different database that is structured differently and is used for different needs and possibly by a different company. Typically, the process of transformation is done according to various rules, lookup tables, or by joining various sets of data to give it a new structure that would fit into the second database's architecture. The last component is the loading of the data. And that is self explanatory because it is simply the process of loading the data that was taken from the original database, converted to the second database's structure, and loaded into that second database.