Making your Local Hadoop More Like AWS Elastic MapReduce
At MediaMath, we’re big users of Elastic MapReduce (EMR). EMR’s incredible flexibility makes it a great fit for our data analytics team, which processes TBs of data each day to provide insights to our clients, to better understand our own business, and to power the various product back-ends that make Terminal 1 the “marketing operating system” that it is.
An extremely important best practice for any analytics project is to ensure the local development and test environments match the production environment as much as possible. This eliminates the nasty surprise of launching a job that takes hours, only to discover that it fails late into the run due to some unmet dependency or configuration mistake. Failing to invest time in the dev/test phase is a surefire way to blow big $$.
I’ve been investigating some configuration settings you can make to your local Hadoop, to bring it inline with what you’ll find when you run a job on an EMR cluster (read here for more info on our Hadoop configuration). This is especially important to us at MediaMath, because we use S3 as a sort of centralized filesystem. EMR is designed to work wonderfully with S3. Specifically by:
- Using s3:// URIs everywhere instead of s3n:// URIs
- Embedding AWS access keys
- Supporting transparent LZO compression
I run all my Hadoop jobs on my laptop using Homebrew. Homebrew is a fantastic package manager for OS X that makes it a breeze to install general UNIX utilities as well as more complicated software packages (like Hadoop and Hive).
And you’re good!
s3:// vs s3n:// URIs in HDFS
Ever wondered what the difference between an s3:// URI and an s3n:// URI is?
Up until December 2010, S3 had a 5GB object size limit. So, if you used the default S3 HDFS implementation (by specifying an s3n:// URI) you couldn’t read/write files greater than 5GB. That said, when you did read or write a file with HDFS there was a 1 to 1 correspondence with the object that got stored in S3.
To process files larger than 5GB, you had to use s3:// URIs in HDFS, which actually chunked the file into multiple pieces behind the scenes, before storing each piece as a separate object in S3. When accessing something via HDFS with an s3://bucket/object URI, you might actually be downloading multiple “chunks” from S3. This page has some more info.
Nowadays the S3 limit is 5TB, so there isn’t really a need to use s3:// URIs in HDFS anymore. In fact, in EMR, s3:// and s3n:// are both aliased to the same implementation (s3n).
Here’s the relevant config:
Embedding AWS Access Keys
If you want to run Hadoop jobs on your laptop but use data stored in S3, you’ll need to ensure your credentials are stored in mapred-site.xml. If you installed Hadoop via Homebrew, just edit $(brew –prefix hadoop)/libexec/conf/mapred-site.xml.
As a side note: I set this up for both the s3n and s3 HDFS filesystem implementations, but since I only ever use the NativeS3FileSystem via s3:// URIs, the s3 properties don’t really matter (because the S3FileSystem will never be used).
Hadoop and LZO compression
The log files we process in our analytics platform are compressed using LZO compression. Luckily EMR can transparently decompress these files, so no extra configuration is needed. Our local Hadoop install, however, cannot. Luckily Twitter has some open source code we can use on GitHub.
Getting this to work requires a few steps:
- Compiling the native C code for LZO
- Compiling the java wrapper used by Hadoop
- Setting up the relevant classpath/library paths so Hadoop can find them
- Configuring Hadoop to use LZO compression when it finds a .lzo file
That’s a lot to remember. Homebrew to the rescue!
Basically, this will use Homebrew to install lzo and maven so you can compile the Twitter Hadoop/LZO code for steps 1 and 2. The compiled code will be installed into
Now we can configure Hadoop to use it. First of all, the default hadoop executable actually resets JAVALIBRARYPATHwhich means we can’t get the native libraries onto the right path. So, we have to edit $(brew –prefix hadoop)/libexec/bin/hadoop and comment out line #353
Now we need to edit our .bashrc. I’m also setting JAVA_HOME just to be safe.
Lastly, let’s edit mapred-site.xml again.
Now your jobs should be able to transparently decompress LZO files, even when running on your laptop!
For more information on installing Hadoop and LZO via Homebrew, checkout the GitHub page for my Homebrew Tap!
Investigating more settings
If you enable debugging when you spin up your EMR cluster, you can actually inspect the jobconf by downloading the file from S3. This is a great way to see what other settings you may need to tweak to ensure that your jobs can be tested on your laptop before you send them off to EMR.
This article first appeared on theMediaMath Developer Blog.