In response to a recent post from MongoHQ entitled “You don’t have big data," I would generally agree with many of the author’s points.
However, regardless of whether you call it big data, small data, hot data or cold data - we are all in a position to admit that *more* data is here to stay - and that’s due to many different factors.
Perhaps primarily, as the article mentions, this is due to the decreasing cost of storage over time. Other factors include access to open APIs, the sheer volume of ever-increasing consumer activity online, as well as a plethora of other incentives that are developing (mostly) behind the scenes as companies “share” data with each other. (You know they do this, right?)
But one of the most important things I’ve learned over the past couple of years is that it’s crucial for forward thinking companies to start to design more robust data pipelines in order to collect, aggregate and process their ever-increasing volumes of data. The main reason for this is to be able to tee up the data in a consistent way for the seemingly-magical quant-like operations that infer relationships between the data that would have otherwise surely gone unnoticed - ingeniously described in the referenced article as correctly “determining the nature of needles from a needle-stack.”
But this raises the question - what are the characteristics of a well-designed data pipeline? Can’t you just throw all your data in Hadoop and call it a day?
As many engineers are discovering - the answer is a resounding "no!" We've rounded up four examples from smart engineers at Stripe, Tapad, Etsy & Square that show aspects of some real-world data pipelines you'll actually see in the wild.
How does Stripe do it?
We spoke to Avi Bryant at Stripe who gave us a nice description of the way Stripe has approached the building of their data pipeline.
"Stripe feeds data to HDFS from various sources, many of them unstructured or semi-structured - server logs, for example, or JSON and BSON documents. In every case, the first step is to translate these into a structured format. We've standardized on using Thrift to define the logical structure, and Parquet as the on-disk storage format.
"We chose Parquet because it's an efficient columnar format which is natively understood by Cloudera's Impala query engine, which gives us very fast relational access to our data for ad-hoc reporting. The combination of Parquet and Thrift can also be used efficiently and idiomatically from Twitter's Scalding framework, which is our tool of choice for complex batch processing.
"The next stage is "denormalization": to keep our analytical jobs and queries fast, we do the most common joins ahead of time, in Scalding, writing to new set of Thrift schemas. At the same time, we do a bunch of enhancing and annotating of the data: for example, geocoding IP addresses, parsing user agent strings, or cleaning up missing values.
"In many cases, this results in schemas with nested structure, which works well from Scalding and which Parquet is happy to store, but which Impala cannot currently query. We've developed a simple tool which will transform arbitrarily nested Parquet data into an equivalent flattened schema, and where necessary we use this to maintain a parallel copy of each data source for use from Impala. We're looking forward to future versions of Impala which might remove this extra step."
Tapad’s Data Pipeline
Tapad is an ad-tech business in NYC that’s experienced lots of growth in both traffic and data over the past several years. So I reached out to their CTO, Dag Liodden, to find out how they’ve built their data pipeline, and some of the strategies and tools they use. In Dag’s own words, here’s how they do it:
"- All ingested data flows through a message queue in a pub-sub fashion (we use Kafka and push multiple TB of data through it every hour)
- All data is encoded with a consistent denormalized schema that supports schema evolution (we use Avro and Protocol Buffers)
- Most of our data stores are updated in real-time from processes consuming the message queues (hot data is pushed to Aerospike and Cassandra, real-time queryable data to Vertica and the raw events, often enriched with data from our Aerospike cluster, is stored in HDFS)
- Advanced analytics and data science computation is typically executed on the denormalized data in HDFS
- The real-time updates can always be reproduced through offline batch jobs over the HDFS stored data. We strive to make our computation logic so that it can be run in-stream *and* in batch MR-mode without any modification”
He notes that the last point allows them to retroactively change their streaming computation at-will and then backfill the other data stores with updated projections.
Dag also explains the “why” behind their use of multiple types of data technologies on the storage side and explains that each of them has its own particular “sweet-spot” which makes it attractive to them:
"- Kafka: High-throughput parallel pub-sub, but relaxed delivery and latency guarantees, limited data retention and no querying capabilities.
- Aerospike: Extremely fast random access read/write performance, by key (we have 3.2 billion keys and 4TB of replicated data), cross data center replication, high availability but very limited querying capabilities
- Cassandra: Medium random access read/write performance, atomic counters and a data model that lends it well to time-series storage. Flexible consistency model and cross data center replication.
- HDFS: High throughput and cheap storage.
- Vertica: Fast and powerful ad-hoc querying capabilities for interactive analysis, high availability, but no support for nested data structure, multi-valued attributes. Storage based pricing makes us limit the amount of data we put here."
How Etsy handles data
For another example, we reached Rafe Colburn, the engineering manager of Etsy's data team and asked how they handled their pipeline. Here's the scoop from Rafe:
"Etsy's analytics pipeline isn't especially linear. It starts with our instrumentation, which consists of an event logger that runs in the browser and another that can be called from the back end. Both ping some internal "beacon" servers.
"We actually use the good old logrotate program to ship the generated Apache access logs to HDFS when they reach a certain size, and process them using Hadoop. We also snapshot our production data (which resides in MySQL) nightly and copy it to HDFS as well, so that we can join our clickstream data to our transactional data.
"We usually send the output of our Hadoop jobs to our Vertica data warehouse, where we replicate our production data as well, for further analysis. We use that data to feed our homemade reporting and analytics tools.
"For features on etsy.com that use data generated from Hadoop jobs, we have a custom tool that takes the output of a job and stores it on our sharded MySQL cluster where it can be accessed at scale. This year we're looking at integrating Kafka into our pipeline to move data both from our instruments to Hadoop (and to streaming analytics tools), and also to send data back from our analytics platforms to the public site."
Another example from a company that has a sophisticated data pipeline is Square. We reached one of their engineering managers, Pascal-Louis Perez, who gave us a strategic view of their pipeline architecture.
Because of the importance of payments flowing through its system, Square has extended the concept of ‘reconciliation’ throughout its entire data pipeline; with each transformation data must be able to be audited and verified. The main issue with this approach, according to Pascal, is that it can be challenging to scale. For every payment received there are "roughly 10 to 15 accounting entries required and the reconciliation system must therefore scale at one order of magnitude above that of processing, which already is very large.”
Square’s approach to solving this problem leverages stream processing, which allows them to map a corresponding data domain to a different stream. In Pascal’s words, "The streams represent the first level abstraction distancing the data-pipeline from the diversity of sources of data. The next level are operators combining one of multiple streams, and producing themselves one or multiple streams. One example operator is a "matcher" which takes two streams, extracts similarly kinded keys from those, and produces two streams separated based on matching criterias.”
Pascal notes that the system of stream processing and stream based operators is similar to relational algebra and its operators, but in this case it’s in real-time and on infinite relations.
It’s pretty obvious that cramming data into Hadoop won’t give you these capabilities!
Interested in getting better with data?
Hakka Labs has launched Machine Learning for Engineers
, a new 3-day training course that teaches engineers the theory and practice behind common machine learning algorithms. Get more info here