anna smith anna smith on

Anna Smith from Rent the Runway talks about how they've evolved their data pipeline over time to deal with infrastructure constraints, disparate data sources, and changing data sources/quality all while still serving reports and data back to the website with minimal downtime. Anna also covers how they leveraged Luigi to ensure robust reporting without forcing non-technical analysts to learn Python.

38:54

This video was recorded at the NYC Data Engineering meetup at Spotify in NYC.

Continue
Rafe Coburn Rafe Coburn on

Three years ago, Etsy's analytics data pipeline was built around a pixel hosted on Akamai, FTP uploads, and Amazon EMR. Rafe Colburn, manager of the data engineering team at Etsy, talks about their migration to a data ingestion pipeline based on Kafka. He gives an overview on how they rebuilt their data pipeline without disrupting ongoing analytics work, as well as the tradeoffs made in building these systems.

Continue
Michael Hwang Michael Hwang on

Here at SumAll, we provide analytics from over 40 different sources that range from social media data to health and fitness tracking. We collect and connect metrics from a wide range of 3rd party API data sources for our clients to help them make better business decisions. One of engineering's biggest challenge has been on building and maintaining a data pipeline that is both performant and reliable.

Continue
Nick Gorski Nick Gorski on

TellApart Software Engineer Nick Gorski takes us through a technical deep-dive into TellApart's personalization system. He discusses the machine learning data pipeline at TellApart that powers the models, real-time calculations of the expected value of shoppers, and how to translate that value into a bid price for every bid request received (hundreds of thousands per second).

Continue
Toby Matejovsky Toby Matejovsky on

Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, Tapad Senior Software Developer Toby Matejovsky speaks about the creation and evolution of the pipeline. He demonstrates a concrete example – a day in the life of an event tracking pixel. Toby also talks about common challenges that his team has overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.

Continue
Pete Soderling Pete Soderling on

In response to a recent post from MongoHQ entitled “You don’t have big data," I would generally agree with many of the author’s points.

However, regardless of whether you call it big data, small data, hot data or cold data - we are all in a position to admit that *more* data is here to stay - and that’s due to many different factors.

Perhaps primarily, as the article mentions, this is due to the decreasing cost of storage over time. Other factors include access to open APIs, the sheer volume of ever-increasing consumer activity online, as well as a plethora of other incentives that are developing (mostly) behind the scenes as companies “share” data with each other. (You know they do this, right?)

But one of the most important things I’ve learned over the past couple of years is that it’s crucial for forward thinking companies to start to design more robust data pipelines in order to collect, aggregate and process their ever-increasing volumes of data. The main reason for this is to be able to tee up the data in a consistent way for the seemingly-magical quant-like operations that infer relationships between the data that would have otherwise surely gone unnoticed - ingeniously described in the referenced article as correctly “determining the nature of needles from a needle-stack.”

But this raises the question - what are the characteristics of a well-designed data pipeline? Can’t you just throw all your data in Hadoop and call it a day?

As many engineers are discovering - the answer is a resounding "no!" We've rounded up four examples from smart engineers at Stripe, Tapad, Etsy & Square that show aspects of some real-world data pipelines you'll actually see in the wild.

How does Stripe do it?

We spoke to Avi Bryant at Stripe who gave us a nice description of the way Stripe has approached the building of their data pipeline.

"Stripe feeds data to HDFS from various sources, many of them unstructured or semi-structured -  server logs, for example, or JSON and BSON documents. In every case, the first step is to translate these into a structured format. We've standardized on using Thrift to define the logical structure, and Parquet as the on-disk storage format.

"We chose Parquet because it's an efficient columnar format which is natively understood by Cloudera's Impala query engine, which gives us very fast relational access to our data for ad-hoc reporting. The combination of Parquet and Thrift can also be used efficiently and idiomatically from Twitter's Scalding framework, which is our tool of choice for complex batch processing.

"The next stage is "denormalization": to keep our analytical jobs and queries fast, we do the most common joins ahead of time, in Scalding, writing to new set of Thrift schemas. At the same time, we do a bunch of enhancing and annotating of the data: for example, geocoding IP addresses, parsing user agent strings, or cleaning up missing values.

"In many cases, this results in schemas with nested structure, which works well from Scalding and which Parquet is happy to store, but which Impala cannot currently query. We've developed a simple tool which will transform arbitrarily nested Parquet data into an equivalent flattened schema, and where necessary we use this to maintain a parallel copy of each data source for use from Impala. We're looking forward to future versions of Impala which might remove this extra step."

Tapad’s Data Pipeline

Tapad is an ad-tech business in NYC that’s experienced lots of growth in both traffic and data over the past several years. So I reached out to their CTO, Dag Liodden, to find out how they’ve built their data pipeline, and some of the strategies and tools they use. In Dag’s own words, here’s how they do it:

"- All ingested data flows through a message queue in a pub-sub fashion (we use Kafka and push multiple TB of data through it every hour)

- All data is encoded with a consistent denormalized schema that supports schema evolution (we use Avro and Protocol Buffers)

- Most of our data stores are updated in real-time from processes consuming the message queues (hot data is pushed to Aerospike and Cassandra, real-time queryable data to Vertica and the raw events, often enriched with data from our Aerospike cluster, is stored in HDFS)
- Advanced analytics and data science computation is typically executed on the denormalized data in HDFS

- The real-time updates can always be reproduced through offline batch jobs over the HDFS stored data. We strive to make our computation logic so that it can be run in-stream *and* in batch MR-mode without any modification”

He notes that the last point allows them to retroactively change their streaming computation at-will and then backfill the other data stores with updated projections.

Dag also explains the “why” behind their use of multiple types of data technologies on the storage side and explains that each of them has its own particular “sweet-spot” which makes it attractive to them:

"- Kafka: High-throughput parallel pub-sub, but relaxed delivery and latency guarantees, limited data retention and no querying capabilities.

- Aerospike: Extremely fast random access read/write performance, by key (we have 3.2 billion keys and 4TB of replicated data), cross data center replication, high availability but very limited querying capabilities

- Cassandra: Medium random access read/write performance, atomic counters and a data model that lends it well to time-series storage. Flexible consistency model and cross data center replication.

- HDFS: High throughput and cheap storage.

- Vertica: Fast and powerful ad-hoc querying capabilities for interactive analysis, high availability, but no support for nested data structure, multi-valued attributes. Storage based pricing makes us limit the amount of data we put here."

How Etsy handles data

For another example, we reached Rafe Colburn, the engineering manager of Etsy's data team and asked how they handled their pipeline. Here's the scoop from Rafe:

"Etsy's analytics pipeline isn't especially linear. It starts with our instrumentation, which consists of an event logger that runs in the browser and another that can be called from the back end. Both ping some internal "beacon" servers.

"We actually use the good old logrotate program to ship the generated Apache access logs to HDFS when they reach a certain size, and process them using Hadoop. We also snapshot our production data (which resides in MySQL) nightly and copy it to HDFS as well, so that we can join our clickstream data to our transactional data.

"We usually send the output of our Hadoop jobs to our Vertica data warehouse, where we replicate our production data as well, for further analysis. We use that data to feed our homemade reporting and analytics tools.

"For features on etsy.com that use data generated from Hadoop jobs, we have a custom tool that takes the output of a job and stores it on our sharded MySQL cluster where it can be accessed at scale. This year we're looking at integrating Kafka into our pipeline to move data both from our instruments to Hadoop (and to streaming analytics tools), and also to send data back from our analytics platforms to the public site."

Square’s approach

Another example from a company that has a sophisticated data pipeline is Square. We reached one of their engineering managers, Pascal-Louis Perez, who gave us a strategic view of their pipeline architecture.

Because of the importance of payments flowing through its system, Square has extended the concept of ‘reconciliation’ throughout its entire data pipeline; with each transformation data must be able to be audited and verified. The main issue with this approach, according to Pascal, is that it can be challenging to scale. For every payment received there are "roughly 10 to 15 accounting entries required and the reconciliation system must therefore scale at one order of magnitude above that of processing, which already is very large.”

Square’s approach to solving this problem leverages stream processing, which allows them to map a corresponding data domain to a different stream. In Pascal’s words, "The streams represent the first level abstraction distancing the data-pipeline from the diversity of sources of data. The next level are operators combining one of multiple streams, and producing themselves one or multiple streams. One example operator is a "matcher" which takes two streams, extracts similarly kinded keys from those, and produces two streams separated based on matching criterias.”

Pascal notes that  the system of stream processing and stream based operators is similar to relational algebra and its operators, but in this case it’s in real-time and on infinite relations.

It’s pretty obvious that cramming data into Hadoop won’t give you these capabilities!

Interested in getting better with data?


Hakka Labs has launched Machine Learning for Engineers, a new 3-day training course that teaches engineers the theory and practice behind common machine learning algorithms. Get more info here.

Continue