Pete Soderling Pete Soderling on

We're excited to announce the Call for Papers for our next DataEngConf - to be held in NYC, late October 2016.

Talks fit into 3 categories - data engineering, data science and data analytics. We made it super-easy to apply, so submit your ideas here!

We'll be selecting two kinds of speakers for the event, some from top companies that are building fascinating systems to process huge amounts of data, as well as the best submitted talks by members of the Hakka Labs community.

Don't delay - CFP ends Aug 15th, 2016.

Continue
Pete Soderling Pete Soderling on

We're excited to announce the Call for Papers for our next DataEngConf - to be held in NYC, late October 2016.

Talks fit into 3 categories - data engineering, data science and data analytics. We made it super-easy to apply, so submit your ideas here!

We'll be selecting two kinds of speakers for the event, some from top companies that are building fascinating systems to process huge amounts of data, as well as the best submitted talks by members of the Hakka Labs community.

Don't delay - CFP ends Aug 15th, 2016.

Continue
Neville Li Neville Li on

Learn about Scio, a Scala API for Google Cloud Dataflow (incubated as Apache Beam). Apache Beam offers a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with, e.g. Spark and Scalding. Neville will cover design and implementation of the framework, including features like typesafe BigQuery macros, REPL, and serialization. There will also be a live coding demo.

Neville is a software engineer at Spotify who works mainly on data infrastructure and tools for machine learning and advanced analytics. In the past few years he has been driving the adoption of Scala and new data tools for music recommendation, including Scalding, Spark, Storm and Parquet. Before that he worked on search quality at Yahoo! and old school distributed systems like MPI.

This talk was given at the NYC Data Engineering meetup in June 2016.

Continue
Neville Li Neville Li on

Learn about Scio, a Scala API for Google Cloud Dataflow (incubated as Apache Beam). Apache Beam offers a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with, e.g. Spark and Scalding. Neville will cover design and implementation of the framework, including features like typesafe BigQuery macros, REPL, and serialization. There will also be a live coding demo.

Neville is a software engineer at Spotify who works mainly on data infrastructure and tools for machine learning and advanced analytics. In the past few years he has been driving the adoption of Scala and new data tools for music recommendation, including Scalding, Spark, Storm and Parquet. Before that he worked on search quality at Yahoo! and old school distributed systems like MPI.

This talk was given at the NYC Data Engineering meetup in June 2016.

Continue
Reuven Lax Reuven Lax on

Reuven will cover the Beam programming model, and the advantages of hosted Google Cloud Dataflow.

Reuven has been a Google engineering since 2006. In that time, he's been instrumental in building Google's streaming data-processing systems from MillWheel to Cloud Dataflow.

This talk was given at the NYC Data Engineering meetup in June 2016.

Continue
Reuven Lax Reuven Lax on

Reuven will cover the Beam programming model, and the advantages of hosted Google Cloud Dataflow.

Reuven has been a Google engineering since 2006. In that time, he's been instrumental in building Google's streaming data-processing systems from MillWheel to Cloud Dataflow.

This talk was given at the NYC Data Engineering meetup in June 2016.

Continue
Sadayuki Furuhashi Sadayuki Furuhashi on

In production environments, it usually takes several applications and team members working together to accomplish moving data from one place to another. This problem can surface in companies of any size but is especially problematic when working at scale. This is because, when the data is being collected, it can come from different sources and likely in different formats which adds obvious complexity. Even if data is collected right, moving it at scale present other challenges that needs proper handling: duplicates, multiple destinations, exceptions and more.

Continue
Sadayuki Furuhashi Sadayuki Furuhashi on

In production environments, it usually takes several applications and team members working together to accomplish moving data from one place to another. This problem can surface in companies of any size but is especially problematic when working at scale. This is because, when the data is being collected, it can come from different sources and likely in different formats which adds obvious complexity. Even if data is collected right, moving it at scale present other challenges that needs proper handling: duplicates, multiple destinations, exceptions and more.

Continue
Calvin French-Owen Calvin French-Owen on

Data is critical to building great apps. Engineers and analysts can understand how customers interact with their brand at any time of the day, from any place they go, from any device they're using - and use that information to build a product they love. But there are countless ways to track, manage, transform, and analyze that data. And when companies are also trying to understand experiences across devices and the effect of mobile marketing campaigns, data engineering can be even trickier. What’s the right way to use data to help customers better engage with your app?

In this all-star panel hear from mobile experts at Instacart, Branch Metrics, Pandora, Invoice2Go, Gametime and Segment on the best practices they use for tracking mobile data and powering their analytics.

Che Horder is the Director of Analytics at Instacart, and previously led a team data science and engineering team at Netflix as Director of Marketing Analytics.

Gautam Joshi is the Engineering Program Manager of Analytics at Pandora and formerly worked at CNET/CBSi and Rdio. He helped create sustainable solutions for deriving meaning from large datasets. He’s a huge fan of music and technology, a California native and a proud Aggie.

Mada Seghete is the co-founder of Branch Metrics, a powerful tool that helps mobile app developers use data to grow and optimize their apps.

Beth Jubera is Senior Software Engineer at Invoice2Go, and was previously a Systems Engineer at IBM.

John Hession is VP of Growth at Gametime, and was previously Director of Mobile Operations and Client Strategy at Conversant.

Continue
Calvin French-Owen Calvin French-Owen on

Data is critical to building great apps. Engineers and analysts can understand how customers interact with their brand at any time of the day, from any place they go, from any device they're using - and use that information to build a product they love. But there are countless ways to track, manage, transform, and analyze that data. And when companies are also trying to understand experiences across devices and the effect of mobile marketing campaigns, data engineering can be even trickier. What’s the right way to use data to help customers better engage with your app?

In this all-star panel hear from mobile experts at Instacart, Branch Metrics, Pandora, Invoice2Go, Gametime and Segment on the best practices they use for tracking mobile data and powering their analytics.

Che Horder is the Director of Analytics at Instacart, and previously led a team data science and engineering team at Netflix as Director of Marketing Analytics.

Gautam Joshi is the Engineering Program Manager of Analytics at Pandora and formerly worked at CNET/CBSi and Rdio. He helped create sustainable solutions for deriving meaning from large datasets. He’s a huge fan of music and technology, a California native and a proud Aggie.

Mada Seghete is the co-founder of Branch Metrics, a powerful tool that helps mobile app developers use data to grow and optimize their apps.

Beth Jubera is Senior Software Engineer at Invoice2Go, and was previously a Systems Engineer at IBM.

John Hession is VP of Growth at Gametime, and was previously Director of Mobile Operations and Client Strategy at Conversant.

Continue
Joey Echeverria Joey Echeverria on

Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.

Continue
Joey Echeverria Joey Echeverria on

Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.

Continue
Josh Wills Josh Wills on

As a long-time practitioner in the data field (roles at Google, Cloudera and others) Josh Wills, currently Director of Data Science at Slack, explains some of the real-world motivations and tensions between data science and engineering teams.

In his own humorous way, Josh brings up some controversial ideas in this talk (ETL in Javascript?!) which spurred some highly interesting Q/A from the audience as well as prolonged attendee discussions throughout the event.

(Sorry that the picture is a bit dark, we were playing w/ the lights - but the audio is good!)

This talk was a keynote recorded at our DataEngConf event in San Francisco.

Continue
Silviu Calinoiu Silviu Calinoiu on

This talk shows how to build an ETL pipeline using Google Cloud Dataflow/Apache Beam that ingests textual data into a BigQuery table. Google engineer Silviu Calinoiu gives a live coding demo and discusses concepts as he codes. You don't need any previous background with big data frameworks, although people familiar with Spark or Flink will see some similar concepts. Because of the way the framework operates the same code can be used to scale from GB files to TB files easily.

Continue
Asim Jalis Asim Jalis on

Big Data applications need to ingest streaming data and analyze it. HBase is great at ingesting streaming data but not so good at analytics. On the other hand, HDFS is great at analytics but not at ingesting streaming data. Frequently applications ingest data into HBase and then move it to HDFS for analytics.

Continue
Hakka Labs Hakka Labs on

Check out the first 20 minutes of our previous Practical Machine Learning training taught by Juan M. Huerta, Senior Data Scientist at PlaceIQ.

20:22

Join us for our next 3-day training! November 10th-12th. This course is designed to help engineers collaborate with data scientists and create code that tackles increasingly complex machine learning problems. The course will be taught by Rachit Srivastava (Senior Data Scientist, PlaceIQ) and supervised by Juan.

By the end of this training, you will be able to:


  • Apply common classification methods for supervised learning when given a data set

  • Apply algorithms for unsupervised learning problems

  • Select/reduce features for both supervised and unsupervised learning problems

  • Optimize code for common machine learning tasks by correcting inefficiencies by using advanced data structures

  • Choose basic tools and criteria to perform predictive analysis


We screen applicants for engineering ability and drive, so you'll be in a room full of passionate devs who ask the right questions. Applicants should have 3+ years of coding experience, knowledge of Python, and previous exposure to linear algebra concepts.

You can apply for a seat on our course page.

Continue
Reynold Xin Reynold Xin on

Mining Big Data can be an incredibly frustrating experience due to its inherent complexity and a lack of tools. Reynold Xin and Aaron Davidson are Committers and PMC Members for Apache Spark and use the framework to mine big data at Databricks. In this presentation and interactive demo, you'll learn about data mining workflows, the architecture and benefits of Spark, as well as practical use cases for the framework.

Continue
Debora Donato Debora Donato on

StumbleUpon indexes over 100 million web pages for serendipitous retrieval for over 25 million registered users. Debora Donato (Principal Data Scientist, StumbleUpon) walks through StumbleUpon's big data architecture, data pipelines, mobile optimization efforts, and data mining projects.

Continue
Evan Chan Evan Chan on

Evan Chan (Software Engineer, Ooyala), describes his experience using the Spark and Shark frameworks for running real-time queries on top of Cassandra data. He starts by surveying the Cassandra analytics landscape, including Hadoop and HIVE, and touches on the use of custom input formats to extract data from Cassandra. Then, he dives into Spark and Shark (two memory-based cluster computing frameworks) and explains how they enable often dramatic improvements in query speed and productivity.

Continue
Michael Kjellman Michael Kjellman on

So far, I've explained why you shouldn't migrate to C* and the origins and key terms. Now, I'm going to turn my attention to how Cassandra stores data.

Cassandra nodes, clusters, rings


At a very high level, Cassandra operates by dividing all data evenly around a cluster of nodes, which can be visualized as a ring. Nodes generally run on commodity hardware. Each C* node in the cluster is responsible for and assigned a token range (which is essentially a range of hashes defined by a partitioner, which defaults to Murmur3Partitioner in C* v1.2+). By default this hash range is defined with a maximum number of possible hash values ranging from 0 to 2^127-1.

Continue
Neville Li Neville Li on

This is the first time that a Spotify engineer has spoken publicly about their deployment and use cases for Storm! In this talk, Software Engineer Neville Li describes:


  • Real-time features developed using Storm and Kafka including recommendations, social features, data visualization and ad targeting

  • Architecture

  • Production integration

  • Best practices for deployment


49:54

Continue
Pete Soderling Pete Soderling on

In response to a recent post from MongoHQ entitled “You don’t have big data," I would generally agree with many of the author’s points.

However, regardless of whether you call it big data, small data, hot data or cold data - we are all in a position to admit that *more* data is here to stay - and that’s due to many different factors.

Perhaps primarily, as the article mentions, this is due to the decreasing cost of storage over time. Other factors include access to open APIs, the sheer volume of ever-increasing consumer activity online, as well as a plethora of other incentives that are developing (mostly) behind the scenes as companies “share” data with each other. (You know they do this, right?)

But one of the most important things I’ve learned over the past couple of years is that it’s crucial for forward thinking companies to start to design more robust data pipelines in order to collect, aggregate and process their ever-increasing volumes of data. The main reason for this is to be able to tee up the data in a consistent way for the seemingly-magical quant-like operations that infer relationships between the data that would have otherwise surely gone unnoticed - ingeniously described in the referenced article as correctly “determining the nature of needles from a needle-stack.”

But this raises the question - what are the characteristics of a well-designed data pipeline? Can’t you just throw all your data in Hadoop and call it a day?

As many engineers are discovering - the answer is a resounding "no!" We've rounded up four examples from smart engineers at Stripe, Tapad, Etsy & Square that show aspects of some real-world data pipelines you'll actually see in the wild.

How does Stripe do it?

We spoke to Avi Bryant at Stripe who gave us a nice description of the way Stripe has approached the building of their data pipeline.

"Stripe feeds data to HDFS from various sources, many of them unstructured or semi-structured -  server logs, for example, or JSON and BSON documents. In every case, the first step is to translate these into a structured format. We've standardized on using Thrift to define the logical structure, and Parquet as the on-disk storage format.

"We chose Parquet because it's an efficient columnar format which is natively understood by Cloudera's Impala query engine, which gives us very fast relational access to our data for ad-hoc reporting. The combination of Parquet and Thrift can also be used efficiently and idiomatically from Twitter's Scalding framework, which is our tool of choice for complex batch processing.

"The next stage is "denormalization": to keep our analytical jobs and queries fast, we do the most common joins ahead of time, in Scalding, writing to new set of Thrift schemas. At the same time, we do a bunch of enhancing and annotating of the data: for example, geocoding IP addresses, parsing user agent strings, or cleaning up missing values.

"In many cases, this results in schemas with nested structure, which works well from Scalding and which Parquet is happy to store, but which Impala cannot currently query. We've developed a simple tool which will transform arbitrarily nested Parquet data into an equivalent flattened schema, and where necessary we use this to maintain a parallel copy of each data source for use from Impala. We're looking forward to future versions of Impala which might remove this extra step."

Tapad’s Data Pipeline

Tapad is an ad-tech business in NYC that’s experienced lots of growth in both traffic and data over the past several years. So I reached out to their CTO, Dag Liodden, to find out how they’ve built their data pipeline, and some of the strategies and tools they use. In Dag’s own words, here’s how they do it:

"- All ingested data flows through a message queue in a pub-sub fashion (we use Kafka and push multiple TB of data through it every hour)

- All data is encoded with a consistent denormalized schema that supports schema evolution (we use Avro and Protocol Buffers)

- Most of our data stores are updated in real-time from processes consuming the message queues (hot data is pushed to Aerospike and Cassandra, real-time queryable data to Vertica and the raw events, often enriched with data from our Aerospike cluster, is stored in HDFS)
- Advanced analytics and data science computation is typically executed on the denormalized data in HDFS

- The real-time updates can always be reproduced through offline batch jobs over the HDFS stored data. We strive to make our computation logic so that it can be run in-stream *and* in batch MR-mode without any modification”

He notes that the last point allows them to retroactively change their streaming computation at-will and then backfill the other data stores with updated projections.

Dag also explains the “why” behind their use of multiple types of data technologies on the storage side and explains that each of them has its own particular “sweet-spot” which makes it attractive to them:

"- Kafka: High-throughput parallel pub-sub, but relaxed delivery and latency guarantees, limited data retention and no querying capabilities.

- Aerospike: Extremely fast random access read/write performance, by key (we have 3.2 billion keys and 4TB of replicated data), cross data center replication, high availability but very limited querying capabilities

- Cassandra: Medium random access read/write performance, atomic counters and a data model that lends it well to time-series storage. Flexible consistency model and cross data center replication.

- HDFS: High throughput and cheap storage.

- Vertica: Fast and powerful ad-hoc querying capabilities for interactive analysis, high availability, but no support for nested data structure, multi-valued attributes. Storage based pricing makes us limit the amount of data we put here."

How Etsy handles data

For another example, we reached Rafe Colburn, the engineering manager of Etsy's data team and asked how they handled their pipeline. Here's the scoop from Rafe:

"Etsy's analytics pipeline isn't especially linear. It starts with our instrumentation, which consists of an event logger that runs in the browser and another that can be called from the back end. Both ping some internal "beacon" servers.

"We actually use the good old logrotate program to ship the generated Apache access logs to HDFS when they reach a certain size, and process them using Hadoop. We also snapshot our production data (which resides in MySQL) nightly and copy it to HDFS as well, so that we can join our clickstream data to our transactional data.

"We usually send the output of our Hadoop jobs to our Vertica data warehouse, where we replicate our production data as well, for further analysis. We use that data to feed our homemade reporting and analytics tools.

"For features on etsy.com that use data generated from Hadoop jobs, we have a custom tool that takes the output of a job and stores it on our sharded MySQL cluster where it can be accessed at scale. This year we're looking at integrating Kafka into our pipeline to move data both from our instruments to Hadoop (and to streaming analytics tools), and also to send data back from our analytics platforms to the public site."

Square’s approach

Another example from a company that has a sophisticated data pipeline is Square. We reached one of their engineering managers, Pascal-Louis Perez, who gave us a strategic view of their pipeline architecture.

Because of the importance of payments flowing through its system, Square has extended the concept of ‘reconciliation’ throughout its entire data pipeline; with each transformation data must be able to be audited and verified. The main issue with this approach, according to Pascal, is that it can be challenging to scale. For every payment received there are "roughly 10 to 15 accounting entries required and the reconciliation system must therefore scale at one order of magnitude above that of processing, which already is very large.”

Square’s approach to solving this problem leverages stream processing, which allows them to map a corresponding data domain to a different stream. In Pascal’s words, "The streams represent the first level abstraction distancing the data-pipeline from the diversity of sources of data. The next level are operators combining one of multiple streams, and producing themselves one or multiple streams. One example operator is a "matcher" which takes two streams, extracts similarly kinded keys from those, and produces two streams separated based on matching criterias.”

Pascal notes that  the system of stream processing and stream based operators is similar to relational algebra and its operators, but in this case it’s in real-time and on infinite relations.

It’s pretty obvious that cramming data into Hadoop won’t give you these capabilities!

Interested in getting better with data?


Hakka Labs has launched Machine Learning for Engineers, a new 3-day training course that teaches engineers the theory and practice behind common machine learning algorithms. Get more info here.

Continue