Small 20e3319227b03b50ec589c29a1e7fd25 400x400 Fabrizio Milo on

Tensorflow is one of the fastest growing open source deep learning frameworks available today. Tensorflow was developed internally by Google and released open source in November 2015.

Although it is mainly known to be applied to model deep learning architectures, Tensorflow's flexible interface makes it a good candidate for production level data-science pipelines as well.

In this talk, you will learn about the fundamentals of distributing Tensorflow models over multiple computers.

Fabrizio Milo is a deep learning architect and early TensorFlow contributor @

Small thumb speaker nevilleli Neville Li on

Learn about Scio, a Scala API for Google Cloud Dataflow (incubated as Apache Beam). Apache Beam offers a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with, e.g. Spark and Scalding. Neville will cover design and implementation of the framework, including features like typesafe BigQuery macros, REPL, and serialization. There will also be a live coding demo.

Neville is a software engineer at Spotify who works mainly on data infrastructure and tools for machine learning and advanced analytics. In the past few years he has been driving the adoption of Scala and new data tools for music recommendation, including Scalding, Spark, Storm and Parquet. Before that he worked on search quality at Yahoo! and old school distributed systems like MPI.

This talk was given at the NYC Data Engineering meetup in June 2016.

Small reuvenlax1 1409720209 140 Reuven Lax on

Reuven will cover the Beam programming model, and the advantages of hosted Google Cloud Dataflow.

Reuven has been a Google engineering since 2006. In that time, he's been instrumental in building Google's streaming data-processing systems from MillWheel to Cloud Dataflow.

This talk was given at the NYC Data Engineering meetup in June 2016.

Small capture Sadayuki Furuhashi on

In production environments, it usually takes several applications and team members working together to accomplish moving data from one place to another. This problem can surface in companies of any size but is especially problematic when working at scale. This is because, when the data is being collected, it can come from different sources and likely in different formats which adds obvious complexity. Even if data is collected right, moving it at scale present other challenges that needs proper handling: duplicates, multiple destinations, exceptions and more.

In this presentation, Sadayuki will dissect the challenges described and share his experience developing two open source solutions to address these problems: Fluentd and Embulk.

Sadayuki Furuhashi is an open-source hacker who wrote original code of MessagePack, Fluentd and Embulk projects. He is also a founder and architect of Treasure Data, Inc. and works on distributed storage and query engines.

This talk was given at SF DataEngConf in April 2016.

Small calvinfrenchowen Calvin French-Owen on

Data is critical to building great apps. Engineers and analysts can understand how customers interact with their brand at any time of the day, from any place they go, from any device they're using - and use that information to build a product they love. But there are countless ways to track, manage, transform, and analyze that data. And when companies are also trying to understand experiences across devices and the effect of mobile marketing campaigns, data engineering can be even trickier. What’s the right way to use data to help customers better engage with your app?

In this all-star panel hear from mobile experts at Instacart, Branch Metrics, Pandora, Invoice2Go, Gametime and Segment on the best practices they use for tracking mobile data and powering their analytics.


Small 38bb484 Joey Echeverria on

Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.

Joey Echeverria presents an alternative approach based on a reusable library that provides configuration-based data transformation. This allows users to write command data-transformation rules once and reuse them in multiple contexts. A common pattern is to consume a single, raw stream and transform it using the same rules before storing in different repositories such as Apache Solr for search and Apache Hadoop HDFS for deep storage.


Small calvinfrenchowen Calvin French-Owen on

Segment’s API has scaled significantly over the past three years and has grown from processing a trickle of events to tens of thousands per second. Today, Segment processes tens of billions of events each month and sends them to hundreds of partner APIs.

It can be a very hostile environment. Partners fail frequently, customers send highly variable data and instances regularly die. As a result, Segment has invested heavily in tools for monitoring, failover and fairness when it comes to routing events through its system.

In this talk, CTO Calvin French-Owen will discuss how Segment continues to maintain a high quality of service, how its infrastructure has evolved over time and where it's heading in the future.

This talk is from DataEngConf SF in April 2016.

Small karthik Karthik Ramasamy on

Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. Twitter open-sourced Heron this year.

In this talk, you will learn about the operating experiences and challenges of running Heron at scale and the approaches that the team at Twitter took to solve those challenges.


Small aaeaaqaaaaaaaajkaaaajdixyzc2ztm1ltjlmtqtndnjzi04mduxltq0odcxmze0mmm4ng Krishna Gade on

At Pinterest, hundreds of services and third-party tools that are implemented in various programming languages generate billions of events every day.

To achieve scalable and reliable low latency logging, there are several challenges: (1) uploading logs that are generated in various formats from tens of thousands of hosts to Kafka in a timely manner; (2) running Kafka reliably on Amazon Web Services where the virtual instances are less reliable than on-premises hardware; (3) moving tens of terabytes data per day from Kafka to cloud storage reliably and efficiently, and guaranteeing exact one time persistence per message.

In this talk, Krishna Gade (Head of Data Engineering) and Yu Yang (Data Engineer) will present Pinterest’s logging pipeline and share their experience addressing these challenges. They dive deep into three components they developed: data uploading from service hosts to Kafka, data transportation from Kafka to S3, and data sanitization. They also share their experience in operating Kafka at scale in the cloud.

This talk was a talk recorded at our DataEngConf event in San Francisco.

Small josh wills headshot Josh Wills on

As a long-time practitioner in the data field (roles at Google, Cloudera and others) Josh Wills, currently Director of Data Science at Slack, explains some of the real-world motivations and tensions between data science and engineering teams.

In his own humorous way, Josh brings up some controversial ideas in this talk (ETL in Javascript?!) which spurred some highly interesting Q/A from the audience as well as prolonged attendee discussions throughout the event.

(Sorry that the picture is a bit dark, we were playing w/ the lights - but the audio is good!)

This talk was a keynote recorded at our DataEngConf event in San Francisco.

Join Us