Small 38bb484 Joey Echeverria on

Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.

Joey Echeverria presents an alternative approach based on a reusable library that provides configuration-based data transformation. This allows users to write command data-transformation rules once and reuse them in multiple contexts. A common pattern is to consume a single, raw stream and transform it using the same rules before storing in different repositories such as Apache Solr for search and Apache Hadoop HDFS for deep storage.

Continue

Small calvinfrenchowen Calvin French-Owen on

Segment’s API has scaled significantly over the past three years and has grown from processing a trickle of events to tens of thousands per second. Today, Segment processes tens of billions of events each month and sends them to hundreds of partner APIs.

It can be a very hostile environment. Partners fail frequently, customers send highly variable data and instances regularly die. As a result, Segment has invested heavily in tools for monitoring, failover and fairness when it comes to routing events through its system.

In this talk, CTO Calvin French-Owen will discuss how Segment continues to maintain a high quality of service, how its infrastructure has evolved over time and where it's heading in the future.

This talk is from DataEngConf SF in April 2016.

Continue
Small karthik Karthik Ramasamy on

Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. Twitter open-sourced Heron this year.

In this talk, you will learn about the operating experiences and challenges of running Heron at scale and the approaches that the team at Twitter took to solve those challenges.

Continue

Small aaeaaqaaaaaaaajkaaaajdixyzc2ztm1ltjlmtqtndnjzi04mduxltq0odcxmze0mmm4ng Krishna Gade on

At Pinterest, hundreds of services and third-party tools that are implemented in various programming languages generate billions of events every day.

To achieve scalable and reliable low latency logging, there are several challenges: (1) uploading logs that are generated in various formats from tens of thousands of hosts to Kafka in a timely manner; (2) running Kafka reliably on Amazon Web Services where the virtual instances are less reliable than on-premises hardware; (3) moving tens of terabytes data per day from Kafka to cloud storage reliably and efficiently, and guaranteeing exact one time persistence per message.

In this talk, Krishna Gade (Head of Data Engineering) and Yu Yang (Data Engineer) will present Pinterest’s logging pipeline and share their experience addressing these challenges. They dive deep into three components they developed: data uploading from service hosts to Kafka, data transportation from Kafka to S3, and data sanitization. They also share their experience in operating Kafka at scale in the cloud.

This talk was a talk recorded at our DataEngConf event in San Francisco.

Continue
Small josh wills headshot Josh Wills on

As a long-time practitioner in the data field (roles at Google, Cloudera and others) Josh Wills, currently Director of Data Science at Slack, explains some of the real-world motivations and tensions between data science and engineering teams.

In his own humorous way, Josh brings up some controversial ideas in this talk (ETL in Javascript?!) which spurred some highly interesting Q/A from the audience as well as prolonged attendee discussions throughout the event.

(Sorry that the picture is a bit dark, we were playing w/ the lights - but the audio is good!)

This talk was a keynote recorded at our DataEngConf event in San Francisco.

Continue
Placeholder Silviu Calinoiu on

This talk shows how to build an ETL pipeline using Google Cloud Dataflow/Apache Beam that ingests textual data into a BigQuery table. Google engineer Silviu Calinoiu gives a live coding demo and discusses concepts as he codes. You don't need any previous background with big data frameworks, although people familiar with Spark or Flink will see some similar concepts. Because of the way the framework operates the same code can be used to scale from GB files to TB files easily.

This talk was given as a joint event from SF Data Engineering and SF Data Science.

Continue
Small 1545224 10152105765716192 1764874921 n Pete Soderling on

We just posted the final schedule for DataEngConf San Francisco, April 7-8, 2016 and added even more great talks & workshops.

We're lucky to have two talks on Google's TensorFlow platform, including one by a TensorFlow committer! You can also participate in 2 free workshops we'll be running throughout the event - one on SparkSQL & one on Scikit-Learn.

Talks from these top companies:

DataEngConf logos


DataEngConf is the first engineering conference that tackles real-world issues with data processing architectures and covers essential concepts of data science from an engineer's perspective.

Hear real world war-stories from data engineering & data science heroes from companies like Google, Airbnb, Slack, Stripe, Netflix, Clover Health, Segment, Lyft and many more.

Full info & tickets: http://www.dataengconf.com

Continue
Placeholder Peter Bakas on

Peter Bakas from Netflix discusses Keystone, their new data pipeline. Hear in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak!

This talk was given as a joint event from SF Data Engineering and SF Data Science, and Peter Bakas - Director of Engineering, Real-Time Data Infrastructure, Netflix is the speaker.
Dataenconfnyc2016 logos3

Continue
Placeholder Asim Jalis on

Big Data applications need to ingest streaming data and analyze it. HBase is great at ingesting streaming data but not so good at analytics. On the other hand, HDFS is great at analytics but not at ingesting streaming data. Frequently applications ingest data into HBase and then move it to HDFS for analytics.

What if you could use a single system for both use cases? This could dramatically simplify your data pipeline architecture.

Enter Apache Kudu. Kudu is a storage system that lives between HDFS and HBase. It is good at both ingesting streaming data and good at analyzing it using Spark, MapReduce, and SQL.

This talk was given as a joint event from SF Data Engineering and SF Data Science, and Asim Jalis (Lead Instructor, Data Engineering Immersive, Galvanize SF) is the speaker.
Dataenconfnyc2016 logos3

Continue
Small 1545224 10152105765716192 1764874921 n Pete Soderling on

After the success of our last event in NYC, we decided to bring DataEngConf to San Francisco, April 7-8, 2016!

logos

DataEngConf is the first engineering conference that tackles real-world issues with data processing architectures and covers essential concepts of data science from an engineer's perspective.

Hear real world war-stories from data engineering & data science heroes from companies like Google, Airbnb, Slack, Stripe, Netflix, Clover Health, Yammer, Lyft and many more.

Use code "site20" for 20% off regularly priced tickets through 3/24.

Full info & tickets: http://www.dataengconf.com

Continue
Join Us