Small capture Sadayuki Furuhashi on

In production environments, it usually takes several applications and team members working together to accomplish moving data from one place to another. This problem can surface in companies of any size but is especially problematic when working at scale. This is because, when the data is being collected, it can come from different sources and likely in different formats which adds obvious complexity. Even if data is collected right, moving it at scale present other challenges that needs proper handling: duplicates, multiple destinations, exceptions and more.

In this presentation, Sadayuki will dissect the challenges described and share his experience developing two open source solutions to address these problems: Fluentd and Embulk.

Sadayuki Furuhashi is an open-source hacker who wrote original code of MessagePack, Fluentd and Embulk projects. He is also a founder and architect of Treasure Data, Inc. and works on distributed storage and query engines.

This talk was given at SF DataEngConf in April 2016.

Small calvinfrenchowen Calvin French-Owen on

Data is critical to building great apps. Engineers and analysts can understand how customers interact with their brand at any time of the day, from any place they go, from any device they're using - and use that information to build a product they love. But there are countless ways to track, manage, transform, and analyze that data. And when companies are also trying to understand experiences across devices and the effect of mobile marketing campaigns, data engineering can be even trickier. What’s the right way to use data to help customers better engage with your app?

In this all-star panel hear from mobile experts at Instacart, Branch Metrics, Pandora, Invoice2Go, Gametime and Segment on the best practices they use for tracking mobile data and powering their analytics.


Small aaeaaqaaaaaaaajkaaaajdixyzc2ztm1ltjlmtqtndnjzi04mduxltq0odcxmze0mmm4ng Krishna Gade on

At Pinterest, hundreds of services and third-party tools that are implemented in various programming languages generate billions of events every day.

To achieve scalable and reliable low latency logging, there are several challenges: (1) uploading logs that are generated in various formats from tens of thousands of hosts to Kafka in a timely manner; (2) running Kafka reliably on Amazon Web Services where the virtual instances are less reliable than on-premises hardware; (3) moving tens of terabytes data per day from Kafka to cloud storage reliably and efficiently, and guaranteeing exact one time persistence per message.

In this talk, Krishna Gade (Head of Data Engineering) and Yu Yang (Data Engineer) will present Pinterest’s logging pipeline and share their experience addressing these challenges. They dive deep into three components they developed: data uploading from service hosts to Kafka, data transportation from Kafka to S3, and data sanitization. They also share their experience in operating Kafka at scale in the cloud.

This talk was a talk recorded at our DataEngConf event in San Francisco.

Small 7481924 Noel Cody on

Interested in learning more about data engineering and data science? Don't miss our 2 day DataEngConf with top engineers in San Francisco, April 2016.

Spotify currently runs over 100 production-level Cassandra clusters. We use Cassandra across user-facing features, in our internal monitoring and analytics stack, paired with Storm for real-time processing, you name it. With scale come questions. “If I change my consistency level from ONE to QUORUM, how much performance am I sacrificing? What about a change to my data model or I/O pattern? I want to write data in really wide columns, is that ok?”

Rules of thumb lead to basic answers, but we can do better. These questions are testable, and the best answers come from pre-launch load-testing and capacity planning. Any system with a strict SLA can and should simulate production traffic in staging before launch.


Unknown author on

Anomaly detection in healthcare data is an enabling technology for the detection of overpayment and fraud. In this talk, we demonstrate how to use PageRank with Hadoop and SociaLite (a distributed query language for large-scale graph analysis) to identify anomalies in healthcare payment information. We demonstrate a variant of PageRank applied to graph data generated from the Medicare-B dataset for anomaly detection, and show real anomalies discovered in the dataset.

Dataenconfnyc2016 logos3

Ofer Mendelevitch is Director of Data science at Hortonworks, where he is responsible for professional services involving data science with Hadoop. Prior to joining Hortonworks, Ofer served as Entrepreneur in Residence at XSeed Capital where he developed an investment strategy around big data. Before XSeed, Ofer served as VP of Engineering at Nor1, and before that he was Director of engineering at Yahoo! where he led multiple engineering and data science teams responsible for R&D of large scale computational advertising projects including CTR prediction (with Hadoop), a new front-end ad-serving system and sales tools.

Unknown author on

In this talk Adam Gibson from presents the ND4J framework with an iScala notebook.
Combined with Spark's dataframes, this is making real data science viable in Scala. ND4J is "Numpy for Java." It works with multiple architectures (or backends) that allow for run-time-neutral scientific computing as well as chip-specific optimizations -- all while writing the same code. Algorithm developers and scientific engineers can write code for a Spark, Hadoop, or Flink cluster while keeping underlying computations that are platform-agnostic. A modern runtime for the JVM with the capability to work with GPUs lets engineers leverage the best parts of the production ecosystem without having to pick which scientific library to use.

This video was recorded at the SF Bay Area Machine Learning meetup.

Unknown author on

Xiangrui Meng, a committer on Apache Spark, talks about how to make machine learning easy and scalable with Spark MLlib. Xiangrui has been actively involved in the development of Spark MLlib and the new DataFrame API. MLlib is an Apache Spark component that focuses on large-scale machine learning (ML). With 50+ organizations and 110+ individuals contributing, MLlib is one of the most active open-source projects on ML. In this talk, Xiangrui shares his experience in developing MLlib. The talk covers both higher-level APIs, ML pipelines, that make MLlib easy to use, as well as lower-level optimizations that make MLlib scale to massive datasets.

ML workflows often involve a sequence of processing and learning stages. Realistic workflows are often even more complex, including cross-validation to choose parameters and combining multiple data sources. Inspired by scikit-learn, we proposed simple APIs to help users quickly assemble and tune ML pipelines. Under the hood, it seamlessly integrates with Spark SQL’s DataFrames and utilizes its data sources, flexible column operations, rich data types, as well as execution plan optimization to create efficient and scalable implementations.

Dataenconfnyc2016 logos3

There are many factors affecting a parallel implementation of an ML algorithm, e.g., optimization algorithm, platform limitation, communication pattern, data locality, numerical stability and performance, and fault-tolerance. Different implementations of the same ML algorithm can perform dramatically different. Xiangrui shares lessons learned from optimizing the alternating least squares (ALS) implementation in MLlib.

This talk was recorded at the NYC ML Meetup at Pivotal Labs in NYC.

Small 39b7a68b6cbc43ec7683ad0bcc4c9570 Paul Dix on

Over the past 4 months, Paul Dix and his team completely rewrote InfluxDB: from Go to Go. In this talk, he gives a quick overview of InfluxDB and shows how it's useful for metrics, analytics, and sensor data.

Paul also dives into the history of the project and why they chose to rewrite their previous Go implementation into the implementation they have now. He shows pain points with their legacy codebase and gives examples of how rewriting the code from scratch gave them the ability to do things they couldn't have done otherwise.

Paul closes out with some comparisons on usability, readability, and performance of the previous version against the new rewritten version.


This video was recorded at the GoSF meetup at Chain in SF.

Join Us