Small aaeaaqaaaaaaaalzaaaajdcxmdnhotazlwezodqtngm4mc1iyzfmltdkmzrjztzlywy3mg Daniel Blazevski on

Dan Blazevski from Insight Data Science presents some recent progress on Apache Flink's machine learning library, focusing on a new implementation of the k-nearest neighbors (knn) algorithm for Flink.

In the spirit of the Kappa Architecture, Apache Flink is a distributed batch and stream processing tool that treats batch as a special case of stream processing. Dan discusses a few ways, both exact and approximate, to do distributed knn queries, focusing on using quadtrees to spatially partition the training set and using z-value based hashing to reduce dimensionality.


Small 38bb484 Joey Echeverria on

Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.

Joey Echeverria presents an alternative approach based on a reusable library that provides configuration-based data transformation. This allows users to write command data-transformation rules once and reuse them in multiple contexts. A common pattern is to consume a single, raw stream and transform it using the same rules before storing in different repositories such as Apache Solr for search and Apache Hadoop HDFS for deep storage.


Small karthik Karthik Ramasamy on

Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. Twitter open-sourced Heron this year.

In this talk, you will learn about the operating experiences and challenges of running Heron at scale and the approaches that the team at Twitter took to solve those challenges.


Placeholder Peter Bakas on

Peter Bakas from Netflix discusses Keystone, their new data pipeline. Hear in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak!

This talk was given as a joint event from SF Data Engineering and SF Data Science, and Peter Bakas - Director of Engineering, Real-Time Data Infrastructure, Netflix is the speaker.
Dataenconfnyc2016 logos4

Placeholder Asim Jalis on

Big Data applications need to ingest streaming data and analyze it. HBase is great at ingesting streaming data but not so good at analytics. On the other hand, HDFS is great at analytics but not at ingesting streaming data. Frequently applications ingest data into HBase and then move it to HDFS for analytics.

What if you could use a single system for both use cases? This could dramatically simplify your data pipeline architecture.

Enter Apache Kudu. Kudu is a storage system that lives between HDFS and HBase. It is good at both ingesting streaming data and good at analyzing it using Spark, MapReduce, and SQL.

This talk was given as a joint event from SF Data Engineering and SF Data Science, and Asim Jalis (Lead Instructor, Data Engineering Immersive, Galvanize SF) is the speaker.
Dataenconfnyc2016 logos4

Small kinshuk mishra Kinshuk Mishra on

Spotify has built several real-time pipelines using Apache Storm for use cases like ad targeting, music recommendation, and data visualization. Each of these real-time pipelines have Apache Storm wired to different systems like Kafka, Cassandra, Zookeeper, and other sources and sinks. Building applications for over 50 million active users globally requires perpetual thinking about scalability to ensure high availability and good system performance.

Interested in learning more about data engineering and data science? Don't miss our 2 day DataEngConf with top engineers in San Francisco, April 2016.


Small gwen shapira Gwen Shapira on

Getting data from Kafka to Hadoop should be simple, which is why the community has so many options to choose from. Cloudera engineer, Gwen Shapira, reviews some popular solutions: Storm, Spark, Flume and Camus. She goes over the pros and cons of each, and recommends use-cases and future development plans as well.

This talk was given at the Apache Kafka NYC meetup at Tapad.


We have many more articles on Apache Kafka. Check out our collection of top 12 tech talks on 12 Apache Kafka tutorials.


Small kinshuk mishra Kinshuk Mishra on

Spotify has over 24 million active users. 1 out of every 4 users is a paying subscriber. Ad revenues allow 3 out of 4 users to enjoy a free experience.

Kinshuk Mishra, lead engineer on Spotify's ad targeting infrastructure, talks about how they approach near-real-time user personalization using a mix of commercial ad tech, open source software, and in-house technology, and gives a glimpse into optimizing for experience rather than clicks. You'll learn about Spotify's deployment of Storm, Kafka, and Hadoop.


We have many more articles on Apache Kafka. Check out our collection of top 12 tech talks on Apache Kafka.

This talk was given at the Developers Ad Tech & RTBkit Meetup organized by Datacratic and hosted by Spotify. If you enjoyed this tech talk, check out the others given by Spotify engineers.

Small thumb speaker nevilleli Neville Li on

This is the first time that a Spotify engineer has spoken publicly about their deployment and use cases for Storm! In this talk, Software Engineer Neville Li describes:

  • Real-time features developed using Storm and Kafka including recommendations, social features, data visualization and ad targeting

  • Architecture

  • Production integration

  • Best practices for deployment



Small usxjnj6xajkj2bcqf7m5 Joe Stein on

In this talk, Joe Stein, Apache Kafka committer, member of the PMC, and Founder and Principal Architect at Big Data Open Source Security, will talk on Apache Kafka an open source, distributed publish-subscribe messaging system. Joe will focus on how to get started with Apache Kafka, how replication works and more! Storm is a great system for real-time analytics and stream processing but to get the data into Storm, you need to collect your data streams with consistency and availability at high loads and large volumes. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. This talk was recorded at the NYC Storm User Group meetup at WebMD Health.



Join Us