Apache Kafka: Top 12 Tutorials on Performance & More...

In this article we put together 12 of the top Kafka talks on Hakka Labs.

Introduction to Apache Kafka

Apache Kafka is a commit log for your entire data center and infrastructure. In this lightning talk, Joe Stein, founder of Big Data Open Source Security LLC, gives a brief introduction to Kafka and talks about the producers, consumers, and client libraries it has to offer. This talk was given at the Apache Kafka NYC meetup at Tapad.


Kafka and Hadoop

Getting data from Kafka to Hadoop should be simple, which is why the community has so many options to choose from. Cloudera engineer, Gwen Shapira, reviews some popular solutions: Storm, Spark, Flume and Camus. She goes over the pros and cons of each, and recommends use-cases and future development plans as well. This talk was given at the Apache Kafka NYC meetup at Tapad.


Apache Kafka: 0.8.2 and Beyond

Apache Kafka committer, Jay Kreps from LinkedIn, walks through a brief production timeline for Kafka. Jay goes over what's new with 0.8.2 and how to get the most out of new features like Log Compaction and the new Java producer. Jay also gives an overview what to expect from 0.9(?): a new consumer, better security and operational improvements. This talk was given at the Apache Kafka NYC meetup at Tapad.


Real-time Streaming and Data Pipelines with Apache Kafka

In this talk, Joe Stein, Apache Kafka committer, member of the PMC, and Founder and Principal Architect at Big Data Open Source Security, will talk on Apache Kafka an open source, distributed publish-subscribe messaging system. Joe will focus on how to get started with Apache Kafka, how replication works and more! Storm is a great system for real-time analytics and stream processing but to get the data into Storm, you need to collect your data streams with consistency and availability at high loads and large volumes. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. This talk was recorded at the NYC Storm User Group meetup at WebMD Health.

More info: Apache Kafka is fast, a single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. It's also scalable, Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. It can also be durable, messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact. Plus it's Distributed by Design, Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

Bio: Joe Stein is an Apache Kafka committer and member of the PMC and is the Founder and Principal Architect at Big Data Open Source Security LLC http://www.stealth.ly


Migrating to Kafka in Three Short Years

Rafe Colburn talks about Etsy's journey in deploying Kafka.

Three years ago, the Etsy analytics data pipeline was built around a pixel hosted on Akamai, FTP uploads, and Amazon EMR. They're now in the last steps of migrating to a data ingestion pipeline based on Kafka. This talk covers all of the crazy things they had to do to rebuild their data pipeline without disrupting their ongoing analytics work, as well as the tradeoffs they've made in building these systems.


Site Reliability Engineering at LinkedIn: Kafka as a Service

LinkedIn runs one of the largest installations of Kafka in the world. In this talk, Todd Palino and Clark Haskins (Site Reliability, LinkedIn) discuss Kafka from an operations point of view. You'll learn the use cases for Kafka and the tools LinkedIn has been developing to improve the management of deployed clusters. They also talk about some of the challenges of managing a multi-tenant data service and how to avoid getting woken up at 3 AM.


Apache Kafka at Wikimedia

Andrew Otto, Systems Engineer at Wikimedia Foundation, talks about the analytics cluster at Wikimedia that allows them to support ~20 billion page views a month (Kafka, Hadoop, Hive, etc). Andrew shares how and why they chose to go with Kafka (scalable log transport) and how they've implemented Kafka with four brokers, a custom-built producer and kafkatee and Camus as their consumers.


How ScalingData uses Kafka for Event-Oriented Machine Data

While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. Eric Sammer, CTO at ScalingData and author of Hadoop Operations talks about how ScalingData uses Kafka together with other open source systems such as Hadoop, Solr, and Impala/Hive to collect, transformation and aggregate event data and then build applications on top of this platform.

This talk was given at the Apache Kafka NYC meetup at Tapad.


Spotify's Ad Targeting Infrastructure: Achieving Real-time Personalization for 24 million+ Users

Spotify has over 24 million active users. 1 out of every 4 users is a paying subscriber. Ad revenues allow 3 out of 4 users to enjoy a free experience.

Kinshuk Mishra, lead engineer on Spotify's ad targeting infrastructure, talks about how they approach near-real-time user personalization using a mix of commercial ad tech, open source software, and in-house technology, and gives a glimpse into optimizing for experience rather than clicks. You'll learn about Spotify's deployment of Storm, Kafka, and Hadoop.

This talk was given at the Developers Ad Tech & RTBkit Meetup organized by Datacratic and hosted by Spotify. If you enjoyed this tech talk, check out the others given by Spotify engineers.


Storm at Spotify: Deployment and Use Cases

This is the first time that a Spotify engineer has spoken publicly about their deployment and use cases for Storm! In this talk, Software Engineer Neville Li describes real-time features developed using Storm and Kafka including recommendations, social features, data visualization and ad targeting architecture, production integration and best practices for deployment.


Data Pipeline at Tapad

Tapad's data pipeline is an elastic combination of technologies (Kafka, Hadoop, Avro, Scalding) that forms a reliable system for analytics, realtime and batch graph-building, and logging. In this talk, Tapad Senior Software Developer Toby Matejovsky speaks about the creation and evolution of the pipeline. He demonstrates a concrete example – a day in the life of an event tracking pixel. Toby also talks about common challenges that his team has overcome such as integrating different pieces of the system, schema evolution, queuing, and data retention policies.