Peter Bakas from Netflix discusses Keystone, their new data pipeline. Hear in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak!Continue
Big Data applications need to ingest streaming data and analyze it. HBase is great at ingesting streaming data but not so good at analytics. On the other hand, HDFS is great at analytics but not at ingesting streaming data. Frequently applications ingest data into HBase and then move it to HDFS for analytics.
What if you could use a single system for both use cases? This could dramatically simplify your data pipeline architecture.
Enter Apache Kudu. Kudu is a storage system that lives between HDFS and HBase. It is good at both ingesting streaming data and good at analyzing it using Spark, MapReduce, and SQL.Continue
In this talk, "Streaming Data Analysis and Online Learning," John Myles White of Facebook surveys some basic methods for analyzing data in a streaming manner. He focuses on using stochastic gradient descent (SGD) to fit models to data sets that arrive in small chunks, discussing some basic implementation issues and demonstrating the effectiveness of SGD for problems like linear and logistic regression as well as matrix factorization. He also describes how these methods allow ML systems to adapt to user data in real-time. This talk was recorded at the New York Open Statistical Programming meetup at Knewton.