Real-time stream analysis starts with ingesting raw data and extracting structured records. While stream-processing frameworks such as Apache Spark and Apache Storm provide primitives for processing individual records, processing windows of records, and grouping/joining records, the process of performing common actions such as filtering, applying regular expressions to extract data, and converting records from one schema to another are left to developers writing business logic.Continue
As a long-time practitioner in the data field (roles at Google, Cloudera and others) Josh Wills, currently Director of Data Science at Slack, explains some of the real-world motivations and tensions between data science and engineering teams.
(Sorry that the picture is a bit dark, we were playing w/ the lights - but the audio is good!)
This talk was a keynote recorded at our DataEngConf event in San Francisco.Continue
This talk shows how to build an ETL pipeline using Google Cloud Dataflow/Apache Beam that ingests textual data into a BigQuery table. Google engineer Silviu Calinoiu gives a live coding demo and discusses concepts as he codes. You don't need any previous background with big data frameworks, although people familiar with Spark or Flink will see some similar concepts. Because of the way the framework operates the same code can be used to scale from GB files to TB files easily.Continue