Small 1368312 Jared Polivka on

In this post, you’ll learn about a special data science giveaway and will get a sneak peak at the three talks I’m most excited about at DataEngConf NYC.

Two of my favorite Data Science Conferences are coming up in November (one in NYC and the other in SF), and you have a chance to win tickets to both of them! Here are the giveaway details:

Grand Prize:

Enter Here Now:

a Rafflecopter giveaway

Thanks to Our Sponsors:

This giveaway was made possible by the Learn Data Science Meetup community and by our amazing sponsors:

Hakka Labs, the creators of DataEngConf

Hakka Labs is an amazing community for data engineers and scientists comprised of thought leaders at influential tech companies like Google, Netflix, LinkedIn, Airbnb, Slack and many others. And as you know, DataEngConf is one of my favorite conferences (the content and the people attending are incredible!).


Created by my friend Courtney Burton, MLconf is a single day, single track event, devoted to the Machine Learning and Data Science community in major cities, agnostic of any tool, platform or company.

*Note: You can read more about MLconf’s and DataEngConf’s backstory in my Quora Answer: “Data Science Conferences - One List to Rule Them All”

AltWork Stations

To me, AltWork represents a way to code comfortably and healthily. You can stand, sit or recline…these workstations are amazing (I feel like captain Kirk while working at an AltWork Station...)

DataEngConf - The Three Talks I’m Most Excited About

Here are the three talks at DataEngConf that I’m most excited about:

1.) Peloton: the Self-Driving Database Management System

By Andy Pavlo - Carnegie Mellon University

Andy Pavlo, from Carnegie Mellon University, is one of the rising stars in databases. Even his job title is cool: Assistant Professor of Databaseology.

Andy argues that we need a DBMS that ‘manages’ itself, and doesn’t require human decision regarding the configuration or maintenance of underlying database mechanisms.

2.) The Future of Column-Oriented Processing with Arrow and Parquet

By Julien La Dem - Principal Architect at Dremio, Apache Parquet co-founder and PMC chair

Columnar storage has been one of the key innovations of the ‘big data’ era, and we’ll hear about the most up-to-date ways it’s currently being used in tools like Kudu, Ibis, Drill, Arrow and others.

3.) Kafka Streams: Stream Processing Made Easy

By Guozhang Wang - Kafka Committer & Software Engineer, Confluent

Considering how ubiquitous Kafka has become in processing large amounts of data in the largest web platforms (starting at LinkedIn), I’m fascinated to see how Kafka Streams compare to Spark Streaming and which one will take the top spot in modern streaming architectures post Twitter’s Storm.

About Jared Polivka:

Jared is the Director of the Developer Evangelist team at Galvanize - the learning community for technology. Learn about Galvanize’s data science training here.

About the Galvanize Developer Evangelist Team:

The Galvanize Developer Evangelists currently create content for the data science and web development communities in 7 cities (San Francisco, Seattle, Austin, New York City, Denver, Boulder and Phoenix). Join a data science community near you via Learn Data Science.

Small aaeaaqaaaaaaaamcaaaajgy0yjhimti1ltg1m2ytndq2os04zddjlwniogy2zte2ymzhnq Ajay Sharma on

I'm a data scientist from SF who relocated to NYC this spring. I prudently spent the prior 8 months scoping & planning, making sure there was a healthy appetite for data scientists in the region. But when I got here it didn't seem like I was getting the responses to my outreach I had anticipated ...

Why is No One Getting Back to Me?

I was a little skeptical the slow-start was attributed to just my own performance and the typical nature of a job search. From what I could tell, the pool of actively open jobs was quite shallow. Eagerly searching for an explanation, I decided to plot the number of data scientist job postings from this year and last year.

The data is from Gary's Guide which does an excellent job of curating tech job postings in NYC ('Data Scientist' used for the search term). This isn't indicative of all the jobs in NYC and is quite biased given the curation but I'd imagine there would be a similar trend for all data science jobs in NYC and insightful from seasonality perspective at the minimum.

What the Data Shows

Looking at hiring trends from last year, there's two peaks: the lion's share of hiring done in the spring, a lull in late summer/early fall, and another upswing just before the holidays -- which is typical seasonality.



Small 928322 Chris Johnson on


This awesome talk by Chris Johnson and Edward Newett, machine learning engineers at Spotify, shows how they imagined, tested, iterated and built the highly-popular "Discover Weekly" feature of Spotify from start to finish.

Learn how product-oriented engineers think in this talk and the tradeoffs they make as they're looking for ways to rapidly test ideas and iterate. Of course, this product was built on music recommendations, so you'll also get to see how they thought through the process to figure out exactly how to generate meaningful recommendations for their millions of users.

Dataenconfnyc2016 logos3

This talk was a talk recorded at the DataEngConf 2015 event in NYC.

Small aaeaaqaaaaaaaalzaaaajdcxmdnhotazlwezodqtngm4mc1iyzfmltdkmzrjztzlywy3mg Daniel Blazevski on

Dan Blazevski from Insight Data Science presents some recent progress on Apache Flink's machine learning library, focusing on a new implementation of the k-nearest neighbors (knn) algorithm for Flink.

In the spirit of the Kappa Architecture, Apache Flink is a distributed batch and stream processing tool that treats batch as a special case of stream processing. Dan discusses a few ways, both exact and approximate, to do distributed knn queries, focusing on using quadtrees to spatially partition the training set and using z-value based hashing to reduce dimensionality.


Small 20e3319227b03b50ec589c29a1e7fd25 400x400 Fabrizio Milo on

Tensorflow is one of the fastest growing open source deep learning frameworks available today. Tensorflow was developed internally by Google and released open source in November 2015.

Although it is mainly known to be applied to model deep learning architectures, Tensorflow's flexible interface makes it a good candidate for production level data-science pipelines as well.

In this talk, you will learn about the fundamentals of distributing Tensorflow models over multiple computers.

Fabrizio Milo is a deep learning architect and early TensorFlow contributor @

Small 1545224 10152105765716192 1764874921 n Pete Soderling on

We're excited to announce the Call for Papers for our next DataEngConf - to be held in NYC, late October 2016.

Talks fit into 3 categories - data engineering, data science and data analytics. We made it super-easy to apply, so submit your ideas here!

We'll be selecting two kinds of speakers for the event, some from top companies that are building fascinating systems to process huge amounts of data, as well as the best submitted talks by members of the Hakka Labs community.

Don't delay - CFP ends Aug 15th, 2016.

Small thumb speaker nevilleli Neville Li on

Learn about Scio, a Scala API for Google Cloud Dataflow (incubated as Apache Beam). Apache Beam offers a simple, unified programming model for both batch and streaming data processing while Scio brings it much closer to the high level API many data engineers are familiar with, e.g. Spark and Scalding. Neville will cover design and implementation of the framework, including features like typesafe BigQuery macros, REPL, and serialization. There will also be a live coding demo.

Neville is a software engineer at Spotify who works mainly on data infrastructure and tools for machine learning and advanced analytics. In the past few years he has been driving the adoption of Scala and new data tools for music recommendation, including Scalding, Spark, Storm and Parquet. Before that he worked on search quality at Yahoo! and old school distributed systems like MPI.

This talk was given at the NYC Data Engineering meetup in June 2016.

Small reuvenlax1 1409720209 140 Reuven Lax on

Reuven will cover the Beam programming model, and the advantages of hosted Google Cloud Dataflow.

Reuven has been a Google engineering since 2006. In that time, he's been instrumental in building Google's streaming data-processing systems from MillWheel to Cloud Dataflow.

This talk was given at the NYC Data Engineering meetup in June 2016.

Small capture Sadayuki Furuhashi on

In production environments, it usually takes several applications and team members working together to accomplish moving data from one place to another. This problem can surface in companies of any size but is especially problematic when working at scale. This is because, when the data is being collected, it can come from different sources and likely in different formats which adds obvious complexity. Even if data is collected right, moving it at scale present other challenges that needs proper handling: duplicates, multiple destinations, exceptions and more.

In this presentation, Sadayuki will dissect the challenges described and share his experience developing two open source solutions to address these problems: Fluentd and Embulk.

Sadayuki Furuhashi is an open-source hacker who wrote original code of MessagePack, Fluentd and Embulk projects. He is also a founder and architect of Treasure Data, Inc. and works on distributed storage and query engines.

This talk was given at SF DataEngConf in April 2016.

Small calvinfrenchowen Calvin French-Owen on

Data is critical to building great apps. Engineers and analysts can understand how customers interact with their brand at any time of the day, from any place they go, from any device they're using - and use that information to build a product they love. But there are countless ways to track, manage, transform, and analyze that data. And when companies are also trying to understand experiences across devices and the effect of mobile marketing campaigns, data engineering can be even trickier. What’s the right way to use data to help customers better engage with your app?

In this all-star panel hear from mobile experts at Instacart, Branch Metrics, Pandora, Invoice2Go, Gametime and Segment on the best practices they use for tracking mobile data and powering their analytics.


Join Us