Yael Elmatad Yael Elmatad on

Many data scientists work within the realm of machine learning, and their problems are often addressable with techniques such as classifiers and recommendation engines. However, at Tapad, they have often had to look outside the standard machine learning toolkit to find inspiration from more traditional engineering algorithms. This has enabled them to solve a scaling problem with their Device Graph’s connected component, as well as maintaining time-consistency in cluster identification week over week.

Pete Soderling Pete Soderling on

We're excited to announce the Call for Papers for our next DataEngConf - to be held in NYC, late October 2016.

Talks fit into 3 categories - data engineering, data science and data analytics. We made it super-easy to apply, so submit your ideas here!

We'll be selecting two kinds of speakers for the event, some from top companies that are building fascinating systems to process huge amounts of data, as well as the best submitted talks by members of the Hakka Labs community.

Don't delay - CFP ends Aug 15th, 2016.

Sam Abrahams Sam Abrahams on

Machine learning, especially deep learning, is becoming more and more important to integrate into day-to-day business infrastructure across all industries. TensorFlow, open-sourced by Google in 2015, has become one of the more popular modern deep learning frameworks available today, promising to bridge the gap between the development of new models and their deployment.

Jeff Ma Jeff Ma on

Jeff Ma's life and career were totally transformed by the advent of the big data movement. From beating blackjack to working with professional sports teams to a variety of entrepreneurial efforts leveraging analytics, all of that has lead him to his current role at Twitter where he works with some of the brightest minds and most interesting data available. Jeff will discuss his personal journey and where he sees the future of analytics in the workplace.

Peadar Coyle Peadar Coyle on

I've been working with Machine Learning models both in academic and industrial settings for a few years now. I've recently been watching the excellent Scalable ML from Mikio Braun, this is to learn some more about Scala and Spark.

His video series talks about the practicalities of 'big data' and so made me think what I wish I knew earlier about Machine Learning

  1. Getting models into production is a lot more than just micro services 
  2.  Feature selection and feature extraction are really hard to learn from a book
  3. The evaluation phase is really important
I'll take each in turn.

Getting models into production is a lot more than just micro services 

I gave a talk on Data-Products and getting Ordinary Differential Equations into production. One thing that I didn't realize until sometime afterwards was just how challenging it is to handle issues like model decay, evaluation of models in production, dev-ops etc all by yourself. This depends on the resources you have and there are platforms available to accelerate this time to value. As we all know from engineering - getting stuff from Research and Development to reliable and scalable production code is a huge challenge.

Some things I've learned is that iterating, and focusing on business outcomes are the important things - and I'm keen to learn a lot more about deploying models.

Feature selection and feature extraction are really hard to learn

Something that I couldn't learn from a book, but tried to is feature selection and extraction. These skills are only learned by Kaggle competitions and real world projects. And learning about the various tricks and methods for this is something one learns only by implementing them or using them in real-world projects. This eats up a lot of the work flow of the data science process. In the new year I'll probably try to write out a blog post only on feature extraction and feature selection.

The evaluation phase is really important

Unless you apply your models to test data - you're not doing predictive analytics. Evaluation techniques such as cross-validation, evaluation metrics, etc are all invaluable as is simply splitting your data into test data and training data. Life often doesn't hand you a dataset with these things defined, so there is a lot of creativity and empathy involved in defining these two sets on a real world dataset. There is a great set of posts on Dato about the challenges of model evaluation.

I think the explanations by Mikio Braun are worth a read. I love his diagrams too and include it here in case you're not familiar with training sets and testing sets.


Source: Mikio Braun 2015

Often we don't discuss evaluation of models in papers, conferences or even when we talk about what techniques we use to solve problems. 'We used SVM on that' doesn't really tell me anything. It doesn't tell me your data sources, your feature selection, your evaluation methods, how you got into production and how you used cross-validation or model-debugging. I think we need a lot more commentary about these 'dirty' aspects of machine learning. And I wish I knew that a lot earlier.

My friend Ian has some great remarks on 'Data Science Delivered' which is a great read for any professional (junior or senior) who builds machine learning models for a living. It is also a great read for recruiters hiring data scientists or managers interacting with data science teams - if you're looking for questions to ask people about - i.e. 'how did you handle that dirty data?'

Ryan Adams Ryan Adams on

Ryan Adams is a machine learning researcher at Twitter and a professor of computer science at Harvard. He co-founded Whetlab, a machine learning startup that was acquired by Twitter in 2015. He co-hosts the Talking Machines podcast.

A big part of machine learning is optimization of continuous functions. Whether for deep neural networks, structured prediction or variational inference, machine learners spend a lot of time taking gradients and verifying them. It turns out, however, that computers are good at doing this kind of calculus automatically, and automatic differentiation tools are becoming more mainstream and easier to use. In his talk, Adams will give an overview of automatic differentiation, with a particular focus on Autograd. I will also give several vignettes about using Autograd to learn hyperparameters in neural networks, perform variational inference, and design new organic molecules.

This talk is from the SF Data Science meetup in June 2016.

Nick Chamandy Nick Chamandy on

Simple “random-user” A/B experiment designs fall short in the face of complex dependence structures. These can come in the form of large-scale social graphs or, more recently, spatio-temporal network interactions in a two-sided transportation marketplace. Naive designs are susceptible to statistical interference, which can lead to biased estimates of the treatment effect under study.

Ben Packer Ben Packer on

With the world’s largest residential energy dataset at their fingertips, Opower is uniquely situated to use Machine Learning to tackle problems in demand-side management. Their communication platform, which reaches millions of energy customers, allows them to build those solutions into their products and make a measurable impact on energy efficiency, customer satisfaction and cost to utilities.

In this talk, Opower surveys several Machine Learning projects that they’ve been working on. These projects vary from predicting customer propensity to clustering load curves for behavioral segmentation, and leverage supervised and unsupervised techniques.

Ben Packer is the Principal Data Scientist at Opower. Ben earned a bachelor's degree in Cognitive Science and a master's degree in Computer Science at the University of Pennsylvania. He then spent half a year living in a cookie factory before coming out to the West Coast, where he did his Ph.D. in Machine Learning and Artificial Intelligence at Stanford.

Justine Kunz is a Data Scientist at Opower. She recently completed her master’s degree in Computer Science at the University of Michigan with a concentration in Big Data and Machine Learning. Now she works on turning ideas into products from the initial Machine Learning research to the production pipeline.

This talk is from the Data Science for Sustainability meetup in June 2016.

Amir Najmi Amir Najmi on

Scalable web technology has greatly reduced the marginal cost of serving users. Thus, an individual business today may support a very large user base. With so much data, one might imagine that it is easy to obtain statistical significance in live experiments. However, this is always not the case. Often, the very business models enabled by the web require answers for which our data is information poor.

Greg Dingle Greg Dingle on

Tech businesses know how they're doing by numbers on a screen. The weakest link in the process of analysis is usually the part in front of the keyboard. People are not designed to think about abstract quantities. Scientists in the field of decision science have described for decades now exactly how people go wrong. You can overcome your biases only by being aware of them. Greg Dingle will walk you through some common biases, examples, and corrective measures.

Josh Wills Josh Wills on

As a long-time practitioner in the data field (roles at Google, Cloudera and others) Josh Wills, currently Director of Data Science at Slack, explains some of the real-world motivations and tensions between data science and engineering teams.

In his own humorous way, Josh brings up some controversial ideas in this talk (ETL in Javascript?!) which spurred some highly interesting Q/A from the audience as well as prolonged attendee discussions throughout the event.

(Sorry that the picture is a bit dark, we were playing w/ the lights - but the audio is good!)

This talk was a keynote recorded at our DataEngConf event in San Francisco.

Unknown author on

The data science team at iHeartRadio has been developing collaborative filtering models to drive a wide variety of features, including recommendations and radio personalization. Collaborative filtering is a popular approach in recommendation systems that makes predictive suggestions to users based on the behavior of other users in a service. For example, it’s often used to recommend new artists to users based on what they and similar users are listening to. In this blog post we discuss our initial experiences applying these models to increase user engagement through targeted marketing campaigns.

One of the most successful approaches to collaborative filtering in recent years has been matrix factorization, which decomposes all the activity on a service into compact representations of its users and what they interact with, e.g. artists. At iHeartRadio we use matrix factorization models to represent our users and artists with ~100 numbers each (called their latent vectors) instead of processing the trillions of possible interactions between them. This simplifies our user activity by many orders of magnitude, and allows us to quickly match artists to users by applying a simple scoring function to the user’s and artist’s latent vectors.

Erik Bernhardsson Erik Bernhardsson on

Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik Bernhardsson developed a library called "Annoy" that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.

Unknown author on

Word2Vec is an interesting unsupervised way to construct vector representations of words to act as features for downstream algorithms or as a basis for similarity searches. We look at using the Spark implementation of Word2Vec shipped in MLLib to help us organize and make sense of some non-textual data by treating discrete clinical events (I.e. Diagnoses, drugs prescribed, etc.) in a medical dataset as non-textual "words”.

Unknown author on

Anomaly detection in healthcare data is an enabling technology for the detection of overpayment and fraud. In this talk, we demonstrate how to use PageRank with Hadoop and SociaLite (a distributed query language for large-scale graph analysis) to identify anomalies in healthcare payment information. We demonstrate a variant of PageRank applied to graph data generated from the Medicare-B dataset for anomaly detection, and show real anomalies discovered in the dataset.

Chris Wiggins Chris Wiggins on

Nearly all fields have been or are being transformed by the availability of copious data and the tools to learn from them. Dr. Chris Wiggins (Chief Data Scientist, New York Times) will talk about using machine learning and large data in both academia and in business. He shares some ways re-framing domain questions as machine learning tasks has opened up new avenues for understanding both in academic research and in real-world applications.

Reynold Xin Reynold Xin on

Mining Big Data can be an incredibly frustrating experience due to its inherent complexity and a lack of tools. Reynold Xin and Aaron Davidson are Committers and PMC Members for Apache Spark and use the framework to mine big data at Databricks. In this presentation and interactive demo, you'll learn about data mining workflows, the architecture and benefits of Spark, as well as practical use cases for the framework.

Jeroen Janssens Jeroen Janssens on

In this talk, Jeroen Janssens, senior data scientist at YPlan, introduces both the outlier selection and one-class classification setting. He then presents a novel algorithm called Stochastic Outlier Selection (SOS). The SOS algorithm computes for each data point an outlier probability. These probabilities are more intuitive than the unbounded outlier scores computed by existing outlier-selection algorithms. Jeroen has evaluated SOS on a variety of real-world and synthetic datasets, and compared it to four state-of-the-art outlier-selection algorithms. The results show that SOS has a superior performance while being more robust to data perturbations and parameter settings. Click Here for the link to Jeroen's blogpost on the subject, it contains links to the d3 demo! This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.

Dustin Mulcahey Dustin Mulcahey on

This is a friendly Lambda Calculus Introduction by Dustin Mulcahey. LISP has its syntactic roots in a formal system called the lambda calculus. After a brief discussion of formal systems and logic in general, Dustin will dive in to the lambda calculus and make enough constructions to convince you that it really is capable of expressing anything that is "computable". Dustin then talks about the simply typed lambda calculus and the Curry-Howard-Lambek correspondence, which asserts that programs and mathematical proofs are "the same thing". This talk was recorded at the Lisp NYC meetup at Meetup HQ.

John Myles White John Myles White on

In this talk, "Streaming Data Analysis and Online Learning," John Myles White of Facebook surveys some basic methods for analyzing data in a streaming manner. He focuses on using stochastic gradient descent (SGD) to fit models to data sets that arrive in small chunks, discussing some basic implementation issues and demonstrating the effectiveness of SGD for problems like linear and logistic regression as well as matrix factorization. He also describes how these methods allow ML systems to adapt to user data in real-time. This talk was recorded at the New York Open Statistical Programming meetup at Knewton.

Ben Engber Ben Engber on

Ben Engber, CEO and founder of Thumbtack Technology, will discuss how to perform tuned benchmarking across a number of NoSQL solutions. He describes a NoSQL Database Comparison across Couchbase, Aerospike, MongoDB, Cassandra, HBase, and others in a way that does not artificially distort the data in favor of a particular database or storage paradigm. This includes hardware and software configurations, as well as ways of measuring to ensure repeatable results. This talk was recorded at the Scale Warriors of NYC meetup at adMarketplace.

Todd Holloway Todd Holloway on

Recommendation engines typically produce a list of recommendations in one of two ways - through collaborative or content-based filtering. Collaborative filtering approaches to build a model from a user's past behavior (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users, then use that model to predict items (or ratings for items) that the user may have an interest in. Content-based filtering approaches utilize a series of discrete characteristics of an item in order to recommend additional items with similar properties.

Christian Posse Christian Posse on

christian posse a/b testingDr. Christian Posse was the last panelist at the recent The Hive Big Data Think Tank meetup at Microsoft. In this talk, Christian shares some of the problems he's seen in the social network field. Not a single piece of code, algorithm, feature, or user experience goes out without A/B Testing. He discusses their development of a system of hashing functions over at LinkedIn that allow them to run millions of A/B tests concurrently without interactions between them.

Dr. Christian Posse recently joined Google as Program Manager, Technology. Before that he was Principal Product Manager and Principal Data Scientist at LinkedIn where he led the development of recommendation products as well as the next generation online experimentation platform. Prior to LinkedIn, Dr. Posse was a founding member and technology lead of Cisco Systems Network Collaboration Business Unit where he designed the search and advanced social analytics of Pulse, Cisco’s network-based search and collaboration platform for the enterprise. Prior to Cisco, Dr. Posse worked in a wide range of environments, from holding faculty positions in US universities, to leading the R&D at software companies and a US National Laboratory in the social networks, biological networks and behavioral analytics fields. His interests are diverse and include search and recommendation engines, social networks analytics, computational social and behavioral sciences, online experimentation and information fusion. He has written over 40 scientific peer-reviewed publications and holds several patents in those fields. Dr. Posse has a PhD in Statistics from the Swiss Federal Institute of Technology, Switzerland.

Caitlin Smallwood Caitlin Smallwood on

Controlled Experimentation (or A/B testing) has evolved into a powerful tool for driving product strategy and innovation. The dramatic growth in online and mobile content, media, and commerce has enabled companies to make principled data-driven decisions. Large numbers of experiments are typically run to validate hypotheses, study causation, and optimize user experience, engagement, and monetization.

Rajesh Parekh Rajesh Parekh on

Controlled Experimentation (or A/B testing) has evolved into a powerful tool for driving product strategy and innovation. The dramatic growth in online and mobile content, media, and commerce has enabled companies to make principled data-driven decisions. Large numbers of experiments are typically run to validate hypotheses, study causation, and optimize user experience, engagement, and monetization.

Sham Kakade Sham Kakade on

We are happy to share with you a recent talk by Sham Kakade from Microsoft recorded at the NYC Machine Learning meetup . In this talk he discusses a general and (computationally and statistically) efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models and latent Dirichlet allocation---by exploiting a certain tensor structure in their low-order observable moments.


Claudia Perlich Claudia Perlich on

Here's a new talk on targeted online advertising recorded at one of the NYC Machine Learning meetups. Two presenters from Media6 labs spoke about their respective papers from the recent Knowledge Discover and Data Mining conference (KDD). Claudia Perlich presented "Bid Optimizing and Inventory Scoring in Targeted Online Advertising" and Troy Raeder presented "Design Principles of Massive, Robust Prediction Systems." Full abstracts and audio below.

Laurent Gautier Laurent Gautier on

We were lucky to attend the Bay Area R users group last week where we recorded Laurent Gautier's talk on the RPy2 bridge which allows one to use Python as the glue language to develop applications while using R for the statistics and data analysis engine. He also demonstrated how a web application could be developed around an existing R script.