I'm a data scientist from SF who relocated to NYC this spring. I prudently spent the prior 8 months scoping & planning, making sure there was a healthy appetite for data scientists in the region. But when I got here it didn't seem like I was getting the responses to my outreach I had anticipated ...Continue
Many data scientists work within the realm of machine learning, and their problems are often addressable with techniques such as classifiers and recommendation engines. However, at Tapad, they have often had to look outside the standard machine learning toolkit to find inspiration from more traditional engineering algorithms. This has enabled them to solve a scaling problem with their Device Graph’s connected component, as well as maintaining time-consistency in cluster identification week over week.Continue
Tensorflow is one of the fastest growing open source deep learning frameworks available today. Tensorflow was developed internally by Google and released open source in November 2015.Continue
We're excited to announce the Call for Papers for our next DataEngConf - to be held in NYC, late October 2016.
Talks fit into 3 categories - data engineering, data science and data analytics. We made it super-easy to apply, so submit your ideas here!
We'll be selecting two kinds of speakers for the event, some from top companies that are building fascinating systems to process huge amounts of data, as well as the best submitted talks by members of the Hakka Labs community.
Don't delay - CFP ends Aug 15th, 2016.Continue
Machine learning, especially deep learning, is becoming more and more important to integrate into day-to-day business infrastructure across all industries. TensorFlow, open-sourced by Google in 2015, has become one of the more popular modern deep learning frameworks available today, promising to bridge the gap between the development of new models and their deployment.Continue
Jeff Ma's life and career were totally transformed by the advent of the big data movement. From beating blackjack to working with professional sports teams to a variety of entrepreneurial efforts leveraging analytics, all of that has lead him to his current role at Twitter where he works with some of the brightest minds and most interesting data available. Jeff will discuss his personal journey and where he sees the future of analytics in the workplace.Continue
I've been working with Machine Learning models both in academic and industrial settings for a few years now. I've recently been watching the excellent Scalable ML from Mikio Braun, this is to learn some more about Scala and Spark.
His video series talks about the practicalities of 'big data' and so made me think what I wish I knew earlier about Machine Learning
- Getting models into production is a lot more than just micro services
- Feature selection and feature extraction are really hard to learn from a book
- The evaluation phase is really important
Getting models into production is a lot more than just micro services
I gave a talk on Data-Products and getting Ordinary Differential Equations into production. One thing that I didn't realize until sometime afterwards was just how challenging it is to handle issues like model decay, evaluation of models in production, dev-ops etc all by yourself. This depends on the resources you have and there are platforms available to accelerate this time to value. As we all know from engineering - getting stuff from Research and Development to reliable and scalable production code is a huge challenge.
Some things I've learned is that iterating, and focusing on business outcomes are the important things - and I'm keen to learn a lot more about deploying models.
Feature selection and feature extraction are really hard to learn
Something that I couldn't learn from a book, but tried to is feature selection and extraction. These skills are only learned by Kaggle competitions and real world projects. And learning about the various tricks and methods for this is something one learns only by implementing them or using them in real-world projects. This eats up a lot of the work flow of the data science process. In the new year I'll probably try to write out a blog post only on feature extraction and feature selection.
The evaluation phase is really important
Unless you apply your models to test data - you're not doing predictive analytics. Evaluation techniques such as cross-validation, evaluation metrics, etc are all invaluable as is simply splitting your data into test data and training data. Life often doesn't hand you a dataset with these things defined, so there is a lot of creativity and empathy involved in defining these two sets on a real world dataset. There is a great set of posts on Dato about the challenges of model evaluation.
I think the explanations by Mikio Braun are worth a read. I love his diagrams too and include it here in case you're not familiar with training sets and testing sets.
Source: Mikio Braun 2015
Often we don't discuss evaluation of models in papers, conferences or even when we talk about what techniques we use to solve problems. 'We used SVM on that' doesn't really tell me anything. It doesn't tell me your data sources, your feature selection, your evaluation methods, how you got into production and how you used cross-validation or model-debugging. I think we need a lot more commentary about these 'dirty' aspects of machine learning. And I wish I knew that a lot earlier.
My friend Ian has some great remarks on 'Data Science Delivered' which is a great read for any professional (junior or senior) who builds machine learning models for a living. It is also a great read for recruiters hiring data scientists or managers interacting with data science teams - if you're looking for questions to ask people about - i.e. 'how did you handle that dirty data?'Continue
Ryan Adams is a machine learning researcher at Twitter and a professor of computer science at Harvard. He co-founded Whetlab, a machine learning startup that was acquired by Twitter in 2015. He co-hosts the Talking Machines podcast.
A big part of machine learning is optimization of continuous functions. Whether for deep neural networks, structured prediction or variational inference, machine learners spend a lot of time taking gradients and verifying them. It turns out, however, that computers are good at doing this kind of calculus automatically, and automatic differentiation tools are becoming more mainstream and easier to use. In his talk, Adams will give an overview of automatic differentiation, with a particular focus on Autograd. I will also give several vignettes about using Autograd to learn hyperparameters in neural networks, perform variational inference, and design new organic molecules.
This talk is from the SF Data Science meetup in June 2016.Continue
Simple “random-user” A/B experiment designs fall short in the face of complex dependence structures. These can come in the form of large-scale social graphs or, more recently, spatio-temporal network interactions in a two-sided transportation marketplace. Naive designs are susceptible to statistical interference, which can lead to biased estimates of the treatment effect under study.Continue
With the world’s largest residential energy dataset at their fingertips, Opower is uniquely situated to use Machine Learning to tackle problems in demand-side management. Their communication platform, which reaches millions of energy customers, allows them to build those solutions into their products and make a measurable impact on energy efficiency, customer satisfaction and cost to utilities.
In this talk, Opower surveys several Machine Learning projects that they’ve been working on. These projects vary from predicting customer propensity to clustering load curves for behavioral segmentation, and leverage supervised and unsupervised techniques.
Ben Packer is the Principal Data Scientist at Opower. Ben earned a bachelor's degree in Cognitive Science and a master's degree in Computer Science at the University of Pennsylvania. He then spent half a year living in a cookie factory before coming out to the West Coast, where he did his Ph.D. in Machine Learning and Artificial Intelligence at Stanford.
Justine Kunz is a Data Scientist at Opower. She recently completed her master’s degree in Computer Science at the University of Michigan with a concentration in Big Data and Machine Learning. Now she works on turning ideas into products from the initial Machine Learning research to the production pipeline.
This talk is from the Data Science for Sustainability meetup in June 2016.Continue
Scalable web technology has greatly reduced the marginal cost of serving users. Thus, an individual business today may support a very large user base. With so much data, one might imagine that it is easy to obtain statistical significance in live experiments. However, this is always not the case. Often, the very business models enabled by the web require answers for which our data is information poor.Continue
Tech businesses know how they're doing by numbers on a screen. The weakest link in the process of analysis is usually the part in front of the keyboard. People are not designed to think about abstract quantities. Scientists in the field of decision science have described for decades now exactly how people go wrong. You can overcome your biases only by being aware of them. Greg Dingle will walk you through some common biases, examples, and corrective measures.Continue
A/B testing is a hallmark of Internet services: from e-commerce sites to social networks to marketplaces, nearly all online services use randomized experiments as a mechanism to make better business decisions. Such tests are generally analyzed using classical frequentist statistical measures: p-values and confidence intervals.Continue
Quantitative trading strategy creation is a unique intellectual undertaking that draws on human insight, proprietary data, and nearly all aspects of computer science.Continue
Data scientists love to create exciting data visualizations and insightful models. However, before they get to that point, usually much effort goes into obtaining, scrubbing, and exploring the required data.Continue