Many data scientists work within the realm of machine learning, and their problems are often addressable with techniques such as classifiers and recommendation engines. However, at Tapad, they have often had to look outside the standard machine learning toolkit to find inspiration from more traditional engineering algorithms. This has enabled them to solve a scaling problem with their Device Graph’s connected component, as well as maintaining time-consistency in cluster identification week over week.Continue
Tensorflow is one of the fastest growing open source deep learning frameworks available today. Tensorflow was developed internally by Google and released open source in November 2015.Continue
Machine learning, especially deep learning, is becoming more and more important to integrate into day-to-day business infrastructure across all industries. TensorFlow, open-sourced by Google in 2015, has become one of the more popular modern deep learning frameworks available today, promising to bridge the gap between the development of new models and their deployment.Continue
I've been working with Machine Learning models both in academic and industrial settings for a few years now. I've recently been watching the excellent Scalable ML from Mikio Braun, this is to learn some more about Scala and Spark.
His video series talks about the practicalities of 'big data' and so made me think what I wish I knew earlier about Machine Learning
- Getting models into production is a lot more than just micro services
- Feature selection and feature extraction are really hard to learn from a book
- The evaluation phase is really important
Getting models into production is a lot more than just micro services
I gave a talk on Data-Products and getting Ordinary Differential Equations into production. One thing that I didn't realize until sometime afterwards was just how challenging it is to handle issues like model decay, evaluation of models in production, dev-ops etc all by yourself. This depends on the resources you have and there are platforms available to accelerate this time to value. As we all know from engineering - getting stuff from Research and Development to reliable and scalable production code is a huge challenge.
Some things I've learned is that iterating, and focusing on business outcomes are the important things - and I'm keen to learn a lot more about deploying models.
Feature selection and feature extraction are really hard to learn
Something that I couldn't learn from a book, but tried to is feature selection and extraction. These skills are only learned by Kaggle competitions and real world projects. And learning about the various tricks and methods for this is something one learns only by implementing them or using them in real-world projects. This eats up a lot of the work flow of the data science process. In the new year I'll probably try to write out a blog post only on feature extraction and feature selection.
The evaluation phase is really important
Unless you apply your models to test data - you're not doing predictive analytics. Evaluation techniques such as cross-validation, evaluation metrics, etc are all invaluable as is simply splitting your data into test data and training data. Life often doesn't hand you a dataset with these things defined, so there is a lot of creativity and empathy involved in defining these two sets on a real world dataset. There is a great set of posts on Dato about the challenges of model evaluation.
I think the explanations by Mikio Braun are worth a read. I love his diagrams too and include it here in case you're not familiar with training sets and testing sets.
Source: Mikio Braun 2015
Often we don't discuss evaluation of models in papers, conferences or even when we talk about what techniques we use to solve problems. 'We used SVM on that' doesn't really tell me anything. It doesn't tell me your data sources, your feature selection, your evaluation methods, how you got into production and how you used cross-validation or model-debugging. I think we need a lot more commentary about these 'dirty' aspects of machine learning. And I wish I knew that a lot earlier.
My friend Ian has some great remarks on 'Data Science Delivered' which is a great read for any professional (junior or senior) who builds machine learning models for a living. It is also a great read for recruiters hiring data scientists or managers interacting with data science teams - if you're looking for questions to ask people about - i.e. 'how did you handle that dirty data?'Continue
Ryan Adams is a machine learning researcher at Twitter and a professor of computer science at Harvard. He co-founded Whetlab, a machine learning startup that was acquired by Twitter in 2015. He co-hosts the Talking Machines podcast.
A big part of machine learning is optimization of continuous functions. Whether for deep neural networks, structured prediction or variational inference, machine learners spend a lot of time taking gradients and verifying them. It turns out, however, that computers are good at doing this kind of calculus automatically, and automatic differentiation tools are becoming more mainstream and easier to use. In his talk, Adams will give an overview of automatic differentiation, with a particular focus on Autograd. I will also give several vignettes about using Autograd to learn hyperparameters in neural networks, perform variational inference, and design new organic molecules.
This talk is from the SF Data Science meetup in June 2016.Continue
With the world’s largest residential energy dataset at their fingertips, Opower is uniquely situated to use Machine Learning to tackle problems in demand-side management. Their communication platform, which reaches millions of energy customers, allows them to build those solutions into their products and make a measurable impact on energy efficiency, customer satisfaction and cost to utilities.
In this talk, Opower surveys several Machine Learning projects that they’ve been working on. These projects vary from predicting customer propensity to clustering load curves for behavioral segmentation, and leverage supervised and unsupervised techniques.
Ben Packer is the Principal Data Scientist at Opower. Ben earned a bachelor's degree in Cognitive Science and a master's degree in Computer Science at the University of Pennsylvania. He then spent half a year living in a cookie factory before coming out to the West Coast, where he did his Ph.D. in Machine Learning and Artificial Intelligence at Stanford.
Justine Kunz is a Data Scientist at Opower. She recently completed her master’s degree in Computer Science at the University of Michigan with a concentration in Big Data and Machine Learning. Now she works on turning ideas into products from the initial Machine Learning research to the production pipeline.
This talk is from the Data Science for Sustainability meetup in June 2016.Continue
Quantitative trading strategy creation is a unique intellectual undertaking that draws on human insight, proprietary data, and nearly all aspects of computer science.Continue
Dmitry Storcheus is an Engineer at Google Research NY, where he does scientific work on novel machine learning algorithms. Dmitry has a Masters of Science in Mathematics from the Courant Institute and despite his very young age he is already an internationally recognized scientist in his field of expertise. He has published in a top peer-reviewed machine learning journal JMLR and spoken at an international conference NIPS. Dmitry Storcheus got peer recognition for his foundational research contribution published in his paper “Foundations of Coupled Nonlinear Dimensionality Reduction”, which has been cited by scientists and engineers. He is a full member of reputable international academic associations: Sigma Xi, New York Academy of Sciences and American Mathematical Society. This year Dmitry is also a primary chair of the NIPS workshop “Feature Extraction: Modern Questions and Challenges”.Continue
Vector models are being used in a lot of different fields: natural language processing, recommender systems, computer vision, and other things. They are fast and convenient and are often state of the art in terms of accuracy. One of the challenges with vector models is that as the number of dimensions increase, finding similar items gets challenging. Erik Bernhardsson developed a library called "Annoy" that uses a forest of random tree to do fast approximate nearest neighbor queries in high dimensional spaces. We will cover some specific applications of vector models with and how Annoy works.Continue
Extracting information from unstructured web documents is a common problem for many applications and determining which category they belong to can be especially challenging at planetary scale.Continue
Xiangrui Meng, a committer on Apache Spark, talks about how to make machine learning easy and scalable with Spark MLlib. Xiangrui has been actively involved in the development of Spark MLlib and the new DataFrame API. MLlib is an Apache Spark component that focuses on large-scale machine learning (ML). With 50+ organizations and 110+ individuals contributing, MLlib is one of the most active open-source projects on ML. In this talk, Xiangrui shares his experience in developing MLlib. The talk covers both higher-level APIs, ML pipelines, that make MLlib easy to use, as well as lower-level optimizations that make MLlib scale to massive datasets.Continue
Amy is a digital personal assistant who schedules meetings for you. Amy understands your calendar-preferences and negotiates meeting times with other people on your behalf - through email. Your meeting guests talk to Amy as they would with a human.Continue
Yunliang Jiang, engineer at Thumbtack, shares from his PhD research about data mining techniques he's applied to the wealth of unstructured health data available online.Continue
In this talk, Simon Chan (co-founder of PredictionIO) introduces the latest developments and shows how to use PredictionIO to build and deploy predictive engines in real production environments. PredictionIO is an open source machine learning server built on Apache Spark and MLlib. It is designed for data scientists and developers to build predictive engines for real-world applications in a fraction of the time normally required.
Using PredictionIO’s DASE design pattern, Simon illustrates how developers can develop machine learning applications with the separation of concerns (SoC) in mind.
“D" stands for Data Source and the Data Preparator, which take care of the preparation of data for model training.
“A" stands for Algorithm, which is where the code of one or more algorithms are implemented. MLlib, the machine learning library of Apache Spark, is natively supported here.
“S” stands for Serving, which handles the application logic during the retrieval of predicted results.
Finally, “E” stands for Evaluation.
Simon also covers upcoming development work, including new Engine Templates for various business scenarios.
This video was recorded at the SF Data Mining meetup at Runway.io in SF.Continue
Matthew Zeiler, PhD, Founder and CEO of Clarifai Inc, speaks about large convolutional neural networks. These networks have recently demonstrated impressive object recognition performance making real world applications possible. However, there was no clear understanding of why they perform so well, or how they might be improved. In this talk, Matt covers a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the overall classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that perform exceedingly well.
This talk was presented at the NYC Machine Learning Meetup at Pivotal Labs.Continue
Machine learning researcher, Edouard Grave, gives a presentation on the field of information extraction (pulling structured data from unstructured documents). Edouard talks about current challenges in the field and introduces distant supervision for relation extraction.
Distant supervision is a recent paradigm for learning to extract information by using an existing knowledge base instead of label data as a form of supervision. The corresponding problem is an instance of multiple label, multiple instance learning. Edouard shows how to obtain a convex formulation of this problem, inspired by the discriminative clustering framework.
He also presents a method to learn to extract named entities from a seed list of such entities. This problem can be formulated as PU learning (learning from positive and unlabeled examples only) and Edouard describe a convex formulation for this problem.
This talk was presented at the NYC Machine Learning Meetup at Pivotal Labs.Continue
Machine learning is often divided into three categories: supervised, unsupervised, and reinforcement learning. Reinforcement learning concerns problems with sequences of decisions (where each decision affects subsequent opportunities), in which the effects can be uncertain, and with potentially long-term goals. It has achieved immense success in various different fields, especially AI/Robotics and Operations Research, by providing a framework for learning from interactions with an environment and feedback in the form of rewards and penalties.Continue
Natural language is a pervasive human skill not yet fully achievable by automated computing systems. The main challenge is understanding how to computationally model both the depth and the breadth of natural languages.Continue
Check out the first 20 minutes of our previous Practical Machine Learning training taught by Juan M. Huerta, Senior Data Scientist at PlaceIQ.
Join us for our next 3-day training! November 10th-12th. This course is designed to help engineers collaborate with data scientists and create code that tackles increasingly complex machine learning problems. The course will be taught by Rachit Srivastava (Senior Data Scientist, PlaceIQ) and supervised by Juan.
By the end of this training, you will be able to:
- Apply common classification methods for supervised learning when given a data set
- Apply algorithms for unsupervised learning problems
- Select/reduce features for both supervised and unsupervised learning problems
- Optimize code for common machine learning tasks by correcting inefficiencies by using advanced data structures
- Choose basic tools and criteria to perform predictive analysis
We screen applicants for engineering ability and drive, so you'll be in a room full of passionate devs who ask the right questions. Applicants should have 3+ years of coding experience, knowledge of Python, and previous exposure to linear algebra concepts.
You can apply for a seat on our course page.Continue
Hosted by Hakka Labs
This 3-day course will demonstrate the fundamental concepts of machine learning by working on a dataset of moderate size, using open source software tools.
This course is designed to help engineers collaborate with data scientists and create code that tackles increasingly complex machine learning problems. By the end of this course, you will be able to:
-Apply common classification methods for supervised learning when given a data set
-Apply algorithms for unsupervised learning problems
-Select/reduce features for both supervised and unsupervised learning problems -Optimize code for common machine learning tasks by correcting inefficiencies by using advanced data structures
-Choose basic tools and criteria to perform predictive analysis
The intended audience of this Machine Learning course is the engineer with strong programming skills as well as a certain level of exposure to linear algebra and probability. Students should understand the basic issue of prediction as well as Python.
Day 1: Linear Algebra/Probability Fundamentals and Supervised Learning
The goal of day one is to give engineers the linear algebra/probability foundation they need to tackle problems during the rest of the course and introduce tools for supervised learning problems.
-Quick Introduction to Machine Learning
-Linear Algebra, Probability and Statistics,
-Linear and Quadratic Discriminant Analysis
-Support Vector Machines and Kernels
-Lab: Working on classification problems on a data set
Day 2: Unsupervised learning, Feature Selection and Reduction
The goal of day two is to help students understand the mindset and tools of data scientists.
-K nearest neighbors, Random Forests, Naive Bayes Classifier
-Information Theoretic Approaches
-Feature Selection and Model Selection/Creation
-Principal Component Analysis/Kernel PCA
-Independent Component Analysis
- Lab: Choosing Features and applying unsupervised learning methods to a data set
Day 3: Performance Optimization of Machine Learning Algorithms
The goal of day three is to help students understand how developers contribute to complex machine learning projects.
-Unsupervised Learning Continued
-DB-SCAN and K-D Trees
-Recommendation Systems and Matrix Factorization Methods
-Lab: Longer lab working on back-end Machine Learning optimization programming problems in Python
TellApart Software Engineer Nick Gorski takes us through a technical deep-dive into TellApart's personalization system. He discusses the machine learning data pipeline at TellApart that powers the models, real-time calculations of the expected value of shoppers, and how to translate that value into a bid price for every bid request received (hundreds of thousands per second).Continue
Nearly all fields have been or are being transformed by the availability of copious data and the tools to learn from them. Dr. Chris Wiggins (Chief Data Scientist, New York Times) will talk about using machine learning and large data in both academia and in business. He shares some ways re-framing domain questions as machine learning tasks has opened up new avenues for understanding both in academic research and in real-world applications.Continue
Understanding the billions of data points we ingest each month is no easy task. Through the development of models that allow us to do so, we’ve noticed some commonalities in the process of converting raw data to real-world understanding. Although you can get pretty good results with simple models and algorithms, digging beyond the obvious abstractions and using more sophisticated methods requires a lot of effort. In school we often learn different techniques and algorithms in isolation, with neatly fitted input sets, and study their properties. In the real world, however, especially the world of location data, we often need to combine these approaches in novel ways in order to yield usable results.Continue
For most large-scale image retrieval systems, performance depends upon accurate meta-data. While content-based image retrieval has progressed in recent years, typically image contributors must provide appropriate keywords or tags that describe the image. Tagging, however, is a difficult and time-consuming task, especially for non-native English speaking contributors.Continue
Machine learning applications like fraud detection and recommendation have played a key role in helping Square achieve their mission to rethink buying and selling. In this talk, Dr. Rong Yan (Director of Data Science and Infrastructure, Square), gives a high-level overview of data applications at Square followed by a deep dive on how machine learning is used in our industrial leading fraud detection models.Continue
Adam Gibson (Data Scientist and Co-Founder, Blix.io) presents his open-source, distributed deep-learning framework, Deeplearning4j. He demos sentiment analysis and facial recognition tools.Continue
This talk is part lecture, part hands-on machine learning tutorial. Daniel Krasner (Co-founder KFit Solutions and Research Scholar at Columbia University) covers rapid development of high performance scalable text processing solutions for tasks such as classification, semantic analysis, topic modeling and general machine learning.Continue
When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible user-generated ratings (like and dislike), then each item is represented as a vector of 2 counts. In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents. The Dirichlet distribution is one of the basic probability distributions for describing this type of data. In this talk, Max Sklar, from Foursquare, takes a closer look at the Dirichlet distribution and it's properties, as well as some of the ways it can be computed efficiently. This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.Continue
Three exciting talks in this video: First Ben McRedmond will share his experiences with machine learning and go over some simple concepts (and practical details) which most web developers will benefit from knowing. In the second talk, James Rosen will talk about ways to make it simple for web developers to access front-end libraries at the HTTP layer for a faster and more automated development process. In the third talk Rudy Rigot, from prismic.io, will share his painful past experiences around manageable content, and will share his critical look on the various ways developers handle it today in their applications. These talks were recorded at the SF Ruby on Rails meetup group at Zendesk.Continue
In this talk, Jeroen Janssens, senior data scientist at YPlan, introduces both the outlier selection and one-class classification setting. He then presents a novel algorithm called Stochastic Outlier Selection (SOS). The SOS algorithm computes for each data point an outlier probability. These probabilities are more intuitive than the unbounded outlier scores computed by existing outlier-selection algorithms. Jeroen has evaluated SOS on a variety of real-world and synthetic datasets, and compared it to four state-of-the-art outlier-selection algorithms. The results show that SOS has a superior performance while being more robust to data perturbations and parameter settings. Click Here for the link to Jeroen's blogpost on the subject, it contains links to the d3 demo! This talk was recorded at the NYC Machine Learning meetup at Pivotal Labs.Continue
In this talk on Machine Learning Distributed GBM, Earl Hathaway, resident Data Scientist at 0xdata, talks about distributed GBM, one of the most popular machine learning algorithms used in data mining competitions. He will discuss where distributed GBM is applicable, and review recent KDD & Kaggle uses of machine learning and distributed GBM. Also, Cliff Click, CTO of 0xdata, will talk about implementation and design choices of a Distributed GBM. This talk was recorded at the SF Data Mining meetup at Trulia.Continue
This is a friendly Lambda Calculus Introduction by Dustin Mulcahey. LISP has its syntactic roots in a formal system called the lambda calculus. After a brief discussion of formal systems and logic in general, Dustin will dive in to the lambda calculus and make enough constructions to convince you that it really is capable of expressing anything that is "computable". Dustin then talks about the simply typed lambda calculus and the Curry-Howard-Lambek correspondence, which asserts that programs and mathematical proofs are "the same thing". This talk was recorded at the Lisp NYC meetup at Meetup HQ.Continue
This talk is by Jan Vitek, a professor in computer science at Purdue University. In it, Jan discusses the design and implementation of Distributed Random Forest, a big data algorithm for H2O. This talk was recorded at the SF Data Mining meetup at Trulia.Continue
We are happy to share with you a recent talk by Sham Kakade from Microsoft recorded at the NYC Machine Learning meetup . In this talk he discusses a general and (computationally and statistically) efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models and latent Dirichlet allocation---by exploiting a certain tensor structure in their low-order observable moments.
Here's a new talk on targeted online advertising recorded at one of the NYC Machine Learning meetups. Two presenters from Media6 labs spoke about their respective papers from the recent Knowledge Discover and Data Mining conference (KDD). Claudia Perlich presented "Bid Optimizing and Inventory Scoring in Targeted Online Advertising" and Troy Raeder presented "Design Principles of Massive, Robust Prediction Systems." Full abstracts and audio below.Continue
Today we've got the last of our November R user talks. It's a lightning talk by Aurobindo Tripathy on Skin Detection Using Random Forest Decision Trees.
In this audio recording from at the NYC Machine Learning Meetup, Jeremy Stanley (SVP, Product and Data Sciences at Collective) talks about Using Pervasive Experimentation & Visualization to Hone Machine Learning on Big DataContinue