I've been working with Machine Learning models both in academic and industrial settings for a few years now. I've recently been watching the excellent Scalable ML from Mikio Braun, this is to learn some more about Scala and Spark.
His video series talks about the practicalities of 'big data' and so made me think what I wish I knew earlier about Machine Learning
- Getting models into production is a lot more than just micro services
- Feature selection and feature extraction are really hard to learn from a book
- The evaluation phase is really important
Getting models into production is a lot more than just micro services
I gave a talk on Data-Products and getting Ordinary Differential Equations into production. One thing that I didn't realize until sometime afterwards was just how challenging it is to handle issues like model decay, evaluation of models in production, dev-ops etc all by yourself. This depends on the resources you have and there are platforms available to accelerate this time to value. As we all know from engineering - getting stuff from Research and Development to reliable and scalable production code is a huge challenge.
Some things I've learned is that iterating, and focusing on business outcomes are the important things - and I'm keen to learn a lot more about deploying models.
Feature selection and feature extraction are really hard to learn
Something that I couldn't learn from a book, but tried to is feature selection and extraction. These skills are only learned by Kaggle competitions and real world projects. And learning about the various tricks and methods for this is something one learns only by implementing them or using them in real-world projects. This eats up a lot of the work flow of the data science process. In the new year I'll probably try to write out a blog post only on feature extraction and feature selection.
The evaluation phase is really important
Unless you apply your models to test data - you're not doing predictive analytics. Evaluation techniques such as cross-validation, evaluation metrics, etc are all invaluable as is simply splitting your data into test data and training data. Life often doesn't hand you a dataset with these things defined, so there is a lot of creativity and empathy involved in defining these two sets on a real world dataset. There is a great set of posts on Dato about the challenges of model evaluation.
I think the explanations by Mikio Braun are worth a read. I love his diagrams too and include it here in case you're not familiar with training sets and testing sets.
Source: Mikio Braun 2015
Often we don't discuss evaluation of models in papers, conferences or even when we talk about what techniques we use to solve problems. 'We used SVM on that' doesn't really tell me anything. It doesn't tell me your data sources, your feature selection, your evaluation methods, how you got into production and how you used cross-validation or model-debugging. I think we need a lot more commentary about these 'dirty' aspects of machine learning. And I wish I knew that a lot earlier.
My friend Ian has some great remarks on 'Data Science Delivered' which is a great read for any professional (junior or senior) who builds machine learning models for a living. It is also a great read for recruiters hiring data scientists or managers interacting with data science teams - if you're looking for questions to ask people about - i.e. 'how did you handle that dirty data?'Continue