How to become a science genius at 24 and score a job at Google Research

Dmitry Storcheus is an Engineer at Google Research NY, where he does scientific work on novel machine learning algorithms. Dmitry has a Masters of Science in Mathematics from the Courant Institute and despite his very young age he is already an internationally recognized scientist in his field of expertise.  He has published in a top peer-reviewed machine learning journal JMLR and spoken at an international conference NIPS. Dmitry Storcheus got peer recognition for his foundational research contribution published in his paper “Foundations of Coupled Nonlinear Dimensionality Reduction”, which has been cited by scientists and engineers. He is a full member of reputable international academic associations: Sigma Xi, New York Academy of Sciences and American Mathematical Society. This year Dmitry is also a primary chair of the NIPS workshop “Feature Extraction: Modern Questions and Challenges”.

DSC_1058

 

- Hi Dima, you were recently invited to give a talk at DataEngConf in NYC and we are very excited to hear about your novel machine learning research.  You have a pretty unique situation where you joined Google Research right after your Masters, you are a very young scientist, working together with top notch professors and Phds. Tell me about your path overall. How did you manage to get in?

- Let me first talk about my path. I studied at a Russian college called ICEF, and then I came for graduate studies to the USA, where I did my masters in math at the Courant Institute. I started machine learning research very early - back in Russia I used machine learning to forecast financial time series and in the USA I continued machine learning studies on a theoretical level. Straight after graduation I was hired by Google Research in New York to work on machine learning algorithms. I think that they key to my employment is that my unique skills and strong technical background was recognized by Google. Also, Google Research recognized my foundational scientific work that I had already done and sound machine learning algorithms that I had developed. What I did is I derived generalization bounds for coupled dimensionality reduction and created an algorithm called SKPCA.   

- So why did you choose Google over any other company?

- 2 reasons. First, a job at Google naturally followed after my research work at my graduate school as I could apply my research directly, and I wanted it to benefit the Machine Learning community. The second reason is that I think that Google is a company that values potential in people. It selects people based on their projected individual growth, and that appealed to me as a young professional because it guaranteed continued growth and mentorship by best scientists in the industry.

- How did you get that job?

- As I mentioned before, my unique technical skills and machine learning expertise were highly searched by Google and my foundational contribution in my thesis was recognized by them. Most of my colleagues in the research department are either PhDs with academic experience or well known research professors that came to us from Universities, and the Research Group rarely hires straight out of school. In my case Google evaluated my achievements and credentials and considered them to be matching the PhD level because the research work I was doing at the master's program is directly relevant to Google tasks right now. When I came on board I started implementing my research straight away. I think they picked me not only because of my research, but also due to my general achievements in mathematics in the Courant institute.  I won the Best Master’s Thesis Award in 2015  - it was  “Generalization bounds for Supervised Kernel Principal Component Analysis”, which laid the foundation for my original research contribution to machine learning.

- What was your novel research contribution?

- I derived generalization bounds for coupled (supervised) dimensionality reduction and developed an algorithm called SKPCA, which solves the dimensionality reduction problem. Dimensionality reduction is an area in computer science that deals with optimal ways to compress data. Right now the data that is available to people over the Internet is enormously big and is growing exponentially, and no algorithm can process and analyze this data at the same time, that’s why we have to find ways to compress data to extract the most relevant pieces of information in a short amount of time. You can read more about my contribution in our paper that was published on arXiv, a repository for scientific researches. It’s called “Foundations of Coupled Nonlinear Dimensionality Reduction”

- What are the practical implications of your novel research at a company like Google?

- Actually, dimensionality reduction can be used to improve pretty much any Google algorithm.  Let me give you an example. One of the projects I’m working on right now is malware and spam detection. Let’s say you are using your Gmail (Google mail server) and want to make sure that spam mail goes to your Spam folder, but not to your Inbox. You can be receiving a huge number of email messages, and each email has lots of information. Some 'features', as we call it, are the sender's email address and the reply-to email address. If either of them are obscure, or they don't match, it may be a sign that this is spam. One can construct lots of such features, but for us, scientists, only a small fraction of these features is relevant in identifying whether this email can be categorized as spam or not. That’s why Google scientists and developers use applied dimensionality reduction to emails to extract the most relevant pieces of information and do a better job at categorizing inappropriate emails as spam. So in practice, using dimensionality reduction for spam identification is going to give higher precision and make spam detection faster.

- Machine learning is a very in-demand subject right now in the USA and the professionals are highly needed in general. (quote the research - J) How did you pick that specific area?

- My major was actually mathematics, and after that I transitioned into machine learning. Machine Learning is a combination of applied mathematics and computer science, so you have to know both.  In order to understand how algorithms work (for example, how spam detection, or Google Now work), we have to know mathematical theorems and laws. I wanted  my skills to be practically implemented, that’s why I expanded my expertise from pure mathematics to machine learning,

- In our Hakka community we do see a gap between machine learning and data engineering. What’s your take on machine learning vs data engineering ?

- I see the main gap between machine learning and data engineering in scalability.  Your algorithms work well on little pieces of data but they are very slow with huge amounts of data. I believe dimensionality reduction is a solution to that. The ability to extract key information from raw data will definitely help bridge that gap. At Google we call this pre-processing pipeline, or feature extraction pipeline, that’s how you process the data before you feed it into the algorithm.

- So by making data more accessible and measurable you basically help data engineers process it easier and faster and make it more applicable?

- Yes, applicable, and also processing data in a smart way. What we were doing before was unsupervised dimensionality reduction. Let’s say that your algorithm is supposed to forecast something. The variables that you use for forecasting are called features, and what you are forecasting is called a label. Let’s say we are forecasting whether the email is spam or not. The label is  “spam” or “not spam”. Before researchers have been pre-processing the data without the knowledge of labels. For example, doing PCA or unsupervised clustering. Let’s say you just have a bunch of emails without labeling them “spam”/”not spam”, and then you want to apply dimensionality reduction for a projection on these emails. But we follow the approach of supervised dimensionality reduction, which means that it has to have the access to the labels, and you fine-tune your pipeline to extract the data so that it knows about the labels and maximizes the correctness of predictions. I believe it will be a big trend in future.

- How can something like that be applied in a project that can improve the economy of our country?

- A few examples of those kinds of projects would be Google Now, Google Maps, Google Search and Image Translate. Let’s take the new amazing tool Image Translate, which translates (in 27 languages!) in real time any text on a picture that you take with your phone or camera. The challenge here is to distinguish words from the rest of the picture, which is efficiently solved by dimensionality reduction. United States is a country of many languages, so opportunity to take pictures of street names, cafes, road signs and immediately translate them provided by Image Translate will greatly benefit the US economy by making tourism and travel extremely easy. In addition, it will help Americans to get local information while travelling abroad to stay safe and enjoy their trip.

- What is some advice that you can give to the young audience that is going to college and wants to pursue the machine learning route and then get into a company like Google.

- They need to start working on applied research early in their career. Classes are great, but they are not enough to understand real world issues and actual business problems. Getting hands-on experience is key. I started my research back in Russia. And I spoke with my first paper at the Eastern Finance Association conference in Boston when I was 19 years old. I admit it was really stressful, but it helped me establish a strong research career. So number one, you have to start doing research early. In order to do so, you have to pick a good advisor. You have to look at his or hers publications, the venues and topics, and the research areas in general. I have been extremely fortunate to have had an excellent advisor Professor Mehryar Mohri, at the Courant Institute, that could guide my research and introduce me to the best venues.

And number two, you have to get out there and start building your brand. So far I have spoken at over 10 peer reviewed scientific conferences. It’s also important to get to speak at the top conferences, and it’s really competitive and tough to get into. But at the same time the audience is really strong and you can get great feedback on your research. My most memorable experience was my talk at the 9th Annual Machine Learning Symposium in New York, where the audience consisted of top scientists and professionals from all over the world. I was under pressure, but my talk was well received and recognized with an Honorable Mention by the Scientific Committee. As a result, after my talk at the Symposium I received the Best Spotlight Talk Award, which means that my original research contribution was recognised by scientists in the international machine learning community. I was so happy and proud that afterwards I was invited to join the New York Academy of Sciences as supporting member and elected as a full member of Sigma Xi - an international honor society of science.

- How do you get into such a conference? What does it take for a conference submission to pass peer review and get accepted?

- My advice on this matter is not just plain words, it comes from my years of experience in judging and reviewing the work of my peers. For two years I was a reviewer for the Neural Information Processing Systems (NIPS) conference. My reviews for 13 papers were critical in making acceptance and rejection. I feel honored to be a reviewer for such a famous international conference as NIPS, since some of the papers I reviewed were authored by renowned professors and scientists in machine learning. Also this year I am a primary chair of NIPS workshop “Feature Extraction: Modern Questions and Challenges”, for which I reviewed more than 30 papers and developed the workshop agenda. From this experience I can say that first, you have to make sure your research outline follows the conference standards.  The paper has to be rigorous and sticking to the facts, all the data needs to be verified and backed up. You also have to select a topic that has a significant contribution in the scientific field, that you bring something new and important to the table. It has to have the potential to be applied by a huge number of people to solve current problems at scale, and that suggested solution has to be significantly better than anything that is currently offered. The last but not least: as an experienced reviewer I encourage authors to answer a question “What is your direct contribution in this paper?” briefly in 1-2 sentences, it will make reviewer’s work so much easier!

 

- What are other things you should take into consideration as a young student?

 

- Here is the full list I can come up with:

 


  1. Get accepted at a good school and pick a prominent advisor, since a good advisor can not only guide you through research, but also introduce you to reputable research venues and help to build professional connections.


 

  1. Actively participate in conferences as I mentioned. We participate in conferences too, at such events as NIPS, ICML, COLT, ALT you will have  an opportunity to meet people from Google Research and discuss your ideas with them. As an example, this year I am chairing a NIPS workshop "Feature Extraction: Modern Questions and Challenges", graduate students are welcome to submit their papers and attend the workshop.


 

  1. Publish research papers! In our day-to-day work at Google Research we often read and analyze new papers that came up in Machine Learning. If we read your paper and find it interesting, it will be a great basis to start a conversation.


 

  1. Get an internship to get a hands-on experience and not just theoretical knowledge. Apply for Summer Research Internship at Google. 


 

  1. Apply for Google Research Scholarship for grad students if you intend to be working on your Phd:  https://research.google.com/university/student-support. This is a great way to share your graduate work with us and collaborate. 


 

  1. Create and publish open source code on Github or contribute to Google OS projects to deeply connect with the existing community and show your programming skills in a friendly and collaborative environment.


 

  1. Prepare well for coding interviews. Specifically I recommend solving every problem from these books: 



When you solve problems, time yourself and write real code that compiles. Also there recently appeared a useful website that provides interview training: https://www.interviewcake.com/.

 


  1. When doing a coding interview, I would advise first to describe a full logic of your solution in words and only then start coding. If your initial logic is wrong, then there is no sense to start writing code.


 

  1. While in grad school, take more advanced fundamental maths courses such as Linear Algebra, Real Analysis, Functional Analysis, Analytic Geometry - they will be of great help to understand machine learning research.


 

  1. Do some serious reading. Not just skim through but repeat theoretic proofs yourself and solve exercises from every chapter.

Some books that I recommend are: