Hadoop Summit 2015: Using Natural Language Processing on Non-Textual Data with MLLib

Word2Vec is an interesting unsupervised way to construct vector representations of words to act as features for downstream algorithms or as a basis for similarity searches. We look at using the Spark implementation of Word2Vec shipped in MLLib to help us organize and make sense of some non-textual data by treating discrete clinical events (I.e. Diagnoses, drugs prescribed, etc.) in a medical dataset as non-textual "words”.

Casey Stella is a principal architect at Hortonworks, focusing on Data Science in the consulting organization at Hortonworks. In the past, I`ve worked as an architect and senior engineer at a healthcare informatics startup spun out of the Cleveland Clinic, as a developer at Oracle and as a Research Geophysicist in the Oil & Gas industry. Before that, I was a poor graduate student in Math at Texas A&M. I specialize in writing software and solving problems where there are either scalability concerns due to large amounts of traffic or large amounts of data. I have a particular passion for data science problems or anything vaguely mathematical.