Information Extraction and Multi-class Classification for Unstructured Data

Yunliang Jiang, engineer at Thumbtack, shares from his PhD research about data mining techniques he's applied to the wealth of unstructured health data available online.

The development of Web 2.0 techniques has led to the prosperity of online communities, which spread to various domains and areas in our daily life. When it comes to the medicine and healthcare domain, a series of good online services such as Yahoo! Groups, WebMD and Med-Help, offer patients and physicians a good platform to discuss health problems, e.g., diseases and drugs, diagnoses and treatments, which also provide a large volume of data for researchers to analyze and explore. However, the nature of the personal messages, e.g., unclean, unstructured and isolated from clinical practice, hinders users’ effective digestion of information in the front end and challenges the data analysis in the back end. In such a scenario, the objective of Yunliang's thesis is to apply advanced data mining, information retrieval and natural language processing techniques to effectively analyze and re-organize the rich source of personal health messages from online medical communities, in order to satisfy patients’ information need and support physicians’ clinical practice.

Yunliang introduces an SVM-based multi-class classification method which utilizes term-appearance, lexical and semantic features to effectively classify health messages sampled from our unique dataset of Yahoo! Health Groups into three categories: News, User Comments and Spam. Yunliang also depicts a comprehensive system with an extensive evaluation framework to organize and cluster patient outcomes utilizing topic model, which groups large collections of personal comments into a series of topics, guided by expert comments. In the third part, Yunliang addresses a novel and promising topic: Comparative Effectiveness Research (CER) hypothesis prediction, by presenting a study which evaluates patients’ opinions on different treatments by machine enabled sentiment analysis or human analysts utilizing our MedHelp dataset. By suggesting three different methods to compare such opinions, reliable conclusions about the patients’ preference on different treatments can be drawn consistently, which imply the effectiveness of the treatments.


This video was recorded at the SF Bayarea Machine Learning meetup at Thumbtack in SF.