Machine Learning and Natural Language Processing

Scientists in many fields now collect massive, high dimensional data on complex processes. The key research problems in these fields are increasingly becoming those of coping with (and indeed benefiting from) scale. Machine learning and natural language processing lab aims to support this developing mode of scientific research by addressing the statistical and computational challenges of building statistical models to make optimal interpretations of data from noisy, incomplete and conflicting evidence. In particular, we investigate techniques for learning accurate models from data, performing efficient inference in complex models, and solving the difficult optimization and search problems that arise. The goal is to advance the state-of-the-art in computer interpretation (natural language processing and computer perception), computer reasoning and decision making (automated reasoning and autonomous systems) and intelligent data analysis (data mining and bioinformatics) including the discovery of new patterns in large databases of medical, financial, or consumer-preference data.

Research projects

  • Large scale distributed syntactic, semantic and lexical language models We aim to build large scale distributed syntactic, semantic, and lexical language models that are trained by corpora with up to web-scale data on a supercomputer to substantially improve the performance of machine translation and speech recognition systems. It is conducted under the directed Markov random field paradigm to integrate both topics and syntax to form complex distributions for natural language. It uses hierarchical Pitman-Yor processes to model long tail properties of natural language. By exploiting the particular structure, the seemingly complex statistical estimation and inference algorithms are decomposed and performed in a distributed environment. Moreover, a long standing open problem, smoothing fractional counts due to latent variables in Kneser-Ney's sense in a principled manner, might be solved. We demonstrate how to put the complex language models into one-pass decoders of machine translation systems, and lattice rescoring decoder in a speech recognition system.
  • Scalable semi-supervised structured prediction We demonstrate how to use the tools in the fields of information theory, optimization theory, numerical continuation, and parallel/distributed computation for semi-supervised discriminative structured prediction. We illustrate a unified approach that minimizes cost functions that are upper bounds of misclassification error for labeled data and expected misclassification error for unlabeled data; we present dual and primal-dual methods, which are naturally suitable in a fully or partially distributed computing environment to handle large scale data sets and have global optimal guarantee for dual optimization; we propose novel algorithms to compute simplicial approximations of regularization surfaces; we apply these techniques in real-world applications such as: part-of-speech tagging, syntactic chunking, named entity recognition, relation and event extraction, Chinese word segmentation, RNA secondary structure prediction, and brain tumor image segmentation.
  • Direct loss minimization for classification and ranking problems Classification and ranking are classic yet notoriously hard machine learning problems due to the fact that the performance measures are non-convex, nondifferentiable and discontinuous. Standard and popular methods optimize surrogate functions that are upper bounds or smoothed approximations of the performance measures. Even though very promising results have been achieved, there is still a mismatch between training objective and testing measure. We present novel learning algorithms that directly optimize the performance measures without resorting to any upper bounds or approximations. Our approach is essentially an iterative greedy or cyclic convergent coordinate descent method in optimization that can be implemented in a parallel/distributed environment. We have applied our methods to boosting and ranking, our results show that our approach is significantly better than existing boosting and ranking methods, and noise tolerant. We expand our research into much harder and complex problems such as multi-class classification, structured prediction, and semi-supervised learning.
  • Deep learning for natural language processing
  • Open domain question answering and dialog systems
  • Automatic news generation

Faculty: Shaojun Wang

Ph.D. students: Ming Tan, Tian Xia, Shaodan Zhai, Raymond Kulhanek, Songan Mao, Zhongliang Li

M.S. students: Lily Guo

Visiting scholars: Professor Baoguo Wei, Northwestern Polytechnical University, 2011.9 - 2012.8

Group wiki page:

Recruiting: Ph.D. Level Graduate Research Assistants in Machine Learning and NLP