Tutorial #1, Xiaodong He and Scott Wen-tau Yih, “Deep Learning and Continuous Representations for Language Processing”

Tutorial #2, Karen Livescu, Raman Arora and Kevin Gimpel, “Multi-view learning of representations for speech and language”

Tutorial #1: “Deep Learning and Continuous Representations for Language Processing”

Sunday, December 7th, 12:30 - 15:00

Xiaodong He (Microsoft Research, Redmond, WA, USA)

Scott Wen-tau Yih (Microsoft Research, Redmond, WA, USA)


Deep learning techniques have demonstrated tremendous success in the speech and language processing community in recent years, establishing new state-of-the-art performance in speech recognition, language modeling, and have shown great potential for many other natural language processing tasks. The focus of this tutorial is to provide an extensive overview on recent deep learning approaches to problems in language or text processing, with particular emphasis on important real-world applications including spoken language understanding, semantic representation modeling, information retrieval, semantic parsing and question answering, etc.

In this tutorial, we will first survey the latest deep learning technology, presenting both theoretical and practical perspectives that are most relevant to our topic. We plan to cover common methods of deep neural networks and more advanced methods of recurrent, recursive, stacking, and convolutional networks. In addition, we will introduce recently proposed continuous-space representations for both semantic word embedding and knowledge base embedding, which are modeled by either matrix/tensor decomposition or neural networks.

Next, we will review general problems and tasks in text/language processing, and underline the distinct properties that differentiate language processing from other tasks such as speech and image object recognition. More importantly, we highlight the general issues of language processing, and elaborate on how new deep learning technologies are proposed and fundamentally address these issues. We then place particular emphasis on several important applications:1) spoken language understanding, 2) semantic information retrieval, 3) semantic parsing and question answering. For each task, we discuss what particular architectures of deep learning models are suitable given the nature of the task, and how learning can be performed efficiently and effectively using end-to-end optimization strategies.

Besides providing a systematic tutorial of the general theory, we will also present hands-on experience in building state-of-the-art SLU/IR/QA systems. In the tutorial, we will share our practice with concrete examples drawn from our first-hand experience in major research benchmarks and some industrial scale applications which we have been working on extensively in recent years.


  1. Background of neural network learning architectures
    1. Background: A review of deep learning theory
    2. Advanced architectures for modeling language structure
    3. Common problems in language processing: Why deep learning is needed?
    4. Learning techniques: regularization, optimization, GPU, etc.
  2. Continuous-space representation learning
    1. Linear algebra based models (matrix/tensor decomposition)
    2. Neural network based models
    3. Semantic word embedding
    4. Knowledge base embedding
  3. Deep learning in spoken language understanding
    1. Overview of SLU
    2. Semantic classification using DCN and kernel-DCN
    3. Slot filling using Recurrent NN (RNN) and Recursive NN (RecNN), bi-directional RNN, and embedding
  4. Deep learning in information retrieval
    1. Overview of IR
    2. Deep structured semantic models (DSSM) for IR
  5. Deep learning in semantic parsing and question answering
    1. Overview of SP/QA
    2. Recent deep learning approaches and embedding models for SP/QA
  6. Summary and discussion

Short bio

Xiaodong He

Xiaodong He is Researcher of Microsoft Research, Redmond, WA, USA. He is also Affiliate Professor in Electrical Engineering at the University of Washington, Seattle, WA, USA. His research interests include deep learning, information retrieval, natural language understanding, machine translation, and speech recognition. Dr. He has published a book and more than 70 technical papers in these areas, and has given tutorials at international conferences in these fields. In benchmark evaluations, he and his colleagues have developed entries that obtained No. 1 place in the 2008 NIST Machine Translation Evaluation (NIST MT) and the 2011 International Workshop on Spoken Language Translation Evaluation (IWSLT), both in Chinese-English translation, respectively. He serves as Associate Editor of IEEE Signal Processing Magazine and IEEE Signal Processing Letters, as Guest Editors of IEEE TASLP for the Special Issue on Continuous-space and related methods in natural language processing, and Area Chair of NAACL2015. He also served as GE for several IEEE Journals, and served in organizing committees and program committees of major speech and language processing conferences in the past. He is a senior member of IEEE and a member of ACL.

Scott Wen-tau Yih

Scott Wen-tau Yih is a Researcher in the Machine Learning Group at Microsoft Research Redmond. His research interests include natural language processing, machine learning and information retrieval. Yih received his Ph.D. in computer science at the University of Illinois at Urbana-Champaign. His work on joint inference using integer linear programming (ILP) [Roth & Yih, 2004] helped the UIUC team win the CoNLL-05 shared task on semantic role labeling, and the approach has been widely adopted in the NLP community. After joining MSR in 2005, he has worked on email spam filtering, keyword extraction and search & ad relevance. His recent work focuses on continuous semantic representations using neural networks and matrix/tensor decomposition methods, with applications in lexical semantics and question answering. Yih received the best paper award from CoNLL-2011 and has served as area chairs (HLT-NAACL-12, ACL-14) and program co-chairs (CEAS-09, CoNLL-14) in recent years.

Tutorial #2: “Multi-view learning of representations for speech and language”

Sunday, December 7th, 15:30 - 18:00

Raman Arora (Johns Hopkins University)

Kevin Gimpel (Toyota Technological Institute at Chicago)

Karen Livescu (TTI-Chicago)


Speech and language data resources often include not only audio or text, but also associated images, video, articulatory measurements, and more. Multi-view learning includes a variety of techniques that use multiple (typically two) "views" of data to learn improved models for each of the views. The views can be multiple measurement modalities (audio + video, text + images, etc.) but also different information extracted from the same source (words + context, document text + links). Theoretical and empirical results show that multi-view techniques can improve over single-view ones in certain settings. Multiple views can help by reducing noise (what is noise in one view is not in the other) or improving confidence (when one view is more confident than the other). In this tutorial, we will focus on multi-view learning of representations (features) for speech and language, especially canonical correlation analysis (CCA) and related techniques. Recent work has produced new varieties of multi-view techniques, including ones that are feasible for the first time for large-data applications, making the methods more practical than ever for our research community. The tutorial will start from basic principles of linear algebra and machine learning and build up an understanding of CCA and its relationship with other techniques such as partial least squares (PLS) and linear discriminant analysis (LDA). We will then present various extensions, such as kernel, deep, sparse, and generalized ("many-view") CCA. We will make connections between methods, describe practical details, and review recent results in speech recognition and natural language processing. Finally, the tutorial will include visualizations and practical tools to improve intuition and enable quick application of the ideas. We anticipate that people will come away from the tutorial with the ability to try out the presented methods on new data and applications in spoken language technology.


  • Introduction: Definitions, motivation, sample applications
  • Methods
    • Background: linear algebra concepts, singular value decomposition (SVD)
    • Linear methods: partial least squares (PLS), canonical correlation analysis (CCA), others
    • Nonlinear methods: kernel CCA, deep CCA
    • Others: supervised variants, many-view (generalized) CCA, sparse CCA, ...
  • Theory: When and why are multi-view techniques helpful?
  • Applications
    • Speech: acoustic feature learning for ASR, data analysis, audio-visual synchronization, speaker clustering/identification
    • NLP: topic clustering, word embeddings for NER and chunking, multi-lingual word embeddings & translation lexicon learning, spectral algorithms for parsing and finite-state models
    • Other applications: robotics, genomics, ...
  • Practical issues: Tuning, scaling to large data, numerical issues, etc.
  • Open questions: Theoretical questions, large-data issues, unexplored applications
  • Resources: Publicly available data sets and code

Short Bios Xiaodong He

Karen Livescu is an Assistant Professor at TTI-Chicago, where she has been since 2008. Previously she completed her PhD at MIT in the Spoken Language Systems group of the Computer Science and Artificial Intelligence Laboratory, and was a post-doctoral lecturer in the MIT EECS department. Her main research interests are in speech and language processing, with a slant toward combining machine learning with knowledge about linguistics and speech science. Her recent work has included multi-view learning of speech representations, articulatory models of pronunciation variation, discriminative training with low resources for spoken term detection and pronunciation modeling, and automatic sign language recognition. She is a member of the IEEE Spoken Language Technical Committee, an associate editor for IEEE Transactions on Audio, Speech, and Language Processing and subject editor for Speech Communication, and an organizer/co-organizer of a number of recent workshops, including the ISCA SIGML workshops on Machine Learning in Speech and Language Processing, the Midwest Speech and Language Days, and the Interspeech Workshop on Speech Production in Automatic Speech Recognition.

Xiaodong He

Raman Arora is an assistant professor in the Department of Computer Science at Johns Hopkins University. Prior to this he was a Research Assistant Professor at Toyota Technological Institute at Chicago(TTIC), a post-doctoral scholar at TTIC hosted by Karen Livescu, a visiting researcher at Microsoft Research Redmond and a research associate at the University of Washington in Seattle. He received his M.S. and Ph.D. degrees in Electrical and Computer Engineering from the University of Wisconsin-Madison in 2005 and 2009, respectively. His research interests include machine learning, speech recognition and statistical signal processing, with emphasis on dimensionality reduction and representation learning using multi-view learning, similarity-based learning, and deep learning as well as methods from group theory, representation theory and harmonic analysis. Central to his research is the theory and application of stochastic approximation algorithms that can scale to big data.

Kevin Gimpel

Kevin Gimpel is a research assistant professor at TTI-Chicago. He received his PhD in 2012 from the Language Technologies Institute at Carnegie Mellon University, where he was an inaugural member of Noah’s ARK. His research focuses on natural language processing, focusing on applications like machine translation, syntactic analysis of social media, and text-driven forecasting of real-world events. He also works on machine learning motivated by NLP, including approximate inference for structure prediction and learning criteria for supervised and unsupervised learning. He received a five-year retrospective best paper award for a paper at WMT 2008.