Multimodal Learning and Reasoning

Desmond Elliott, Douwe Kielay, and Angeliki Lazaridou


Natural Language Processing has broadened in scope to tackle more and more challenging language understanding and reasoning tasks. The core NLP tasks remain predominantly unimodal, focusing on linguistic input, despite the fact that we, humans, acquire and use language while communicating in perceptually rich environments. Moving towards human-level AI will require the integration and modeling of multiple modalities beyond language. With this tutorial, our aim is to introduce researchers to the areas of NLP that have dealt with multimodal signals. The key advantage of using multimodal signals in NLP tasks is the complementarity of the data in different modalities. For example, we are less likely to nd descriptions of yellow bananas or wooden chairs in text corpora, but these visual attributes can be readily extracted directly from images. Multimodal signals, such as visual, auditory or olfactory data, have proven useful for models of word similarity and relatedness, automatic image and video description, and even predicting the associated smells of words. Finally, multimodality offers a practical opportunity to study and apply multitask learning, a general machine learning paradigm that improves generalization performance of a task by using training signals of other related tasks.

All material associated to the tutorial will be available at

Tutorial Overview 

Part I: Moving beyond language

In the first part of the tutorial, we will provide an overview of Multimodal NLP, focusing on landmark research, discussing current trends but also challenges and long-term goals related to how multimodality ts into the general AI picture, e.g., robotics, conversational agents, image understanding, etc. An immediate challenge for NLP researchers working with the visual modality is data preprocessing. We will provide an overview of the key background in visual feature extraction using convolutional neural networks.

Part II: Grounded Lexical Semantics

In the second part, we will elaborate on how different modalities can interact with language at a lexical level, i.e., what is the interplay between multimodal signals and word embeddings, which have recently attracted a lot of attention in the NLP community. We will focus on two aspects pertaining to the relation between the semantic spaces of language and vision, their structural complementarity and similarity|an antithesis that gives rise to two applications, multimodal fusion and cross-modal mapping|and how those can be used for boosting performance in NLP tasks. For multimodal fusion, we will present different methods for learning word embeddings with multimodal signals. For cross-modal mapping, we will discuss different techniques for estimating missing information in one modality (e.g., vision) given another modality (e.g., language) and its applications in NLP tasks.

Part III: Multimodal Reasoning and Understanding

In the final part, we will move from lexical tasks to full-fledged end-to-end applications that require deeper understanding of the participating modalities. Specifically we will focus on image description and visual question answering, introducing state-of-the-art neural network methods, and discussing available datasets and the challenges of evaluation in these tasks. We will also discuss future directions in these areas, including multilingual image description and video description.


Part I: Moving beyond language (30 minutes)

  • Overview of Multimodal NLP: Seminal works, current trends and long-term goals of multimodality and NLP
  • Basics of different modalities (with emphasis on visual modality): CNN, Representations, software packages

Part II: Grounded Lexical Semantics (60 minutes)

  • Multimodal fusion: Learning multi-modal embeddings, tasks
  • Cross-modal semantics: Learning models for cross-modal transfer, applications to NLP tasks

Break (15 minutes)

Part III: Multimodal Reasoning and Understanding (60 minutes)

  • Image description generation: Models, datasets and evaluation methods
  • Visual QA: Motivation, methods and resources 

Part IV: Final Remarks (15 minutes) 

About the Speakers

Desmond Elliott ( is a postdoc at the Institute for Logic, Language and Computation in the University of Amsterdam (The Netherlands). His main research interests are models and evaluation methods for automatic image description. He delivered a tutorial on Datasets and Evaluation Methods for Image Description at the 2015 Integrating Vision and Language Summer School, and is co-organising a shared task on Multimodal Machine Translation at the 2016 Workshop on Machine Translation.

Douwe Kiela ( is a final year PhD student at the University of Cambridge's Computer Laboratory, supervised by Stephen Clark. He is interested in trying to enrich NLP with additional resources, primarily through grounding representations in perceptual modalities including vision, but also auditory and even olfactory modalities. He is a student board member of EACL and has published 8 top-tier conference papers over the three years of his PhD.

Angeliki Lazaridou ( is a final year PhD student, supervised by Marco Baroni at the Center for Mind/Brain Sciences of the University of Trento (Italy). Her primary research interests are in the area of multimodal semantics, i.e., making purely text-based models of meaning interact with other modalities, such as visual and sensorimotor. She has focused on learning models with multimodal signals and using those for multimodal inference, work that has appeared at related venues (ACL, NAACL, TACL, EMNLP).