Understanding Short Texts

Zhongyuan Wang and Haixun Wang

Tutorial website


Billions of short texts are produced every day, in the form of search queries, ad keywords, tags, tweets, messenger conversations, social network posts, etc. Unlike documents, short texts have some unique characteristics which make them difficult to handle. First, short texts, especially search queries, do not always observe the syntax of a written language. This means traditional NLP techniques, such as syntactic parsing, do not always apply to short texts. Second, short texts contain limited context. The majority of search queries contain less than 5 words, and tweets can have no more than 140 characters. Because of the above reasons, short texts give rise to a significant amount of ambiguity, which makes them extremely difficult to handle. On the other hand, many applications, including search engines, ads, automatic question answering, online advertising, recommendation systems, etc., rely on short text understanding. In all these applications, the necessary first step is to transform an input text into a machine-interpretable representation, namely to "understand" the short text. A growing number of approaches leverage external knowledge to address the issue of inadequate contextual information that accompanies the short texts. These approaches can be classified into two categories: Explicit Representation Model (ERM) and Implicit Representation Model (IRM). In this tutorial, we will present a comprehensive overview of short text understanding based on explicit semantics (knowledge graph representation, acquisition, and reasoning) and implicit semantics (embedding and deep learning). Specifically, we will go over various techniques in knowledge acquisition, representation, and inferencing has been proposed for text understanding, and we will describe massive structured and semi-structured data that have been made available in the recent decade that directly or indirectly encode human knowledge, turning the knowledge representation problems into a computational grand challenge with feasible solutions insight.


Zhongyuan Wang https://www.microsoft.com/en-us/research/people/zhowang/ is a Researcher at Microsoft Research. He leads two projects: Enterprise Dictionary (knowledge mining from Enterprise) and Probase (knowledge mining from Web). He got his PhD. degree in computer science from Renmin University of China, and his PhD thesis is "Short Text Understanding". Zhongyuan Wang has published 20+ papers (including ICDE 2015 Best Paper Award on short text understanding) in the leading international conferences, such as VLDB, ICDE, IJCAI, CIKM, etc. He is also the co-author of the book "Web Data Management: Concepts and Techniques"(published in 2014), and the author of book "Short Text Understanding"(Will published in Sept. 2016). His research interests include knowledge base, natural language processing, semantic network, machine learning, and web data mining.

Haixun Wang http://haixun.olidu.com/ is a research scientist / Engineering manager at Facebook. Before Facebook, he was with Google Research, working on natural language processing. He led research in semantic search, graph data processing systems, and distributed query processing at Microsoft Research Asia. He had been a research staff member at IBM T. J. Watson Research Center from 2000 - 2009. He was Technical Assistant to Stuart Feldman (Vice President of Computer Science of IBM Research) from 2006 to 2007, and Technical Assistant to Mark Wegman (Head of Computer Science of IBM Research) from 2007 to 2009. He received the Ph.D. degree in computer science from the University of California, Los Angeles in 2000. He has published more than 150 research papers in referred international journals and conference proceedings. He served PC Chair of conferences such as CIKM12 and he is on the editorial board of IEEE Transactions of Knowledge and Data Engineering (TKDE), and Journal of Computer Science and Technology (JCST). He won the best paper award in ICDE 2015, 10 year best paper award in ICDM 2013, and best paper award of ER 2009.