Machine Learning for Query Formulation in Question Answering
Natural Language Engineering
Research on question answering dates back to the 1960s but has more recently been revisited as part of TREC’s evaluation campaigns, where question answering is addressed as a subarea of information retrieval that focuses on specific answers to a user’s information need. Whereas document retrieval systems aim to return the documents that are most relevant to a user’s query, question answering systems aim to return actual answers to a users question. Despite this difference, question answering systems rely on information retrieval components to identify documents that contain an answer to a user’s question. The computationally more expensive answer extraction methods are then applied only to this subset of documents that are likely to contain an answer. As information retrieval methods are used to filter the documents in the collection, the performance of this component is critical as documents that are not retrieved are not analyzed by the answer extraction component. The formulation of queries that are used for retrieving those documents has a strong impact on the effectiveness of the retrieval component. In this paper, we focus on predicting the importance of terms from the original question. We use model tree machine learning techniques in order to assign weights to query terms according to their usefulness for identifying documents that contain an answer. Term weights are learned by inspecting a large number of query formulation variations and their respective accuracy in identifying documents containing an answer. Several linguistic features are used for building the models, including part-of-speech tags, degree of connectivity in the dependency parse tree of the question, and ontological information. All of these features are extracted automatically by using several natural language processing tools. Incorporating the learned weights into a state-of- the-art retrieval system results in statistically significant improvements in identifying answer-bearing documents.