Impact of Linguistic Analysis on the Semantic Graph Coverage and Learning of Document Extracts
Jure Leskovec, Natasa Milic-Frayling and Marko Grobelnik
In: AAAI 2005, Pittsburgh, PA(2005).
Automatic document summarization is a problem of creating a document surrogate that adequately represents the full document content. We aim at a summarization system that can replicate the quality of summaries created by humans. In this paper we investigate the machine learning method for extracting full sentences from documents based on the document semantic graph structure. In particular, we explore how the Support Vector Machines (SVM) learning method is affected by the quality of linguistic analyses and the corresponding semantic graph representations. We apply two types of linguistic analysis: (1) a simple part-of-speech tagging of noun phrases and verbs and (2) full logical form analysis which identifies Subject-Predicate-Object triples, and then build the semantic graphs. We train the SVM classifier to identify summary nodes and use these nodes to extract sentences. Experiments with the DUC 2002 and CAST datasets show that the SVM based extraction of sentences does not differ significantly for the simple and the sophisticated syntactic analysis. In both cases the graph attributes used in learning are essential for the classifier performance and the quality of extracted summaries.