Mining Broadcast News data: Robust Information Extraction from Word Lattices
Fine-grained information extraction performance from spoken corpora is strongly correlated with theWord Error Rate (WER) of the automatic transcriptions processed. Despite the recent advances in Automatic Speech Recognition (ASR) methods, high WER transcriptions are common when dealing with unmatched conditions between the documents to process and those used to train the ASR models. Such mismatch is inevitable in the processing of large spoken archives containing documents related to a large number of time periods and topics. Moreover, from a text indexation point of view, rare events and entities are often the most interesting information to extract as well as the ones that are very likely to be poorly recognized. In order to deal with high WER transcriptions this paper proposes a robust Information Extraction method that mines the full ASR search space for specific entities thanks to a 3-steps process: firstly, adaptation of the extraction models thanks to metadata information linked to the documents to process; secondly transduction of a word lattice outputs by theASR module into an entity lattice; thirdly a decision module that scores each entity hypothesis with different confidence scores. A first implementation of this model is proposed for the French Broadcast News Named Entity extraction task of the evaluation program ESTER.