Multimodal Interactive Transcription of Handwritten Text Images
This thesis presents an interactive multimodal approach for efficient transcription of handwritten text images. This approach, rather than full automation, aims at assisting the expert in the proper recognition transcription process. Until now, the handwritten text recognition systems (HTR) are far from being perfect and heavy human intervention is often required to check and correct the results of such systems. HTR systems have indeed proven useful for restricted applications involving form-constrained handwriting and/or fairly limited vocabulary (such as postal addresses or bank check legal amounts), achieving in this kind of tasks relatively high recognition accuracy. However, in the case of unconstrained handwritten documents (such as old manuscripts and/or unconstrained, spontaneous text), current HTR technology typically only achieves results which are far from being directly acceptable in practice. The interactive scenario studied in this thesis allows for a more effective approach. Here, the automatic HTR system and the human transcriber cooperate to generate the final transcription of the text images. In this scenario, the system uses the text image and a previously validated part (prefix) of its transcription to propose a suitable continuation. Then the user finds and corrects the next system error, thereby providing a longer prefix which the system uses to suggest a new, hopefully better continuation. The technology used in this work is based on Hidden Markov Models (HMMs) and $n$-gram language models, used in the same way as they are used in the current automatic speech recognition (ASR) systems. To take into account the feedback introduced by the user some modifications in the conventional $n$-gram language models have been studied. To implement the decoding process in one step, as in conventional HTR systems, two main approaches are presented. The first of them consist in building a special language model and the second one, on more sophisticated word-graph techniques. The last approach integrates efficient error-correcting algorithms in order to guarantee low response time and preserve adequate transcription accuracy. The system was tested on three corpora, two of them contain handwritten text in modern Spanish and English, whereas, the third corpus consists of cursive handwritten page images in old Spanish. The results on the three cursive handwritten tasks suggest that, using the interactive approach, considerable amounts of user effort can be saved with respect to both pure manual work and non-interactive HTR systems. In the interactive system presented here the user is repeatedly interacting with the system. Hence, the quality and ergonomy of the interactive process is crucial for the success of the system. In this thesis, different ways to interact with the system and different levels (whole word and keystroke) have been studied. Moreover, more ergonomic multimodal interfaces have been used in order to obtain an easier and more comfortable human-machine interaction. Among many possible feedback modalities, we focus here on touchscreen communication, which is perhaps the most natural modality to provide the required feedback. The on-line feedback HTR subsystem used is based on HMMs in the same way that the main off-line HTR system. To train the on-line HTR feedback subsystem and test the multimodal approach, an on-line handwriting corpus has been used. The required word instances that would have to be handwritten by the user in the multimodal interaction process were generated by concatenating random character instances from three categories: digits, lowercase letters and symbols. The obtained results show that, in spite of loosing the deterministic accuracy of the traditional keyboard an mouse, the more ergonomic multimodal approach can save significant amounts of human effort.