Proceedings of the ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010)
The LaTeCH (Language Technology for Cultural Heritage, Social Sciences, and Humanities) workshop series aims to provide a forum for researchers who are working on developing novel information technology for improved information access to data from the Humanities, Social Sciences, and Cultural Heritage. The first LaTeCH workshop was held in 2007, as a satellite workshop at ACL (Annual Meeting of the Association for Computational Linguistics), in Prague. There have been since three further editions of the workshop: at LREC 2008 (Language Resources and Evaluation Conference), in Marrakech, at EACL 2009 (Conference of the European Chapter of the Association for Computational Linguistics), in Athens, and now at ECAI 2010, in Lisbon. While the initial focus was on ‘Cultural Heritage’, it has gradually broadened to also include the Humanities and the Social Sciences. All three areas have in common that language data, i.e., text and –to a lesser extent– speech, play a central role, both as primary and secondary data sources. Current developments in these areas have resulted in an increased amount of data becoming available in electronic format, either as the outcome of recent digitisation efforts, or in the form of born-digital data. What is often lacking, nevertheless, is the technology to process and access these data in an intelligent way. Information technology research and applications can provide solutions to this problem, such as methods for data cleaning and data enrichment with semantic information, so as to support more sophisticated querying, and discovery and visualisation of interesting data trends. While the Humanities, Social Sciences and Cultural Heritage domains clearly benefit from this type of research, these domains also provide a challenging test bed for information technology. Traditionally, language information technology has been focused on other domains, such as newswire. Data from the Humanities, Social Sciences and Cultural Heritage entail new challenges, such as noisy text (e.g., due to OCR problems), non-standard, or archaic language varieties, the necessity to link data of diverse formats (e.g., text, database, video, speech) and languages, and the lack of large annotated data sets for supervised machine learning solutions. Researchers consequently have to be creative in developing robust methods for these domains. While the main focus of LaTeCH is on language technology, for the current edition of the workshop we broadened the scope and invited papers from related areas, including machine learning, pattern recognition, knowledge representation, multi-modal systems, recommender systems, and neighbouring fields in AI. Papers were accepted for LaTeCH 2010 after a thorough peer-review process and the selected papers give a good overview of the current breadth of this exciting and expanding area. On the technology side the papers cover topics ranging from preprocessing and error detection, over semantic annotation and information extraction to data visualisation. The contributions also deal with a wide variety of domains, including folk tales and ritual descriptions, cabinet minutes and political speeches, letters, legal documents, Hungarian codices, Alpine literature and audio-video streams.