Towards interactive visual analysis of corpora
Jefrey Lijffijt, Harri Siirtola, Tanja Saily, Turo Vartiainen, Terttu Nevalainen and Heikki Mannila
In: ICAME 31, 26-30 May 2010, Giessen, Germany.
In the past decades, the compilation of electronic text corpora has benefited from the increased processing power of computers, thus enabling the study of more diverse aspects of language. However, the development of tools for analyzing corpora has received somewhat less attention. It seems to us that the research field could benefit from more advanced, interactive software tools specifically developed for the analysis of natural language corpora. In this presentation we raise some questions that cannot be answered with the basic tools normally available, and introduce some freeware which can help linguists to deal with them.
To understand some of the limitations of current tools, let us consider e.g. the BNCweb and Xaira, the most popular and powerful toolsets available for accessing the British National Corpus. They provide specialized solutions to query a text corpus like a database. Although the usefulness of these tools is beyond dispute, they have some limitations. Firstly, there is no native support for cross-corpus searching or comparison. Secondly, their statistical capabilities are rather limited, providing only features such as random sampling, counting, cross-tabulation of variables and collocations. These essential tools could be enhanced by the introduction of methods for pattern discovery, clustering and other data-analysis tasks commonly incorporated into off-the-shelf databases. Moreover, the availability of interactive data visualization tools could support hypothesis building.
The aim of the DAMMOC project is to enable easier, faster and more powerful analysis of corpora. This vision embodies two concrete goals. The first goal is to identify methods in data analysis and data visualization that are useful for the study of language. We have started with standard pattern mining algorithms to find differences between speakers or writers and plots to compare word counts over different groups, but also experimented with less obvious methods such as data-driven periodization of a diachronic corpus. The second goal is to develop these methods into software tools accessible to corpus linguists. The linguistic topics we are currently using as test cases for this toolkit include issues in linguistic complexity and language variation and change over time. To ensure their generality and extendibility, these tools will be provided as an R package and as extensions to Mondrian, a data visualization engine interfacing with R.