PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Local and Global Lexicon: A Novel Approach to Quantifying Persistence
Jefrey Lijffijt
In: XXXVII Kielitieteen päivät Helsingin yliopistossa, 20-22 May 2010, Helsinki, Finland.

Abstract

The term "persistence" refers to the repetition of linguistic structures, such as words or phrases, in spoken or written discourse. There is ample evidence for the existence of lexical and syntactic persis- tence [1, 4], recently also from corpus analysis [3, 6]. In existing literature persistence is connected to the neurological concepts of priming [4] (it is easier to repeat than to invent) and mirroring [5] (hu- mans are attracted to similar behaviour). Persistence is naturally very much dependent on the context, such as topic, audience or background of the writer/speaker. To assess persistence of lexicon in a text, we can search for words that occur with unexpected frequency and to find short term persistence it is interesting to look at short parts of a written text or conversation. We define the concepts of local and global lexicon. We propose to measure lexical persis- tence in a text by, first, taking a fixed size sample of consecutive words and counting the number of unique words, we call this local lexicon. Next, the statistical significance can be computed by compar- ing to fully random samples of words from the same text, which we refer to as global lexicon. The latter sampling procedure has been used before, for example in studying vocabulary coverage between different genres [2]. The idea behind comparing local and global lexicon is simple: if there is lexical persistence in the text, we expect the global lexicon to be significantly larger than the local lexicon. Using the British National Corpus, we investigate the presence or absence of persistence for different genres of text and sample sizes. Moreover, we study how to, given part of a text, explain the signifi- cant persistence in terms of words repeated unexpectedly. We show the introduced methodology is an effective tool for quantification of persistence in both a single text and collections of texts. We find that lexical persistence is present in time frames of a few thousand words or shorter and verify the strength of the effect decays over time. References [1] J. Bock. Syntactic persistence in language production. Cognitive Psychology, 18(3):355–387, 1986. [2] F. Fengxiang. A corpus-based study on random textual vocabulary coverage. Corpus Linguistics and Linguistic Theory, 4(1):1–17, June 2008. [3] S. T. Gries. Syntactic priming: A corpus-based approach. Journal of Psycholinguistic Research, 34(4):365–399, July 2005. [4] M. J. Pickering and H. P. Branigan. Syntactic priming in language production. Trends in Cognitive Sciences, 3(4):136–141, April 1999. [5] M. J. Pickering and S. Garrod. Alignment as the basis for successful communication. Research on Language and Computation, 4(2–3):203–228, October 2006. [6] B. Szmrecsanyi. Language users as creatures of habit: A corpus-based analysis of persistence in spoken english. Corpus Linguistics and Linguistic Theory, 1(1):113–150, 2005.

EPrint Type:Conference or Workshop Item (Oral)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:7763
Deposited By:Jefrey Lijffijt
Deposited On:17 March 2011