PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Lemmatization and Lexicalized Statistical Parsing of Morphologically Rich Languages: the Case of French
Djame Seddah, Grzegorz Chrupala, Özlem Çetinoğlu, Josef van Genabith and Marie Candito
In: NAACL SPMRL 2010 workshop(2010).

Abstract

This paper shows that training a lexicalized parser on a lemmatized morphologically-rich treebank such as the French Treebank slightly improves parsing results. We also show that lemmatizing a similar in size subset of the English Penn Treebank has almost no effect on parsing performance with gold lemmas and leads to a small drop of performance when automatically assigned lemmas and POS tags are used. This highlights two facts: (i) lemmatization helps to reduce lexicon data-sparseness issues for French, (ii) it also makes the parsing process sensitive to correct assignment of POS tags to unknown words.

EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Natural Language Processing
ID Code:8850
Deposited By:Grzegorz Chrupala
Deposited On:21 February 2012