PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

Semi-supervised Document Classification with a Mislabeling Error Model
Anastasia Krithara, Massih Amini, Jean-Michel Renders and Cyril Goutte
In: 30th European Conference on Information Retrieval (ECIR 2008), 31 March - 2 April 2008, Glasgow, England.


This paper investigates a new extension of the Probabilistic Latent Semantic Analysis (PLSA) model for text classification where the training set is partially labeled. The proposed approach iteratively labels the unlabeled documents and estimates the probabilities of its labeling errors. These probabilities are then taken into account in the estimation of the new model parameters before the next round. Our approach outperforms an earlier semi-supervised extension of PLSA introduced by \cite{Krithara06b} which is based on the use of \textit{fake labels}. However, it maintains its simplicity and ability to solve multiclass problems. In addition, it gives valuable information about the most uncertain and difficult classes to label. We perform experiments over the $\News$, $WebKB$~and $\Reuters$~document collections and show the effectiveness of our approach over two other semi-supervised algorithms applied to these text classification problems.

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Conference or Workshop Item (Paper)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Information Retrieval & Textual Information Access
ID Code:3524
Deposited By:Massih Amini
Deposited On:11 February 2008