Structured Multimedia Document Classification
Ludovic Denoyer, Patrick Gallinari, Jean-Noel Vittaut, Sylvie Brunesseaux and Stephan Brunesseaux
In: ACM DOCENG 2003, Grenoble, France(2003).
We propose a new statistical model for the classiﬁcation of
structured documents and consider its use for multimedia
document classiﬁcation. Its main originality is its ability to
simultaneously take into account the structural and the content
information present in a structured document, and also
to cope with diﬀerent types of content (text, image, etc).
We present experiments on the classiﬁcation of multilingual
pornographic HTML pages using text and image data. The
system accurately classiﬁes porn sites from 8 European languages. This corpus has been developed by EADS company
in the context of a large Web site ﬁltering application.