Machine Learning for Semi-Structured Multimedia Documents : Application to pornographic filtering and thematic categorization.
We propose a generative statistical model for the classication of semi structured multimedia documents. Its main originality is its ability to simultaneously take into account the structural and the content information present in a semi structured document, and also to cope with dierent types of content (text, image, etc). We then present the results obtained on two sets of experiments: one set concerns the ltering of pornographic Web pages. the second one concerns the thematic classication of Wikipedia documents