Weakly supervised clustering: hydridizing constructive induction and double clustering
Xiangliang Zhang, Michele Sebag and Cecile Germain
In: One-Day Statistical Workshop, June 29, 2007, Lisieux, Normandy, France.
Motivated by the behavioral modeling of a grid system based on the Logging and Bookkeeping
(L&B) files that are job traces recorded by the EGEE1 grid broker, we face the facts that
1)the jobs are weakly supervised. i.e., two classes (“done” and “failed”) are known according to the job executing situation.
2)the initial description of jobs is highly redundant. For each job, there are three tables containing circa 100 attributes among which strong dependence exits.
3)job traces are represented using a structured representation (Job Description Language) and there is no natural metric on this representation space
4)the complex distribution of job data set is heterogeneous on users who launched the jobs and on the instant grid load measured on weeks.
To solve the difficulties, a two-step approach is proposed, Constructive Induction and Double Clustering, to examine whether the two classes can be interpreted as finer-grained sub clusters of the jobs. Stability of the clusters guarantees the quality of the clusters found based on two independent representation of the jobs.