PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

The 2009 Knowledge Discovery and Data Mining Competition (KDD Cup 2009)
Gideon Dror, Marc Boullé, Isabelle Guyon, Vincent Lemaire and David Vogel, ed. (2011) Challenges in Machine Learning , Volume 3 . Microtome Publishing , Brookline, MA . ISBN 978-0-9719777-3-0


The annual ACM SIGKDD conference on Knowledge Discovery and Data Mining (KDD) is dedicated to facilitating interactions between data mining researchers and practitioners from academia, industry, and government. The 15th edition took place in Paris, France, and hosted the KDD Cup competition, which attracted a very large number of participants, three times more than any KDD Cup in the past. The organizing team of the KDD Cup 2009 is pleased to welcome you to this edition of the proceeding of the workshop, where the challenge results were presented and analyzed. We organized the KDD cup 2009 around a marketing problem with the goal of identifying data mining techniques capable of rapidly building predictive models and scoring new entries on a large database. Customer Relationship Management (CRM) is a key element of modern marketing strategies. The KDD Cup 2009 offered the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn), buy new products or services (appetency), or buy upgrades or add- ons proposed to them to make the sale more profitable (up-selling). The most practical way to build knowledge on customers in a CRM system is to produce scores. A score (the output of a model) is an evaluation for all target variables to explain (i.e., churn, appetency or up-selling). Tools producing scores provide quantifiable information on a given population. The score is computed using customer records represented by a number of variables or features. Scores are then used by the information system (IS), for example, to personalize the customer relationship. Building tens to hundreds CRM scores can be a key element in a marketing application. In this context, the automation of the data preparation and modeling steps of the data mining process is a challenging issue. The three CRM tasks of the KDD Cup 2009 related to the data donated by Orange encompass several scientifically or technically challenges: large datasets with 100,000 samples and 15,000 variables, noisy data, mixed types with numerical and categorical variables containing up to thousands of values, many missing values, unbalanced classes. In practice, scoring methods have to fulfill several objectives, such as full automation, effectiveness, time efficiency both for training and deployment, understandability of the models. Full automation is necessary in order to meet the increasing demand for building numerous scores. Effectiveness involves the accuracy of the scores, and has a direct impact on the marketing payoff: the better customers are targeted, the higher the response rate. The standard scientific accuracy indicator for scores is the area under the ROC curve (AUC), evaluated on test data. In practice, the mar- keting quality indicator is the payoff of a campaign which includes modeling costs, campaign costs (by mail, email, phone) and revenue related to the response. Training time efficiency al- lows to frequently update the scores, and involves modeling with train datasets up to hundred of thousands of samples and thousands of variables. Deployment time efficiency permits the scoring of tens of millions of customers in order to select the most responsive customers. Fi- nally, the understandability of the models provides useful information to marketing teams. In a challenge, the tasks must be approachable and we chose to focus on effectiveness given training time constraints. The performance indicator is the AUC on test data. The participants had one month to get familiar with the data tables without the target values, and then fives days to submit their test results once the training target values were made available. The participant exploited a wide variety of preprocessing, feature selection, classification, model selection and ensemble iii methods, providing a large and significant evaluation of the techniques effective for problems with large numbers of samples and variables, mixed types of variables, lots of missing values and unbalanced classes. This volume gathers the material of the challenge on Fast Scoring on a Large Marketing Database organized for the conference on Knowledge Discovery and Data Mining, June 28, 2009 in Paris. The book contains a collection of papers first published in JMLR W&CP, in- cluding a paper summarizing the results of the challenge and contributions of the top ranking entrants. The book is complemented by a web site from which the datasets can be down- loaded and post-challenge submissions can be made to benchmark new algorithms, see http: //

PDF - Requires Adobe Acrobat Reader or other PDF viewer.
EPrint Type:Book
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Theory & Algorithms
ID Code:9182
Deposited By:Isabelle Guyon
Deposited On:21 February 2012