PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning

AntiPhish: Lessons Learnt
Andre Bergholz
In: KDD Workshop on CyberSecurity and Intelligence Informatics (CSI-KDD), 28 Jun 2009, Paris, France.

Abstract

Phishing emails usually contain a message from a credible looking source requesting a user to click a link to a website where she/he is asked to enter a password or other confidential information. Most phishing emails aim at withdrawing money from financial institutions or getting access to private information. Phishing has increased enormously over the last years and is a serious threat to global security and economy. There are a number of possible countermeasures to phishing. These range from communication-oriented approaches like authentication protocols over blacklisting to content-based filtering approaches [3]. We argue that the first two approaches are currently not broadly implemented or exhibit deficits. Therefore content-based phishing filters are necessary and widely used to increase communication security. A number of features are extracted capturing the content and structural properties of the email. Subsequently a statistical classifier is trained using these features on a training set of emails labeled as ham (legitimate), spam or phishing. This classifier may then be applied to an email stream to estimate the classes of new incoming emails. AntiPhish is a specific targeted research project funded under Framework Program 6 by the European Union. It is aims at developing improved anti-phishing technologies that help to protect and secure the global email communication infrastructure. The project on the one hand developed the filter methodology in a test laboratory setting, but on the other hand implemented this technology in real world settings, to be used to filter all email traffic online in real time. In this talk we summarize our experience with phishing filtering with benchmark data and in addition with different real-life email streams. First we describe a number of novel features that are particularly well-suited to identify phishing emails [1]. These include statistical models for the low-dimensional descriptions of email topics, sequential analysis of email text and external links, the detection of embedded logos as well as indicators for hidden salting [2]. Hidden salting is the intentional addition or distortion of content not perceivable by the reader. For empirical evaluation we have obtained a large realistic corpus of emails pre-labeled as spam, phishing, and ham (legitimate). In experiments with benchmark data our methods outperform other published approaches for classifying phishing emails. The second part of the talk describes the application of these approaches to real-life email streams. On the one hand we investigate how we can identify new phishing emails arriving from a honeypot system. This allows to spot new types of phishing mails. Subsequently the characteristics of these new phishing emails can be used to update client-based phishing filters. A second experiment investigates the capabilities of the AntiPhish system when monitoring emails in an ISP framework. It turns out that active learning approaches are very efficient to maintain and improve filtering accuracy. We discuss the implications of these results for the practical application of this approach in the workflow of an email provider. Finally we describe a strategy how the filters may be updated and adapted to new types of phishing.

EPrint Type:Conference or Workshop Item (Invited Talk)
Project Keyword:Project Keyword UNSPECIFIED
Subjects:Learning/Statistics & Optimisation
Natural Language Processing
ID Code:6776
Deposited By:Andre Bergholz
Deposited On:08 March 2010