Fraud Detection by Generating Positive Samples for Classification from Unlabeled Data
In many real world (binary) classification problems it is easy to obtain unlabeled data, but labeled data are very expensive or simply unavailable. In certain cases, however, such as in the problem of detecting frauds in (computer) games, or insider trading in stock markets, one can assume that the unlabeled data contains very few samples from one class (fraudulent plays or insider trades), but it is possible to generate synthetic data from this class. Training a naive classifier on the above data is particularly suited for detecting frauds in Markov decision problems if the feature vectors of the classifier are composed of the frequency a player abates from the optimal policy in each state and the associated excess reward. Based on a synthetic example in blackjack, we demonstrate that the above classification method can perform quite well even in the case the generated positive samples come from a distribution different to the real one. The method is also applied to identify possibly fraudulent trades in the stock market.