Mining probabilistic automata: a statistical view of sequential pattern mining
During the past decade, sequential pattern mining has been the core of numerous research efforts. It is now possible to efficiently extract knowledge of users’ behavior from a huge set of sequences collected over time. This has applications in various domains such as purchases in supermarkets, Web site visits, etc. However, sequence mining algorithms do little to control the risks of extracting false discoveries or overlooking true knowledge. In this paper, the theoretical conditions to achieve a relevant sequence mining process are examined. Then, the article offers a statistical view of sequence mining which has the following advantages: First, it uses a compact and generalized representation of the original sequences in the form of a probabilistic automaton. Second, it integrates statistical constraints to guarantee the extraction of significant patterns. Finally, it provides an interesting solution in a privacy preserving context in order to respect individuals’ information. An application in car flow modeling is presented, showing the ability of our algorithm (acsm) to discover frequent routes without any private information. Comparisons with a classical sequence mining algorithm (spam) are made, showing the effectiveness of our approach.