Mining Poly-regions in DNA
We study the problem of mining poly-regions in DNA sequences and propose three methods to solve it. A poly-region is defined as a bursty DNA area, i.e., area of high occurrence of a DNA pattern. In this paper, we introduce a general formulation that covers all possibly meaningful types of poly-regions in DNA and develop three efficient methods to detect them. The first one is entropy- based and applies a recursive segmentation technique that produces a set of candidate segments which may potentially lead to a poly- region. The key idea behind the second approach is to use a set of sliding windows over the sequence. Each sliding window covers a sequence segment and keeps a summary that mainly includes the number of occurrences of each item or pattern in that segment. Combining these summaries yields the complete set of poly-regions in the given sequence. The third approach applies a technique based on the majority vote, achieving linear running time with a minimal number of false negatives. In addition, we use apply an existing method to discover frequently occurring arrangements of those poly-regions in several types of DNA regions, such as introns, exons, and nucleosomes. The proposed algorithms are tested on DNA sequences of four different organisms in terms of recall and runtime.