Finding the Longest Multiple Repeat solved by 612

Aug. 23, 2012, midnight by btjaden

Topics: Graph Algorithms

Long Repeats

We saw in “Introduction to Pattern Matching” that a data structure commonly used to encode the relationships among a collection of strings was the trie, which is particularly useful when the strings represent a collection of patterns that we wish to match to a larger text.

The trie is helpful when processing multiple strings at once, but when we want to analyze a single string, we need something different.

In this problem, we will use a new data structure to handle the problem of finding long repeats in the genome. Recall from “Finding a Motif in DNA” that cataloguing these repeats is a problem of the utmost interest to molecular biologists, as a natural correlation exists between the frequency of a repeat and its influence on RNA transcription. Our aim is therefore to identify long repeats that occur more than some predetermined number of times.

Problem

Figure 1. The suffix tree for s = GTCCGAAGCTCCGG. Note that the dollar sign has been appended to a substring of the tree to mark the end of s. Every path from the root to a leaf corresponds to a unique suffix of GTCCGAAGCTCCGG, and each leaf is labeled with the location in s of the suffix ending at that leaf.

A repeated substring of a string $s$ of length $n$ is simply a substring that appears in more than one location of $s$; more specifically, a k-fold substring appears in at least k distinct locations.

The suffix tree of $s$, denoted $T(s)$, is defined as follows:

See Figure 1 for an example of a suffix tree.

Given: A DNA string $s$ (of length at most 20 kbp) with $ appended, a positive integer $k$, and a list of edges defining the suffix tree of $s$. Each edge is represented by four components:

  1. the label of its parent node in $T(s)$;
  2. the label of its child node in $T(s)$;
  3. the location of the substring $t$ of $s^{*}$ assigned to the edge; and
  4. the length of $t$.

Return: The longest substring of $s$ that occurs at least $k$ times in $s$. (If multiple solutions exist, you may return any single solution.)

Sample Dataset

CATACATAC$
2
node1 node2 1 1
node1 node7 2 1
node1 node14 3 3
node1 node17 10 1
node2 node3 2 4
node2 node6 10 1
node3 node4 6 5
node3 node5 10 1
node7 node8 3 3
node7 node11 5 1
node8 node9 6 5
node8 node10 10 1
node11 node12 6 5
node11 node13 10 1
node14 node15 6 5
node14 node16 10 1

Sample Output

CATAC

Hint

How can repeated substrings of $s$ be located in $T(s)$?

Please login to solve this problem.