July 2, 2012, midnight by Rosalind Team
Topics: String Algorithms
Shortening the Motif Search
In “Finding a Motif in DNA”, we discussed the problem of searching a genome for a known motif. Because of the large scale of eukaryotic genomes, we need to accomplish this computational task as efficiently as possible.
The standard method for locating one string
$t$ as a substring of another string$s$ (and perhaps one you implemented in “Finding a Motif in DNA”) is to move a sliding window across the larger string, at each step starting at$s[k]$ and matching subsequent symbols of$t$ to symbols of$s$ . After we have located a match or mismatch, we then shift the window backwards to begin searching at$s[k+1]$ .The potential weakness of this method is as follows: say we have matched 100 symbols of
$t$ to$s$ before reaching a mismatch. The window-sliding method would then move back 99 symbols of$s$ and start comparing$t$ to$s$ ; can we avoid some of this sliding?For example, say that we are looking for
$t = \mathrm{ACGTACGT}$ in$s = \mathrm{TAGGTACGTACGGCATCACG}$ . From$s[6]$ to$s[12]$ , we have matched seven symbols of$t$ , and yet$s[13]$ = G produces a mismatch with$t[8]$ = T. We don't need to go all the way back to$s[7]$ and start matching with$t$ because$s[7] = \mathrm{C}$ ,$s[8] = \mathrm{G}$ , and$s[9] = \mathrm{T}$ are all different from$t[1] = \mathrm{A}$ . What about$s[10]$ ? Because$t[1:4] = t[5:8] = \mathrm{ACGT}$ , the previous mismatch of$s[13] = \mathrm{G}$ and$t[8] = \mathrm{T}$ guarantees the same mismatch with$s[13]$ and$t[4]$ . Following this analysis, we may advance directly to$s[14]$ and continue sliding our window, without ever having to move it backward.This method can be generalized to form the framework behind the Knuth-Morris-Pratt algorithm (KMP), which was published in 1977 and offers an efficiency boost for determining whether a given motif can be located within a larger string.
A prefix of a length
The failure array of
Given: A DNA string
Return: The failure array of
>Rosalind_87 CAGCATGGTATCACAGCAGAG
0 0 0 1 2 0 0 0 0 0 0 1 2 1 2 3 4 5 3 0 0
Extra Information
If you would like a more precise technical explanation of the Knuth-Morris-Pratt algorithm, please take a look at this site