July 13, 2012, midnight by Rosalind Team
Topics: String Algorithms
Motifs Are Rarely Contiguous
In “Finding a Motif in DNA”, we searched for occurrences of a motif as a substring of a larger database genetic string. However, because a DNA strand coding for a protein is often interspersed with introns (see “RNA Splicing”), we need a way to recognize a motif that has been chopped up into pieces along a chromosome.
A subsequence of a string is a collection of symbols contained in order (though not necessarily contiguously) in the string (e.g., ACG is a subsequence of TATGCTAAGATC). The indices of a subsequence are the positions in the string at which the symbols of the subsequence appear; thus, the indices of ACG in TATGCTAAGATC can be represented by (2, 5, 9).
As a substring can have multiple locations, a subsequence can have multiple collections of indices, and the same index can be reused in more than one appearance of the subsequence; for example, ACG is a subsequence of AACCGGTT in 8 different ways.
Given: Two DNA strings
Return: One collection of indices of
>Rosalind_14 ACGTACGTGACG >Rosalind_18 GTA
3 8 10
Extra Information
For the mathematically inclined, we may equivalently say that
$t = t_1 t_2 \cdots t_m$ is a subsequence of$s = s_1 s_2 \cdots s_n$ if the characters of$t$ appear in the same order within$s$ . Even more formally, a subsequence of$s$ is a string$s_{i_1} s_{i_2} \ldots s_{i_k}$ , where$1 \leq i_1 < i_2 \cdots < i_k \leq n$ .