Aug. 7, 2012, midnight by Rosalind Team
Topics: String Algorithms
Getting Repetitive
We have seen that every genome contains a large number of repeats and noted that the Alu repeat recurs around a million times on the human genome. Yet exactly how repetitive is the human genome?
To frame such a vague question mathematically, we first need to make the observation that if the genome were formed by adding nucleobases randomly, with each base having a 1/4 probability of being added at each nucleotide position, then we should expect to see a huge number of different substrings in the genome. Yet (to take a simple case) the genome containing only adenosine and forming the DNA string "AAAAAA...AAA" has relatively very few distinct substrings.
Now, real genomes are formed by a process that chooses nucleotides somewhere in between these two extreme cases, and so to quantify just how random this process is, we need to take the percentage of distinct substrings appearing in a genome with respect to the maximum possible number of distinct substrings that could appear in a genome of the same length.
Given a length
The linguistic complexity of
As an example, consider the DNA string (
| | |
1 | 3 | 4 |
2 | 5 | 8 |
3 | 6 | 7 |
4 | 6 | 6 |
5 | 5 | 5 |
6 | 4 | 4 |
7 | 3 | 3 |
8 | 2 | 2 |
9 | 1 | 1 |
Total | 35 | 40 |
Given: A DNA string
Return: The linguistic complexity
ATTTGGATT
0.875
Hint
Why does this problem follow “Encoding Suffix Trees”?