July 2, 2012, midnight by Mikhail Dvorkin
Topics: Genome Assembly, Graph Algorithms
Putting the Puzzle Together
In practical genome sequencing, even if we assume that reads have been sequenced without errors, we have no idea of knowing immediately the particular strand of DNA a read has come from.
Also, our reads may not have the same length. In 1995, Idury and Waterman proposed a way to boost read coverage and achieve uniform read length by breaking long reads into overlapping
$k$ -mers for some fixed value of$k$ . For example, a 100 bp read could be split into 51 overlapping 50-mers.
A directed cycle is simply a cycle in a directed graph in which the head of one edge is equal to the tail of the next (so that every edge in the cycle is traversed in the same direction).
For a set of DNA strings
Given: A collection
Return: A cyclic superstring of minimal length containing every read or its reverse complement.
AATCT TGTAA GATTA ACAGA
GATTACA
Note
The reads "AATCT" and "TGTAA" are not present in the answer, but their reverse complements "AGATT" and "TTACA" are present in the circular string (GATTACA).