July 2, 2012, midnight by Mikhail Dvorkin
Topics: Genome Assembly
Wading Through the Reads
Because we use multiple copies of the genome to generate and identify reads for the purposes of fragment assembly, the total length of all reads will be much longer than the genome itself. This begs the definition of read coverage as the average number of times that each nucleotide from the genome appears in the reads. In other words, if the total length of our reads is 30 billion bp for a 3 billion bp genome, then we have 10x read coverage.
To handle such a large number of
$k$ -mers for the purposes of sequencing the genome, we need an efficient and simple structure.
Consider a set
The de Bruijn graph
Given: A collection of up to 1000 (possibly repeating) DNA strings of equal length (not exceeding 50 bp) corresponding to a set
Return: The adjacency list corresponding to the de Bruijn graph corresponding to
TGAT CATG TCAT ATGC CATC CATC
(ATC, TCA) (ATG, TGA) (ATG, TGC) (CAT, ATC) (CAT, ATG) (GAT, ATG) (GCA, CAT) (TCA, CAT) (TGA, GAT)