ROSALIND | Constructing a De Bruijn Graph

Wading Through the Reads

Because we use multiple copies of the genome to generate and identify reads for the purposes of fragment assembly, the total length of all reads will be much longer than the genome itself. This begs the definition of read coverage as the average number of times that each nucleotide from the genome appears in the reads. In other words, if the total length of our reads is 30 billion bp for a 3 billion bp genome, then we have 10x read coverage.

To handle such a large number of $k$-mers for the purposes of sequencing the genome, we need an efficient and simple structure.

Problem

Consider a set $S$ of $(k+1)$-mers of some unknown DNA string. Let $S^{\textrm{rc}}$ denote the set containing all reverse complements of the elements of $S$. (recall from “Counting Subsets” that sets are not allowed to contain duplicate elements).

The de Bruijn graph $B_k$ of order $k$ corresponding to $S \cup S^{\textrm{rc}}$ is a digraph defined in the following way:

Nodes of $B_k$ correspond to all $k$-mers that are present as a substring of a $(k+1)$-mer from $S \cup S^{\textrm{rc}}$.
Edges of $B_k$ are encoded by the $(k+1)$-mers of $S \cup S^{\textrm{rc}}$ in the following way: for each $(k+1)$-mer $r$ in $S \cup S^{\textrm{rc}}$, form a directed edge ($r[1:k]$, $r[2:k+1]$).

Given: A collection of up to 1000 (possibly repeating) DNA strings of equal length (not exceeding 50 bp) corresponding to a set $S$ of $(k+1)$-mers.

Return: The adjacency list corresponding to the de Bruijn graph corresponding to $S \cup S^{\textrm{rc}}$.

Sample Dataset

TGAT
CATG
TCAT
ATGC
CATC
CATC

Sample Output

(ATC, TCA)
(ATG, TGA)
(ATG, TGC)
(CAT, ATC)
(CAT, ATG)
(GAT, ATG)
(GCA, CAT)
(TCA, CAT)
(TGA, GAT)

Constructing a De Bruijn Graph solved by 1169

Wading Through the Reads

Problem

Sample Dataset

Sample Output

Report a typo