ROSALIND | Genome Assembly with Perfect Coverage

Cyclic Chromosomes

Recall that although chromosomes taken from eukaryotes have a linear structure, many bacterial chromosomes are actually circular. We represented a linear chromosome with a DNA string, so we only need to modify the definition of string to model circular chromosomes.

Perfect coverage is the phenomenon in fragment assembly of having a read (or $k$-mer) begin at every possible location in the genome. Unfortunately, perfect coverage is still difficult to achieve, but fragment assembly technology continues to improve by leaps and bounds, and perfect coverage is perhaps not the fantasy it once was.

Problem

A circular string is a string that does not have an initial or terminal element; instead, the string is viewed as a necklace of symbols. We can represent a circular string as a string enclosed in parentheses. For example, consider the circular DNA string (ACGTAC), and note that because the string "wraps around" at the end, this circular string can equally be represented by (CGTACA), (GTACAC), (TACACG), (ACACGT), and (CACGTA). The definitions of substrings and superstrings are easy to generalize to the case of circular strings (keeping in mind that substrings are allowed to wrap around).

Given: A collection of (error-free) DNA $k$-mers ($k \leq 50$) taken from the same strand of a circular chromosome. In this dataset, all $k$-mers from this strand of the chromosome are present, and their de Bruijn graph consists of exactly one simple cycle.

Return: A cyclic superstring of minimal length containing the reads (thus corresponding to a candidate cyclic chromosome).

Sample Dataset

ATTAC
TACAG
GATTA
ACAGA
CAGAT
TTACA
AGATT

Sample Output

GATTACA

Note

The assumption made above that all reads derive from the same strand is practically unrealistic; in reality, researchers will not know the strand of DNA from which a given read has been sequenced.

Genome Assembly with Perfect Coverage solved by 838

Cyclic Chromosomes

Problem

Sample Dataset

Sample Output

Note

Report a typo