July 2, 2012, midnight by Rosalind Team
Topics: Phylogeny
From Characters Toward Alignments
In “Creating a Character Table from Genetic Strings”, we used strings to create a collection of characters from which we could create a phylogeny. However, the strings all had to share the same length, which was a problem. In practice, we would like to create a phylogeny from genetic strings having differing lengths; specifically, our aim is to construct a phylogeny from a multiple alignment.
Unfortunately, constructing a phylogeny from the ground up based only on an alignment can be difficult. In order to produce an efficient solution, we will need to assume that the structure of the phylogeny has already been provided (perhaps from character-based methods), and our aim instead is to reconstruct the genetic strings corresponding to the internal nodes (i.e., ancestors) in the tree.
The ancestor strings should have the property that the total number of point mutations separating adjacent nodes in the tree is minimized (in keeping with parsimony).
Say that we have
Recall that the number of single-symbol substitutions required to transform one string into
another is the Hamming distance between the strings (see “Counting Point Mutations”). Say that we have a rooted binary tree
Thus, our aim is to minimize
Given: A rooted binary tree
Return: The minimum possible value of
(((ostrich,cat)rat,(duck,fly)mouse)dog,(elephant,pikachu)hamster)robot; >ostrich AC >cat CA >duck T- >fly GC >elephant -T >pikachu AA
8 >rat AC >mouse TC >dog AC >hamster AT >robot AC
Note
Given internal strings minimizing
$d_{\textrm{H}}(T)$ , the alignment between any two adjacent strings is not necessarily an optimal global paired alignment. In other words, it may not be the case that$d_{\textrm{H}}(\overline{s}_i, \overline{s}_j)$ is equal to the edit distance$d_{\textrm{E}}(s_i, s_j)$ .