Creating a Character Table solved by 667

Aug. 21, 2012, midnight by Rosalind Team

Topics: Phylogeny

Introduction to Character-Based Phylogeny

Before the modern genetics revolution, phylogenies were constructed from physical characters resulting from direct structural comparison of taxa. A great deal of analysis relied on the fossil record, as fossils provided the only concrete framework for studying the appearance of extinct species and for inferring how they could have evolved into present-day organisms.

A classic case illustrating the utility of the fossil record is the case of dinosaur pelvic bones. In 1887, Harry Seeley proposed a new classification of dinosaurs into two orders, Saurischia and Ornithischia: the former possessed hip bones shaped like those found in reptiles, whereas the latter had a much different hip shape that resembled birds. Seeley's pelvic classification has survived to the present day as the principal division of dinosaurs.

The key point is that hip bone shape is a physical character that separates all dinosaurs into two different groups. Our hope is to construct a phylogeny solely from a collection of characters. Throughout character-based phylogeny, our two-part assumption is that all taxa possessing a character must have evolved from a single ancestor that introduced this character, and conversely, any taxon not possessing the character cannot be descended from this ancestor.

Problem

Given a collection of $n$ taxa, any subset $S$ of these taxa can be seen as encoding a character that divides the taxa into the sets $S$ and $S^{\textrm{c}}$; we can represent the character by $S \mid S^{\textrm{c}}$, which is called a split. Alternately, the character can be represented by a character array $A$ of length $n$ for which $A[j] = 1$ if the $j$th taxon belongs to $S$ and $A[j] = 0$ if the $j$th taxon belongs to $S^{\textrm{c}}$ (recall the "ON"/"OFF" analogy from “Counting Subsets”).

At the same time, observe that the removal of an edge from an unrooted binary tree produces two separate trees, each one containing a subset of the original taxa. So each edge may also be encoded by a split $S \mid S^{\textrm{c}}$.

A trivial character isolates a single taxon into a group of its own. The corresponding split $S \mid S^{\textrm{c}}$ must be such that $S$ or $S^{\textrm{c}}$ contains only one element; the edge encoded by this split must be incident to a leaf of the unrooted binary tree, and the array for the character contains exactly one 0 or exactly one 1. Trivial characters are of no phylogenetic interest because they fail to provide us with information regarding the relationships of taxa to each other. All other characters are called nontrivial characters (and the associated splits are called nontrivial splits).

A character table is a matrix $C$ in which each row represents the array notation for a nontrivial character. That is, entry $C_{i,j}$ denotes the "ON"/"OFF" position of the $i$th character with respect to the $j$th taxon.

Given: An unrooted binary tree $T$ in Newick format for at most 200 species taxa.

Return: A character table having the same splits as the edge splits of $T$. The columns of the character table should encode the taxa ordered lexicographically; the rows of the character table may be given in any order. Also, for any given character, the particular subset of taxa to which 1s are assigned is arbitrary.

Sample Dataset

(dog,((elephant,mouse),robot),cat);

Sample Output

00110
00111

Please login to solve this problem.