Molecular phylogeny

Overview
Molecular phylogeny is the use of the structure of molecules to gain information on an organism's evolutionary relationships. The result of a molecular phylogenetic analysis is expressed in a so-called phylogenetic tree.

Every living organism contains DNA, RNA, and proteins. Closely related organisms generally have a high degree of agreement in the molecular structure of these substances, while the molecules of organisms distantly related usually show a pattern of dissimilarity. Molecular phylogeny uses such data to build a "relationship tree" that shows the probable evolution of various organisms. Not until recent decades, however, has it been possible to isolate and identify these molecular structures.

Another application of the techniques that make this possible can be seen in the very limited field of human genetics, such as the ever more popular use of genetic testing to determine a child's paternity, as well as the emergence of a new branch of criminal forensics focused on genetic evidence.

The effect on traditional scientific classification schemes in the biological sciences has been dramatic as well. Work that was once immensely labor- and materials-intensive can now be done quickly and easily, leading to yet another source of information becoming available for systematic and taxonomic appraisal. This particular kind of data has become so popular that taxonomical schemes based solely on molecular data may be encountered. Proponents even claim that taxonomy was previously based on morphology alone, which of course is utter fable.

Theoretical background
Early attempts at molecular systematics where also termed as chemotaxonomy and made use of proteins, enzymes, carbohydrates and other molecules which were separated and characterized using techniques such as chromatography. These have been largely replaced in recent times by DNA sequencing which produces the exact sequences of nucleotides or bases in either DNA or RNA segments extracted using different techniques. These are generally considered superior for evolutionary studies since the actions of evolution are ultimately reflected in the genetic sequences. At present it is still a long and expensive process to sequence the entire DNA of an organism (its genome), and this has been done for only a few species. However it is quite feasible to determine the sequence of a defined area of a particular chromosome. Typical molecular systematic analyses require the sequencing of around 1000 base pairs. At any location within such a sequence, the bases found in a given position may vary between organisms. The particular sequence found in a given organism is referred to as its haplotype. In principle, since there are four base types, with 1000 base pairs, we could have 41000 distinct haplotypes. However, for organisms within a particular species, or in a group of related species, it turns out as a matter of empirical fact that
 * only a minority of sites show any variation at all
 * most of the variations that are found are correlated, so that the number of distinct haplotypes that are found is relatively small.

In a molecular systematic analysis, the haplotypes are determined for a defined area of genetic material; ideally a substantial sample of individuals of the target species or other taxon are used however many current studies are based on single individuals. Haplotypes of individuals of closely related, but supposedly different, taxa are also determined. Finally, haplotypes from a smaller number of individuals from a definitely different taxon are determined: these are referred to as an out group. The base sequences for the haplotypes are then compared. In the simplest case, the difference between two haplotypes is assessed by counting the number of locations where they have different bases: this is referred to as the number of substitutions (other kinds of differences between haplotypes can also occur, for example the insertion of a section of nucleic acid in one haplotype that is not present in another). Usually the difference between organisms is re-expressed as a percentage divergence, by dividing the number of substitutions by the number of base-pairs analysed: the hope is that this measure will be independent of the location and length of the section of DNA that is sequenced.

An older and superseded approach was to determine the divergences between the genotypes of individuals by DNA-DNA hybridisation. The advantage claimed for using hybridisation rather than gene sequencing was that it was based on the entire genotype, rather than on particular sections of DNA. Modern sequence comparison techniques overcome this objection by the use of multiple sequences.

Once the divergences between all pairs of samples have been determined, the resulting triangular matrix of differences is submitted to some form of statistical cluster analysis, and the resulting dendrogram is examined in order to see whether the samples cluster in the way that would be expected from current ideas about the taxonomy of the group, or not. Any group of haplotypes that are all more similar to one another than any of them is to any other haplotype may be said to constitute a clade. Statistical techniques such as bootstrapping and jackknifing help in providing reliability estimates for the positions of haplotypes within the evolutionary trees.

Characteristics and assumptions of molecular systematics
This example illustrates several characteristics of molecular systematics and its underlying assumptions.
 * 1) Molecular systematics is an essentially cladistic approach: it assumes that classification must correspond to phylogenetic descent, and that all valid taxa must be at least paraphyletic and preferably monophyletic.
 * 2) Molecular systematics often uses the molecular clock assumption that quantitative similarity of genotype is a sufficient measure of the recency of genetic divergence. Particularly in relation to speciation, this assumption could be wrong if either
 * 3) some relatively small genotypic modification acted to prevent interbreeding between two groups of organisms, or
 * 4) in different subgroups of the organisms being considered, genetic modification proceeded at different rates.
 * 5) In animals, it is often convenient to use mitochondrial DNA for molecular systematic analysis. However, because in mammals mitochondria are inherited only from the mother, this is not fully satisfactory, because inheritance in the paternal line might not be detected: in the example above, Vilà et al cite more limited studies with chromosomal DNA that support their conclusions.

These characteristics and assumptions are not wholly uncontroversial among biological systematists. As a cladistic method, molecular systematics is open to the same criticisms as cladistics in general. It can also be argued that it is a mistake to replace a classification based on visible and ecologically relevant characteristics by one based on genetic details that may not even be expressed in the phenotype. However the molecular approach to systematics, and its underlying assumptions, are gaining increasing acceptance. As gene sequencing becomes easier and cheaper, molecular systematics is being applied to more and more groups, and in some cases is leading to radical revisions of accepted taxonomies.

Telephone directory of earth's species
On September 14, 2007, a team of scientists (50 countries) initiated a (DNA barcoding) global database project for Earth's 1.8 million known species (from tiny genetic material). David Schindel, a Smithsonian Institution paleontologist, executive secretary of the Consortium for the Barcode of Life stated that it will create a global reference library: "a kind of telephone directory for all species." 30,000 species had been put in the database to reach 500,000, 5 years. The consortium is sponsored by the Smithsonian Institution's Museum of Natural History. The 2003 research paper of geneticist Paul Hebert of University of Guelph, Ontario proposed a database of DNA barcodes identifying all species.