mebioda

Big data phylogenetics

Outline

Phylogenetic data

About representation formats

The Newick / New Hampshire format

Newick representation:

(((A,B),(C,D)),E)Root;

New Hampshire eXtended

Example:

(
	(
		( A[&&NHX:S=Homo sapiens], B[&&NHX:S=Homo sapiens] )[&&NHX:D=T],
		( C[&&NHX:S=Pan paniscus], D[&&NHX:S=Pan troglodytes] )[&&NHX:D=F],	
	) , E
);

The technique to ‘overload’ comments in square brackets to embed data is also used in other contexts, such as:

Example of gene tree research: TreeFam data mining

  1. Download the TreeFam data dump
  2. Extract and clean up NHX trees and FASTA data
  3. Perform fossil calibration on NHX trees (using a template command file for r8s)
  4. Extract rate as function of distance from duplication
  5. Draw a plot

PhyloXML

<?xml version="1.0" encoding="UTF-8"?>
<phyloxml xmlns="http://www.phyloxml.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.10/phyloxml.xsd">
  <phylogeny rooted="true">
    <clade>
      <clade>
        <clade>
          <events><duplications>1</duplications></events>
          <clade>
            <name>A</name>
            <taxonomy><scientific_name>Homo sapiens</scientific_name></taxonomy>
          </clade>
          <clade>
            <name>B</name>
            <taxonomy><scientific_name>Homo sapiens</scientific_name></taxonomy>            
          </clade>
        </clade>
        <clade>
          <events><speciations>1</speciations></events>
          <clade>
            <name>C</name>
            <taxonomy><scientific_name>Pan paniscus</scientific_name></taxonomy>            
          </clade>
          <clade>
            <name>D</name>
            <taxonomy><scientific_name>Pan troglodytes</scientific_name></taxonomy>            
          </clade>
        </clade>
      </clade>
      <clade>
        <name>E</name>
      </clade>
    </clade>
  </phylogeny>
</phyloxml>

The Nexus format

Nexus representation:

#NEXUS
begin taxa;
	dimensions ntax=5;
	taxlabels
		A
		B
		C
		D
		E
	;		
end;
begin trees;
	translate
		1 A,
		2 B,
		3 C,
		4 D,
		5 E;
	tree t1 = (((1,2),(3,4)),5);
end;

NeXML

#!/usr/bin/perl
use Bio::Phylo::IO qw'parse';

print parse(
	-format     => 'nexus',
	-file       => 'tree.nex',
	-as_project => 1,
)->to_xml; 

Example of species tree research: TreeBASE data mining

  1. Fetch the treebase sitemap.xml
  2. Download studies as nexml through the treebase API
  3. Convert trees to MRP matrices (Baum, 1992, Ragan, 1992) using a script
  4. Extract all species, and normalize the taxa
  5. Partition the data by taxonomic class
  6. Perform MP analyses with PAUP* and visualize the result (example: Arachnida)

This workflow was scripted using make for parallelization.

RDF and CDAO

Tabular representations

// load the external data
d3.csv("Arachnida.csv", function(error, data) {

	// create a name: node map
	var dataMap = data.reduce(function(map, node) {
		map[node.name] = node;
		return map;
	}, {});

	// populate the tree structure
	var root;
	data.forEach(function(node) {	
		var parent = dataMap[node.parent];		
		if ( parent ) {			
			( parent.children || ( parent.children = [] ) ).push(node);
		} 
		else {
			root = node;
		}
	});	
});