SUPERSMART

Introduction

The Self-Updating Platform for Estimating Rates of Speciation and Migration, Ages and Relationships of Taxa (SUPERSMART) is a self-contained, easy to install analytical environment for large-scale phylogenetic data mining, taxonomic name resolution, tree inference and fossil-based tree calibration.

Unlike supertree and supermatrix approaches, SUPERSMART decomposes the tree searching problem into subproblems so that complicated recent divergences can be reconstructed using the current state-of-the-art (the multispecies, multilocus coalescent), while retaining the ability to build very large, composite estimates of phylogeny with branch lengths proportional to time.

Installation

We have made every effort to make installation of SUPERSMART as simple as possible on as many different computers as possible. The three, easy steps described below are known to work on the majority of modern computers with recent versions of Windows, Mac OS X and Linux-like operating systems. The end result is an installation that allows you to try out SUPERSMART, but it does not take full advantage of some of the speedups provided by the kind of high-performance computing (HPC) clusters that are available at certain academic computing centres^*. More detailed installation instructions are on the wiki.

(^*To install the HPC version of SUPERSMART go to the advanced topics.)

Step 1: Install VirtualBox

VirtualBox extends the capabilities of your existing computer so that it can run multiple operating systems (inside multiple "virtual machines", or VMs) at the same time. Because SUPERSMART consists of many different tools that depend on one another we make it available as a VM that can run on any operating system that VirtualBox can run on. VirtualBox is free and open source software. Go here to download, then run the installer.

Step 2: Install Vagrant

vagrant Vagrant is a system for downloading, managing and launching virtual machines. It tightly integrates with VirtualBox to fetch and run the most up to date version of the SUPERSMART VM and to connect it to your existing computer (e.g. to exchange data). Vagrant is free and open source software. Go here to download, then run the installer.

Step 3: Launch SUPERSMART

In the last step, SUPERSMART gets downloaded, installed, and started using three simple commands issued from the command line (click the [+] to see what will be written to the command line during a typical installation):

> vagrant init Naturalis/supersmart
A `Vagrantfile` has been placed in this directory. You are now
ready to `vagrant up` your first virtual environment! Please read
the comments in the Vagrantfile as well as documentation on
`vagrantup.com` for more information on using Vagrant.
> vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'Naturalis/supersmart' could not be found. Attempting to find and install...
    default: Box Provider: virtualbox
    default: Box Version: >= 0
==> default: Loading metadata for box 'Naturalis/supersmart'
    default: URL: https://vagrantcloud.com/Naturalis/supersmart
==> default: Adding box 'Naturalis/supersmart' (v0.1.0) for provider: virtualbox
    default: Downloading: https://vagrantcloud.com/Naturalis/boxes/supersmart/versions/0.1.0/providers/virtualbox.box
==> default: Successfully added box 'Naturalis/supersmart' (v0.1.0) for 'virtualbox'!
==> default: Importing base box 'Naturalis/supersmart'...
==> default: Matching MAC address for NAT networking...
==> default: Checking if box 'Naturalis/supersmart' is up to date...
==> default: Setting the name of the VM: supersmart-vagrant_default_1428676504351_12521
==> default: Clearing any previously set network interfaces...
==> default: Preparing network interfaces based on configuration...
    default: Adapter 1: nat
==> default: Forwarding ports...
    default: 22 => 2222 (adapter 1)
==> default: Booting VM...
==> default: Waiting for machine to boot. This may take a few minutes...
    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
    default: Warning: Connection timeout. Retrying...
    default: Warning: Remote connection disconnect. Retrying...
==> default: Machine booted and ready!
==> default: Checking for guest additions in VM...
==> default: Mounting shared folders...
    default: /vagrant => /Users/rutger.vos/Documents/local-projects/supersmart-vagrant
> vagrant ssh
Welcome to Ubuntu 14.04.2 LTS (GNU/Linux 3.16.0-30-generic x86_64)

 * Documentation:  https://help.ubuntu.com/
Last login: Tue Mar 31 07:21:40 2015 from 10.0.2.2

Please note: the vagrant up step will download a large amount of data (~20Gb). This is a one time operation. Ensure that you have a fast internet connection and enough drive space for this. The vagrant ssh command will log you into the SUPERSMART environment, which you will recognize by the different-looking command prompt:

SUPERSMART v0.1.22 - aadc17a3b0 ~ $

You are now inside the SUPERSMART environment and can start running analyses!

Running

The SUPERSMART pipeline consists of a number of different steps that can be chained together by issuing commands in a terminal window, or by using scripting tools such as shell scripts, Makefiles, and so on. A full analysis run-through is provided on the wiki.

The `smrt` command

When logged into the SUPERSMART environment, steps of the pipeline can be executed by issuing the smrt command in the terminal with the appropriate global options, subcommands, and their arguments, as follows:

smrt [OPTIONS] [commands|help] [SUBCOMMAND] [ARGUMENTS]

Examples of different ways in which smrt can be used:

smrt

Prints brief, overall help message and returns to terminal.

smrt --version

Prints version number and returns to terminal.

smrt commands

Lists all available subcommands and returns to terminal.

smrt help [SUBCOMMAND]

Prints help message about the specified subcommand and returns to terminal.

smrt [SUBCOMMAND] [ARGUMENTS]

Runs the specified subcommand with the provided arguments.

smrt [OPTIONS] [SUBCOMMAND] [ARGUMENTS]

Runs the specified subcommand with the provided arguments and with alternative global options in effect. The following global options are available:

--workdir=<location>: Specifies where data files are written to and read from. By default this is the current working directory.
--verbose: Sets the global verbosity, can be repeated multiple times. By default, verbosity is set such that warning and error messages are shown, but informational and debugging messages are hidden.

Available subcommands

The following subcommands, which represent steps of the pipeline, are available. For detailed usage, consult their respective usage messages by issuing smrt help [SUBCOMMAND].

taxize: Performs taxonomic name resolution by mapping a provided list of input names onto the NCBI taxonomy. Optionally expands the higher input taxa to the specified lower taxon, e.g. to expand a named Order to its constituent species. Produces a table that lists all resolved (and expanded) taxa and their higher classification.
align: Given an input table of resolved taxa, performs multiple sequence alignment of all potentially useful PhyLoTA clusters for the taxa of interest. Produces a list of aligned candidate clusters. The alignment method can be configured, effective methods include those provided by muscle and mafft.
orthologize: Given a list of aligned candidate clusters, assigns orthology among the clusters by performing reciprocal blast searches on the seed sequences around which the clusters were assembled. Produces a list of re-aligned superclusters.
bbmerge: Given a list of superclusters and an input table of resolved taxa, creates a supermatrix that can be used to infer a backbone phylogeny. The supermatrix contains up to two, distal exemplar species from each genus and an optimized selection of superclusters that minimizes sparseness of the supermatrix while attaining a user-specified minimum number of distinct markers for each taxon and omitting any alignments that exceed a user-specified maximum amount of sequence divergence.
bbinfer: Given a supermatrix and a starting tree, infers a backbone phylogeny. This backbone is typically a tree that reconstructs the relationships among a large number of genera (represented by up to two exemplar species) and so this step employs large-scale phylogenetic inference methods such as provided by ExaML, RaXML or ExaBayes. Bootstrap analysis is supported for non-bayesian tree inference.
bbreroot: Given a backbone phylogeny and a table of resolved taxa, reroots the phylogeny to approximate the rooting implied by the classification hierarchy of the provided, resolved taxa. Rerooting at user-specified outgroup taxa is supported. Also works on sets of trees (e.g. boostrap replicates).
bbcalibrate: Given a rooted molecular backbone phylogeny and a table of fossils, performs tree calibration using treePL. Produces an ultrametric tree with branch lengths proportional to evolutionary time (i.e. a "chronogram"). Also works on sets of trees (e.g. boostrap replicates).
consense: Given a file containing a set of trees (bootstrap replicates or Bayesian posterior set), calculates a consensus tree annotated with boostrap support and posterior support values, respectively.
bbdecompose: Given a rooted backbone phylogeny, a list of superclusters and a table of resolved taxa, decomposes the backbone into its constituent, most recent, monophyletic clades, expands these clades into all taxa from the provided table and assembles sets of alignments for each clade that can be used for further tree inference within that clade.
clademerge: For each decomposed clade, merges the set of alignments assembled for this clade into an input file for tree inference.
cladeinfer: For each decomposed clade, infers the species tree. This is done using the multispecies, multilocus coalescent as implemented in *BEAST on the basis of the merged set of alignments for that clade.
cladegraft: Grafts the inferred clade trees on the backbone chronogram.

Advanced topics

Installing on HPC systems

To take advantage of the infrastructure provided by HPC clusters it is generally best to compile key components of the SUPERSMART pipeline from source code. To simplify this process we provide a "provisioning" script. This script is written for puppet, which consequently must be available on your system. The script compiles and installs all dependencies, data, and code that SUPERSMART needs. At present this is done in system-wide locations, so the commands below must be executed by a privileged user^*:

$ git clone https://github.com/naturalis/supersmart.git
$ cd supersmart/conf
$ puppet apply

The provisioning that commences when the puppet apply command is issued can take up to several hours to complete.

(^*The components that need to be installed system-wide are packages that are installed by apt-get. In addition, the installer expects there to be a user called 'vagrant' under whose account various tools will be downloaded, compiled and installed. Updating this user name at the top of the puppet manifest should allow you to install these tools under some other account. Ask a system administrator for help if you don't know how to change this or what the consequences might be.)

Updating SUPERSMART

The SUPERSMART pipeline is continuously under development. To ensure that you are always using the latest functionality, the VM updates itself every time you log in (vagrant ssh). The exact version that you are using is shown in the command prompt:

SUPERSMART v0.1.22 - aadc17a3b0 ~ $

(Note the part in bold red: 'aadc17a3b0'). If you encounter any issues that you wish to report it is a good idea to mention this version so that the developers know what you're working with.

Advanced configuration

The SUPERSMART pipeline is configured using the file $SUPERSMART_HOME/conf/supersmart.ini. For convenience, SUPERSMART provides the system-wide command smrt-config which opens the configuration file with in the system's default editor. Here are some of the variables in this file that can be configured by users:

EXAML_MODEL: The substitution model used when inferring the backbone phylogeny. Possible values are GAMMA for a gamma model of rate heterogeneity with 4 discrete rates, or PSR for a per-site rate category model.
TREEPL_SMOOTH: Smoothing factor for the penalized likelihood tree calibration.
BURNIN: Fraction of burn-in to omit for the clade inference step.
MSA_TOOL: Multiple sequence alignment tool to use. Wrappers are available for a variety of possible aligners (which can be specified here as: mafft, clustalw, kalign, muscle, probalign, probcons, tcoffee, amap) but the VM comes bundled with mafft and muscle.
ALN_MERGE: Specifies the method to use for assigning orthology among aligned clusters. The value 'blast' refers to reciprocal BLAST searches, and is most tested. Other, experimental values are 'inparanoid', where orthology is assigned transitively from orthology assessments produced by the InParanoid project (for coding regions), or 'feature', which depends on standardized locus names in sequence annotations.
MERGE_OVERLAP: Specifies the minimum amount of reciprocal overlap among orthologous clusters for them to be merged into a supercluster.
BACKBONE_INFERENCE_TOOL: Specifies the tool used for inferring the backbone phylogeny. Possible values are examl or exabayes.
BACKBONE_MAX_DISTANCE: Maximum amount of average pairwise sequence distance that the pipeline accepts for superclusters to be included in the backbone inference step.
BACKBONE_MIN_COVERAGE: Minimum number of loci that must be assembled for each taxon in the supermatrix.
CLADE_MAX_DISTANCE: Maximum amount of average pairwise sequence distance that the pipeline accepts for superclusters to be included in the clade refinement step.
CLADE_MIN_DENSITY: Minimum fraction of clade members (taxa) that must be sequenced in an alignment for it to be included in the clade refinement step.
FOSSIL_BEST_PRACTICE_CUTOFF: Defines the minimum 'best practice score' for records from the fossil table to be used as calibration points in the tree calibration.

Resources

Installation prerequisites

To install the SUPERSMART pipeline as a virtual machine:

To install the pipeline on HPC infrastructure:

git
puppet

Analysis tools used internally

The following analysis tools are used by the SUPERSMART pipeline. If you use results obtained by the pipeline in a scholarly publication, please cite these tools as well. Listed are the tools that the pipeline uses by default, if you alter these (by editing the configuration file), update your cited references accordingly:

Taxonomic name resolution is performed using TaxoSaurus
The sequence data have been pre-processed using the PhyLoTA pipeline.
The default multiple sequence alignment tool is mafft
The default profile alignment tool is muscle
The default method for assessing orthology uses BLAST
The default inference tool for the backbone phylogeny is ExaML
The default inference tool for the clade phylogenies is BEAST
The default tool used for tree calibration is treePL
Data transformation and file conversion is performed using Bio::Phylo and BioPerl

Getting help

The Wiki has detailed installation instructions, an analysis run-through and a growing FAQ
The SUPERSMART-users mailing list is used by developers and users to ask questions, discuss problems, and so on.
The SUPERSMART-dev mailing list is for developers (and those interested in the arcana) to discuss implementation details of the pipeline.
The bug tracker is used for managing progress on specific bugs, feature requests and enhancements.

Citing SUPERSMART

SUPERSMART has been published in Systematic Biology and can be cited as follows:

Antonelli A., Hettling H., Condamine F.L., Vos, K. Nilsson R.H., Sanderson, M.J., Sauquet, H., Scharn, R., Silvestro, D., Töpel, M., Bacon, C.D., Oxelman, B. Vos, R.A. 2016. Toward a Self-Updating Platform for Estimating Rates of Speciation and Migration, Ages, and Relationships of Taxa Systematic Biology first published online September 10, 2016 [doi:10.1093/sysbio/syw066]

Relevant literature

Altschul S. F., Gish W., Miller W., Myers E. W. and D. J. Lipman 1990. Basic local alignment search tool Journal of Molecular Biology 215: 403-410 [doi:10.1016/S0022-2836(05)80360-2]

Drummond A. J., Suchard M. A., Xie D. and A. Rambaut 2012. Bayesian phylogenetics with BEAUti and the BEAST 1.7 Molecular Biology And Evolution 29: 1969-1973 [doi:10.1093/molbev/mss075]

Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput Nucleic Acids Research 32:1792-1797 [doi:10.1093/nar/gkh340]

Katoh K. and D. M. Standley 2013 MAFFT multiple sequence alignment software version 7: improvements in performance and usability Molecular Biology and Evolution 30:772-780) [doi:10.1093/molbev/mst010]

Sanderson, M. J., D. Boss, D. Chen, K. A. Cranston, and A. Wehe. 2008. The PhyLoTA Browser: Processing GenBank for molecular phylogenetics research. Systematic Biology 57:335-346. [doi:10.1080/10635150802158688]

Smith, S. A. and O'Meara. B. C. 2012. treePL: divergence time estimation using penalized likelihood for large phylogenies. Bioinformatics 28: 2689-2690. [doi:10.1093/bioinformatics/bts492]

Stamatakis, A. and A. J. Aberer 2013. Novel Parallelization Schemes for Large-Scale Likelihood-based Phylogenetic Inference. Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium 1195-1204 [doi:10.1109/IPDPS.2013.70]

Stoltzfus A., Lapp H., Matasci N., Deus H., Sidlauskas B., Zmasek C. M., Vaidya G., Pontelli E., Cranston K., Vos R. A., Webb C. O., Harmon L. J., Pirrung M., O'Meara B. C., Pennell M. W., Mirarab S., Rosenberg M. S., Balhoff J. P., Bik H. M., Heath T., Midford P., Brown J. W., McTavish E. J., Sukumaran J., Westneat M., Alfaro M. E. and A Steele. 2013. Phylotastic! Making Tree-of-Life Knowledge Accessible, Re-usable and Convenient. BMC Bioinformatics 14:158 [doi:10.1186/1471-2105-14-158]

Vos R. A., Biserkov J. V., Balech B., Beard N., Blissett M., Brenninkmeijer C., van Dooren T., Eades D., Gosline G., Groom Q. J., Hamann T. D., Hettling H., Hoehndorf R., Holleman A., Hovenkamp P., Kelbert P., King D., Kirkup D., Lammers Y., DeMeulemeester T., Mietchen D., Miller J. A., Mounce R., Nicolson N., Page R. D. M., Pawlik A., Pereira S., Penev L., Richards K., Sautter G., Shorthouse D. P., Tähtinen M., Weiland C., Williams A. R. and S. Sierra. 2014. Enriched biodiversity data as a resource and service. Biodiversity Data Journal 2:e1125 [doi:10.3897/BDJ.2.e1125]

Introduction

Installation

Step 1: Install VirtualBox

Step 2: Install Vagrant

Step 3: Launch SUPERSMART

Running

The `smrt` command

Available subcommands

Advanced topics

Installing on HPC systems

Updating SUPERSMART

Advanced configuration

Resources

Installation prerequisites

Analysis tools used internally

Getting help

Citing SUPERSMART

Relevant literature

People

Contributing organizations

Introduction

Installation

Step 1: Install VirtualBox

Step 2: Install Vagrant

Step 3: Launch SUPERSMART

Running

The smrt command

Available subcommands

Advanced topics

Installing on HPC systems

Updating SUPERSMART

Advanced configuration

Resources

Installation prerequisites

Analysis tools used internally

Getting help

Citing SUPERSMART

Relevant literature

People

Contributing organizations

The `smrt` command