Introduction
The Self-Updating Platform for Estimating Rates of Speciation and Migration, Ages and Relationships of Taxa (SUPERSMART) is a self-contained, easy to install analytical environment for large-scale phylogenetic data mining, taxonomic name resolution, tree inference and fossil-based tree calibration.
Unlike supertree and supermatrix approaches, SUPERSMART decomposes the tree searching problem into subproblems so that complicated recent divergences can be reconstructed using the current state-of-the-art (the multispecies, multilocus coalescent), while retaining the ability to build very large, composite estimates of phylogeny with branch lengths proportional to time.
Installation
We have made every effort to make installation of SUPERSMART as simple as possible on as many different computers as possible. The three, easy steps described below are known to work on the majority of modern computers with recent versions of Windows, Mac OS X and Linux-like operating systems. The end result is an installation that allows you to try out SUPERSMART, but it does not take full advantage of some of the speedups provided by the kind of high-performance computing (HPC) clusters that are available at certain academic computing centres*. More detailed installation instructions are on the wiki.
(*To install the HPC version of SUPERSMART go to the advanced topics.)Step 1: Install VirtualBox
VirtualBox extends the capabilities of your existing computer so that it can run multiple operating systems (inside multiple "virtual machines", or VMs) at the same time. Because SUPERSMART consists of many different tools that depend on one another we make it available as a VM that can run on any operating system that VirtualBox can run on. VirtualBox is free and open source software. Go here to download, then run the installer.
Step 2: Install Vagrant
Vagrant is a system for downloading, managing and launching virtual machines. It tightly integrates with VirtualBox to fetch and run the most up to date version of the SUPERSMART VM and to connect it to your existing computer (e.g. to exchange data). Vagrant is free and open source software. Go here to download, then run the installer.
Step 3: Launch SUPERSMART
In the last step, SUPERSMART gets downloaded, installed, and started using three simple commands issued from the command line (click the [+] to see what will be written to the command line during a typical installation):
> vagrant init Naturalis/supersmart A `Vagrantfile` has been placed in this directory. You are now ready to `vagrant up` your first virtual environment! Please read the comments in the Vagrantfile as well as documentation on `vagrantup.com` for more information on using Vagrant. > vagrant up Bringing machine 'default' up with 'virtualbox' provider... ==> default: Box 'Naturalis/supersmart' could not be found. Attempting to find and install... default: Box Provider: virtualbox default: Box Version: >= 0 ==> default: Loading metadata for box 'Naturalis/supersmart' default: URL: https://vagrantcloud.com/Naturalis/supersmart ==> default: Adding box 'Naturalis/supersmart' (v0.1.0) for provider: virtualbox default: Downloading: https://vagrantcloud.com/Naturalis/boxes/supersmart/versions/0.1.0/providers/virtualbox.box ==> default: Successfully added box 'Naturalis/supersmart' (v0.1.0) for 'virtualbox'! ==> default: Importing base box 'Naturalis/supersmart'... ==> default: Matching MAC address for NAT networking... ==> default: Checking if box 'Naturalis/supersmart' is up to date... ==> default: Setting the name of the VM: supersmart-vagrant_default_1428676504351_12521 ==> default: Clearing any previously set network interfaces... ==> default: Preparing network interfaces based on configuration... default: Adapter 1: nat ==> default: Forwarding ports... default: 22 => 2222 (adapter 1) ==> default: Booting VM... ==> default: Waiting for machine to boot. This may take a few minutes... default: SSH address: 127.0.0.1:2222 default: SSH username: vagrant default: SSH auth method: private key default: Warning: Connection timeout. Retrying... default: Warning: Remote connection disconnect. Retrying... ==> default: Machine booted and ready! ==> default: Checking for guest additions in VM... ==> default: Mounting shared folders... default: /vagrant => /Users/rutger.vos/Documents/local-projects/supersmart-vagrant > vagrant ssh Welcome to Ubuntu 14.04.2 LTS (GNU/Linux 3.16.0-30-generic x86_64) * Documentation: https://help.ubuntu.com/ Last login: Tue Mar 31 07:21:40 2015 from 10.0.2.2
Please note: the vagrant up
step will download a large amount
of data (~20Gb). This is a one time operation. Ensure that you have a fast internet
connection and enough drive space for this. The vagrant ssh
command
will log you into the SUPERSMART environment, which you will recognize by the
different-looking command prompt:
SUPERSMART v0.1.22 - aadc17a3b0 ~ $
You are now inside the SUPERSMART environment and can start running analyses!
Running
The SUPERSMART pipeline consists of a number of different steps that can be chained together by issuing commands in a terminal window, or by using scripting tools such as shell scripts, Makefiles, and so on. A full analysis run-through is provided on the wiki.
The smrt
command
When logged into the SUPERSMART environment, steps of the pipeline can be
executed by issuing the smrt
command in the
terminal with the appropriate global options, subcommands, and their arguments,
as follows:
smrt [OPTIONS] [commands|help] [SUBCOMMAND] [ARGUMENTS]
Examples of different ways in which smrt
can be
used:
- smrt
- Prints brief, overall help message and returns to terminal.
- smrt --version
- Prints version number and returns to terminal.
- smrt commands
- Lists all available subcommands and returns to terminal.
- smrt help [SUBCOMMAND]
- Prints help message about the specified subcommand and returns to terminal.
- smrt [SUBCOMMAND] [ARGUMENTS]
- Runs the specified subcommand with the provided arguments.
- smrt [OPTIONS] [SUBCOMMAND] [ARGUMENTS]
- Runs the specified subcommand with the provided arguments and with
alternative global options in effect. The following global options are
available:
- --workdir=<location>
- Specifies where data files are written to and read from. By default this is the current working directory.
- --verbose
- Sets the global verbosity, can be repeated multiple times. By default, verbosity is set such that warning and error messages are shown, but informational and debugging messages are hidden.
Available subcommands
The following subcommands, which represent steps of the pipeline, are
available. For detailed usage, consult their respective usage messages by
issuing smrt help [SUBCOMMAND]
.
- taxize
- Performs taxonomic name resolution by mapping a provided list of input names onto the NCBI taxonomy. Optionally expands the higher input taxa to the specified lower taxon, e.g. to expand a named Order to its constituent species. Produces a table that lists all resolved (and expanded) taxa and their higher classification.
- align
- Given an input table of resolved taxa, performs multiple sequence
alignment of all potentially useful PhyLoTA
clusters for the taxa of interest. Produces a list of aligned candidate
clusters. The alignment method can be configured, effective methods
include those provided by
muscle
andmafft
. - orthologize
- Given a list of aligned candidate clusters, assigns orthology among the
clusters by performing reciprocal
blast
searches on the seed sequences around which the clusters were assembled. Produces a list of re-aligned superclusters. - bbmerge
- Given a list of superclusters and an input table of resolved taxa, creates a supermatrix that can be used to infer a backbone phylogeny. The supermatrix contains up to two, distal exemplar species from each genus and an optimized selection of superclusters that minimizes sparseness of the supermatrix while attaining a user-specified minimum number of distinct markers for each taxon and omitting any alignments that exceed a user-specified maximum amount of sequence divergence.
- bbinfer
- Given a supermatrix and a starting tree, infers a backbone phylogeny. This
backbone is typically a tree that reconstructs the relationships among a large
number of genera (represented by up to two exemplar species) and so this step
employs large-scale phylogenetic inference methods such as provided by
ExaML
,RaXML
orExaBayes
. Bootstrap analysis is supported for non-bayesian tree inference. - bbreroot
- Given a backbone phylogeny and a table of resolved taxa, reroots the phylogeny to approximate the rooting implied by the classification hierarchy of the provided, resolved taxa. Rerooting at user-specified outgroup taxa is supported. Also works on sets of trees (e.g. boostrap replicates).
- bbcalibrate
- Given a rooted molecular backbone phylogeny and a table of fossils,
performs tree calibration using
treePL
. Produces an ultrametric tree with branch lengths proportional to evolutionary time (i.e. a "chronogram"). Also works on sets of trees (e.g. boostrap replicates). - consense
- Given a file containing a set of trees (bootstrap replicates or Bayesian posterior set), calculates a consensus tree annotated with boostrap support and posterior support values, respectively.
- bbdecompose
- Given a rooted backbone phylogeny, a list of superclusters and a table of resolved taxa, decomposes the backbone into its constituent, most recent, monophyletic clades, expands these clades into all taxa from the provided table and assembles sets of alignments for each clade that can be used for further tree inference within that clade.
- clademerge
- For each decomposed clade, merges the set of alignments assembled for this clade into an input file for tree inference.
- cladeinfer
- For each decomposed clade, infers the species tree. This is done using
the multispecies, multilocus coalescent as implemented in
*BEAST
on the basis of the merged set of alignments for that clade. - cladegraft
- Grafts the inferred clade trees on the backbone chronogram.
Advanced topics
Installing on HPC systems
To take advantage of the infrastructure provided by HPC clusters it is generally best to compile key components of the SUPERSMART pipeline from source code. To simplify this process we provide a "provisioning" script. This script is written for puppet, which consequently must be available on your system. The script compiles and installs all dependencies, data, and code that SUPERSMART needs. At present this is done in system-wide locations, so the commands below must be executed by a privileged user*:
$ git clone https://github.com/naturalis/supersmart.git $ cd supersmart/conf $ puppet apply
The provisioning that commences when the puppet apply
command is
issued can take up to several hours to complete.
apt-get
. In addition, the installer
expects there to be a user called 'vagrant' under whose account various tools will be
downloaded, compiled and installed. Updating this user name at the top of the puppet manifest
should allow you to install these tools under some other account. Ask a system administrator
for help if you don't know how to change this or what the consequences might be.)
Updating SUPERSMART
The SUPERSMART pipeline is continuously under development. To ensure that you are always
using the latest functionality, the VM updates itself every time you log in
(vagrant ssh
). The exact version that you are using is shown in the command
prompt:
SUPERSMART v0.1.22 - aadc17a3b0 ~ $
(Note the part in bold red: 'aadc17a3b0'). If you encounter any issues that you wish to report it is a good idea to mention this version so that the developers know what you're working with.
Advanced configuration
The SUPERSMART pipeline is configured using the file
$SUPERSMART_HOME/conf/supersmart.ini
. For convenience, SUPERSMART provides
the system-wide command smrt-config
which opens the configuration file with
in the system's default editor. Here are some of the variables
in this file that can be configured by users:
- EXAML_MODEL
- The substitution model used when inferring the backbone phylogeny. Possible values are GAMMA for a gamma model of rate heterogeneity with 4 discrete rates, or PSR for a per-site rate category model.
- TREEPL_SMOOTH
- Smoothing factor for the penalized likelihood tree calibration.
- BURNIN
- Fraction of burn-in to omit for the clade inference step.
- MSA_TOOL
- Multiple sequence alignment tool to use. Wrappers are available for a variety of possible aligners (which can be specified here as: mafft, clustalw, kalign, muscle, probalign, probcons, tcoffee, amap) but the VM comes bundled with mafft and muscle.
- ALN_MERGE
- Specifies the method to use for assigning orthology among aligned clusters. The value 'blast' refers to reciprocal BLAST searches, and is most tested. Other, experimental values are 'inparanoid', where orthology is assigned transitively from orthology assessments produced by the InParanoid project (for coding regions), or 'feature', which depends on standardized locus names in sequence annotations.
- MERGE_OVERLAP
- Specifies the minimum amount of reciprocal overlap among orthologous clusters for them to be merged into a supercluster.
- BACKBONE_INFERENCE_TOOL
- Specifies the tool used for inferring the backbone phylogeny. Possible values are examl or exabayes.
- BACKBONE_MAX_DISTANCE
- Maximum amount of average pairwise sequence distance that the pipeline accepts for superclusters to be included in the backbone inference step.
- BACKBONE_MIN_COVERAGE
- Minimum number of loci that must be assembled for each taxon in the supermatrix.
- CLADE_MAX_DISTANCE
- Maximum amount of average pairwise sequence distance that the pipeline accepts for superclusters to be included in the clade refinement step.
- CLADE_MIN_DENSITY
- Minimum fraction of clade members (taxa) that must be sequenced in an alignment for it to be included in the clade refinement step.
- FOSSIL_BEST_PRACTICE_CUTOFF
- Defines the minimum 'best practice score' for records from the fossil table to be used as calibration points in the tree calibration.
Resources
Installation prerequisites
To install the SUPERSMART pipeline as a virtual machine:
To install the pipeline on HPC infrastructure:
Analysis tools used internally
The following analysis tools are used by the SUPERSMART pipeline. If you use results obtained by the pipeline in a scholarly publication, please cite these tools as well. Listed are the tools that the pipeline uses by default, if you alter these (by editing the configuration file), update your cited references accordingly:
- Taxonomic name resolution is performed using TaxoSaurus
- The sequence data have been pre-processed using the PhyLoTA pipeline.
- The default multiple sequence alignment tool is mafft
- The default profile alignment tool is muscle
- The default method for assessing orthology uses BLAST
- The default inference tool for the backbone phylogeny is ExaML
- The default inference tool for the clade phylogenies is BEAST
- The default tool used for tree calibration is treePL
- Data transformation and file conversion is performed using Bio::Phylo and BioPerl
Getting help
- The Wiki has detailed installation instructions, an analysis run-through and a growing FAQ
- The SUPERSMART-users mailing list is used by developers and users to ask questions, discuss problems, and so on.
- The SUPERSMART-dev mailing list is for developers (and those interested in the arcana) to discuss implementation details of the pipeline.
- The bug tracker is used for managing progress on specific bugs, feature requests and enhancements.
Citing SUPERSMART
SUPERSMART has been published in Systematic Biology and can be cited as follows:
Antonelli A., Hettling H., Condamine F.L., Vos, K. Nilsson R.H., Sanderson, M.J., Sauquet, H., Scharn, R., Silvestro, D., Töpel, M., Bacon, C.D., Oxelman, B. Vos, R.A. 2016. Toward a Self-Updating Platform for Estimating Rates of Speciation and Migration, Ages, and Relationships of Taxa Systematic Biology first published online September 10, 2016 [doi:10.1093/sysbio/syw066]
Relevant literature
Altschul S. F., Gish W., Miller W., Myers E. W. and D. J. Lipman 1990. Basic local alignment search tool Journal of Molecular Biology 215: 403-410 [doi:10.1016/S0022-2836(05)80360-2]
Drummond A. J., Suchard M. A., Xie D. and A. Rambaut 2012. Bayesian phylogenetics with BEAUti and the BEAST 1.7 Molecular Biology And Evolution 29: 1969-1973 [doi:10.1093/molbev/mss075]
Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput Nucleic Acids Research 32:1792-1797 [doi:10.1093/nar/gkh340]
Katoh K. and D. M. Standley 2013 MAFFT multiple sequence alignment software version 7: improvements in performance and usability Molecular Biology and Evolution 30:772-780) [doi:10.1093/molbev/mst010]
Sanderson, M. J., D. Boss, D. Chen, K. A. Cranston, and A. Wehe. 2008. The PhyLoTA Browser: Processing GenBank for molecular phylogenetics research. Systematic Biology 57:335-346. [doi:10.1080/10635150802158688]
Smith, S. A. and O'Meara. B. C. 2012. treePL: divergence time estimation using penalized likelihood for large phylogenies. Bioinformatics 28: 2689-2690. [doi:10.1093/bioinformatics/bts492]
Stamatakis, A. and A. J. Aberer 2013. Novel Parallelization Schemes for Large-Scale Likelihood-based Phylogenetic Inference. Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium 1195-1204 [doi:10.1109/IPDPS.2013.70]
Stoltzfus A., Lapp H., Matasci N., Deus H., Sidlauskas B., Zmasek C. M., Vaidya G., Pontelli E., Cranston K., Vos R. A., Webb C. O., Harmon L. J., Pirrung M., O'Meara B. C., Pennell M. W., Mirarab S., Rosenberg M. S., Balhoff J. P., Bik H. M., Heath T., Midford P., Brown J. W., McTavish E. J., Sukumaran J., Westneat M., Alfaro M. E. and A Steele. 2013. Phylotastic! Making Tree-of-Life Knowledge Accessible, Re-usable and Convenient. BMC Bioinformatics 14:158 [doi:10.1186/1471-2105-14-158]
Vos R. A., Biserkov J. V., Balech B., Beard N., Blissett M., Brenninkmeijer C., van Dooren T., Eades D., Gosline G., Groom Q. J., Hamann T. D., Hettling H., Hoehndorf R., Holleman A., Hovenkamp P., Kelbert P., King D., Kirkup D., Lammers Y., DeMeulemeester T., Mietchen D., Miller J. A., Mounce R., Nicolson N., Page R. D. M., Pawlik A., Pereira S., Penev L., Richards K., Sautter G., Shorthouse D. P., Tähtinen M., Weiland C., Williams A. R. and S. Sierra. 2014. Enriched biodiversity data as a resource and service. Biodiversity Data Journal 2:e1125 [doi:10.3897/BDJ.2.e1125]
People
- Alexandre Antonelli
- Rutger Vos
- Hannes Hettling
- Mike Sanderson
- Bengt Oxelman
- Karin Nilsson
- Mats Töpel
- Hervé Sauquet
- Henrik Nilsson
- Daniele Silvestro
- Fabien Condamine
- Ruud Scharn