mebioda

Species delimitation with DNA barcodes

Species concepts and delimitation intro

The following slide show provides an overview of species concepts and the application of species delimitation techniques to natural history collection specimens:

Species delimitation - species limits and character evolution

Barcode Index Number (BIN)

Ratnasingham, S & Hebert, PDN 2013. A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. PLoS ONE 8(7): e66213 doi:10.1371/journal.pone.0066213 (pdf)

BIN divergence thresholds

Correspondence between species present in eight datasets and OTUs recognized through single linkage clustering with sequence divergence thresholds ranging from 0.1–6.0%.

BIN pipeline

Automatic Barcode Gap Discovery (ABGD)

Puillandre N, Lambert A, Brouillet S & Achaz G 2012. ABGD, Automatic Barcode Gap Discovery for primary species delimitation. Mol Ecol. 21(8): 1864-77 doi:10.1111/j.1365-294X.2011.05239.x (pdf)

The ABGD command line tool

$ curl -L -O http://wwwabi.snv.jussieu.fr/public/abgd/last.tgz
$ tar xzvf last.tgz
$ cd Abgd
$ make
$ export PATH="${PATH}":`pwd`

We should now have an executable called abgd on the $PATH. This accepts aligned FASTA as input, so let’s analyze one of the files we have:

# inside w1d3 folder
$ mkdir Danaus_ABGD
$ abgd -o Danaus_ABGD -a Danaus.mafft.fas

Resulting files, showing the barcode gap inflection point:

Generalized Mixed Yule Coalescent (GMYC)

Fujisawa T & Barraclough TG. 2013. Delimiting Species Using Single-Locus Data and the Generalized Mixed Yule Coalescent Approach: A Revised Method and Evaluation on Simulated Data Sets Systematic Biology 62(5): 707–724 doi:10.1093/sysbio/syt033 (pdf)

The GMYC web service

The analysis can be performed through a web service, and results for the Danaus consensus tree in the following clusters:

Which are distributed across the clades near the tips:

How many (using line count, wc -l) distinct taxonomic names do we have in the alignment:

$ grep '>' Danaus.mafft.fas | cut -f 1 -d '-' | sort | uniq | wc -l
      15

Poisson Tree Process (PTP)

J Zhang, P Kapli, P Pavlidis, A Stamatakis 2013. A general species delimitation method with applications to phylogenetic placements. Bioinformatics 29(22): 2869–2876 doi:10.1093/bioinformatics/btt499

Using the same tree as for GMYC on the bPTP web server obtains an MLE of 16 species with potential for (far) greater splitting:

Accptance rate: 0.69020000000000004 
Merge: 49798 
Split: 50202 
Estimated number of species is between 14 and 135 
Mean: 78.03

Monophyly, polyphyly, paraphyly

Mutanen, M et al. 2016. Species-Level Para- and Polyphyly in DNA Barcode Gene Trees: Strong Operational Bias in European Lepidoptera, Systematic Biology 65(6): 1024–1040 doi:10.1093/sysbio/syw044

So how are the putative species from BoLD actually entangled?

For each taxon:

  1. Collect all tips that belong to it
  2. Find the MRCA for the collected tips
  3. Collect all descendants of the MRCA. If this set is identical to the set of step 1. then the taxon is monophyletic and the analysis moves on to the next taxon.
  4. Collect all nodes that subtend tips from the focal taxon as well as at least one other taxon and sort these by their post-order index.
  5. Group the collected, sorted nodes into distinct root-to-tip paths. Internal nodes that are nested in each other are identified (and collected in the same group) by checking that the pre-order index of the focal node is larger, and the post-order index of the focal node is smaller than that of the next node in the sorted list. If there is more than one distinct root-to-tip path (i.e., group), the taxon is considered polyphyletic, otherwise paraphyletic.
  6. For each first (i.e. most recent) node in each group, collect all subtended species. The union of these sets across groups forms the set of entangled species.
$ curl \
  -F "infile=@BEAST/Danaus.consensus.trees.nwk" \
  -F "format=newick" \
  -F "separator=-" \
  -F "astsv=true" \
  -F "cgi=true" \
  http://monophylizer.naturalis.nl/cgi-bin/monophylizer.pl > Danaus.monophyly.tsv

Which produces a spreadsheet that identifies the exact matches (i.e. monophyletic) and where there is entanglement among species (i.e. paraphyletic or polyphyletic).