- sequence pre-processing, e.g. de-replication, filtering low-quality and overly
short reads, trimming low-quality ends, merging pairs. Using generic
HTS tools.
- demultiplexing, e.g. to split by sampling locations and/or times. Done, for example,
with QIIME, OBITools
or mothur
- clustering using tools such as CD-HIT,
UCLUST, or
OCTUPUS
- outlier detection, e.g. chimeric sequences or singletons
- taxonomic assignment, e.g. by BLAST
searches against reference databases, or with usearch
or vsearch
- phylogenetic analysis, e.g. phylogenetic placement, computation of diversity metrics
- comparing treatments, e.g. by rarefaction of OTU tables
Species identification of gut contents of permafrost grazers
- ancient DNA sequencing of the gut contents of permafrost grazers
- two chloroplast markers (rbcL and trnL-trnF) amplified with forward and reverse
primers
- findings corroborated with morphological analysis of macroremains and pollen
Analysis workflow:
- demultiplex on IonTorrent adaptors; Phred quality (Q20) and length (>=100bp) filter
- cluster reads with CD-HIT
- BLAST against NCBI nr
Mid-Holocene horse
B Gravendeel, A Protopopov, I Bull, E Duijm, F Gill, A Nieman, N Rudaya, A N Tikhonov,
S Trofimova, GBA van Reenen, R Vos, S Zhilich & B van Geel. 2014. Multiproxy study of
the last meal of a mid-Holocene Oyogos Yar horse, Sakha Republic, Russia.
The Holocene 24(10): 1288-1296
doi:10.1177/0959683614540953
- c. 5,400 years ago, Oyogos Yar, Russia
- Pollen grains and the aDNA record give information about taxa that occurred in the
landscape.
- The combined data point to an open landscape of a coastal tundra dominated by
graminoids (Poaceae, Cyperaceae) with a limited amount of Birch and Alder.
Early Holocene Yakutian bison
B van Geel, A Protopopov, I Bull, E Duijm, F Gill, Y Lammers, A Nieman, N Rudaya,
S Trofimova, A N Tikhonov, R Vos, S Zhilich, B Gravendeel. 2014. Multiproxy diet
analysis of the last meal of an early Holocene Yakutian bison.
Journal of Quaternary Science 29(3): 261-268
doi:10.1002/jqs.2698
- c. 10,500 years ago, Chuckchalakh lake, Russia
- Remains of shrubs (Alnus, Betula, Salix) and Poaceae indicate that the animal
probably lived in a landscape of predominantly dry soils, intermixed with wetlands
containing herbaceous plant species, as indicated by remains of Comarum palustre,
Caltha palustris, Eriophorum, Sparganium, Menyanthes trifoliata and
Utricularia.
- All recorded taxa still occur in the present-day Yakutian tundra vegetation.
Joining CITES listing with species detection in organic mixtures
Y Lammers, T Peelen, R A Vos & B Gravendeel. 2014. The HTS barcode checker pipeline,
a tool for automated detection of illegally traded species from high-throughput
sequencing data. BMC Bioinformatics 15:44
doi:10.1186/1471-2105-15-44
First, species from the CITES appendices are joined with the
species in the NCBI taxonomy using the
GlobalNames taxonomic name resolution service.
A simple request to this service would be:
curl -o globalnames.json http://resolver.globalnames.org/name_resolvers.json?names=Homo+sapiens
Which results in a large JSON file. In a script, you might process
such data as follows:
# doing a single request to the globalnames service
# usage: tnrs.py 'Homo sapiens'
import requests, sys
try:
url = 'http://resolver.globalnames.org/name_resolvers.json'
response = requests.get(url, params={'names':sys.argv[1]}, allow_redirects=True)
json = response.json()
except:
json = []
if 'results' in json['data'][0].keys():
if 'name_string' in json['data'][0]['results'][0].keys():
for data_dict in json['data']:
for results_dict in data_dict['results']:
if results_dict['data_source_title'] == 'NCBI':
print( results_dict['taxon_id'] )
Using this logic, a file-based database is populated:
Querying the CITES-annotated reference database
HM Bik, KM Halanych, J Sharma & WK Thomas. 2012. Dramatic Shifts in Benthic Microbial
Eukaryote Communities following the Deepwater Horizon Oil Spill. PLoS ONE
7(6): e38550
doi:10.1371/journal.pone.0038550
Deepwater Horizon sampling design
A study using 454 data processed with the QIIME pipeline. With these data the assumption
was that the reads are structured according to the following primer and amplicon construct:
In this case with data with the following experimental design:
- sampled over two points in time (pre- and post-spill);
- in 7 localities (Bayfront Park, Shellfish Lab, Ryan Ct, Cadillac Ave, Dauphin Bay,
Belleair Blvd, Grand Isle), mostly in the vicinity of
Mobile, Alabama
- sequencing two markers
(regions of the 18 S gene) with two
primers (respectively F04/R22, NF1/18Sr2b)
Oil spill impact: dramatic shifts in benthic microbial eukaryote communities
Accordingly, the reads were demultiplexed following
this complex mapping. The reads were then clustered with
UCLUST and denoised. Finally,
taxonomic identification of each cluster was performed using MegaBLAST, resulting in a
sample by taxon table alternatively visualized as follows:
Phylogenetic diversity
To quantify the turnover between sites and treatments, it is useful to compute metrics
of phylogenetic β diversity, such as UniFrac.
Squares, triangles, and circles denote sequences derived from different communities.
Branches attached to nodes are colored black if they are unique to a particular
environment and gray if they are shared.
- A Tree representing phylogenetically similar communities, where a significant
fraction of the branch length in the tree is shared (gray).
- B Tree representing two communities that are maximally different so that 100% of
the branch length is unique to either the circle or square environment.
- C Using the UniFrac metric to determine if the circle and square communities are
significantly different. For n replicates (r), the environment assignments of the
sequences were randomized, and the fraction of unique (black) branch lengths was
calculated. The reported P value is the fraction of random trees that have at least
as much unique branch length as the true tree (arrow). If this P value is below a
defined threshold, the samples are considered to be significantly different.
- D The UniFrac metric can be calculated for all pairwise combinations of
environments in a tree to make a distance matrix. This matrix can be used with standard
multivariate statistical techniques such as UPGMA and principal coordinate analysis to
compare the biotas in the environments.
- A common approach is to align ribosomal RNA against the SILVA
reference database (in this case using PyNAST) and
infer a tree (e.g. with FastTree)
- This tree then becomes the input tree for the Unifrac distance calculations (β
diversity), resulting in a distance matrix
- From the distance matrix, a tree can clustered that, in this case, shows the similarity
among the post-spill sites
Principal Coordinate Analysis
- Because the Unifrac distance matrix is multidimensional, methods to reduce
dimensionality (e.g. to 3D) such as PCoA, to explore and to visualize similarities or
dissimilarities of data. It starts with a similarity matrix or dissimilarity matrix
(= distance matrix) and assigns for each item a location in a low-dimensional space,
e.g. as a 3D graphics
- The 2D visualization broadly shows the same as the environment clustering: post-spill
sites are similar to one another.
The authors claim that the pre-spill sites were more diverse. This is not very obvious
from the 2D plot, but perhaps it is clearer in an interactive 3D view.
The supplementary data with the paper have
PCoA results in *.kin files next to folders that have a king.jar
file in them. Use
this to view some of the *.kin files. Were the pre-spill sites more diverse?