Using the ARISE barcode metadata DB

1. Loading data

This section describes how to populate the DB with data from various sources. The steps are as follows:

Initializing a new database
Loading the NSR taxonomic topology
Loading the NSR synonyms table
Fetching BOLD metadata
Indexing the topology for faster queries
Loading the Geneious metadata table

1.1 Initializing a new database

A new, empty SQLite database file that implements the database schema as implied by the object-relational mapping in arise.barcode.metadata.orm.* is generated as follows:

$ arise_create_barcode_metadata_db.py -outfile arise-barcode-metadata.db

1.2 Loading the backbone topology

This operation will load the accepted species names from the NSR into the table nsr_species and the higher taxonomy in the node table, whose structure is based on Vos, 2019. This requires an empty database file (see 1.1). The contents are fetched from the NSR endpoint for DwC data dumps at: http://api.biodiversitydata.nl/v2/taxon/dwca/getDataSet/nsr - use the script as follows:

$ arise_load_backbone.py -db <sqlite.db> -endpoint <url>

The terms between pointed brackets are merely placeholders here; the script uses sensible defaults, i.e. the actual endpoint location and the name of the default output from 1.1

Note that the database file is updated in place.

1.3 Loading the synonyms table

Here we load the set of taxonomic synonyms that is curated by the NSR and available as a separate table dump. The purpose of this is improve the matching between the NSR and the other data sources (i.e. internal sequencing and BOLD). This table is assumed to have similar characteristics as the canonical names table:

a preamble to ignore
a header row, but here we are looking for the columns "synonym" and "taxon"
tab-separated records

With these assumptions, the loading is then executed as:

$ arise_load_synonyms.py -infile <synonyms.csv> -db <sqlite.db>

Note that we now only match species, so anything below or above that will trigger a ‘no match’ message

1.4 Loading BOLD metadata

In this step we load the output from BOLD’s Full Data Retrieval (Specimen + Sequence) web service. When filtered on Netherlands|Belgium|Germany this output is ±300MB and takes an hour or so to download. The step can also be done from a cached copy. In either case, the format must be TSV. The operation is as follows:

$ arise_load_bold.py -tsv <file.tsv>

1.5 Indexing the topology for faster queries

Now the tree topology should be indexed. This is easiest done using the Bio::Phylo::Forest::DBTree package:

$ perl -MBio::Phylo::Forest::DBTree -e \
    'Bio::Phylo::Forest::DBTree->connect("arise-barcode-metadata.db")->get_root->_index'

1.6 Query and Visualize the database content using the Jupyter Lab

Or run it in the container by executing the following command at the root of the project:

docker-compose -f docker-compose.yml run dbtree

(docker-compose 1.xx command, try docker compose -f ... for v2.XX)

It might take 15-30min to complete.

1.6 Query and visualize the database content using Jupyter Lab

Jupyter Lab runs in a docker container, to start it:

NB_UID=`id -u` docker-compose -f docker-compose.yml up jupyter

and follow the instructions displayed in the terminal.