More especially: lexical semantics:
When is the plant in bloom?
plant flowering begin
age_first_flowering
age_first_flowering
Bloom_Period
How much do the seeds weigh?
seed mass
seed_mass
, SeedMass
, Seed.per.Pound
seed_mass
Seeds_per_Pound
The databases appear to have the same information - or close enough so that one type of
record (Seed.per.Pound
) can be converted to another (perhaps
seed mass
= 1lb. / Seed.per.pound
). However, there are at least two challenges:
In computer science and information science, an ontology is a formal naming and definition of the types, properties, and interrelationships of the entities that really exist in a particular domain of discourse. Thus, it is basically a taxonomy.
An ontology compartmentalizes the variables needed for some set of computations and establishes the relationships between them.
The fields of artificial intelligence, the Semantic Web, systems engineering, software engineering, biomedical informatics, library science, enterprise bookmarking, and information architecture all create ontologies to limit complexity and organize information. The ontology can then be applied to problem solving.
Data-driven ontologies have very many classes, which are (semi-)automatically generated from activities such as expression analyses (GO) or text-mining (FLOPO):
These types of ontologies can be used to do analyses such as enrichment tests, which assess whether a data set (for example, a list of genes that was expressed in a transcriptome) is significantly biased towards part of the ontology, such as a gene functional group (such as immune response).
Vocabularies to describe the essential concepts within a domain, such as:
Terms from such vocabularies are used to structure and annotate data. For example, the tabular occurrence records from GBIF use Darwin Core terms as column names. GenBank records use SO terms to label features on the sequence (such as exons, introns, CDSs).
Data from the EoL trait bank is annotated with terms from the Darwin Core. We can query the database as follows:
library(traits)
library(taxize)
library(dplyr)
# example...
species <- "Cocos nucifera"
# For EoL traitbank data we need to provide the EoL taxon ID as input parameter. Hence,
# we first need to do a TNRS lookup of these, as follows:
sources <- gnr_datasources() # frame with global names sources
eol_id <- sources[sources$title == "Encyclopedia of Life", "id"] # lookup the id of the EOL source
eol_tnrs <- gnr_resolve(species, data_source_ids = c(eol_id), fields = "all") # resolve species
eol_taxon_id <- eol_tnrs[eol_tnrs$matched_name == species,]$local_id # lookup integer id
# Now that we have the taxon id, we query the traitbank
eol_results <- traitbank(eol_taxon_id)
eol_graph <- eol_results[["graph"]] # the interesting bit in the results is the graph
# Here we select the fields with Darwin Core terms
eol_triples <- select( eol_graph, 'dwc:scientificname', 'dwc:measurementtype', 'dwc:measurementvalue', 'units' )
# Write to file
write.csv(eol_triples, file = "eol_triples.csv")
Which produces this csv file
“A coconut will not
resprout following top (above ground
biomass) removal”
The three links can be viewed as the semantic anchors of this fact:
Together, they form a triple:
Our statement about coconuts, and all the other rows in our csv file can be reformatted in this simple RDF representation:
<http://eol.org/1091712> <http://eol.org/schema/terms/ResproutAbility> <http://eol.org/schema/terms/ResproutNo> .
By doing so, we integrate our data set in a web of knowledge representations, linking us to, for example:
All these triples together form a graph that, when published, can participate in the linked data graph:
RDF is fundamentally about abstract concepts (the triples thing), which can be
represented in many ways. The example above (<> <> <> .
) is a terse form called
turtle, others include:
Large RDF data sets are typically not stored as files but in a special kind of database called a triple store (this is closer to the graph database neo4j used by OpenTree and EoL than to a relational SQL database).
The language for querying relational databases (SQL) has a lot of functionality for expressing various types of joins and other clauses that combine tables. For graphs this is far less useful: emergent topological patterns are more important. Hence, languages for querying graphs have been developed. For RDF and the semantic web, the standard language SPARQL is endorsed by the w3c. For neo4j there is the (vendor-specific) language cypher.