1. maximum entropy

Introduction

The following R markdown file performs the steps needed to calculate niche overlap between the extant Ungulates. Here we implement two different approaches to calculate the niche per species:

  1. Maxent species distribution models, and
  2. Outlying Mean Index (OMI)

The following libraries and scripts need to be loaded:

library(raster, quietly = T)
library(knitr, quietly = T)
library(maxent, quietly = T)
library(rJava, quietly = T)
library(maptools, quietly = T)
library(dismo, quietly = T)
library(sp, quietly = T)
library(ape, quietly = T)
library(rgdal, quietly=T)
library(dplyr, quietly=T)

# our local code. the goal is that this will eventually be properly portable 
# code that lives inside the ./script folder of the repository
source("../script/MaxEnt_function.R")

In addition, here we define the global variables that we will reuse throughout the code:

# this is the location of the root directory of the git repository in the local 
# file system. normally you would be running the code from within rstudio, using
# a project that is initialized within that directory, so the working directory 
# should automatically be set.
REPO_HOME <- paste(getwd(), "/../", sep = "")

Environmental data

The environmental data used in this research are based on climatic variables, topography, and soil characteristics. Climatic information about the present was extracted from the widely used Bioclim dataset, which includes 19 bioclimatic layers. The datasets contain information such as ‘precipitation in the driest quarter’ or ‘maximum temperatures of the coldest month’ and are constructed based on monthly remote sensing data between 1950 and 2000 (Hijmans et al., 2005, Title et al., 2018). The dataset can directly be downloaded with the getData() function from the raster package. It is possible to adjust the spatial resolution res to 30 arc seconds, 2.5, 5, and 10 arc minutes.

# This function from the raster package loads bioclimatic layers. It first 
# attempts to do this from files in data/GIS/wc5, but if these are not found it 
# will attempt to download them and cache them locally. Since the files are 
# small enough to commit to github we will normally have these files already. 
# This could change if we move to a resolution that is finer than 5 arcminutes, 
# as the layer files will then grow above 15MB.
gis.layers <- raster::getData(
  "worldclim",
  var = "bio",
  res = 5,
  path = paste(REPO_HOME, "/data/GIS", sep = ""),
  download = T
)

The other environmental datasets that we used are the new ENVIREM variables that give additional climatic information to the Bioclim datasets. We used median elevation variables from the Harmonized World Soil Database (HWSD), which are based on NASA’s Shuttle Radar Topographic Mission to calculate worldwide slope and aspect. We used indirect height measures such as slope and aspect because the height variables are directly correlated with the temperature Bioclim datasets. To capture soil characteristics, we used organic carbon, pH CaCL, bulk density and clay percentage datasets obtained from the land-atmosphere interaction research group at Sun Yat-sen University.

# The location of the TIFF file with the stacked layers described directly above
files.names <- list.files(paste(REPO_HOME, "/data/GIS/5_deg", sep = ""))

# Turn the file names into layer names: strip the prefix (which might include
# the resolution) and strip the file extension
gis.layers.names <- files.names
gis.layers.names <- gsub('current_5arcmin_','',gis.layers.names)
gis.layers.names <- gsub('.tif','',gis.layers.names)

# Combine the layer names with those we've already read from BIOCLIM
gis.layers.names <- c(names(gis.layers),gis.layers.names)

# Iterate over files
for (i in 1:length(files.names)) {
  
  # Stack with previously read layers
  gis.layers <- stack(
    gis.layers,
    
    # Read as raster
    raster(
      
      # Construct file name
      paste(REPO_HOME, "/data/GIS/5_deg/", files.names[i], sep = "")
    )
  )
}

# Apply all names
names(gis.layers) <- gis.layers.names
rm(gis.layers.names, files.names, i)
plot(gis.layers)

Taxon data

In a project that includes many taxa, such as this one, it is helpful if we can work incrementally by having separate input and output files per taxon. To set this up we first read in the list of taxa from /data/filtered/taxa.txt and create corresponding output directories:

# read listing
taxa <- scan(paste(REPO_HOME, "/data/filtered/taxa.txt", sep = ""), sep = "\n", what = character())

# make output directories, if needed
for (i in 1:length(taxa)) {
  taxon.dir.name <- sprintf("%s/results/per_species/%s/", REPO_HOME, taxa[i])
  if (!dir.exists(taxon.dir.name)) {
    dir.create(taxon.dir.name, recursive = T)
  }
}
rm(i, taxon.dir.name)

When calculating the species distribution models a dataframe containing all the occurence points is needed, which is created and filled below:

# create an empty dataframe 
occurrences.df = data.frame(
  matrix(
    vector(), 0, 4,
    dimnames=list(
      c(), 
      c("gbif_id", "taxon_name", "decimal_latitude", "decimal_longitude")
    )
  ),
  stringsAsFactors=F
)

# loop to add the occurence data to the empty dataframe
for ( i in 1:length(taxa) ) {
  csv.file <- sprintf("%s/data/filtered/%s.csv", REPO_HOME, taxa[i])
  csv <- read.csv(csv.file, header = T)
  occurrences.df <- rbind(occurrences.df, csv)
}

# finalize data frame: parse longitudes and latitudes
occurrences.df$decimal_longitude <- as.numeric(as.character(occurrences.df$decimal_longitude))
occurrences.df$decimal_latitude <- as.numeric(as.character(occurrences.df$decimal_latitude))

rm(csv.file,csv,i)

Below the raw occurrences are plotted on a map for each species, which goes to /results/per_species/<taxon>/occurrences.png:

data(wrld_simpl)

# example of the occurrence point plots 
taxon <- taxa[68]
taxon.name <- gsub( "_", " ", taxon)

# select the longitude and latitude of the focal taxon from the data frame
taxon.occurrences <- dplyr::filter(occurrences.df, taxon_name == taxon.name)
lon <- dplyr::select(taxon.occurrences, decimal_longitude)$decimal_longitude
lat <- dplyr::select(taxon.occurrences, decimal_latitude)$decimal_latitude

# plot a simple map ±5 degrees long and lat around the occurrences
plot(
      wrld_simpl,
      axes = TRUE,
      xlim = c(min(lon) - 5, max(lon) + 5),
      ylim = c(min(lat) - 5, max(lat) + 5),
      col = "gray93",
      main = sprintf("Filtered occurrences for %s", taxon.name)
    )
    points(lon, lat, col = "red", pch = 20, cex = 0.75)
    box()

# loop through all the occurrence files and save them in their repositories 
for (i in 1:length(taxa)) {
  
  # construct the name of the occurrences file for the current taxon
  taxon <- taxa[i]
  occ.map.file <- sprintf("%s/results/per_species/%s/occurrences.png", REPO_HOME, taxon)

  # generate the map
  if (!file.exists(occ.map.file)) {
    taxon.name <- gsub( "_", " ", taxon)

    # select the longitude and latitude of the focal taxon from the data frame
    taxon.occurrences <- dplyr::filter(occurrences.df, taxon_name == taxon.name)
    lon <- dplyr::select(taxon.occurrences, decimal_longitude)$decimal_longitude
    lat <- dplyr::select(taxon.occurrences, decimal_latitude)$decimal_latitude

    # plot a simple map ±5 degrees long and lat around the occurrences
    png(file = occ.map.file)
    plot(
      wrld_simpl,
      axes = TRUE,
      xlim = c(min(lon) - 5, max(lon) + 5),
      ylim = c(min(lat) - 5, max(lat) + 5),
      col = "gray93",
      main = sprintf("Filtered occurrences for %s", taxon.name)
    )
    points(lon, lat, col = "red", pch = 20, cex = 0.75)
    box()
    dev.off()
    
    # clean up
    rm(taxon.name,taxon.occurrences,lat,lon)
  }
  rm(occ.map.file, taxon)
}
rm(i, wrld_simpl)

Maxent: species distribution modeling

For this study we use the dismo::maxent algorithm, which we invoke from the Maxent_function as implemented in this script.

In the following section we run the Maxent analysis for every taxon. To assess the validity of the model we follow the method of Raes and Ter Steege. This approach seeks to solve the problem that we do not have true absences (because absence of evidence is no evidence of absence). We do this by considering the area within the buffer around the true occurrences of the focal species, and then sample from among the localities within that buffer where occurrences for other species have been recorded, but not for the focal species. This way we avoid sampling pseudo-absences in locations that are, for example, not reachable, i.e. observer biases.

By doing the resampling a sufficient number of times we can establish a null distribution and assess whether the true AUC fits better than the AUCs generated given pseudo-absences. If the true AUC fits better, it is written to a file valid_maxent_model.rda, otherwise to a file invalid_maxent_model.rda.

# containers for models with low/high accuracy
list_species_model_low_accuracy <- list()
list_species_model_high_accuracy <- list()

# iterate over n species
for (i in 1: length(taxa) )
    {
  name <- taxa[i]
  valid.model.file <- sprintf(
    "%s/results/per_species/%s/valid_maxent_model.rda", REPO_HOME, name)
  invalid.model.file <- sprintf(
    "%s/results/per_species/%s/invalid_maxent_model.rda", REPO_HOME, name)

  # check if we have the model cached
  if (!file.exists(valid.model.file) && !file.exists(invalid.model.file)) {
    taxon.name <- gsub('_', ' ', name)
    message(taxon.name)
    taxon.occurrences <- dplyr::filter(occurrences.df, taxon_name == taxon.name)

    # Run maxent:
    # - buffer environment to ± 9 degrees lat/long relative to the occurrences
    # - remove colinear layers
    # - run dismo::maxent()
    species_model <- Maxent_function(taxon.occurrences, gis.layers)
    # select species model from the output list
    spmod <- species_model[[1]] 
    # select trainingAUC
    trainingAUC <- spmod@results[[5,1]]
    
    ## null model 
    # - sample occurence points within the buffered environment 
    # - run the maxent model 100 times using the selected layers
    visited_areas <- dplyr::filter(occurrences.df, taxon_name != taxon.name)
    modelenvironment <- species_model[[2]] # selected layers
    polygonextent <- species_model[[4]] # buffered environment
    AUCvalues <- nullModel_without_spatial(
      visited_areas, 
      modelenvironment, 
      polygonextent, 
      nrow(taxon.occurrences), 
      rep = 100
    )
    
    auc <- sort(unlist(AUCvalues), decreasing = FALSE)
    # select 95th number in the list 
    confidence.interval<- auc[95]
    
    # populate two stacks, one with the models with 
    # low accuracy and one with high accuracy 
    if (trainingAUC > confidence.interval) {
      model <- species_model[[1]]
      list_species_model_high_accuracy[[name]] <- model
      save(model, file = valid.model.file)
    }
    else {
      model <- species_model[[1]]
      list_species_model_low_accuracy[[name]] <- model
      save(model, file = invalid.model.file)
    }

    # valid model was previously constructed, load it
  } else if (file.exists(valid.model.file)) {
    env <- new.env()
    nm <- load(valid.model.file, env)[[1]]
    list_species_model_high_accuracy[[name]] <- env[[nm]]

    # invalid model was previously constructed, load it
  } else if (file.exists(invalid.model.file)) {
    env <- new.env()
    nm <- load(invalid.model.file, env)[[1]]
    list_species_model_low_accuracy[[name]] <- env[[nm]]
  }
}
rm(env,i,name,nm,invalid.model.file,valid.model.file)

Having constructed the models, we now record their scores in a spreadsheet:

# combine two lists of valid and invalid models
AUC.csv <- paste(REPO_HOME, "/results/maxent/model_summaries/AUCvalues.csv", sep = "")

if (!file.exists(AUC.csv)) {
output_AUC_valid <- data.frame(
  matrix(
    ncol = 3, 
    nrow = length(list_species_model_high_accuracy)
  )
)
colnames(output_AUC_valid) <- c("taxon", "trainingAUC", "validation")

# iterate over accurate models and add to data frame
for (i in 1:length(list_species_model_high_accuracy)) {
  open_species_model <- list_species_model_high_accuracy[[i]]
  name <- names(list_species_model_high_accuracy[i])
  name_underscore <- gsub("_", " ", name)
  trainingAUC <- open_species_model@results[[5, 1]]
  output_AUC_valid[i, ] <- c(name_underscore, trainingAUC, "valid")
}

# initialize data frame
output_AUC_invalid <- data.frame(
  matrix(
    ncol = 3, 
    nrow = length(list_species_model_low_accuracy)
  )
)
colnames(output_AUC_invalid) <- c("taxon", "trainingAUC", "validation")

# iterate over inaccurate models and add to data frame
for (i in 1:length(list_species_model_low_accuracy)) {
  open_species_model <- list_species_model_low_accuracy[[i]]
  name <- names(list_species_model_low_accuracy[i])
  name_underscore <- gsub("_", " ", name)
  trainingAUC <- open_species_model@results[[5, 1]]
  output_AUC_invalid[i, ] <- c(name_underscore, trainingAUC, "invalid")
}

# concatenate data frames and write to CSV
combined_auc <- rbind(output_AUC_invalid, output_AUC_valid)
write.csv(combined_auc, file = AUC.csv)
}

For the accurate models we then write the respective importance of the selected variables to a figure, and the same for the response curves to these variables:

# plot variable importance 
maxent.model <- list_species_model_high_accuracy[[10]]
taxon.name <- names(list_species_model_high_accuracy[10])
plot.title <- sprintf("Variable importance for %s", taxon.name)
plot(maxent.model, main = plot.title)

# plot response curves 
response(maxent.model)

# iterate over accurate models
for (i in 1:length(list_species_model_high_accuracy)) {
  maxent.model <- list_species_model_high_accuracy[[i]]
  taxon.name <- names(list_species_model_high_accuracy[i])
  plot.title <- sprintf("Variable importance for %s", taxon.name)

  # construct file name for variable importance
  variable.map.file <- sprintf(
    "%s/results/per_species/%s/valid_maxent_variable_importance.png", 
    REPO_HOME, 
    taxon.name
  )
  if (!file.exists(variable.map.file)) {
    png(file = variable.map.file)
    plot(maxent.model, main = plot.title)
    dev.off()
  }
  
  # construct file for variable responses
  response.map.file <- sprintf(
    "%s/results/per_species/%s/valid_maxent_response_curve.png", 
    REPO_HOME, 
    taxon.name
  )
  if (!file.exists(response.map.file)) {
    png(file = response.map.file, res = 80, units = "px")
    response(maxent.model)
    dev.off()
  }
}

The outcome of the species distribution models can be used to create a prediction of the areas that are suitable for the species. We can do this world- wide, which allows us to calculate (potential) niche overlap even for species that natively live on different continents. However, the projections that we thus generate disregard barriers to dispersal and are therefore quite unnatural.

For more plausible projections we restrict the extent in which we project to the biomes in which the occurrences were recorded to begin with.

# combine all prediction rasters, either by doing the projection or by reading from a file
prediction_rasters <- stack()
crs.string <- "+proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs"

# the stacking of all the suitability rasters uses a lot of memory. You can change 1: lenth(taxa) to for example the first 20 species 1:20. When calculating the Schoener's distance in {calc_dist} all the suitability rasters need to be stacked (use 1:length(taxa)) and therefore it is recommended to use a computer with sufficient memory. 
for ( i in 1:20
      #length(taxa) 
      ) {
  taxon.name <- gsub( '_', ' ', taxa[i] )

  # file name: load if exists, create otherwise
  valid.model.prediction <- sprintf(
    "%s/results/per_species/%s/valid_maxent_prediction.rda", 
    REPO_HOME, 
    taxa[i]
  )
  if ( file.exists(valid.model.prediction) ) {

    # load file
    message(sprintf("loading prediction for taxon %s from file %s", taxon.name, valid.model.prediction))
    env <- new.env()
    nm <- load(valid.model.prediction, envir = env)[[1]]
    prediction_rasters <- stack(prediction_rasters, env[[nm]])
  }
  else {
    message(sprintf("computing prediction for taxon %s, will write to file %s", taxon.name, valid.model.prediction))
    maxent.model <- list_species_model_high_accuracy[[i]]

    # select non correlated layers from model environment in the world environment
    noncorrelated <- names(maxent.model@presence)
    subsetworldEnv <- subset(gis.layers, noncorrelated)

    # make projection of the whole earth based on the maxent SDM
    species.pred <- dismo::predict(maxent.model, subsetworldEnv)

    # create occurrence points for focal taxon
    points <- dplyr::filter( occurrences.df, taxon_name == taxon.name )
    points <- dplyr::select( points, decimal_longitude, decimal_latitude )
    coordinates(points) <- ~ decimal_longitude + decimal_latitude
    proj4string(points) <- CRS(crs.string)
    
    # select realms in which occurrence points are recorded
    #polys.sub <- realms[!is.na(sp::over(realms, sp::geometry(points))), ] 
    #polys.sub <- gUnaryUnion(polys.sub)

    # set all the values outside the polygon to 0
    #maskedprediction <- mask(
      #species.pred, 
      #polys.sub, 
      #updatevalue=0, 
      #updateNA=TRUE
   # )
    
    # remove the 10% lowest prediction values under the occurence points 
    projection.values <- extract(species.pred, points, method="simple")
    sort.projection.values <- sort(unlist(projection.values), decreasing = FALSE)
    percentage.number <- round((0.1* dplyr::count(as.data.frame(sort.projection.values))), digits = 0)
    threshold.value <- sort.projection.values[percentage.number[1,1]]
    species.pred[species.pred < threshold.value] <- 0
    # save file, stack projection contents
    save(species.pred, file = valid.model.prediction)
    prediction_rasters <- stack(prediction_rasters, species.pred)
  }
}
# if you only stack the first 20 used taxa[1:20]
names(prediction_rasters) <- taxa[1:20]
taxa.prediction.rasters<-names(prediction_rasters)
crs.string <- "+proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs"
realms <- readOGR(paste(REPO_HOME, "/data/GIS/Realms/newRealms.shp", sep = ""))
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\elkeh\Documents\Stage_Naturalis\Scripts_R\trait-geo-diverse-ungulates\Data\GIS\Realms\newRealms.shp", layer: "newRealms"
## with 11 features
## It has 5 fields
realms <- spTransform(realms, CRS(crs.string))
plot(realms)

for ( i in 1: length(taxa.prediction.rasters) ) {
# construct restricted projection file name
  prediction.restricted.file <- sprintf(
    "%s/results/per_species/%s/valid_maxent_prediction_restricted.rda", 
    REPO_HOME, 
    taxa.prediction.rasters[i]
  )
  if (!file.exists(prediction.restricted.file)) {
   projection.raster<- prediction_rasters[[i]]
   taxa.name<- gsub("_", " ", taxa.prediction.rasters[i])

  # create occurrence points for focal taxon
  points <- dplyr::filter( occurrences.df, taxon_name == taxa.name )
  points <- dplyr::select( points, decimal_longitude, decimal_latitude )
  coordinates(points) <- ~ decimal_longitude + decimal_latitude
  proj4string(points) <- CRS(crs.string)
    
  #select realms in which occurrence points are recorded
  polys.sub <- realms[!is.na(sp::over(realms, sp::geometry(points))), ] 
  polys.sub <- gUnaryUnion(polys.sub)

  # set all the values outside the polygon to 0
  maskedprediction <- mask(
      projection.raster, 
      polys.sub, 
      updatevalue=0, 
      updateNA=TRUE )
  
  save(maskedprediction, file = prediction.restricted.file)

  }

  # construct prediction map file name
  prediction.map.file <- sprintf(
    "%s/results/per_species/%s/prediction_map.png", 
    REPO_HOME, 
    taxa.prediction.rasters[i]
  )
  if (!file.exists(prediction.map.file)) {
    png(file = prediction.map.file)
    plot(maskedprediction)
    dev.off()
  }

  # construction prediction map with occurrences file name
  prediction.occurence.map.file <- sprintf(
    "%s/results/per_species/%s/prediction_occurence_map.png", 
    REPO_HOME, 
      taxa.prediction.rasters[i]
  )
  if (!file.exists(prediction.occurence.map.file)) {
    
    # get occurrences of focal taxon, superimpose on prediction map
    taxa.name<- gsub("_", " ", taxa.prediction.rasters[i])
    png(file = prediction.occurence.map.file)
    plot(maskedprediction)
    points(
      points$decimal_longitude, 
      points$decimal_latitude, 
      col = "red", 
      pch = 20, 
      cex = 0.2
    )
    box()
    dev.off()
  }

}

Write species-specific documentation file

# location of a triangular, inverse distance matrix file
for ( i in 1:length(taxa)) {
  name <- taxa[i]
  readme.file <- sprintf('%s/results/per_species/%s/README.md', REPO_HOME, taxa[i])
  
  if(!file.exists(readme.file)) {
    taxa_remove<- gsub( "_", " ", name)
    
    template<- 
'# {{taxa_name}} 

## Distribution of occurrence points 

The following map shows the distribution of the filtered 
[occurrences](../../../data/filtered/{{taxa_underscore}}.csv) for {{taxa_name}} 
used in [the Maxent model](valid_maxent_model.rda). 

![](occurrences.png)
    
## Variable importance 

The variable importance graph shows the relative importance of the abiotic 
raster layers in the model

![](valid_maxent_variable_importance.png)
    
## Response curves

The response curve graphs show the modeled response of {{taxa_name}} to the 
selected abiotic raster layers. 

![](valid_maxent_response_curve.png)
    
## Model predictions

To be able to compute the potential distribution overlap, worldwide, between
all pairs of species, a model [prediction](valid_maxent_prediction.Rda) is
available. However, for prediction maps that take into account large-scale 
biogeographic barriers to dispersal, we use a 
[restricted prediction](valid_maxent_prediction_restricted.rda) that limits 
the projection to the 
[zoogeographical realm(s)](../../../data/GIS/Realms/newRealms.shp) in which the
raw occurrences are located. The maps below are based on this restricted
projection.

The first map shows the predicted suitable areas on earth based on the niche 
preferences for {{taxa_name}} calculated in the Maxent model. The second map 
shows the suitable area map with the original occurrence points superimposed.

![](prediction_map.png)
![](prediction_occurence_map.png)
    '
    data <- list( taxa_name = taxa_remove, taxa_underscore = name )
    
    text <- whisker.render(template, data)
    
    
    writeLines(text, readme.file)
  }
}

Niche overlap using Schoener’s D

Calculate and store overlap values

To assess whether the predicted suitable niche spaces differ per species we calculate their niche overlap. We use Schoener’s D to calculate niche overlap since it has been suggested to be the best suited index for maxent SDM outputs (Rödder & Engler, 2011). The index ranges from 0 which is no overlap to 1 which is a complete overlap and is based on the model prediction maps.

# location of a triangular, inverse distance matrix file
overlap.file <- paste(REPO_HOME, "/results/maxent/comparative/schoener/overlap.csv", sep = "")
if (!file.exists(overlap.file)) {

  # file doesn't exist, calculate Schoener's Distance and store file
  overlap <- calc.niche.overlap(prediction_rasters, stat = "D", maxent.args)
  write.table(overlap, file = overlap.file, sep = ",", quote = F)
} else {

  # file exists, read it
  overlap <- read.table(overlap.file, sep = ",")
}

Construct and store NJ tree

Having calculated the Schoener’s D distances, we can now use these to make a symmetrical distance matrix, and perform a neighbor joining clustering to visualize the distances in niche space.

dendrogram.file <- paste(REPO_HOME, "/results/maxent/comparative/schoener/inverse_overlap_nj.tree", sep = "")
if (!file.exists(dendrogram.file)) {

  # create a distance matrix, i.e. the inverse of the overlap
  dmatrix <- as.dist(1 - overlap)

  # plot an unrooted dendrogram
  tree <- nj(dmatrix)
  write.tree(tree, file = dendrogram.file)
} else {
  tree <- read.tree(file = dendrogram.file)
}