mebioda

Phylogenetic comparative analysis

The problem: non-independence in comparative analysis

The problem of analyzing phylogenetically structured data with conventional statistical methods. Ignoring phylogeny, one would conclude that X and Y are positively correlated (Pearson r = 0.48, 2-tailed P = 0.034), when in fact this relationship emerges primarily from the high divergence in X and Y between the two clades at the root of the phylogeny.

False positives (type I errors)

Increased type I error rates of conventional statistics in analyses of interspecific data. When two traits evolve independently along a phylogeny according to Brownian motion, the probability of rejecting the null hypothesis of no correlation (type I error) increases with the amount of phylogenetic structure of the data.

Simulations with a star phylogeny result in the error rates of 5%, which is the expected type I error rate if conventional (nonphylogenetic) analyses are used.

The shaded area represents simulations where the resulting ordinary Pearson coefficient falls above the tabular critical value of +0.476 (11 degrees of freedom), which would incorrectly suggest that the two traits are correlated.

Type I error rates can be higher than 25% if the data shows a strong phylogenetic structure.

Brownian motion

Brownian simulation

Here is some code to perform discrete-time (non-phylogenetic) Brownian motion simulation:

# discrete time BM simulation
n<-100; t<-100; sig2<-1/t # set parameters
time<-0:t
X<-rbind(rep(0,n),matrix(rnorm(n*t,sd=sqrt(sig2)),t,n))
Y<-apply(X,2,cumsum)
plot(time,Y[,1],ylim=range(Y),xlab="time",ylab="phenotype", type="l")
apply(Y,2,lines,x=time)

And here is the result:

Independent contrasts

Calculation of phylogenetic independent contrasts for two hypothetical variables X and Y. Contrasts estimate the amount of phenotypic divergence across sister lineages standardized by the amount of time they had to diverge (the square root of the sum of the two branches).

The algorithm runs iteratively from the tips to the root of the phylogeny, transforming n phenotypic measurements that are not independent in n–1 contrast that are statistically independent. Because phenotypic estimates at intermediate nodes (X’ and Y’) are not measured, but inferred from the tip data, divergence times employed to calculate these contrasts include an additional component of variance that reflects the uncertainty associated with these estimates. In practice, this involves lengthening the branches (dashed lines) by an amount that, assuming Brownian motion, can be calculated as:

(daughter branch length 1 × daughter branch length 2)
----------------------------------------------------- 
(daughter branch length 1 + daughter branch length 2)

Contrasts results and statistical analysis

As a result, the association between the hypothetical phenotypic variables X and Y analyzed employing conventional statistics and independent contrasts may seem remarkably different. Because independent contrasts estimate phenotypic divergence after speciation and are expressed as deviations from zero (i.e., the daughter lineages were initially phenotypically identical), correlation and regression analyses employing contrasts do not include an intercept term and must be always calculated through the origin.

Note that the sign of each contrast is arbitrary; hence many studies have adopted the convention to give a positive sign to contrasts in the x-axis and invert the sign of the contrast in the y-axis accordingly (this procedure does not affect regression or correlation analyses through the origin). Even though the classic algorithm to calculate contrasts neglects important sources of uncertainty such as individual variation and measurement error, recent methods can account for these sources of error.

R/ape tutorial for independent contrasts

Independent contrasts drawbacks

An example case

Ordinary least squares

In an ordinary least squares (OLS) regression model, the relationship of a response variable Y to a predictor variable X1 can be given using the regression equation:

For a simple regression with one predictor (X), the slope of the regression line b1 is given by:

The intercept b0 then follows:

Generalizing the model: the variance-covariance matrix

In the case of OLS, the implicit assumption is that there is no covariance between residuals (i.e. all species are independent of each other, and residuals from closely related species are not more similar on average than residuals from distantly related species).

This (n x n) variance–covariance matrix is denoted as C, and for five species under the assumption of no phylogenetic effects on the residuals, it looks like:

Variance-covariance matrix from relatedness

Phylogenetic Generalized Least Squares (PGLS) solution

When this new version of C is applied to the GLS calculation (e.g. with geiger), we eventually end with the PGLS solution:

log(propodus length) = - 0.276 + 1.616 x log(carapace breadth)

Categorical data and character analysis

Continuous-time Markov models

Transitions between states can be modeled as a rate matrix (Q). For example, for a character with two states (i and j), matrix looks like this:

  i j
i 1-qij qij
j qji 1-qji

Likelihood

Quick recap

The likelihood framework


Here is a tutorial that applies these concepts using the R package geiger.