A comment on the distribution of residuals (and data) for phylogenetic ANOVA

I get inquiries (with some regularity) about "testing for normality in phylogenetic (i.e., species) data" before phylogenetic regression or ANOVA; or about "satisfying the assumptions of parametric tests," by which is usually meant the assumption of normality.

I could probably write a whole paper about this (à la Revell 2009 or Revell 2010), but instead I'll make the simple point: we do not expect normality of the dependent (or independent, for continuous x) variables in phylogenetic data. Instead, what we do expect is multivariate normality of the residual error in y given X (or, equivalently, normality of y given X, controlling for the tree). This is actually a generally under-appreciated property of non-phylogenetic parametric statistical tests - but it is one that is entirely logical. Think: do we expect normality of human height, say, in order to fit an ANOVA model in which height depends on sex? Of course not, the response variable (height) is bimodal. ANOVA is appropriate to test for an effect of sex on height so long as the residual error in height controlling for sex is normal (and, like many such tests, may be fairly robust to mild violations of this assumption). Phylogenetic data are just a little more complicated because even after controlling for our main effects, the residual error can still be bi- or multi-modal due to phylogenetic correlations.

We can still test the parametric assumptions of our model - and I applaud those inclined to do so, as this is relatively seldom done in comparative studies. In the example below, I will first simulate data under the assumptions of the generalized phylogenetic ANCOVA; test normality of the response variable, y (it should fail) and my continuous covariate, x₂ (it should fail); fit the phylogenetic ANCOVA model anyway, and then test normality of the residuals (these should fail, because the residuals are phylogenetically autocorrelated, see Revell 2009); mathematically "remove" the phylogeny, following Butler et al. (2000), and test for normality again (this time, it should pass). For normality testing, I'm using the Lilliefors (Kolmogorov-Smirnov) test, implemented in the R package nortest. A significant result means the data are not normally distributed.

> # load required packages
> require(phytools)
> require(nlme)
> require(nortest)
> # first simulate a completely balanced tree
> tree<-compute.brlen(stree(n=128,type="balanced"))
> # now simulate a discrete character on the tree
> Q<-matrix(c(-2,1,1,1,-2,1,1,1,-2),3,3)
> rownames(Q)<-colnames(Q)<-c(0,1,2)
> mtree<-sim.history(tree,Q)
> cols<-c("white","blue","red"); names(cols)<-0:2
> plotTree(mtree,ftype="off",lwd=5)
> plotSimmap(mtree,cols,pts=FALSE,ftype="off",lwd=3, add=TRUE)

This is the distribution of our effect on the tree.

> # now simulate data under an arbitrary ANCOVA model
> # the same principle applies to regression or ANOVA
> x1<-as.factor(mtree$states)
> x2<-fastBM(tree,sig2=2)
> e<-fastBM(tree)
> y<-2*as.numeric(x1)+0.75*x2+e
> # is y normal? (should fail)
> lillie.test(y)

Lilliefors (Kolmogorov-Smirnov) normality test

data: y
D = 0.1049, p-value = 0.00149

> # is x2 normal? (should fail)
> lillie.test(x2)

Lilliefors (Kolmogorov-Smirnov) normality test

data: x2
D = 0.1113, p-value = 0.0005154

> # fit the model
> fit<-gls(y~x1+x2,data=data.frame(x1,x2,y), correlation=corBrownian(1,tree))
> fit
Generalized least squares fit by REML
Model: y ~ x1 + x2
Data: data.frame(x1, x2, y)
Log-restricted-likelihood: 40.7237

Coefficients:
(Intercept) x11 x12 x2
1.7388578 1.8929459 3.9681291 0.8418073

Correlation Structure: corBrownian
Formula: ~1
Parameter estimate(s):
numeric(0)
Degrees of freedom: 128 total; 124 residual
Residual standard error: 0.9261019
> # are the residuals normal? (should fail)
> lillie.test(residuals(fit))

Lilliefors (Kolmogorov-Smirnov) normality test

data: residuals(fit)
D = 0.1156, p-value = 0.0002458

> # are the residuals controlling for phylogeny normal?
> # (should pass)
> lillie.test(chol(solve(vcv(tree)))%*%residuals(fit))

Lilliefors (Kolmogorov-Smirnov) normality test

data: chol(solve(vcv(tree))) %*% residuals(fit)
D = 0.0694, p-value = 0.1371

The basic point is that we do not expect our input data in phylogenetic ANOVA or regression to be normally distributed - just the residual error controlling both for the main effects in our model, and (importantly, because this is most often forgotten) the tree.

A comment on the distribution of residuals (and data) for phylogenetic ANOVA

Trending Articles

Black Angus Grilled Artichokes

Attharintiki Daaredhi: Bappu Gari Bommo Lyrics Translation

Kusvirana kana mambotukana kunonaka sei? – Makwirirwo anodiwa nevakadzi!

Lady Gaga & Bruno Mars – Die With A Smile (Acoustic) – Single [iTunes Plus M4A]

Moondru Mudichu 01-05-2017 – Polimer tv Serial

Download: Bicko Bicko ft Rich Bizzy & Crew G- Wanfulanganya (Prod by: Bicko...

Karimnagar District Police Office Mobile Numbers List in Telangana State

Punjab School Education Board Latest Exam Result 2016 www.pseb.ac.in

Senior High School (SHS) DLL - Organization and Management

99 God Status for Whatsapp, Facebook

Practice Sheet of Right form of verbs for HSC Students

VMOU RSCIT Result 2017, RSCIT Result VMOU rkcl.vmou.ac.in Name Wise

RNS 510 C14 bricked after NAND erase

Michel Roux roast duck with cherries, cherry sauce and potatoes recipe on...

Online এ তৈরি করুন Fake Smart ID Card

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

[GET] High Ticket PDF Profits by Glynn Kosky - superhyped

Aoi Teshima – Mori no Chiisana Restaurant – Single [iTunes Plus M4A]

The Personal Assistant (JL Creation) (ENG+RUS) [L] [1.79GB]