Phylogenetic Tools for Comparative Biology

My new article about ancestral character estimation under the threshold model is now available online 'Accepted' (i.e., final submitted pre-proof ms.) at the journal Evolution. In this manuscript, I develop a new approach based on Bayesian MCMC to reconstruct the states for an ordered, discretely valued character at internal nodes of the tree based on the threshold model from evolutionary quantitative genetics. Under the threshold model, the state for a discrete character is determined by an underlying, unobservable continuous trait called liability. Whenever liability crosses a 'threshold' value, the discrete character changes state. In the figure above (which is from my article), I have simulated the evolution of liability on the branches of a tree using Brownian motion, and the plotted color changes track changes to the discrete character.

The threshold model is originally due to Wright (1934); however, it was first used for phylogenetic comparative analysis by Felsenstein (2005; 2012). My contribution is relatively minor. I just show how this model can also be used in the endeavor of ancestral state reconstruction for discrete characters using Bayesian MCMC; and that when the evolutionary process is 'threshold-like', ancestral states are more accurate that the states reconstructed using existing alternative models.

Unfortunately, the appearance (not content, so far as I know) of the equations in this pre-release article version are messed up. Hopefully that is fixed in the proofs! Please feel free to contact me if your institution does not have access to Evolution and you would like a copy of this article. I would be happy to send you the PDF.

A colleague and friend emailed me the other day to ask me if I had any R code for the so-called "random skewers" matrix comparison method of Cheverud (1996). According to this method, we 'skewer' our target pair of covariance matrices with a set of random selection vectors and then measure the (vector) correlation between the response. The strength of this correlation (compared to the correlation of random vectors) is a measure of the similarity of our matrices. The method is perhaps most clearly described in Cheverud & Marroig (2007).

Well, I don't have code - but it was relatively easy to come up with. The biggest challenge is trying to figure out where the null distribution for the correlation comes from. In Cheverud & Marroig (2007) they suggest that we can just generate random vectors where the elements come from a uniform distribution & then compute the correlations between these random vectors. In my limited trials, though, this seems to result in elevated type I error. Alternatively, perhaps we should generate pairs of random covariance matrices using some model, and then compute the mean correlation between random vectors used to skewer our random covariance matrices. If the model for our covariance matrices is the same one used to generate our null distribution, then we seem to end up with type I error at or around the nominal level; as well as p-values more or less uniform on the interval [0, 1], which is another good sign. Unfortunately, this method is much more computationally intensive.

I've posted code for this function on the phytools page here. It depends on clusterGeneration, so this should be installed before attempting to run the code!

Let's try it using genPositiveDefMat(...,covMehod="unifcorrmat") for our covariance matrix & null hypothesis test:

> source("skewers.R")
> library(clusterGeneration)
> foo<-function(){
X1<-genPositiveDefMat(dim=5,covMethod="unifcorrmat")$Sigma
X2<-genPositiveDefMat(dim=5,covMethod="unifcorrmat")$Sigma
skewers(X1,X2,nsim=100,method="unifcorrmat")
}
> X<-replicate(200,foo(),simplify=TRUE)
> mean(unlist(X["p",])<=0.05)
[1] 0.065
> hist(unlist(X["p",]),main="histogram of p",xlab="p", col="grey")

Now let's try it for simulated empirical VCV matrices that are derived from data simulated using the same underlying covariance structure:

> library(mnormt)
> V<-genPositiveDefMat(dim=5,covMethod="unifcorrmat")$Sigma
> X<-rmnorm(n=30,varcov=V)
> Y<-rmnorm(n=30,varcov=V)
> X<-cov(X)
> Y<-cov(Y)
> skewers(X,Y,method="unifcorrmat")
$r
[1] 0.9397734
$p
[1] 0

This is offered without any guarantees. Please let me know if you find any errors.

Recently an R-sig-phylo subscriber made the following request:

"I need to add a list of 230 species in a phylogenetic tree. Is there a logical way to add the species to the root of the genera to which it belongs, in a systematic way, that is, to make a function recognizes the name of the genera to which the species belongs, and so it adds the species to that root?"

I posted a solution to this, but I also thought that other phytools users might face the same problem so I have now automated this in the function add.species.to.genus. This, along with the matrix comparison function skewers, is also in a new minor phytools build (phytools 0.3-73).

The function works by first peeling the genus name out of the species name. It does this by looking for either the underscore character, "_", or a space character, "". It then proceeds to identify the clade containing con-generics in the tree by matching the genus name to the tip labels. Finally, it can either attach the new species to the root node of the most inclusive clade containing congenerics; or it can attach the new species randomly within that clade. In general, the function works best if the input tree is ultrametric. Otherwise, it may return a tree without edge lengths!

Here's a quick demo:

> ## load phytools
> library(phytools)
Loading required package: ape
Loading required package: maps
Loading required package: rgl
> packageVersion("phytools")
[1] ‘0.3.73’
> ## here's our starting tree
> plotTree(tree,ftype="i")

> ## add a species to 'Genus2' at root of genus
> species<-"Genus2_sp3"
> t1<-add.species.to.genus(tree,species)
> plotTree(t1,ftype="i")

> ## add a species to 'Genus4' randomly in genus
> species<-"Genus4 sp7"
> t2<-add.species.to.genus(tree,species,where="random")
> plotTree(t2,ftype="i")

> ## add a set of species in a vector
> species<-c("Genus1_sp2", "Genus2_sp3", "Genus3_sp2", "Genus4_sp7", "Genus4_sp8", "Genus5_sp3")
> for(i in 1:length(species)) tree<-add.species.to.genus(tree,species[i],where="random")
> plotTree(tree,ftype="i")

If the user supplies a species name for a genus with only one representative in the tree; a species with no congeners in the tree; or a genus that is non-monophyletic, in each case the function will try to do something rational and return a warning.

That's it.

I'm working on a short comment (perhaps for publication, perhaps not) on ancestral state reconstruction methods when data for some tips are missing or uncertain. The reason for this is twofold: (1) enough people have asked me about this that it seems some kind of literature reference might be handy; and (2) little to no modification of existing approaches is required to reconstruct ancestral states when some tip states are uncertain or unknown. For instance, this is a routine matter in the reconstruction of ancestral DNA sequences when some tip states are ambiguous.

Possibly the least interesting case for reconstructing ancestral states when some of the data for tips are unknown is ML estimation of ancestral values of a continuous character under constant-rate Brownian motion. This is (1) because ML ancestral states for the parent nodes of lineages leading to tip taxa with missing are exactly the same as the states we'd obtain by linearly interpolating on the basis of the states at subtending nodes, and (2) because ML states for the missing tips in the tree are (under BM which has an expected change of zero with variance σ²t along any edge of length t, remember) identical to the reconstructed state at the parent node.

Uninteresting as it might be in most cases, we can nonetheless do this. I just modified the phytools function anc.ML to attempt this if some tips in the tree are missing from the data vector, x. Here's a simple demo to show what I mean:

> library(phytools)
> source("anc.ML.R")
> tree<-pbtree(n=26,scale=1,tip.label=LETTERS)
> yy<-x<-fastBM(tree)
> x<-sample(x,13)
> X<-anc.ML(tree,x)
> setdiff(names(yy),names(x))
[1] "B""D""F""J""K""L""M""N""R""T""U""V""X"
> phenogram(tree,c(x,X$missing.x,X$ace),spread.labels=TRUE, spread.cost=c(1,0))

If you look closely at the list of missing taxa returned by setdiff and the plotted traitgram you'll see (just as promised) the reconstructed phenotypes of missing tips are connected to their ancestors (interpolated perfectly along the subtending branch) by precisely horizontal lines.

That's it.

This morning Joan Maspons kindly supplied code for several S3 methods that can be used with the phytools function phyl.pca to return results via print, summary, and biplot that are similar to the equivalent S3 methods for objects of class prcomp. After a few minor tweaks (source code here) I have added these methods to the phytools package. The object returned by phyl.pca is in no way changed (it's merely been assigned the class attribute "phyl.pca"), so any prior functions designed to work with phyl.pca should still function. The updates are in a new minor phytools release phytools 0.3-74.

Here's a quick demo:

> library(phytools)
Loading required package: ape
Loading required package: maps
Loading required package: rgl
> packageVersion("phytools")
[1] ‘0.3.74’
> ## simulate tree
> tree<-pbtree(n=26,scale=1,tip.label=LETTERS)
> ## simulate data
> X<-fastBM(tree,nsim=4)
> ## phylogenetic PCA
> pca<-phyl.pca(tree,X,mode="cov")
> ## S3 print method
> pca
Phylogenetic pca
Starndard deviations:
PC1    PC2    PC3    PC4
1.1851495 1.1200792 0.9641014 0.4812711
Loads:
PC1 PC2    PC3 PC4
[1,] -0.1915766 0.9282404 0.15015897 -0.2812837
[2,] 0.9346040 -0.2431641 -0.03290792 -0.2574947
[3,] 0.4657425 0.5602278 -0.65456104 0.2019371
[4,] 0.5296652 0.3231947 0.74759143 0.2368691
> summary(pca)
Importance of components:
   PC1   PC2   PC3   PC4
Standard deviation    1.185 1.120 0.964 0.481
Proportion of Variance 0.368 0.328 0.243 0.060
Cumulative Proportion 0.368 0.696 0.939 1.000
> biplot(pca)

The phytools function ancThresh, which conducts ancestral state reconstruction of a discrete character under the threshold model using Bayesian MCMC (Revell In press), also allows uncertain tip nodes in the tree. Uncertainty in the tip states are treated as a set of prior probability distributions on the states of terminal species in the tree. In this case, we can also compute posterior probabilities that each tip is in each state. These are automatically plotted by ancThresh if ancThresh(...,control=list(tipcol="estimated"). The burn-in that is used to compute these posterior probabilities is the user supplied burn-in in ancThresh(...,control=list(burnin)).

Unfortunately, and unlike the posterior probabilities for internal nodes, these probabilities are not returned to the user nor are they especially easy to compute. This is because the list component $mcmc only contains the implied states at internal nodes; and thus to get the posterior probabilities for tips, we need to use $liab and $par (which contains the sampled positions of the thresholds for each generation of the MCMC) to get the states.

Here is some code that can be used to compute these probabilities. In the example, mcmc is our object returned by ancThresh:

burnin<-200000 ## for instance
## find the row with the first post-burnin sample
ii<-which(mcmc$par[,"gen"]==burnin)
## number of tips
n<-length(tree$tip.label)
## states for the discrete character
states<-colnames(mcmc$par)[2:(ncol(mcmc$par)-1)]
## create and populate our matrix of posterior probabilities
PP<-matrix(NA,n,length(states),dimnames=
  list(colnames(mcmc$liab[1:n]),states))
for(i in 1:n){
  x<-vector(length=nrow(mcmc$liab)-ii+1)
  for(j in ii:nrow(mcmc$liab))
    x[j-ii+1]<-threshState(mcmc$liab[j,i],
      mcmc$par[j,2:(ncol(mcmc$par)-1)])
  PP[i,]<-summary(factor(x,levels=states))/length(x)
}

That's it.

I just wrote some code for my phylogeny methods class (class materials here) to simulate evolution on the phylogeny by the Ornstein-Uhlenbeck process using a constant θ and α without transforming the tree. I think that what I did is correct. I have posted the code (it is an internally called function of fastBM) here. It is also in a minor phytools version (phytools 0.3-77) which can be downloaded & installed from source.

One of the main reasons it could be interesting to simulate OU using my function - rather than by first transforming the tree using transform.phylo (formerly ouTree) in the geiger package & then simulating on the transformed tree using BM - is because it can be used to simulate θ (the position of the optimum) that is different from the x₀, the state at the root node. As many of you probably know - it is not possible to fit this model to a tree with contemporaneous tips (the model is non-identifiable); however simulating under the model presents no problem theoretically.

Here's a quick demo:

> require(phytools)
Loading required package: phytools
> packageVersion("phytools")
[1] ‘0.3.77’
> tree<-pbtree(n=26,tip.label=LETTERS,scale=10)
> x<-fastBM(tree,a=0,theta=3,alpha=0.2,sig2=0.1, internal=TRUE)
> phenogram(tree,x,spread.labels=TRUE,spread.cost= c(1,0.01))

To create the plot above also requires a new version of phenogram, which handles label spreading slightly different (using the full range on the vertical axis, rather than simply the range of tip values for extant species). This is also in the latest phytools version.

A phytools user reported that phyl.cca breaks if either X or Y contain only one column. Even though we would not normally use CCA when one or the other of our xs or ys contains only one variable (why not use multivariable regression instead?) there is no theoretic reason why it should not work. The main reason seems to be R's different behavior towards vectors & matrices with one column.

The updated code is here. Please let me know if you run into any problems.

I just learned of this recent article by Ryan Ellingson et al. which makes very neat use of the phytools functions densityMap and contMap for plotting a posterior density from stochastic mapping on a tree, and for mapping a continuous character on the tree (respectively). See Revell (2013) for more details. Below is a reproduction of their figure. On the left is a posterior density from stochastic mapping of 'benthic' vs. 'infaunal' habit; whereas on the right is a projection of the observed or reconstructed values of a continuous character, relative eye size. For more details see the original article, Ellingson et al. (2013).

Click here for larger version.

**Disclaimer: neither densityMap nor contMap can be used (directly) to add tiny line images of fishes to your tree!

phytools has a function, anc.Bayes, for ancestral state estimation of continuous traits using Bayesian MCMC. In preparing a tutorial on ancestral character estimation for my phylogeny methods graduate class, I uncovered a serious error in how the prior probability was calculated. I have now fixed it and posted the code, along with a new phytools build (phytools 0.3-79) that contains this fix.

If you used anc.Bayes with an uninformative prior, this bug probably did not affect you. However, if you tried a strong informative prior in your analysis, then prior versions of this function will not have worked properly.

I also (re-) discovered that the way this function handles specification of the prior and other control parameters of the MCMC is pretty annoying. The function is probably due an overhaul! Hopefully, I can get to that at some point.

I just discovered a small bug in phenogram (the phytools function for plotting traitgrams in various ways) which causes phenogram(...,spread.labels=TRUE) to fail when labels actually don't overlap. So, for instance:

> library(phytools)
> tree<-pbtree(n=10,tip.label=LETTERS[10:1])
> ## no problem
> phenogram(tree,setNames(1:10,tree$tip.label))

> ## problem
> phenogram(tree,setNames(1:10,tree$tip.label), spread.labels=TRUE)
Error in optim(zz, ff, yy = yy, mo = mo, ms = ms, cost = cost, method = "L-BFGS-B", :
L-BFGS-B needs finite values of 'fn'

The bug is because if the labels do not overlap at all when we start our optimization to minimize overlap, the objective function becomes undefined. The fix just checks to ensure that there is some overlap, and if not spits back the original vertical positions of the tip labels.

> source("http://www.phytools.org/phenogram/v1.1/phenogram.R")
> ## fixed
> phenogram(tree,setNames(1:10,tree$tip.label), spread.labels=TRUE)

I also changed the default spread.cost, which controls how much to penalize overlap & deviation from the vertical position, x, respectively; although this is still totally under user control.

A few days ago, I received an email with the following text:

"What I have been trying to do is to color the branches of a tree according to the number of steps under parsimony that are required if certain species is attached to the corresponding branch of a backbone tree. As I am paleontologist, I usually need to place certain morphological-defined species onto a backbone constraint tree. But I could not find yet the function to do that. Any help is appreciated!"

Like many such emails (unfortunately) I flagged it to (hopefully) return to later, and then I rediscovered it today. Upon closer look, my first reaction was "why do people think I know how to do these things?" As I started to respond as such, it struck me that in fact this shouldn't be that hard (thanks, in large part, to Klaus Schliep's great package phangorn).

First, let's simulate the empirical case in which the 'backbone' tree for our N taxa is 'known', but we have data for N+1 taxa and we want to know where that extra tip belongs:

## load libraries
require(phytools)
require(phangorn)

## simulate tree
tt<-pbtree(n=26,tip.label=LETTERS[26:1])
## this is our true, full, unrooted phylogeny
plot(unroot(tt),use.edge.length=FALSE,type="unrooted", edge.width=2)

## simulate data on the true tree
X<-genSeq(tt,l=2000,format="phyDat",rate=0.01)
## now let's drop one of our tips, tip "Z" in this case
tree<-drop.tip(tt,LETTERS[26])
tree<-unroot(tree) # unroot

OK, now we have a data set (X) containing 26 taxa; but an unrooted tree (tree) containing only 25 tips, and we want to know (& plot) the change in parsimony score that results from attaching the one remaining tip to all of the 48 places in the tree to which it could be attached!

## first set all branch lengths to 1.0
## this is just for bind.tip
tree$edge.length<-rep(1,nrow(tree$edge))
## vector for parsimony scores
ps<-vector(length=nrow(tree$edge))
names(ps)<-apply(tree$edge,1,paste,collapse=",")
## tip to attach
tip.label<-LETTERS[26]
## loop over all edges in the tree
for(i in 1:nrow(tree$edge)){
  ## attach tip
  x<-bind.tip(tree,tip.label,1,where=tree$edge[i,2],
    position=0.5)
  x$edge.length<-NULL
  ## compute parsimony score
  ps[i]<-parsimony(x,X)
}
## subtract parsimony score on 25-taxon tree
ps<-ps-parsimony(tree,X)

Let's see what ps (the parsimony cost of attaching the tip to all edges of the tree) looks like in this one case:

> ps
26,1 26,27 27,2 27,3 26,28 28,4 28,29 29,30 30,5
121   122   172   172 97   124   121   146   178
30,31 31,32 32,6 32,7 31,8 29,33 33,34 34,35 35,36
178   211   211   212   211   147   149   155   168
36,9 36,10 35,37 37,38 38,11 38,12 37,13 34,39 39,40
216   216   169   212   224   222   210   153   183
40,41 41,14 41,15 40,16 39,42 42,43 43,44 44,17 44,18
202   216   216   203   183   210   210   217   217
43,19 42,20 33,45 45,46 46,47 47,21 47,22 46,23 45,48
209   209   148   200   219   235   235   219   200
48,24 48,25
216   218

names(ps) here gives the starting & ending node numbers for each edge in the tree &ps its corresponding parsimony cost.

The neatest thing is that we can easily plot these on the tree using the phytools function plotBranchbyTrait (which, unlike other plotting functions in phytools, calls the ape function plot.phylo internally):

plotBranchbyTrait(tree,ps,type="unrooted",prompt=TRUE, title="parsimony cost")

It's pretty clear that the cost of attaching tip "Z" is indeed lowest along the edge (the internode connecting the terminal edges leading to tips "V" and "Y") where tip "Z" actually belongs. Cool.

Note that if we had a model for the evolution of our characters (in this case, our data are DNA sequences - but the query pertained to morphological data from fossils so this is not guaranteed) we could easily construct a similar plot showing differences in the likelihood across the tree.

That's it.

A phytools user reported a small, but measurable, difference in Pybus & Harvey's γ computed by gammaStat in ape and ltt in the phytools package. Since I know that ltt and gammaStat return the same value of γ for trees that are genuinely ultrametric, my guess was that the tree might not be precisely ultrametric. In fact, this turns out to be the case. Even though the tree passed the check is.ultrametric, trees can only be ultrametric to the degree of numerical precision that can be represented in the computer, and trees that are read in from file will generally be still considerably more non-ultrametric than that (due to rounding in the input file).

The user was kind enough to send me her tree so I was able to verify that this was indeed the case. The tree passes the check of is.ultrametric, but as we decrease tol, it eventually fails. Furthermore, it turns out to be the case that γ reported by gammaStat for a slightly non-ultrametric tree is exactly the same as a precisely ultrametric tree where all the tips have the same height as the highest tip; and γ reported by ltt for a slightly non-ultrametric tree is (nearly) exactly the same as for a precisely ultrametric tree where all the tips have the same height as the lowest tip. Here's how I found out (using the problematic tree):

> tree<-read.nexus("Trees.nex")
> gammaStat(tree)
[1] 2.320419
> ltt(tree,plot=FALSE)$gamma
[1] 2.321524
>
> ## compute the heights of all tips
> H<-nodeHeights(tree)
> h<-sapply(1:length(tree$tip.label),function(ii,h,e) h[which(e==ii)],h=H[,2],e=tree$edge[,2])
>
> ## the difference from the max height
> d<-max(h)-h
> ## add each difference to the tip
> for(i in 1:length(d))
+ tree$edge.length[which(tree$edge[,2]==i)] <- tree$edge.length[which(tree$edge[,2]==i)]+d[i]
>
> ## now they're the same & equal to gammaStat
> gammaStat(tree)
[1] 2.320419
> ltt(tree,plot=FALSE)$gamma
[1] 2.320419
>
> ## start again
> tree<-read.nexus("Trees.nex")
> gammaStat(tree)
[1] 2.320419
> ltt(tree,plot=FALSE)$gamma
[1] 2.321524
>
> ## the difference from the min
> d<-min(h)-h
> ## add each difference to the tip
> for(i in 1:length(d))
+ tree$edge.length[which(tree$edge[,2]==i)] <- tree$edge.length[which(tree$edge[,2]==i)]+d[i]
>
> ## now they're the same & equal to ltt
> gammaStat(tree)
[1] 2.321529
> ltt(tree,plot=FALSE)$gamma
[1] 2.321529

I just posted a new version of ltt that first checks that the tree is ultrametric & then if it passes is.ultrametric then it corrects the tree to be precisely ultrametric so that the computed γ lines up with gammaStat.

Just towards the end of this week I started working on a new project, Rphylip, which is designed to provide an R interface for the longstanding, popular phylogenetics package PHYLIP. PHYLIP does a really remarkable number of different analysis - from a wide range of phylogeny inference methods, to a variety of comparative methods (including several important new approaches that are implemented nowhere else), to tree drawing and other things. A complete list of the programs in the PHYLIP package is here.

My goal is to create functions that interface with the programs of PHYLIP, but allow the user to operate completely within R. That is, all inputs for the PHYLIP programs are created by R; and all outputs are parsed back into R or printed to the R console. Finally, Rphylip will even clean up the input & output files it has created for the PHYLIP programs.

So far, I have created test functions to interface with only a few of the programs of PHYLIP: Rdnaml (for dnaml); Rcontml (for contml; and Rcontrast (for contrast).

To use Rphylip one has to, of course, first install PHYLIP on your computer. Having done this, it is straightforward to run any of the programs of PHYLIP through R. Below is a simple demo using Rdnaml under the default conditions (which are slightly different then the defaults for dnaml). Some output excluded:

> require(Rphylip)
Loading required package: Rphylip
Loading required package: ape
> X<-read.dna("primates.dna")
> X
12 DNA sequences in binary format stored in a matrix.

All sequences of same length: 232
Labels: Lemur Tarsier Sq.Monkey J.Macaque R.Macaque E.Macaque ...

Base composition:
a    c    g    t
0.364 0.414 0.041 0.181

> tree<-Rdnaml(X,path="C:/Users/Liam/phylip/exe")
Warning:
One or more of "infile", "outfile", "outtree"
was found in your current working directory and may be overwritten

Press ENTER to continue or q to quit:

...

Nucleic acid sequence Maximum Likelihood method, version 3.695

...

   +--------------3
   |
   | +--------8
   | |
   | | +-----11
+------------8 +-----2    +--1
| | |    | +--3 +--12
| | |    | | |
| | |    +----5 +----10
| | | |
| +----4 +--------9
|    |
|    | +4
|    |    +-10
|    | +--6 +--5
|    | | |
|    +--------7 +--6
| |
| +------7
|
9-------------------2
|
+------------------1

remember: this is an unrooted tree!
...

Translation table
-----------------
1    Lemur
2    Tarsier
3    Sq.Monkey
4    J.Macaque
5    R.Macaque
6    E.Macaque
7    B.Macaque
8    Gibbon
9    Orangutan
10 Gorilla
11 Chimp
12 Human

> tree$logLik
[1] -2200.766
> require(phangorn)
Loading required package: phangorn
Loading required package: rgl
> plot(midpoint(tree),edge.width=2,no.margin=TRUE)

The source code for the functions I've written so far in Rphylip are here and I posted a little package build with some minimal documentation for the three interface functions here: http://www.phytools.org/Rphylip/. Feedback welcome!

I've spent a little more time working on Rphylip. Here are a few updates since my last post: I have now added the functions Rdnapars (for dnapars), Rneighbor (for neighbor). I have also cleaned up the code in various ways by pulling out (for instance) repeated functions like writing a PHYLIP formatted DNA input file, etc.

Perhaps the most interesting addition (in my opinion) - at least for Windows users who work in R - is the addition of an internal function findPath. (See code here.) This function means that users of Rphylip may no longer be required to specify the path to the directory containing the PHYLIP executable files. Instead, if no path is supplied, R will check various places on the computer where it is likely to be found (for instance the current working directory, C:\Program Files\", etc.). To my surprise, this option works surprisingly well - and also does not bomb R if the directory containing PHYLIP cannot be found (instead returning a sensible error message).

For instance, witness the following example in which PHYLIP is hidden:

> require(Rphylip)
Loading required package: Rphylip
Loading required package: ape
> data(primates)
> tree<-Rdnapars(primates)
Error in Rdnapars(primates) :
No path provided and was not able to find path to dnapars

OK, now let's put the PHYLIP folder back someplace sensible, in this case in C:/Program Files:

> tree<-Rdnapars(primates)
...

DNA parsimony algorithm, version 3.695

One most parsimonious tree found:

   +----12
   +--10
+----9   +------11
| |
   +----8 +-----10
   | |
+----7 +--------9
| |
+-----6 +----------8
|    |
|    |    +------7
|    +-------5
|    | +-----6
+---------2    +----4
|    | | +---5
|    | +--3
|    |    +-4
|    |
|    +------------3
|
1---------------2
|
+-------------1

requires a total of 593.000

...

Translation table
-----------------
1    Lemur
2    Tarsier
3    Sq.Monkey
4    J.Macaque
5    R.Macaque
6    E.Macaque
7    B.Macaque
8    Gibbon
9    Orangutan
10 Gorilla
11 Chimp
12 Human

> plot(tree,edge.width=2,no.margin=TRUE)

That worked.

Rphylip is also now on GitHub - so you can follow updates & changes to the code there.

I just tried Rphylip in R on Mac OS X. It appears to work fine, although the internal function findPath will not - so the full path to the PHYLIP executables needs to be supplied by the user. However I also discovered that is is very important that the installation protocol for Mac OS X described here on the PHYLIP installation page needs to be followed or else it will not work. Since this protocol involves commands that can be executed from the terminal, I may write a simple R script for Rphylip that will run these commands for the user to make it even easier.

More soon.

I've been gradually plugging away at Rphylip (my R wrapper for PHYLIP), which now includes working interfaces for contml, contrast, dnaml, dnamlk, dnapars, neighbor, threshml, and treedist. Obviously, I have a long way to go; however some of the new additions cover functionality that can't be found in any other software. For example, Rcontrast (interface for contrast) does regular phylogenetic contrasts - but also allows for within-species sampling (Felsenstein 2008). Rthreshml (interface for threshml, which is not actually yet in PHYLIP but will be in the next release) implements the threshold model which can be used for estimating the correlation between discrete and continuous characters, among other things (Felsenstein 2012).

Here's a quick demo of Rthreshml:

> library(Rphylip)
Loading required package: ape
> require(phytools)
Loading required package: phytools
Loading required package: maps
Loading required package: rgl
> tree<-pbtree(n=100)
> V<-matrix(c(1,0,0.8,0,
0,2,0,1.2,
0.8,0,1,0,
0,1.2,0,1),4,4)
> V
   [,1] [,2] [,3] [,4]
[1,] 1.0 0.0 0.8 0.0
[2,] 0.0 2.0 0.0 1.2
[3,] 0.8 0.0 1.0 0.0
[4,] 0.0 1.2 0.0 1.0
> X<-sim.corrs(tree,V)
> th<-setNames(c(0,Inf),LETTERS[1:2])
> X<-data.frame(X[,1],X[,2],sapply(X[,3],threshState,th), sapply(X[,4],threshState,th))
> names(X)<-paste("v",1:ncol(X),sep="")
> X
   v1 v2 v3 v4
t15   2.8841985 4.98211106 B B
t26   2.1335369 7.73244895 A B
t27   0.9770008 6.58944668 A B
t44   3.5651266 4.94113242 B B
t85   1.6813201 3.73859795 A B
t86   1.3738674 3.36676417 A B
t69   1.1805336 2.17284155 A B
t48   2.2652202 3.93811812 B B
....

> fit<-Rthreshml(tree,X,proposal=0.1)

....

Threshold character Maximum Likelihood method version 3.7a

....

Markov chain Monte Carlo (MCMC) estimation of covariances:

   Acceptance Norm of change
Chains (20)    rate in transform
------ ---------- --------------

Burn-in: 1000 updates
Chain 1: 100000 updates ........ 0.8623    0.339614
Chain 2: 100000 updates ........ 0.8651    0.215640
Chain 3: 100000 updates ........ 0.8674    0.160055
Chain 4: 100000 updates ....

....

Covariance matrix of continuous characters
and liabilities of discrete characters
(the continuous characters are first)

   1 2 3 4
1    0.94584 0.09582 0.73779   -0.15732
2    0.09582 2.07466   -0.17442 1.13575
3    0.73779   -0.17442 1.00000   -0.31895
4 -0.15732 1.13575   -0.31895 1.00000

....

> fit
$Covariance_matrix
   v1    v2    v3    v4
v1 0.94584 0.09582 0.73779 -0.15732
v2 0.09582 2.07466 -0.17442 1.13575
v3 0.73779 -0.17442 1.00000 -0.31895
v4 -0.15732 1.13575 -0.31895 1.00000

$Transform_indepvar_liab
   v1    v2    v3 v4
v1 0.97254 0.00000 0.00000 0.00000
v2 0.09853 1.43699 0.00000 0.00000
v3 0.75862 -0.17339 0.62804 0.00000
v4 -0.16177 0.80146 -0.09118 0.56849

$Var_change
v1    v2    v3    v4
2.859484 1.708908 0.238306 0.213801

$Transform_liab_cont
   v1    v2    v3 v4
v1 -0.08096 0.81423 -0.20088 0.53863
v2 -0.69130 -0.26046 -0.67287 0.03887
v3 -0.08890 -0.47040 0.32064 0.81732
v4 0.71249 -0.21889 -0.63568 0.20090

We can see that the estimated covariance matrix is actually very close to our generating matrix. Cool.

Unlike the threshml standalone, we can provide our discrete & continuous characters in any order, and our discrete character can be coded anyway we want - so long as it only has two states. Rthreshml creates the input file, reads back in the output, and sorts the estimated covariance matrix back into the order of the columns of X.

Since my last post on Rphylip I've added R interfaces for a few more of PHYLIP programs: consense (Rconsense), proml (proml), and promlk (promlk). The latter two do ML phylogeny estimation (without & with a molecular clock, respectively) from protein sequences.

In addition to these new interfaces, I also added a few different functions for handling protein sequences in R. I should say that I first attempted to establish whether any such methods already existed in other phylogeny packages by running help.search("protein") on all of my installed packages. What I didn't think to do was search for "amino acid". Had I done so, I would have found the function read.aa in the phangorn package, with which at least my Rphylip function read.protein is basically redundant.

The protein handling functions in Rphylip are as follows:

read.protein, an amino acid sequence reading function that can accept data files in formats "fasta" or "sequential". If format="sequential" or format="fasta" with all sequences of the same length then read.protein will store the input data in a matrix; whereas if format="fasta" and sequences have different lengths, then the data will be stored as a list. In both cases the object is assigned the class attribute "proseq" (protein sequence, in case it wasn't obvious enough).

print.proseq is an S3 print method for objects of class "proseq". The print output is designed to mimic that of print.DNAbin in the ape package. (See demo below.)

Finally, as.proseq converts protein sequences from other formats. At the moment, it only recognizes & converts from objects of class phyDat (with attr(x,"type")="AA") from the phangorn package.

OK - here is a brief illustration of the Rproml proml interface using an amino acid dataset for chloroplasts from phangorn written to file:

> require(Rphylip)
Loading required package: Rphylip
Loading required package: ape
> X<-read.protein("chloroplast.aa")
> X
19 protein sequences in character format stored in a matrix.

All sequences of same length: 5144

Labels: Trico Nostoc Syn6301 Prochl Syn8102 Thermo ...

Amino acid composition:
A    C    D    E    F    G    H    I    K    L
0.087 0.007 0.040 0.047 0.051 0.091 0.029 0.074 0.040 0.102
M    N    P    Q    R    S    T    V    W    Y
0.025 0.036 0.048 0.040 0.052 0.054 0.052 0.073 0.022 0.032

> tree<-Rproml(X,speedier=TRUE,global=FALSE)

...

Amino acid sequence Maximum Likelihood method, version 3.695

...

Adding species:
   1. 14
   2. 10
   3. 11
   4. 7
   5. 16
   6. 19
   7. 3
   8. 2
   9. 9
10. 8
11. 13
12. 12
13. 5
14. 1
15. 15
16. 18
17. 4
18. 17
19. 6

...

Jones-Taylor-Thornton model of amino acid change

+-2
   +-12
+--3 +-16
| |
| +-----7
|
| +-----18
| |
|    +-15    +----12
|    | | +-11
|    | | | +--10
|    | +--9
|    |    | +--------11
| +--2    +--6
| | | +-------9
   | | |
| | |    +----15
| | |   +--13
| | |   |   +--17
|    +--1 +--10
|    | | | +------13
|    | | +--5
|    | |    | +-----19
|    | |    +--4
| +-14 | +-------14
| | | |
| | | +-----------8
| | |
| | | +---3
16--8 +-17
| |    | +------4
| |    +----7
| | +---5
| |
| +----6
|
+----1

remember: this is an unrooted tree!

...

Translation table
-----------------
1    Trico
2    Nostoc
3    Syn6301
4    Prochl
5    Syn8102
6    Thermo
7    Syn6803
8    Gloeo
9    Odont
10 Porph
11 Cyanid
12 Gracil
13 Nephros
14 Chlamy
15 Arabid
16 Anabae
17 March
18 Cyanoph
19 Chlorel

> plot(tree,type="unrooted",no.margin=TRUE,edge.width=2, cex=0.8)

Rphylip can be obtained from its webpage or from github.

I keep plugging away at the Rphylip project. I have now written R interfaces covering 15* of the 35* programs in the PHYLIP package (*including threshml, which is technically not yet in PHYLIP), as well as a number of helper functions. The latest two added to that list are protpars (for MP tree inference from protein sequences) and protdist (for evolutionary distance estimation from protein sequences). The following is a quick demo of the latter:

> require(Rphylip)
Loading required package: Rphylip
Loading required package: ape
> packageVersion("Rphylip")
[1] ‘0.1.9’
> data(chloroplast)
> chloroplast
19 protein sequences in character format stored in a matrix.

All sequences of same length: 5144

Labels: Trico Nostoc Syn6301 Prochl Syn8102 Thermo ...

Amino acid composition:
A    C    D    E    F    G    H    I    K    L
0.087 0.007 0.040 0.047 0.051 0.091 0.029 0.074 0.040 0.102
M    N    P    Q    R    S    T    V    W    Y
0.025 0.036 0.048 0.040 0.052 0.054 0.052 0.073 0.022 0.032

> D<-Rprotdist(chloroplast,model="PAM")

...

Protein distance algorithm, version 3.695

Settings for this run:
P Use JTT, PMB, PAM, Kimura, categories model? Dayhoff PAM matrix
G Gamma distribution of rates among positions? No
C    One category of substitution rates? Yes
W Use weights for positions? No
M    Analyze multiple data sets? No
I Input sequences interleaved? Yes
0    Terminal type (IBM PC, ANSI)? IBM PC
1 Print out the data at start of run No
2 Print indications of progress of run Yes

Are these settings correct? (type Y or the letter for one to change)

Computing distances:
1
2 .
3 ..
4 ...
5 ....
6 .....
7 ......
8 .......
9 ........
10    .........
11    ..........
12    ...........
13    ............
14    .............
15    ..............
16    ...............
17    ................
18    .................
19    ..................

Output written to file "outfile"

Done.

Press enter to quit.

> tree<-Rneighbor(D)

...

Neighbor-Joining/UPGMA method version 3.695

Neighbor-joining method

Negative branch lengths allowed

+----7
!
!    +-------9
!    +-8
!    ! ! +---12
!    +-9 +-5
!    ! !   +--10
!    ! !
!    ! +--------11
!    !
! +-10   +-----13
! ! ! +-6
! ! ! ! ! +-------14
! ! ! ! +-4
! ! +-7   +------19
!    +-11 !
!    ! ! ! +--17
!    ! ! +--2
! +-13 !    +----15
! ! ! !
! ! ! +-----18
! ! !
!    +-14 +----------8
!    ! !
!    ! !    +-----4
!    ! ! +--1
! +-15 +-12 +---5
! ! !    !
! ! !    +--3
17-16 !
! ! +----6
! !
! ! +-16
! +--3
!    +-2
!
+----1

remember: this is an unrooted tree!

...

Translation table
-----------------
1    Trico
2    Nostoc
3    Syn6301
4    Prochl
5    Syn8102
6    Thermo
7    Syn6803
8    Gloeo
9    Odont
10 Porph
11 Cyanid
12 Gracil
13 Nephros
14 Chlamy
15 Arabid
16 Anabae
17 March
18 Cyanoph
19 Chlorel

> plot(tree,type="unrooted",no.margin=TRUE,edge.width=2)

Unfortunately - as I don't know where these data came from, I can't tell whether this is a good tree, a bad tree, or an average tree; however I believe that it is quite similar to the tree obtained using Rproml.

Rphylip is available here and on github.

Here are three different ways to compute the among-species variance-covariance matrix for multiple characters on the tree. If we assume that the characters are evolving by Brownian motion, this matrix is an unbiased estimate of the instantaneous diffusion matrix of the evolutionary process.

The three options are (1) the GLS estimating equation (used in Revell 2009); (2) the method of equation 5 in Butler et al. (2000); and (3) we can use PICs (Felsenstein 1985).

Here is code to do each of these in R:

## compute evolutionary VCV matrix using method of
## Revell (2009)
A<-matrix(1,nrow(X),1)%*%apply(X,2,fastAnc,tree=tree)[1,]
V1<-t(X-A)%*%solve(vcv(tree))%*%(X-A)/(nrow(X)-1)

## compute evolutionary VCV matrix using method of
## Butler et al. (2000)
Z<-solve(t(chol(vcv(tree))))%*%(X-A)
V2<-t(Z)%*%Z/(nrow(X)-1)

## compute the evolutionary VCV matrix using pics
Y<-apply(X,2,pic,phy=tree)
V3<-t(Y)%*%Y/nrow(Y)

OK - now let's try it with some simulated data:

> ## load packages
> require(phytools)
Loading required package: phytools
Loading required package: ape
Loading required package: maps
Loading required package: rgl
> ## simulate tree & data
> simV<-matrix(c(1,1.2,0,1.2,2,0,0,0,3),3,3)
> simV ## generating covariance matrix
   [,1] [,2] [,3]
[1,] 1.0 1.2 0
[2,] 1.2 2.0 0
[3,] 0.0 0.0 3
> tree<-pbtree(n=100,scale=1) ## simulate tree
> X<-sim.corrs(tree,simV) ## simulate data
>
> ## Revell (2009)
> A<-matrix(1,nrow(X),1)%*%apply(X,2,fastAnc,tree=tree)[1,]
> V1<-t(X-A)%*%solve(vcv(tree))%*%(X-A)/(nrow(X)-1)
>
> ## Butler et al. (2000)
> Z<-solve(t(chol(vcv(tree))))%*%(X-A)
> V2<-t(Z)%*%Z/(nrow(X)-1)
>
> ## pics
> Y<-apply(X,2,pic,phy=tree)
> V3<-t(Y)%*%Y/nrow(Y)
>
> ## compare
> V1
[,1]    [,2] [,3]
[1,] 1.0896188 1.375287 0.1157135
[2,] 1.3752866 2.280321 0.2996510
[3,] 0.1157135 0.299651 2.9491762
> V2
[,1]    [,2] [,3]
[1,] 1.0896188 1.375287 0.1157135
[2,] 1.3752866 2.280321 0.2996510
[3,] 0.1157135 0.299651 2.9491762
> V3
[,1]    [,2] [,3]
[1,] 1.0896188 1.375287 0.1157135
[2,] 1.3752866 2.280321 0.2996510
[3,] 0.1157135 0.299651 2.9491762

Obviously, although we can see that these calculations seem different, they produce the same results. The PIC method is a bit trickier, but at least in methods (1) & (2) it is fairly easy to see why this is the case. In case (1) we do X^TC^-1X; whereas in case (2) we do Z = C^-1/2X (as Cholesky decomposition is a kind of matrix square root), followed by Z^TZ which is the same as X^TC^-1/2C^-1/2X (in which each of C^-1/2 are the inverse of the lower & upper Cholesky matrices, respectively), which is then equivalent to X^TC^-1X. Obviously, we need to center X on the vector of ancestral values & divide by n - 1 in either case.

Ancestral state estimation under the threshold model ms. available

Random skewers method for comparing covariance matrices

New function to add species to a genus in a phylogeny

Reconstructed ancestral & tip states for a continuous character evolving by Brownian motion with missing data

S3 methods for phyl.pca

Computing the posterior probabilities for tip nodes from ancThresh

New OU simulator in fastBM

Minor update to phyl.cca

Cool use of densityMap and contMap (& in the same figure!)

Bug fix for anc.Bayes

Bug fix for phenogram(...,spread.labels=TRUE) for when labels don't actually overlap

Plotting the number of steps under parsimony to attach a new tip to all possible places in a tree

Difference between gammaStat & ltt when the tree is slightly non-ultrametric

New project - Rphylip: an R interface for PHYLIP

More on Rphylip: Rdnapars and Rneighbor

Rphylip in Mac OS X

More on Rphylip: Interfaces for threshml & treedist

Rphylip: consense, proml, promlk, and some functions for handling protein sequences

Rphylip: protpars & protdist

Three different ways to calculate the among-species variance-covariance matrix for multiple traits on a phylogeny