Quantcast
Viewing all 802 articles
Browse latest View live

Important updates to internal function used by make.simmap

I just rewrotethe function that is used internally to compute the likelihood of any transition matrix Q given the data & tree; and to estimate the ML value of Q. This function is highly similar (but slightly different in important ways), both functionally & structurally, to ace(...,type="discrete") in the ape package. It is also in some ways functionally redundant with ace and fitDiscretein that they fit the same model - although they differ in that the prior distribution on the root node can be controlled in this function. In fact, this is the main reason for this 're-write' because in previous versions of phytools, I suspect that this prior was being used improperly in the calculations. Now, when this function is used internally by make.simmap (which is it's main purpose, although it can also be used alone) the prior, pi, can influence not only the states sampled at the root (and thus, indirectly, for other nodes in the tree), but it can also affect the estimated value of Q under likelihood.

Here's a quick demo of using the function alone to fit the Mk discrete character evolution model:

library(devtools)
install_github("liamrevell/phytools",quiet=TRUE)

Simulate tree & data:

library(phytools)
tree<-pbtree(n=100,scale=1)
Q<-matrix(c(-1,1,0,1,-2,1,0,1,-1),3,3)
rownames(Q)<-colnames(Q)<-letters[1:3]
Q
##    a  b  c
## a -1 1 0
## b 1 -2 1
## c 0 1 -1
x<-sim.history(tree,Q,anc="a")$states
## Done simulation(s).
x
##  t16  t17  t63  t64  t39  t29  t54  t55   t5  t18  t50  t51  t33  t19  t20 
## "a" "a" "a" "a" "a" "a" "c" "b" "c" "b" "c" "c" "c" "b" "c"
## t10 t99 t100 t88 t76 t79 t80 t93 t94 t89 t90 t44 t95 t96 t68
## "c" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "c"
## t69 t65 t81 t82 t77 t78 t60 t9 t40 t83 t84 t72 t73 t56 t57
## "c" "c" "c" "c" "c" "c" "c" "a" "c" "b" "b" "b" "b" "c" "c"
## t26 t21 t61 t62 t42 t43 t52 t53 t6 t7 t2 t58 t59 t11 t12
## "b" "b" "b" "b" "b" "b" "c" "c" "a" "a" "a" "a" "a" "a" "a"
## t1 t22 t23 t3 t4 t45 t46 t27 t85 t86 t35 t36 t37 t38 t28
## "a" "c" "c" "b" "a" "a" "a" "a" "a" "a" "b" "b" "b" "a" "a"
## t24 t25 t30 t74 t75 t14 t15 t13 t8 t87 t91 t92 t70 t71 t49
## "a" "b" "b" "a" "b" "a" "a" "a" "a" "b" "b" "b" "b" "b" "b"
## t47 t48 t41 t34 t66 t67 t31 t32 t97 t98
## "b" "a" "b" "b" "b" "b" "a" "b" "a" "a"

Fit the model:

obj<-fitMk(tree,as.factor(x),model="SYM")
obj
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## a b c
## a -0.930216 0.930216 0.00000
## b 0.930216 -2.094586 1.16437
## c 0.000000 1.164370 -1.16437
##
## Fitted (or set) value of pi:
## a b c
## 0.3333333 0.3333333 0.3333333
##
## Log-likelihood: -66.577856

It is also used internally by make.simmap. Here, with a prior that is not flat:

trees<-make.simmap(tree,x,pi=setNames(c(1,0,0),letters[1:3]),nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
##            a          b        c
## a -0.8965016 0.8965016 0.00000
## b 0.8965016 -2.0898515 1.19335
## c 0.0000000 1.1933498 -1.19335
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c 
## 1 0 0
## Done.
obj<-summary(trees)
obj
## 100 trees with a mapped discrete character with states:
## a, b, c
##
## trees have 34.17 changes between states on average
##
## changes are of the following types:
## a,b a,c b,a b,c c,a c,b
## x->y 12 0 6.36 9.63 0 6.18
##
## mean total time spent in each state is:
## a b c total
## raw 12.1857664 8.5412422 4.3792713 25.10628
## prop 0.4853673 0.3402034 0.1744293 1.00000
plot(obj,ftype="off",type="fan")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-4

That's it.


Imperceptible update to rerootingMethod for ancestral state reconstruction

I just made an updateto the phytools function rerootingMethodwhich does marginal ancestral state estimation using the 'rerooting method' of Yang (1996). This method is essentially redundant (now) with ace(...,type="discrete") under its default settings, although it does not permit assymetric models of trait evolution (i.e., models in which the backward & forward rates for a particular transition type are allowed to assume different values). The only advantage of this method is that the function does permit polytomies. Prior versions alsopermitted polytomies; however this was by resolving polytomies internally, estimating ancestral states, and then matching the nodes between the fully resolved tree and its original, multifurcating counterpart. Now that rerootingMethod uses my new Mk model fitting function internally (fitMk), this is no longer necessary.

We can see this as follows:

library(devtools)
install_github("liamrevell/phytools",quiet=TRUE)
library(phytools)
## Loading required package: ape
## Loading required package: maps
set.seed(1)
## simulate tree with polytomies & data
tree<-rtree(n=26,tip.label=LETTERS)
tree$edge.length[which(tree$edge[,2]==47)]<-0
tree$edge.length[which(tree$edge[,2]==38)]<-0
tree$edge.length[which(tree$edge[,2]==29)]<-0
tree<-di2multi(tree)
plotTree(tree)
Q<-matrix(c(-1,1,0,1,-2,1,0,1,-1),3,3)
rownames(Q)<-colnames(Q)<-letters[1:3]
x<-sim.history(tree,Q)$states
## Done simulation(s).
x
##   S   Y   J   R   U   E   N   C   V   G   A   F   M   W   O   Q   T   B 
## "a" "c" "a" "b" "c" "c" "b" "b" "c" "a" "b" "b" "b" "b" "a" "a" "a" "a"
## K X H P Z I D L
## "c" "c" "a" "c" "b" "a" "b" "b"
## fit model & estimate ancestral states using rerootingMethod
model<-matrix(c(0,1,0,1,0,1,0,1,0),3,3)
rownames(model)<-colnames(model)<-letters[1:3]
fit1<-rerootingMethod(tree,x,model=model)
plotTree(tree)
nodelabels(pie=fit1$marginal.anc,piecol=setNames(c("blue","red","green"),
c("a","b","c")),cex=0.6)
tiplabels(pie=to.matrix(x[tree$tip.label],c("a","b","c")),
piecol=setNames(c("blue","red","green"),c("a","b","c")),cex=0.3)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-2

## compare to ace
fit2<-ace(x,tree,type="discrete",model=model) ## doesn't work
## Error in ace(x, tree, type = "discrete", model = model): "phy" is not rooted AND fully dichotomous.
fit3<-ace(x,multi2di(tree),type="discrete",model=model) ## doesn't work
## Error in ace(x, multi2di(tree), type = "discrete", model = model): some branches have length zero or negative
tt<-multi2di(tree)
tt$edge.length[tt$edge.length==0]<-1e-8
fit4<-ace(x,tt,type="discrete",model=model) ## works
## compare to rerootingMethod
M<-matchNodes(tree,tt) ## first match nodes between the trees
plotTree(tt)
nodelabels(pie=fit4$lik.anc,piecol=setNames(c("blue","red","green"),
c("a","b","c")),cex=0.6)
tiplabels(pie=to.matrix(x[tt$tip.label],c("a","b","c")),
piecol=setNames(c("blue","red","green"),c("a","b","c")),cex=0.3)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-3

plot(fit1$marginal.anc,fit4$lik.anc[M[,2]-Ntip(tt),],
xlab="rerootingMethod",ylab="ace")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-4

That's it.

Additional method for fitPagel to fit Pagel's (1994) model of correlated binary character evolution

I just updated fitPagel, a phytools function that fit the Pagel (1994) method for testing whether theh evolutiokn of one binary character affects a second character (or vice versa).

The way this method works is it simply re-codes the two binary characters into a single four state character - e.g., 0|0, 0|1, etc. Then it fits two different models. In the first model, the 'independent' model, the rate of transition in the first character is constrained to be equal for a given transition type, irregardless of the state for the second trait. So, for instance, the rate of transition 0|0 -> 1|0 is the same as 0|1 -> 1|1 because both transitions involve the same change (0 -> 1) in character 1. The second model, the 'dependent' model, allows each type of transition to have a different rate. In other words 0|0 -> 1|0 and 0|1 -> 1|1 can have different rates (different probabilities of occuring on a given time interval), even though the exact same type of change is occuring for the same character in both instances. Obviously, transitions 0|0 -> 1|1 are constrained to have a rate of zero in both models.

The updatenow permits the phytools function fitMk to be used internally to fit the Mk model. Previously the two methods available were method = "ace" and method = "fitDiscrete", and these are both still available as options.

Let's compare them:

library(devtools)
install_github("liamrevell/phytools",quiet=TRUE)
library(phytools)
## Loading required package: ape
## Loading required package: maps

We can start by simulating data under the 'dependent' model

## simulate tree
tree<-pbtree(n=300,scale=1)
## simulate data for the two characters
Q<-matrix(c(0,0.4,0.4,0,2,0,0,2,2,0,0,2,0,0.4,0.4,0),4,4,byrow=TRUE)
rownames(Q)<-colnames(Q)<-c("aa","ab","ba","bb")
diag(Q)<--rowSums(Q)
tt<-sim.history(tree,Q)
## Note - the rate of substitution from i->j should be given by Q[j,i].
## Detecting that rows, not columns, of Q sum to zero :
## Transposing Q for internal calculations.
## Done simulation(s).
## split the 'aa','ab','bb','ba' data into two characters
t1<-mergeMappedStates(tt,c("aa","ab"),"a")
t1<-mergeMappedStates(t1,c("ba","bb"),"b")
t2<-mergeMappedStates(tt,c("aa","ba"),"a")
t2<-mergeMappedStates(t2,c("ab","bb"),"b")
t1$states<-getStates(t1,"tips")
t2$states<-getStates(t2,"tips")
## visualize the correlated evolution of the two traits
par(mfrow=c(1,2))
plotSimmap(t1,setNames(c("red","blue"),letters[1:2]),lwd=1,ftype="off")
plotSimmap(t2,setNames(c("red","blue"),letters[1:2]),lwd=1,ftype="off",
direction="leftwards")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-2

## extract the tip data from our simulation
x<-getStates(t1,"tips")
y<-getStates(t2,"tips")

Now we're ready to fit our model using all three methods:

fitPagel(tree,x,y) ## default method is "fitMk"
## 
## Pagel's binary character correlation test:
##
## Indepedent model rate matrix:
## a|a a|b b|a b|b
## a|a -0.8726028 0.6514011 0.2212017 0.0000000
## a|b 0.3220958 -0.5432975 0.0000000 0.2212017
## b|a 0.6189176 0.0000000 -1.2703187 0.6514011
## b|b 0.0000000 0.6189176 0.3220958 -0.9410134
##
## Dependent model rate matrix:
## a|a a|b b|a b|b
## a|a -0.5951317 0.3393103 0.2558215 0.0000000
## a|b 4.6486390 -4.7698708 0.0000000 0.1212318
## b|a 2.2086339 0.0000000 -4.5275279 2.3188940
## b|b 0.0000000 0.4113254 0.3105573 -0.7218827
##
## Model fit:
## log-likelihood
## independent -172.8745
## dependent -150.4001
##
## Hypothesis test result:
## likelihood-ratio: 44.94869
## p-value: 4.074835e-09
##
## Model fitting method used was fitMk
fitPagel(tree,x,y,method="ace")
## 
## Pagel's binary character correlation test:
##
## Indepedent model rate matrix:
## a|a a|b b|a b|b
## a|a -0.8725883 0.6513844 0.2212039 0.0000000
## a|b 0.3220944 -0.5432984 0.0000000 0.2212039
## b|a 0.6189141 0.0000000 -1.2702985 0.6513844
## b|b 0.0000000 0.6189141 0.3220944 -0.9410086
##
## Dependent model rate matrix:
## a|a a|b b|a b|b
## a|a -0.595134 0.3393112 0.2558228 0.0000000
## a|b 4.648654 -4.7698873 0.0000000 0.1212328
## b|a 2.208634 0.0000000 -4.5275253 2.3188912
## b|b 0.000000 0.4113245 0.3105577 -0.7218823
##
## Model fit:
## log-likelihood
## independent -171.4882
## dependent -149.0138
##
## Hypothesis test result:
## likelihood-ratio: 44.94869
## p-value: 4.074835e-09
##
## Model fitting method used was ace
fitPagel(tree,x,y,method="fitDiscrete")
## Loading required package: geiger
## Loading required package: parallel
## Warning in fitDiscrete(tree, xy, model = iQ): Parameter estimates appear at bounds:
## q14
## q23
## q32
## q41
## Warning in fitDiscrete(tree, xy, model = dQ): Parameter estimates appear at bounds:
## q14
## q23
## q32
## q41
## 
## Pagel's binary character correlation test:
##
## Indepedent model rate matrix:
## a|a a|b b|a b|b
## a|a -0.8815200 0.6639046 0.2176153 0.0000000
## a|b 0.3054217 -0.5230370 0.0000000 0.2176153
## b|a 0.6199298 0.0000000 -1.2838344 0.6639046
## b|b 0.0000000 0.6199298 0.3054217 -0.9253515
##
## Dependent model rate matrix:
## a|a a|b b|a b|b
## a|a -0.5943124 0.3350400 0.2592725 0.00000000
## a|b 4.0225133 -4.0854456 0.0000000 0.06293222
## b|a 2.0872453 0.0000000 -4.5241365 2.43689116
## b|b 0.0000000 0.3640895 0.2807990 -0.64488849
##
## Model fit:
## log-likelihood
## independent -171.5709
## dependent -149.5117
##
## Hypothesis test result:
## likelihood-ratio: 44.11832
## p-value: 6.06277e-09
##
## Model fitting method used was fitDiscrete

That's it.

Update to rateshift method that results in much better optimization for models with multiple shifts

I have been working on the phytools function rateshift, for locating one or multiple temporal shifts in the rate of evolution for a continuous character on the tree.

This function has nice print and plot methods, and would seem to be quite nice - however it suffers from one fairly critical problem which is that it doesn't really work. That is to say, it frequently fails to find the MLE solution.

I have put a little effort into overhauling the guts of this function now; with the main and most critical update being that the function now uses the phytools function brownie.lite to find the MLE rates conditioned on a given set of shift points; rather than simultaneously optimizing rates & shift points. This effectively halves the dimensionality of the optimization, and this seems to have had a very nice effect on performance (although there is still the chance of being stuck on local optima, as we'll see below).

library(devtools)
install_github("liamrevell/phytools",quiet=TRUE)
library(phytools)
## Loading required package: ape
## Loading required package: maps

Now, we can start by simulating a tree and data in which there are two shift points, and thus three Brownian rates:

set.seed(99)
tree<-pbtree(n=50,scale=100)
x<-sim.rates(tree<-make.era.map(tree,c(0,60,85)),
setNames(c(5,20,1),1:3))
## these are our simulated regimes:
plot(tree,setNames(c("purple","red","blue"),1:3),fsize=0.8,ftype="i",
mar=c(3.1,0.1,0.1,0.1))
axis(1)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-2

Let's fit one, two, three, and four rate models to these data:

fit1<-rateshift(tree,x,nrates=1,niter=10)
## Optimization progress:
## |..........|
## Done.
fit1
## ML 1-rate model:
## s^2(1) se(1) k logL
## value 7.4779 1.4956 2 -206.1193
##
## This is a one-rate model.
##
## Frequency of best fit: 1
##
## R thinks it has found the ML solution.
fit2<-rateshift(tree,x,nrates=2,niter=10)
## Optimization progress:
## |..........|
## Done.
fit2
## ML 2-rate model:
## s^2(1) se(1) s^2(2) se(2) k logL
## value 13.8711 3.3579 0.5263 0.2124 4 -193.311
##
## Shift point(s) between regimes (height above root):
## 1|2 se(1|2)
## value 88.4271 0.0436
##
## Frequency of best fit: 0.9
##
## R thinks it has found the ML solution.
fit3<-rateshift(tree,x,nrates=3,niter=10)
## Optimization progress:
## |..........|
## Done.
fit3
## ML 3-rate model:
## s^2(1) se(1) s^2(2) se(2) s^2(3) se(3) k logL
## value 7e-04 NaN 16.5352 4.7994 0.5158 0.2057 6 -190.4577
##
## Shift point(s) between regimes (height above root):
## 1|2 se(1|2) 2|3 se(2|3)
## value 48.8951 10.6356 88.4271 0.0387
##
## Frequency of best fit: 0.7
##
## R thinks it has found the ML solution.
fit4<-rateshift(tree,x,nrates=4,niter=10)
## Optimization progress:
## |..........|
## Done.
fit4
## ML 4-rate model:
## s^2(1) se(1) s^2(2) se(2) s^2(3) se(3) s^2(4) se(4) k logL
## value 7e-04 NaN 833.039 NaN 5.0879 2.2729 0.5437 0.2229 8 -188.5406
##
## Shift point(s) between regimes (height above root):
## 1|2 se(1|2) 2|3 se(2|3) 3|4 se(3|4)
## value 68.2645 NaN 69.0033 NaN 90.5243 0.07
##
## Frequency of best fit: 0.2
##
## R thinks it has found the ML solution.
AIC(fit1,fit2,fit3,fit4)
##      df      AIC
## fit1 2 416.2387
## fit2 4 394.6221
## fit3 6 392.9153
## fit4 8 393.0813

We can plot our best model, in this case the generating, three-rate model:

plot(fit3,fsize=0.8)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-4

Finally, I mentioned that it is possible to get trapped on local optima. This is true, somewhat surprisingly, even for the relatively simple, two-rate model. We can see why by visualizing the likelihood surface across the length of the tree:

shift<-1:99/100*max(nodeHeights(tree))
logL<-sapply(shift,function(s,tree,x) logLik(rateshift(tree,x,fixed.shift=s,
quiet=TRUE)),tree=tree,x=x)
plot(shift,logL,type="l",lwd=2)
plotTree(tree,color=rgb(0,0,1,0.25),ftype="off",mar=c(5.1,4.1,4.1,2.1),
add=TRUE)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-5

We can also do this in two dimensions for the three-rate model:

shift<-seq(2,98,by=2)/100*max(nodeHeights(tree))
logL<-sapply(shift,function(s1,s2,tree,x)
sapply(s2,function(s2,s1,tree,x)
logLik(rateshift(tree,x,fixed.shift=if(s1!=s2) c(s1,s2) else s1,
quiet=TRUE)),s1=s1,tree=tree,x=x),s2=shift,tree=tree,x=x)
image(shift,shift,logL,col=gray.colors(60,0,1),xlab="shift 1 (or 2)",
ylab="shift 2 (or 1)")
points(fit3$shift[1],fit3$shift[2],pch=4) ## MLE
points(fit3$shift[2],fit3$shift[1],pch=4)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-6

Or, alternatively:

library(lattice)
wireframe(logL,row.values=shift,column.values=shift,xlab="shift 1 (or 2)",
ylab="shift 2 (or 1)",colorkey=TRUE,drape=TRUE)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-7

Interesting!

The difference between different methods for fitting the Mk model in R: It's all about the prior

There are multiple functions in R to fit the so-called Mk model for discrete character evolution. For instance, there is ace in the ape package, fitDiscrete in geiger, and now fitMk in my package, phytools - which has a lot of similarity to ace, but some important differences. This does not even consider the functions of phangorn and diversitree, which can also be used to fit this model.

A casual user may discovery, however, that although these methods all purport to be fitting the same model, parameter estimates and likelihoods often differ ever so slightly (but more than one would expect based on the limits of numerical precision) between methods.

What accounts for this, it turns out, is different implicit or explicit assumptions about the prior probability distribution at the global root of the tree. This is difficult to track down in documentation for any of these functions, although in fitMk the assumed prior (pi) is reported explicitly (and can be modified), and (as I show below) it is also possible to use fitDiscrete to fit the Mk model with any prior.

First, load packages

## load packages
library(phytools)
library(geiger)

Next, let's simulate some data under the Mk model:

## simulate some data
tree<-pbtree(n=100,scale=1)
Q<-matrix(c(-2,1,1,1,-2,1,1,1,-2),3,3)
rownames(Q)<-colnames(Q)<-letters[1:3]
Q
##    a  b  c
## a -2 1 1
## b 1 -2 1
## c 1 1 -2
x<-sim.history(tree,Q,anc="a")$states
## Done simulation(s).
x
##  t53  t54  t89  t90   t9  t10   t6  t11  t24  t25  t12  t44  t62  t67  t74 
## "a" "a" "c" "c" "a" "a" "b" "b" "a" "a" "a" "a" "a" "b" "a"
## t75 t65 t66 t45 t81 t82 t69 t35 t36 t70 t71 t1 t5 t17 t23
## "a" "c" "c" "a" "a" "c" "a" "a" "a" "b" "c" "c" "a" "a" "a"
## t91 t92 t55 t56 t46 t47 t50 t51 t22 t78 t85 t86 t61 t63 t64
## "c" "b" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c" "c"
## t18 t95 t96 t13 t42 t43 t93 t94 t99 t100 t15 t4 t40 t41 t2
## "c" "b" "b" "c" "c" "c" "c" "c" "a" "a" "b" "c" "b" "b" "b"
## t3 t76 t77 t39 t38 t30 t27 t31 t83 t84 t57 t58 t37 t14 t19
## "b" "b" "a" "a" "a" "b" "a" "a" "a" "a" "a" "a" "c" "a" "a"
## t20 t7 t79 t80 t21 t8 t68 t87 t88 t52 t72 t73 t33 t34 t16
## "a" "c" "b" "b" "b" "a" "a" "a" "a" "a" "b" "b" "b" "b" "b"
## t26 t28 t29 t48 t49 t97 t98 t32 t59 t60
## "b" "c" "a" "a" "a" "b" "b" "b" "a" "b"

Now, we can fit each model under the generating model (model="ER"):

## fit models
fit.phytools<-fitMk(tree,x,model="ER",pi="equal") ## flat root
fit.phytools
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## a b c
## a -1.773072 0.886536 0.886536
## b 0.886536 -1.773072 0.886536
## c 0.886536 0.886536 -1.773072
##
## Fitted (or set) value of pi:
## a b c
## 0.3333333 0.3333333 0.3333333
##
## Log-likelihood: -86.449618
fit.ape<-ace(x,tree,type="discrete",model="ER")
fit.ape
## 
## Ancestral Character Estimation
##
## Call: ace(x = x, phy = tree, type = "discrete", model = "ER")
##
## Log-likelihood: -85.35101
##
## Rate index matrix:
## a b c
## a . 1 1
## b 1 . 1
## c 1 1 .
##
## Parameter estimates:
## rate index estimate std-err
## 1 0.8865 0.1524
##
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
## a b c
## 0.4114971 0.3071304 0.2813725
fit.geiger<-fitDiscrete(tree,x,model="ER")
fit.geiger
## GEIGER-fitted comparative model of discrete data
## fitted Q matrix:
## a b c
## a -1.7668479 0.8834239 0.8834239
## b 0.8834239 -1.7668479 0.8834239
## c 0.8834239 0.8834239 -1.7668479
##
## model summary:
## log-likelihood = -86.421424
## AIC = 174.842847
## AICc = 174.883663
## free parameters = 1
##
## Convergence diagnostics:
## optimization iterations = 100
## failed iterations = 0
## frequency of best fit = 1.00
##
## object summary:
## 'lik' -- likelihood function
## 'bnd' -- bounds for likelihood search
## 'res' -- optimization iteration summary
## 'opt' -- maximum likelihood parameter estimates

We can see that each is slightly different - either in the fitted model, or the likelihood, or both!

To show that this is due to different priors on the root node, we can change the prior using the object created by fitDiscrete. This is not super-straightforward, but kind of neat in the end. (There may be some way to do this within a call of fitDiscrete, but I was not able to do that.)

## the likelihood function
lik<-fit.geiger$lik
## optimize it using a flat/"equal" prior
fit.flat<-optimize(lik,c(1e-8,1000),root=ROOT.FLAT,maximum=TRUE)
fit.flat
## $maximum
## [1] 0.8865379
##
## $objective
## [1] -86.44962

If we compare this rate parameter to fitMk or ace, or the likelihood to fitMk, we should see that it is equal. Cool.

That's all I have to say about this for now.

Modifications to brownieREML & new REML version of rateshift

I just updatedthe phytools function brownieREML, which implements a REML (restricted maximum likelihood) version of brownie.lite which is in turn based on Brian O'Meara et al. (2006; Evolution) method to test for multiple rates of continuous character evolution on the tree, where the branches associated with each rate regime are specified a prioriby the user.

The updates are relatively simple:

(1) Using the function optimHess I estimate the variance on the parameter estimates from the negative inverse of the Hessian matrix.

(2) I change some of the names of the elements of the list returned by brownieREML to match the corresponding elements from the function brownie.lite.

(3) Finally, I added an object class ("brownieREML") and an S3 print method to the object returned by brownieREML.

Here's a very quick demo of these attributes:

library(devtools)
install_github("liamrevell/phytools",quiet=TRUE)
library(phytools)
## Loading required package: ape
## Loading required package: maps
## set seed
set.seed(1)
## simulate multiple temporal regimes with different rates:
tree<-pbtree(n=100,scale=100)
mtree<-make.era.map(tree,c(0,75))
x<-sim.rates(mtree,setNames(c(10,1),1:2))
plot(mtree,ftype="off",colors=setNames(c("red","blue"),1:2))
add.simmap.legend(colors=setNames(c("red","blue"),
c("fast rate","slow rate")),prompt=FALSE,x=0,y=Ntip(mtree))

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-2

## fit using brownie.lite
fit.ml<-brownie.lite(mtree,x)
fit.ml
## ML single-rate model:
## s^2 se a k logL
## value 2.7271 0.3856 8.5444 2 -323.1423
##
## ML multi-rate model:
## s^2(1) se(1) s^2(2) se(2) a k logL
## value 14.2251 4.3892 0.7519 0.1221 8.5015 3 -285.7067
##
## P-value (based on X^2): 0
##
## R thinks it has found the ML solution.
## fit using brownieREML
fit.reml<-brownieREML(mtree,x)
fit.reml
## REML single-rate model:
## s^2 se k logL
## value 2.7547 0.3915 1 -320.173
##
## REML multi-rate model:
## s^2(1) se(1) s^2(2) se(2) k logL
## value 14.9208 4.691 0.7511 0.1221 2 -281.947
##
## R thinks it has found the REML solution.

The main reason for this update is to permit brownieREML to be used internally by rateshift for rateshift(...,method="REML").

Here's more or less how that would look (along with the speed-up that results):

system.time(
fit.ml<-rateshift(tree,x,method="ML",nrates=2)
)
## Optimization progress:
## |..........|
## Done.
##    user  system elapsed 
## 402.82 0.53 413.34
fit.ml
## ML 2-rate model:
## s^2(1) se(1) s^2(2) se(2) k logL
## value 13.7226 4.3593 0.7304 0.1196 4 -285.3171
##
## Shift point(s) between regimes (height above root):
## 1|2 se(1|2)
## value 75.6848 1.2073
##
## Model fit using ML.
##
## Frequency of best fit: 0.4
##
## R thinks it has found the ML solution.
system.time(
fit.reml<-rateshift(tree,x,method="REML",nrates=2)
)
## Optimization progress:
## |..........|
## Done.
##    user  system elapsed 
## 31.63 0.19 32.15
fit.reml
## ML 2-rate model:
## s^2(1) se(1) s^2(2) se(2) k logL
## value 14.4349 4.6601 0.7303 0.1196 4 -281.5724
##
## Shift point(s) between regimes (height above root):
## 1|2 se(1|2)
## value 75.6301 1.1231
##
## Model fit using REML.
##
## Frequency of best fit: 0.3
##
## R thinks it has found the REML solution.

So, it is at least a little faster.

New version of phytools (0.5-00) submitted to CRAN

I have just submitted a new version of phytools (phytools 0.5-00) to CRAN. Hopefully, it survives the CRAN maintainer scrutiny & posts within the next 24 hours; however it can already be installed from GitHub using devtools:

library(devtools)
install_github("liamrevell/phytools",quiet=TRUE)
packageVersion("phytools")
## [1] '0.5.0'

This version of phytools has many changes and new functionality when compared to the previous CRAN version from early July, so it will be hard to be fully comprehensive about the updates; however, among the changes in this CRAN version, users will find:

  1. A new method for co-phylogenetic plotting (e.g., 1, 2, 3, 4).

  2. A bug fix in plot.ltt with a plotted tree overlain that caused the tip labels to be spaced exponentially.

  3. An important bug fix in the phytools function to test for differences in the evolutionary correlation between traits in different parts of the tree, based on Revell & Collar (2009; Evolution).

  4. A very nice update/improvement to the phytols function (cladelabels) for adding clade labels to a plotted tree (here).

  5. A function (markChanges) to demarcated & extract the timing of changes from a stochastic character map style tree (1, 2, 3, 4).

  6. A change in the way object class is checked in phytools that affected the source code of many different phytools function (here).

  7. Some changes to plotSimmap and cladelabels to permit them to be used more effectively with the function split.plotTree(here).

  8. A major update to the object class of "phylo" objects with a mapped discrete character, along with S3 methods for this object class (print, plot, summary, etc.; e.g., 1, 2).

  9. An update to the function treeSlice to slice the tree at a particular point & return all subtrees (that, among other things, allows this to be done interactively).

  10. A fixto the default color scheme used by phenogram with a mapped discrete character.

  11. An updateto the phytools function phylomorphospace that allows it to play with ape functions such as nodelabels (and thus permits us to plot labels at internal or external nodes of a phylogeny projected into 2D morphospace.

  12. A newS3 plotting method for objects of class "rateshift" created by the function to estimate the temporal position of a shift in the evolutionary rate for a continuous character through time.

  13. A small bug fix in the phytools function drop.tip.densityMap.

  14. An update to plotSimmap to allow the painted color to be split along the vertical edge plotted at a node, for when the character changes state exactly at a node (1, 2).

  15. A new S3 print methods for objects created by the phytools ancestral state estimation function fastAnc(here).

  16. An important update to the internal functions used by make.simmap to compute the likelihood, as well as the conditional likelihoods at internal nodes. This primarily affects users who are using a non-default prior at the root node of the tree.

  17. A new function, fitMk, to fit the Mk model. (Functionally similar to fitDiscrete in geiger.)

  18. A more or less imperceptible update to the function rerootingMethod. This basically allows the function to handle polytomies “natively” (i.e., without first having to resolve them, perform calculations using the resolved tree, and then match nodes back to the original tree).

  19. An additional estimation method (and a change in the default estimation method) for the phytools function fitPagel, which fits the Pagel (1994) model for testing whether the evolution of two binary characters is correlated.

  20. An updateto the function rateshift for testing for temporal shifts in the rate of evolution for a continuous trait that permits much more effective (if not faster) optimization. (Prior versions really could not maximize the likelihood effectively for more than - at best - two shifts on reasonably sized trees.)

  21. Finally, some modifications to the phytools function brownieREML, and a REML version of the function rateshift, which runs much faster than the ML version (here).

Feedback is always welcome on any issues created by these changes. (Of course, I'm always happy to take positive feedback too!)

New phytools version on CRAN; bug fix in make.simmap(...,Q="mcmc")

A new version of phytools is now on CRAN, although it will probably take a few days for binaries to be built and then percolate through all the mirror repositories.

Unfortunately, as soon it this version was accepted, I discovered a small bug with the function make.simmap. make.simmap is a popular phytools function that implements the method of stochastic character mapping. The bug is present in only the full hierarchical Bayesian method (Q="mcmc"), and was introduced in the latest version of phytools because I know use the function fitMk to compute the likelihood of the Mk model internally.

Here is how the bug manifests:

library(phytools)
## Loading required package: ape
## Loading required package: maps
packageVersion("phytools")
## [1] '0.5.0'
## load data
data(anoletree)
## pull out the data for tips so we can test make.simmap
x<-getStates(anoletree,"tips")
## default method (empirical Bayesian method)
eb.trees<-make.simmap(anoletree,x,nsim=100,model="ER")
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
##             CG          GB          TC          TG          Tr          Tw
## CG -0.13884868 0.02314145 0.02314145 0.02314145 0.02314145 0.02314145
## GB 0.02314145 -0.13884868 0.02314145 0.02314145 0.02314145 0.02314145
## TC 0.02314145 0.02314145 -0.13884868 0.02314145 0.02314145 0.02314145
## TG 0.02314145 0.02314145 0.02314145 -0.13884868 0.02314145 0.02314145
## Tr 0.02314145 0.02314145 0.02314145 0.02314145 -0.13884868 0.02314145
## Tw 0.02314145 0.02314145 0.02314145 0.02314145 0.02314145 -0.13884868
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##        CG        GB        TC        TG        Tr        Tw 
## 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
## Done.
## try full Bayesian method (should fail)
fb.trees<-make.simmap(anoletree,x,nsim=100,model="ER",Q="mcmc")
## Running MCMC burn-in. Please wait....
## Error in if (p.odds >= runif(n = 1)) {: argument is of length zero

I have fixedthis bug and the fixed version can be already be installed from GitHub using devtools. Here I will demo it by loading the function from source:

source("../phytools/R/make.simmap.R")
fb.trees<-make.simmap(anoletree,x,nsim=100,model="ER",Q="mcmc")
## Running MCMC burn-in. Please wait....
## Running 10000 generations of MCMC, sampling every 100 generations.
## Please wait....
##
## make.simmap is simulating with a sample of Q from
## the posterior distribution
##
## Mean Q from the posterior is
## Q =
## CG GB TC TG Tr Tw
## CG -0.11975495 0.02395099 0.02395099 0.02395099 0.02395099 0.02395099
## GB 0.02395099 -0.11975495 0.02395099 0.02395099 0.02395099 0.02395099
## TC 0.02395099 0.02395099 -0.11975495 0.02395099 0.02395099 0.02395099
## TG 0.02395099 0.02395099 0.02395099 -0.11975495 0.02395099 0.02395099
## Tr 0.02395099 0.02395099 0.02395099 0.02395099 -0.11975495 0.02395099
## Tw 0.02395099 0.02395099 0.02395099 0.02395099 0.02395099 -0.11975495
## and (mean) root node prior probabilities
## pi =
## [1] 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
## Done.
obj<-summary(fb.trees)
plot(obj,fsize=0.6,ftype="i",ylim=c(-2,Ntip(anoletree)))
colors<-setNames(palette()[1:length(unique(x))],sort(unique(x)))
add.simmap.legend(colors=colors,x=0,y=-2,prompt=FALSE,vertical=FALSE)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-2

This fix can be obtained most easily by installing phytools from GitHub using devtools. In a new R session, run:

library(devtools)
install_github("liamrevell/phytools")

That's it.


New, reasonably fast method to compute the patristic distance between a pair of species in a large tree

A R-sig-phylo questionyesterday read as follows (very slightly paraphrased):

“Given a large newick tree e.g. from the new open tree of life (304959 tips) I want to calculate a pairwise distance between a small subset of species. The tricks I've seen so far [Ed. ape::cophenetic, which computes a N× N patristic distance for a tree with N tips] do not work on a tree of that size. I want to determine single species to single species distances without having an algorithm attempt to calculate the distance for all possible species (which is what ape is doing). Is there a package/trick to accomplish this?”

As a matter of fact, this can be done in a reasonably efficient way using the phytools function fastHeight, which computes the height above the root of the MRCA of any pair of species. That is because, for two species i and j, the patristic distane between them is simply the sum of the heights above the root for species i and j minus two times the height above the root of the common ancestor of i& j.

To show that this works fairly well, let's try to write a little custom function, fastDist, to compute this distance using the heights as mentioned:

library(phytools)
fastDist<-function(tree,sp1,sp2){
fastHeight(tree,sp1,sp1)+fastHeight(tree,sp2,sp2)-
2*fastHeight(tree,sp1,sp2)
}

Now, let's test it on a somewhat large tree:

## simulate tree at random
tree<-rtree(n=5000)
sp1<-sample(tree$tip.label)[1]
sp1
## [1] "t941"
sp2<-sample(tree$tip.label)[2]
sp2
## [1] "t2799"
## compute using cophenetic
system.time(d<-cophenetic(tree)[sp1,sp2])
##    user  system elapsed 
## 5.20 0.68 5.96
d
## [1] 11.34199
## compute using fastDist
system.time(d<-fastDist(tree,sp1,sp2))
##    user  system elapsed 
## 0.05 0.00 0.04
d
## [1] 11.34199

Having done that, let's go for broke with a tree of 304,959, just as described:

tree<-rtree(n=304959)
tree
## 
## Phylogenetic tree with 304959 tips and 304958 internal nodes.
##
## Tip labels:
## t51258, t83258, t60186, t263493, t117091, t210153, ...
##
## Rooted; includes branch lengths.
sp1<-sample(tree$tip.label)[1]
sp1
## [1] "t51585"
sp2<-sample(tree$tip.label)[2]
sp2
## [1] "t244730"
## just to check that our MRCA is not the root
fastMRCA(tree,sp1,sp2)
## [1] 427604
## compute using fastDist
system.time(d<-fastDist(tree,sp1,sp2))
##    user  system elapsed 
## 4.80 1.34 6.21
d
## [1] 12.93951

Cool.

Just for the heck of it, I have also added this function to phytools & it is available from GitHub already.

A few minor fixes & updates

I made a few minor updates to phytools in the past couple of days.

First, I fixed a bug in plotSimmap (and, consequently, in any function that uses plotSimmap internally, such as plotTree). The bug seems to result when the edges of the tree are neither in "cladewise" nor in "pruningwise"order, and I have fixed it by substituting the more standard "postorder" edge order (for post-order tree traversal) for ape's related "pruningwise" order. The fix can be seen here.

Second, I updated the phytools function collapse.to.star which collapses a user-specified subtree to a star (unsurprisingly) so that it now accepts trees without edge lengths. Details of this update can be seen here.

Finally, third, and perhaps most amusingly, I corrected the spelling of "neighber" to "neighbor" in a message printed to report the number of nearest-neighbor-interchanges required to find the best tree in the phytools function for unweighted least-squares phylogeny inference from a distance matrix, optim.phylo.ls.

That's it.

Scale bar for cophylo plotting

I just added the option to the phytools function cophylo (for co-phylogenetic plotting) to show separate scale bars for the right & left plotted trees.

Here's a quick demo of how it works:

library(devtools)
install_github("liamrevell/phytools",quiet=TRUE)
## Installing 1 packages: expm
## package 'expm' successfully unpacked and MD5 sums checked
library(phytools)
## Loading required package: ape
## Loading required package: maps
##
## # ATTENTION: maps v3.0 has an updated 'world' map. #
## # Many country borders and names have changed since 1990. #
## # Type '?world' or 'news(package="maps")'. See README_v3. #

First, simulate some trees:

t1<-rtree(n=26,tip.label=LETTERS)
t2<-rtree(n=26,tip.label=LETTERS)

The basic plots, no scale bar, look as follows:

obj<-cophylo(t1,t2)
## Rotating nodes to optimize matching...
## Done.
plot(obj)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-3

Now let's try with the same scale bar, 2.5 units in length, on either side:

plot(obj,scale.bar=rep(2.5,2))

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-4

We can use different scale bars:

plot(obj,scale.bar=c(2.5,3.5))

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-5

And, finally, if we set either to zero it turns off:

plot(obj,scale.bar=c(3,0))

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-6

Note that the latter plot might be misinterpreted as a single scale bar for both the right & left plotted trees - but this would be wrong (unless the two trees happen to have exactly the same total height) as the trees are always rescaled internally before plotting in this function!

That's it!

Re-write of internally used function phyloDesign

I have totally re-written the internally used function, phyloDesign, which computes a design matrix for least-squares phylogeny estimation in the phytools function optim.phylo.ls, which does unweighted least-squares phylogeny estimation from a distance matrix.

Here's a simple demo using a DNA datasetfrom Jackman et al. (1999; Syst. Biol.).

library(devtools)
install_github("liamrevell/phytools",quiet=TRUE)
packageVersion("phytools")
## [1] '0.5.3'
library(phytools)
## load data
dna<-read.nexus.data(file="http://www.phytools.org/UMB2015/data/Jackman-etal.nex")
dna<-as.DNAbin(dna)
dna
## 55 DNA sequences in binary format stored in a list.
##
## All sequences of same length: 1456
##
## Labels: Diplolaemus_darwinii Polychrus_acutirostris Anolis_acutus Anolis_cristatellus Anolis_krugi Anolis_stratulus ...
##
## Base composition:
## a c g t
## 0.339 0.261 0.118 0.281
## compute distance matrix
D<-dist.dna(dna)
## estimate tree, using NJ as a starting tree:
ls.fit<-optim.phylo.ls(D,phangorn::NJ(D))
## 1 set(s) of nearest neighbor interchanges. best Q so far = 0.187309346 
## 2 set(s) of nearest neighbor interchanges. best Q so far = 0.185881266
## 3 set(s) of nearest neighbor interchanges. best Q so far = 0.1847435677
## 4 set(s) of nearest neighbor interchanges. best Q so far = 0.1839579959
## 5 set(s) of nearest neighbor interchanges. best Q so far = 0.1835163321
## 6 set(s) of nearest neighbor interchanges. best Q so far = 0.1830769343
## 7 set(s) of nearest neighbor interchanges. best Q so far = 0.1827496093
## 8 set(s) of nearest neighbor interchanges. best Q so far = 0.18261176
## 9 set(s) of nearest neighbor interchanges. best Q so far = 0.1823425143
## 10 set(s) of nearest neighbor interchanges. best Q so far = 0.1821113405
## 11 set(s) of nearest neighbor interchanges. best Q so far = 0.1819861917
## 12 set(s) of nearest neighbor interchanges. best Q so far = 0.1819386781
## 13 set(s) of nearest neighbor interchanges. best Q so far = 0.1819014877
## 14 set(s) of nearest neighbor interchanges. best Q so far = 0.1814029139
## 15 set(s) of nearest neighbor interchanges. best Q so far = 0.1813641191
## 16 set(s) of nearest neighbor interchanges. best Q so far = 0.180996378
## best Q score of 0.180996378 found after 16 nearest neighbor interchange(s).
plotTree(root(ls.fit,outgroup="Diplolaemus_darwinii",resolve.root=TRUE),
fsize=0.8,ftype="i")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-2

It's hard to summarize where the speed-up in phyloDesign (which is many-fold for reasonable sized trees) comes from. The original version was written with loops & logical statements. The new version with apply family functions and set comparisons. Basically, I'm a bit better at R programming nowadays than I was five years ago when I wrote the earlier version of this function. (But still not all that good….)

That's it!

Small update to the small phytools function to fix malconformed trees

I just made a small update to the phytools function untangle, which basically tries to 'fix' malconformed "phylo" class objects that occasionally result from certain types of tree manipulation.

The updatebasically allows the returned "phylo" object to inherit the attributes of the input tree (except the ones that change in the fix) - which it otherwise may not depending on the method that is used to fix the malconformation.

Here's a quick demo:

library(phytools)
print(anole.trees,details=TRUE)
## 9 phylogenetic trees
## tree 1 : 18 tips
## tree 2 : 21 tips
## tree 3 : 11 tips
## tree 4 : 29 tips
## tree 5 : 26 tips
## tree 6 : 28 tips
## tree 7 : 17 tips
## tree 8 : 26 tips
## tree 9 : 19 tips
## estimate an MRP super tree
super.tree<-mrp.supertree(anole.trees)
## [1] "Best pscore so far: 174"
## [1] "Best pscore so far: 174"
## [1] "Best pscore so far: 174"
## [1] "Best pscore so far: 174"
## [1] "Best pscore so far: 174"
## [1] "Best pscore so far: 174"
## [1] "Best pscore so far: 174"
## [1] "Best pscore so far: 174"
## [1] "Best pscore so far: 174"
## [1] "Best pscore so far: 174"
## The MRP supertree, optimized via pratchet(),
## has a parsimony score of 174 (minimum 174)
super.tree<-root(super.tree,outgroup="Iguana_iguana",resolve.root=TRUE) ## fails
## Warning in newNb[phy$edge[sndcol, 2]] <- n + 2:phy$Nnode: number of items
## to replace is not a multiple of replacement length
## Error in phy$edge[sndcol, 2] <- newNb[phy$edge[sndcol, 2]] <- n + 2:phy$Nnode: number of items to replace is not a multiple of replacement length
attributes(super.tree)
## $names
## [1] "edge" "Nnode" "tip.label"
##
## $class
## [1] "phylo"
##
## $order
## [1] "postorder"
##
## $pscore
## [1] 174
super.tree<-untangle(super.tree,method="read.tree")
attributes(super.tree)
## $names
## [1] "edge" "tip.label" "Nnode"
##
## $class
## [1] "phylo"
##
## $order
## [1] "cladewise"
##
## $pscore
## [1] 174
super.tree<-root(super.tree,outgroup="Iguana_iguana",resolve.root=TRUE) ## works
plotTree(super.tree,ftype="i",fsize=0.7)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-1

That's it.

Co-phylogenetic plot for trees with no edge lengths

The phytools function plot.cophylo will no plot an object of class "cophylo" containing trees without edge lengths (upon user request).

Here's what I mean:

library(devtools)
install_github("liamrevell/phytools")

Now try it….

library(phytools)
t1<-rtree(n=26,tip.label=LETTERS,br=NULL)
t2<-rtree(n=26,tip.label=LETTERS,br=NULL)
obj<-cophylo(t1,t2)
## Rotating nodes to optimize matching...
## Done.
plot(obj)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-2

This was super easy. I just plugged the line:

if(is.null(tree$edge.length)) tree<-compute.brlen(tree)

into the internally used plotting function, phylogram, and that was all there was too it.

Node, edge, & tip labels for plotted cophylo objects

A recent blog comment asked if there was some way to 'maintain support values' (i.e., include node, edge, or tip labels) for a plotted "cophylo"object produced by phytools.

This is not quite as straightforward as it might seem because, unlike for a single plotted tree, there are two sets of node, two sets of tips, two sets of edges etc.

My solutionwas as follows:

First, I have the internally used tree plotting function, phylogram, create an environmental variable "last_plot.phylo", as do other tree plotting methods such as plot.phylo and plotSimmap.

Next, I had the S3 plot method for objects of class "cophylo"pull each of these two "last_plot.phylo" objects from their environment (one for each tree plotted) and create a new environmental variable, "last_plot.cophylo", in the same environment, consisting of a simple list of these two objects.

Finally, I created a series of wrappers around nodelabels, tiplabels, and edgelabels, which take as input which tree is to have labels added to it, and then sets the corresponding element of "last_plot.cophylo"to "last_plot.phylo" so that nodelabels etc. can be called internally.

Here's a demo using the following two trees which have matching tree labels. (Remember, we can use non-matching labels too if we are prepared to supply an association matrix.)

library(phytools)
packageVersion("phytools")
## [1] '0.5.4'
par(mfrow=c(1,2))
plotTree(t1)
plotTree(t2)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-1

Now, let's create an object of class "cophylo" and then try each of our new labeling methods:

obj<-cophylo(t1,t2)
## Rotating nodes to optimize matching...
## Done.
plot(obj)
nodelabels.cophylo() ## left side
nodelabels.cophylo(which="right")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-2

We can also change the values represented, of course. For instance:

library(stringr)
n.left<-vector()
for(i in 1:obj$trees[[1]]$Nnode){
ii<-getDescendants(obj$trees[[1]],i+Ntip(obj$trees[[1]]))
n.left[i]<-str_to_lower(paste(sort(obj$trees[[1]]$tip.label[ii[ii<=Ntip(obj$trees[[1]])]]),
collapse=""))
}
n.right<-vector()
for(i in 1:obj$trees[[2]]$Nnode){
ii<-getDescendants(obj$trees[[2]],i+Ntip(obj$trees[[2]]))
n.right[i]<-str_to_lower(paste(sort(obj$trees[[2]]$tip.label[ii[ii<=Ntip(obj$trees[[2]])]]),
collapse=""))
}
plot(obj)
nodelabels.cophylo(n.left,cex=0.9,bg="white",srt=-90)
nodelabels.cophylo(n.right,which="right",cex=0.9,bg="white",srt=-90)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-3

Functions to label edges & tips also work for this object class, for instance:

plot(obj)
## show rounded edge lengths:
edgelabels.cophylo(round(obj$trees[[1]]$edge.length,2),cex=0.7,
frame="none",adj=c(0.5,1.2))
edgelabels.cophylo(round(obj$trees[[2]]$edge.length,2),cex=0.7,
frame="none",adj=c(0.5,1.2),which="right")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-4

plot(obj,ftype="off")
## add tip labels
tiplabels.cophylo(pch=21,frame="none",bg="grey",cex=1.5)
tiplabels.cophylo(pch=21,frame="none",bg="grey",which="right",cex=1.5)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-5

We can use any of the options available from nodelabels and its sister functions. So, for instance, if we want to overlay reconstructed ancestral states for a discrete character, we can do so easily:

## tip data
x
##   C   E   H   L   J   B   A   D   K   G   F   I 
## "b" "a" "b" "b" "b" "b" "b" "a" "a" "a" "b" "a"
y
##   H   C   A   L   J   B   F   E   G   D   I   K 
## "a" "a" "b" "b" "a" "b" "a" "a" "a" "b" "b" "b"
fit.x<-ace(x,obj$trees[[1]],type="discrete")
fit.y<-ace(y,obj$trees[[2]],type="discrete")
## plot pies
plot(obj)
nodelabels.cophylo(pie=fit.x$lik.anc,piecol=c("grey","white"),cex=0.7)
nodelabels.cophylo(pie=fit.y$lik.anc,piecol=c("grey","white"),cex=0.7,
which="right")
tiplabels.cophylo(pie=to.matrix(x[obj$trees[[1]]$tip.label],c("a","b")),
piecol=c("grey","white"),cex=0.4)
tiplabels.cophylo(pie=to.matrix(y[obj$trees[[2]]$tip.label],c("a","b")),
piecol=c("grey","white"),cex=0.4,which="right")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-6

Something very important to keep in mind is that the order of the tips, nodes, & edges in the object of class "cophylo" may be different than it was in the input trees, particulary if we set rotate=TRUEwhen we made the object.

That's it!


New features in phytools function plotTree.wBars

On the suggestion of Frederico Faleiro I just added two new features to the phytools plotting function plotTree.wBars.

First, the user can now set the bar colors. This can be a single color, or different colors for different bars. Second, the user can now set or turn off bar borders.

Here's a quick demo:

library(phytools)
tree<-pbtree(n=60,scale=1)
x<-fastBM(tree)
## basic method:
plotTree.wBars(tree,x,scale=0.5)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-1

Now, greyscale by value:

col<-grey((x-min(x))/(max(x)-min(x)))
col<-setNames(col,names(x))
col[1:10] ## e.g.
##       t44       t45       t31       t32       t23       t24       t42 
## "#3B3B3B" "#626262" "#252525" "#212121" "#767676" "#8F8F8F" "#060606"
## t43 t34 t35
## "#1D1D1D" "#606060" "#626262"
plotTree.wBars(tree,x,col=col,scale=0.5,border="grey")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-2

Rainbow by value (a little trickier):

## just one way to do this:
obj<-contMap(tree,x,plot=FALSE)
foo<-function(i,obj){
ind<-which(obj$tree$edge[,2]==i)
n<-names(obj$tree$maps[[ind]])[length(obj$tree$maps[[ind]])]
obj$cols[n]
}
col<-setNames(sapply(1:Ntip(tree),foo,obj=obj),obj$tree$tip.label)
plotTree.wBars(tree,x,col=col,scale=0.5,border="transparent",width=0.8,
ylim=c(-10,Ntip(tree)))
add.color.bar(1,cols=obj$cols,title="trait value",lims=obj$lims,
prompt=FALSE,x=mean(par()$usr[1:2])-0.5,y=-7)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-3

Something like that….

Simulating a species tree from a genus tree in which the genera arise (stochastically) under a Yule process

Today, Karla Shikev askedthe following question:

I need to add species go a genus-level phylogeny, such that the generated nodes are consistent with a Yule process. I tried phytools' add.species.to.genus function, but the generated branch lengths tend to be too recent.

add.species.to.genus adds tips randomly - not under some model. However, if we have a purely genus-level tree (that is, a tree in which only genera are represented) and we want to produce random species trees, in which each subtree arose by a Yule process - then this is not too hard either.

Here is a quick demo:

library(phytools)
## first let's simulate our genus tree:
genus.tree<-pbtree(n=7,tip.label=c("Abc","Def","Ghi","Jkl","Mno","Qrs","Tuv"))
## here it is:
plotTree(genus.tree,ftype="i")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-1

Next, let's simulate some random species within each genus to attach to our tree. Here I simulate 1-5 species per genus; however, note that if there is only one species in a genus, then the tip of our genus tree will simply be renamed.

tips<-c()
for(i in 1:Ntip(genus.tree)){
n.genus<-sample(1:5,1)
for(j in 1:n.genus) tips<-c(tips,paste(genus.tree$tip.label[i],paste(sample(letters,6),collapse=""),sep="_"))
}
## these are the tips we want for our species tree:
tips
##  [1] "Abc_svhmwl" "Abc_xhjdvy" "Def_ykhsrx" "Ghi_dxuhtl" "Ghi_avzyks"
## [6] "Ghi_buyawg" "Jkl_nbgmsk" "Jkl_ryialh" "Jkl_nwuqhe" "Jkl_ufmixj"
## [11] "Jkl_bclftw" "Mno_sonmkr" "Mno_kpydtl" "Mno_ujalsy" "Mno_cyeojk"
## [16] "Mno_sohent" "Qrs_xlbphe" "Qrs_nvsziu" "Qrs_iuwkjm" "Qrs_tscqbr"
## [21] "Qrs_azcfnl" "Tuv_tidfwy"

OK, now let's imagine how we want to create our species tree in which each genus arose via a Yule process. We can simply take the genus tree and glue on a Yule tree somewhere along the terminal edge leading to each tip in the genus tree! Here is an example function in which this is done at a random position along that edge - but this needn't be the case, and could be modified easily:

genus.to.pbSpeciesTree<-function(tree,tips){
N<-Ntip(tree)
genera<-tree$tip.label
for(i in 1:N){
jj<-grep(paste(genera[i],"_",sep=""),tips)
nn<-which(tree$tip.label==genera[i])
if(length(jj)>1){
h<-runif(n=1)*tree$edge.length[which(tree$edge[,2]==nn)]
tree$edge.length[which(tree$edge[,2]==nn)]<-
tree$edge.length[which(tree$edge[,2]==nn)]-h
sub.tree<-pbtree(n=length(jj),scale=h,tip.label=tips[jj])
tree<-bind.tree(tree,sub.tree,where=nn)
} else tree$tip.label[nn]<-tips[jj]
}
tree
}
species.tree<-genus.to.pbSpeciesTree(genus.tree,tips)
plotTree(species.tree,ftype="i")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-3

As an alternative to rescaling, we could have simulated conditioned on time & a birth rate; however I'm not sure it would make much difference, and it would require that some birthrate be supplied.

That's it!

Interactive mode for bind.tip & reroot

I recently added interactive modes to both the phytools functions bind.tip (for attaching a single tip to a tree) and reroot (for re-rooting a phylogeny on a node or edge).

Key for both of these updates is the helper function get.treepos. This function uses a plotted tree and the R base function locatorto find allow the user to supply a 'location' on a plotted tree. For now, the function only works for "rightward" plotted "phylogram"s (technically, square phylograms or cladograms - but plot.phylo calls these phylograms, and slanted phylograms or cladograms "cladogram"s.

Here is the code for get.treepos, for those people interested in such things:

## get a position in the tree interactively
## written by Liam J. Revell 2015
get.treepos<-function(message=TRUE){
obj<-get("last_plot.phylo",envir=.PlotPhyloEnv)
if(obj$type=="phylogram"&&obj$direction=="rightwards"){
if(message){
cat("Click on the tree position you want to capture...\n")
flush.console()
}
x<-unlist(locator(1))
y<-x[2]
x<-x[1]
d<-pos<-c()
for(i in 1:nrow(obj$edge)){
x0<-obj$xx[obj$edge[i,]]
y0<-obj$yy[obj$edge[i,2]]
if(x<x0[1]||x>x0[2]){
d[i]<-min(dist(rbind(c(x,y),c(x0[1],y0))),
dist(rbind(c(x,y),c(x0[2],y0))))
pos[i]<-if(x>x0[2]) 0 else diff(obj$xx[obj$edge[i,]])
} else {
d[i]<-abs(y0-y)
pos[i]<-obj$xx[obj$edge[i,2]]-x
}
}
ii<-which(d==min(d))
list(where=obj$edge[ii,2],pos=pos[ii])
} else stop("Does not work for the plotted tree type.")
}

Obviously, this is somewhat hard to demonstrate using knitted R markdown, so instead I shot a screen video:

That's it!

Stochastic mapping with an ordered character

Yesterday I received the following question:

I have three character states (1,2,3) and I’d like to create stochastic maps for a situation where character change is constrained to go through state 2, so that the 1->3 and 3->1 transitions are not allowed. In my analyses so far I’ve just been using and ARD model, but I wondered if there is a way to implement these constrained pathways so that the Q matrix only allows certain specified transitions? I was also wondering if there is an easy way to constrain the root state for the analysis?

Both of these things are pretty easy to do. Here, I'll demonstrate with simulated data & tree:

library(phytools)
tree
## 
## Phylogenetic tree with 100 tips and 99 internal nodes.
##
## Tip labels:
## t14, t15, t51, t67, t68, t55, ...
##
## Rooted; includes branch lengths.
x
##  t14  t15  t51  t67  t68  t55  t46  t26  t11   t1   t2   t3   t4  t97  t98 
## "1" "1" "1" "1" "1" "1" "1" "1" "2" "1" "1" "1" "1" "2" "2"
## t78 t93 t94 t41 t79 t80 t47 t48 t24 t25 t29 t30 t31 t58 t59
## "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "3" "3"
## t6 t53 t54 t87 t88 t32 t81 t82 t95 t96 t83 t84 t21 t22 t18
## "3" "2" "2" "3" "3" "3" "2" "2" "3" "3" "2" "3" "2" "2" "2"
## t19 t99 t100 t44 t45 t42 t43 t64 t69 t70 t7 t73 t74 t75 t17
## "2" "2" "2" "2" "2" "2" "3" "2" "2" "2" "2" "2" "2" "2" "2"
## t85 t86 t12 t10 t5 t39 t91 t92 t20 t40 t62 t63 t89 t90 t38
## "2" "2" "2" "3" "2" "2" "2" "2" "1" "2" "3" "2" "3" "3" "3"
## t71 t72 t8 t9 t36 t37 t33 t34 t35 t60 t61 t65 t66 t52 t13
## "3" "3" "2" "2" "3" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2"
## t49 t50 t16 t23 t76 t77 t27 t28 t56 t57
## "2" "2" "2" "3" "3" "3" "2" "2" "2" "2"

First, let's make the model we want to fit. Here, I assume that there is two rates - one of forward transition, and a second of backward transition- but I could have also fit a model with a single rate, or one in which each type of permitted change had a different rate. (This just depends on my preference.)

model<-matrix(c(0,1,0,2,0,1,0,2,0),3,3)
rownames(model)<-colnames(model)<-1:3
model
##   1 2 3
## 1 0 2 0
## 2 1 0 2
## 3 0 1 0
prior<-setNames(c(1,0,0),1:3)
prior
## 1 2 3 
## 1 0 0

OK, now we are ready to do our stochastic mapping:

trees<-make.simmap(tree,x,model=model,pi=prior,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## 1 2 3
## 1 -1.0920856 1.0920856 0.0000000
## 2 0.1074346 -1.1995202 1.0920856
## 3 0.0000000 0.1074346 -0.1074346
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## 1 2 3
## 1 0 0
## Done.

First, here is our result:

par(mfrow=c(10,10))
colors<-setNames(c("blue","purple","red"),1:3)
plot(trees,lwd=1,ftype="off",colors=colors)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-4

We can also do any specific tree:

plot(trees[[1]],colors=colors,fsize=0.5)
markChanges(trees[[1]],colors=colors)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-5

Or summarize our results across trees:

obj<-summary(trees)
obj
## 100 trees with a mapped discrete character with states:
## 1, 2, 3
##
## trees have 18.27 changes between states on average
##
## changes are of the following types:
## 1,2 1,3 2,1 2,3 3,1 3,2
## x->y 5.63 0 1.19 11.25 0 0.2
##
## mean total time spent in each state is:
## 1 2 3 total
## raw 4.3721937 11.2234788 1.7854335 17.38111
## prop 0.2515486 0.6457287 0.1027227 1.00000
plot(obj,colors=colors,ftype="off")
add.simmap.legend(colors=colors,prompt=FALSE,x=0,y=Ntip(tree))

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-6

Finally, it is also possible to use Bayesian MCMC to sample the parameters of the transition process from their posterior distribution rather than setting them to their ML values (as is the default). This otherwise is done in a very similar way to the above. For instance:

trees.mcmc<-make.simmap(tree,x,model=model,pi=prior,Q="mcmc",nsim=100)
## Running MCMC burn-in. Please wait....
## Running 10000 generations of MCMC, sampling every 100 generations.
## Please wait....
##
## make.simmap is simulating with a sample of Q from
## the posterior distribution
##
## Mean Q from the posterior is
## Q =
## 1 2 3
## 1 -1.0707285 1.0707285 0.0000000
## 2 0.2334955 -1.3042240 1.0707285
## 3 0.0000000 0.2334955 -0.2334955
## and (mean) root node prior probabilities
## pi =
## 1 2 3
## 1 0 0
## Done.
obj<-summary(trees.mcmc)
plot(obj,colors=colors,ftype="off")
add.simmap.legend(colors=colors,prompt=FALSE,x=0,y=Ntip(tree))

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-7

(Note that for this to work is written you will need a bug fix for make.simmap that can be obtained by installing from GitHub.)

BTW, these were done on simulated data. Simulation code was as follows:

library(phytools)
Q<-matrix(c(-1,1,0,0.2,-1.2,1,0,0.2,-0.2),3,3)
rownames(Q)<-colnames(Q)<-1:3
tree<-pbtree(n=100,scale=1)
x<-sim.history(tree,t(Q),anc="1")$states

That's all for now.

phylo.to.map with right-facing trees; preliminary t-shirt design for Valdivia workshop

I will be teaching a workshop at Universidad Austral de Chile in Valdivia in December. Right now I'm working on the design for the t-shirt that we have made a tradition in prior similar workshops (e.g., 1, 2).

I decided it might be cool to project a phylogeny onto the geographic outline of Chile. The only problem is that Chile is a very long country, and phylo.to.map previously only permits a projection of the tree from above facing down.

Well - that has now been addressed. I just pusheda phytools update that allows a phylogeny to be plotted to the left of a geographic map facing rightwards. Here is a demo.

First, we need to install from GitHub:

library(devtools)
install_github("liamrevell/phytools")

Load packages:

library(phytools)
library(phangorn) ## we will use phangorn too

For fun, let's use the lat. & long. coordinates of some major Chilean cities:

X<-read.csv("Chile-cities.csv",row.names=1)
X<-as.matrix(X)
X
##                 Lat   Long
## Santiago -33.45 -70.67
## Valparaiso -33.05 -71.62
## Antofagasta -23.65 -70.40
## Temuco -38.75 -72.67
## Concepcion -36.82 -73.05
## Rancagua -34.17 -70.75
## Talca -35.43 -71.67
## Arica -18.48 -70.33
## Iquique -20.22 -70.15
## Puerto_Montt -41.47 -72.93
## La_Serena -29.90 -71.25
## Chillan -36.60 -72.12
## Osorno -40.57 -73.14
## Valdivia -39.75 -72.50
## Calama -22.47 -68.93
## Copiapo -27.36 -70.33
## Los_Angeles -37.47 -72.35
## Punta_Arenas -53.17 -70.93

(Hopefully I got most of these right!)

Rather than use a random “tree” - I figured why not compute a tree from the latitudinal distances between cities:

D<-dist(X[,1])
D
##              Santiago Valparaiso Antofagasta Temuco Concepcion Rancagua
## Valparaiso 0.40
## Antofagasta 9.80 9.40
## Temuco 5.30 5.70 15.10
## Concepcion 3.37 3.77 13.17 1.93
## Rancagua 0.72 1.12 10.52 4.58 2.65
## Talca 1.98 2.38 11.78 3.32 1.39 1.26
## Arica 14.97 14.57 5.17 20.27 18.34 15.69
## Iquique 13.23 12.83 3.43 18.53 16.60 13.95
## Puerto_Montt 8.02 8.42 17.82 2.72 4.65 7.30
## La_Serena 3.55 3.15 6.25 8.85 6.92 4.27
## Chillan 3.15 3.55 12.95 2.15 0.22 2.43
## Osorno 7.12 7.52 16.92 1.82 3.75 6.40
## Valdivia 6.30 6.70 16.10 1.00 2.93 5.58
## Calama 10.98 10.58 1.18 16.28 14.35 11.70
## Copiapo 6.09 5.69 3.71 11.39 9.46 6.81
## Los_Angeles 4.02 4.42 13.82 1.28 0.65 3.30
## Punta_Arenas 19.72 20.12 29.52 14.42 16.35 19.00
## Talca Arica Iquique Puerto_Montt La_Serena Chillan Osorno
## Valparaiso
## Antofagasta
## Temuco
## Concepcion
## Rancagua
## Talca
## Arica 16.95
## Iquique 15.21 1.74
## Puerto_Montt 6.04 22.99 21.25
## La_Serena 5.53 11.42 9.68 11.57
## Chillan 1.17 18.12 16.38 4.87 6.70
## Osorno 5.14 22.09 20.35 0.90 10.67 3.97
## Valdivia 4.32 21.27 19.53 1.72 9.85 3.15 0.82
## Calama 12.96 3.99 2.25 19.00 7.43 14.13 18.10
## Copiapo 8.07 8.88 7.14 14.11 2.54 9.24 13.21
## Los_Angeles 2.04 18.99 17.25 4.00 7.57 0.87 3.10
## Punta_Arenas 17.74 34.69 32.95 11.70 23.27 16.57 12.60
## Valdivia Calama Copiapo Los_Angeles
## Valparaiso
## Antofagasta
## Temuco
## Concepcion
## Rancagua
## Talca
## Arica
## Iquique
## Puerto_Montt
## La_Serena
## Chillan
## Osorno
## Valdivia
## Calama 17.28
## Copiapo 12.39 4.89
## Los_Angeles 2.28 15.00 10.11
## Punta_Arenas 13.42 30.70 25.81 15.70
tree<-untangle(upgma(D),"read.tree") ## untangle is hack

(Use of untangle is a hack to ensure that the tree is in good conformation for phylo.to.map.)

Now let's proceed!

obj<-phylo.to.map(tree,X,regions="Chile",direction="rightwards",
plot=FALSE)
## objective: 32
## objective: 32
## objective: 32
## objective: 26
## objective: 26
## objective: 26
## objective: 8
## objective: 2
## objective: 2
## objective: 2
## objective: 2
## objective: 2
## objective: 2
## objective: 2
## objective: 2
## objective: 2
## objective: 0
obj
## Object of class "phylo.to.map" containing:
##
## (1) A phylogenetic tree with 18 tips and 17 internal nodes.
##
## (2) A geographic map with range:
## -55.89N, -17.51N
## -109.43W, -66.44W.
##
## (3) A table containing 18 geographic coordinates.
plot(obj,direction="rightwards",ftype="i")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-5

One issue here is the spacing - which actually turns out to be the result of the inclusion in “Chile” of Chile's pacific island territories, such as Easter Island. Let's unceremoniously chop off those islands!

ii<-which(obj$map$x<(-77))
obj$map$x<-obj$map$x[-ii]
obj$map$y<-obj$map$y[-ii]
obj$map$range[1]<--81
plot(obj,direction="rightwards",split=c(0.6,0.4),ftype="i")

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-6

Finally, let's plot it with the tip labels removed & a special star at the location of our upcoming workshop!

plot(obj,direction="rightwards",split=c(0.3,0.7),ftype="off",
colors=c(grey(0.3),"black"))
points(X["Valdivia",2],X["Valdivia",1],pch=8,cex=1.2)

Image may be NSFW.
Clik here to view.
plot of chunk unnamed-chunk-7

Not bad! Of course, as always, we can export a higher quality version using pdf:

pdf(file="phylo.to.map-Chile.pdf",width=5,height=8)
plot(obj,direction="rightwards",split=c(0.3,0.7),ftype="off",
colors=c(grey(0.3),"black"))
points(X["Valdivia",2],X["Valdivia",1],pch=8,cex=1.2)
dev.off()
## windows 
## 2

Here it is!

That's all for now.

Viewing all 802 articles
Browse latest View live