Quantcast
Channel: Phylogenetic Tools for Comparative Biology
Viewing all 801 articles
Browse latest View live

Significant update to phyl.pairedttest for phylogenetic paired t-test in R

$
0
0

A good long time ago now some co-authors & I (Lindenfors et al. 2010) published a 'phylogenetic paired t-test.' This is a really simple method in which we ask if a set of paired traits (say, male and female RBC counts; or summer & winter body mass) differ significantly one from the other across species, while taking into account non-independence due to the phylogeny. The method can also account for (known) uncertainty in either or both trait vectors, and can relax the assumption of pure Brownian motion for the correlation structure of the errors (using Pagel's λ model).

The method has been implemented in the function phyl.pairedttest; however a phytools user recently pointed out some issues with optimizing the model (which uses likelihood). This turned out to be due to trait scale. For instance, for (numerically) large mean differences between the two trait vectors, the corresponding element of the Hessian matrix from the likelihood surface was sometimes numerically indistinguishable from zero.

I fixed this by using an internal rescaling, which also then required that I back-transform the estimated parameter by (some function of) the scaling factor. I was also able to make other improvements and I added an S3 print method for the object class. All of my updates can be seen here.

The following is a simple demo using simulated data:

library(phytools)
packageVersion("phytools")
## [1] '0.6.46'
phylo.heatmap(tree,cbind(x1,x2,x3),fsize=c(0.5,0.9,0.9),
standardize=TRUE)

plot of chunk unnamed-chunk-1

phyl.pairedttest(tree,x1,x2)
## 
## Phylogenetic paired t-test:
##
## t = 0.52022, df = 57, p-value = 0.604925
##
## alternative hypothesis:
## true difference in means is not equal to 0
##
## 95 percent confidence interval on the phylogenetic
## difference in mean:
## [-0.999187, 1.72124]
##
## estimates:
## phylogenetic mean difference = 0.361026
## sig^2 of BM model = 0.879318
## lambda (fixed or estimated) = 1
##
## log-likelihood:
## -78.3204
phyl.pairedttest(tree,x1,x3)
## 
## Phylogenetic paired t-test:
##
## t = -2.44073, df = 57, p-value = 0.0177879
##
## alternative hypothesis:
## true difference in means is not equal to 0
##
## 95 percent confidence interval on the phylogenetic
## difference in mean:
## [-2.95937, -0.323278]
##
## estimates:
## phylogenetic mean difference = -1.64132
## sig^2 of BM model = 0.826258
## lambda (fixed or estimated) = 0.998969
##
## log-likelihood:
## -77.2733

Note that the method should yield (nearly approximately) the same values as a standard paired t-test, if we fix λ to zero. The only difference is that we compute the standard error of the mean difference from the Hessian matrix instead of analytically:

phyl.pairedttest(tree,x1,x2,lambda=0,fixed=TRUE)
## 
## Phylogenetic paired t-test:
##
## t = 1.33211, df = 58, p-value = 0.188035
##
## alternative hypothesis:
## true difference in means is not equal to 0
##
## 95 percent confidence interval on the phylogenetic
## difference in mean:
## [-0.129139, 0.677096]
##
## estimates:
## phylogenetic mean difference = 0.273978
## sig^2 of BM model = 0.720126
## lambda (fixed or estimated) = 0
##
## log-likelihood:
## -113.078
t.test(x1,x2,paired=TRUE,var.equal=TRUE)
## 
## Paired t-test
##
## data: x1 and x2
## t = 1.321, df = 59, p-value = 0.1916
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1410439 0.6890009
## sample estimates:
## mean of the differences
## 0.2739785

In this example, the data were simulated with a genuine difference between x1 and x3, but not between x1 and x2 as follows:

tree<-pbtree(n=60)
x1<-fastBM(tree)
x2<-x1+fastBM(tree)
x3<-x1+fastBM(tree,a=2)

That's it.


Running make.simmap in parallel

$
0
0

Today a phytools user askedon my GitHub page if there was any way to permit make.simmap run in parallel.

Note that the issue motivating this report likely stems from a replacement of the method to do matrix exponentiation in the latest CRAN version of phytools. This has already been fixed - although, of course, if we are using make.simmap with large trees, complex models, or generating many samples, it will still be slow!

make.simmap does not have a parallelization option; however, for the basic method we are just estimating the transition matrix, Q, via Maximum Likelihood, and then sampling histories for the discrete character conditioned on this fitted model. Consequently, we ought to be able to first compute Q once and then run our sampler on multiple cores. In theory, this can be done using parallel::mclapply.

Unfortunately, parallelization is not supported in R for Windows. Nonetheless, it is possible to run mclapply which pseudo-parallelizes our analysis. This is so that functions & packages using parallelization don't break on Windows - but it will not run any faster!

Here is what that might look like. I have assumed we have 4 cores to play with.

library(phytools)
## our data:
cols<-setNames(colorRampPalette(c("blue","red"))(3),
c("a","b","c"))
dotTree(tree,x,fsize=0.7,ftype="i",colors=cols)

plot of chunk unnamed-chunk-1

## first fit our model
fit<-fitMk(tree,x,model="ARD")
fit
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## a b c
## a -0.667821 0.667821 0.000000
## b 0.337475 -0.715334 0.377859
## c 0.989418 0.070697 -1.060115
##
## Fitted (or set) value of pi:
## a b c
## 0.3333333 0.3333333 0.3333333
##
## Log-likelihood: -51.296883
##
## Optimization method used was "nlminb"
## extracted the fitted transition matrix:
fittedQ<-matrix(NA,length(fit$states),length(fit$states))
fittedQ[]<-c(0,fit$rates)[fit$index.matrix+1]
diag(fittedQ)<-0
diag(fittedQ)<--rowSums(fittedQ)
colnames(fittedQ)<-rownames(fittedQ)<-fit$states
## ready to run our analysis
library(parallel)
trees<-mclapply(1:4,function(n,tree,x,fixedQ) make.simmap(tree,x,Q=fixedQ,nsim=100),
tree=tree,x=x,fixedQ=fittedQ,mc.cores=if(.Platform$OS.type=="windows") 1L else 4L)
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## a b c
## a -0.6678208 0.66782077 0.0000000
## b 0.3374750 -0.71533391 0.3778589
## c 0.9894179 0.07069738 -1.0601152
## (specified by the user);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## a b c
## a -0.6678208 0.66782077 0.0000000
## b 0.3374750 -0.71533391 0.3778589
## c 0.9894179 0.07069738 -1.0601152
## (specified by the user);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## a b c
## a -0.6678208 0.66782077 0.0000000
## b 0.3374750 -0.71533391 0.3778589
## c 0.9894179 0.07069738 -1.0601152
## (specified by the user);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## a b c
## a -0.6678208 0.66782077 0.0000000
## b 0.3374750 -0.71533391 0.3778589
## c 0.9894179 0.07069738 -1.0601152
## (specified by the user);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
trees
## [[1]]
## 100 phylogenetic trees with mapped discrete characters
##
## [[2]]
## 100 phylogenetic trees with mapped discrete characters
##
## [[3]]
## 100 phylogenetic trees with mapped discrete characters
##
## [[4]]
## 100 phylogenetic trees with mapped discrete characters

At the end, though, instead of one "multiSimmap" object, we have a list of 4! Fortunately, we can combine these in a fairly easy way as follows:

trees<-do.call(c,trees)
if(!("multiSimmap"%in%class(trees))) class(trees)<-c("multiSimmap",class(trees))
trees
## 400 phylogenetic trees with mapped discrete characters

Neat.

Now, let's compute a summary & visualize the results:

obj<-summary(trees)
obj
## 400 trees with a mapped discrete character with states:
## a, b, c
##
## trees have 43.7325 changes between states on average
##
## changes are of the following types:
## a,b a,c b,a b,c c,a c,b
## x->y 18.29 0 7.945 8.745 8.1425 0.61
##
## mean total time spent in each state is:
## a b c total
## raw 27.2549076 23.1011251 8.2695555 58.62559
## prop 0.4648978 0.3940451 0.1410571 1.00000
plot(obj,colors=cols,ftype="i",fsize=0.7)
add.simmap.legend(colors=cols,prompt=FALSE,vertical=FALSE,
x=0,y=2)

plot of chunk unnamed-chunk-3

The tree & data for this example were simulated as follows:

tree<-pbtree(n=60)
Q<-matrix(c(-1,1,0,
1,-2,1,
0,1,-1),3,3)
rownames(Q)<-colnames(Q)<-letters[1:3]
x<-as.factor(sim.history(tree,Q)$states)

I would love to hear a report from someone working on a Unix-alike system (e.g., Mac OS) who can report if this indeed works on their machine!

Running make.simmap in parallel in R for Windows using snow

$
0
0

Shortly after I postedearlier today about parallelizing make.simmap using mcapply, which does not work in R for Windows, a colleaguecommentedthat this could indeed be done in Windows, just with slightly greater difficulty.

In fact, this turns out to be precisely true using the CRAN package snow.

Below I demonstrate that this works & that it results (on my machine, which has two cores but 4 hyper-threaded logical processors) in about a 50% speed-up relative to running the same process on a single core.

First, here's our data again:

library(phytools)
cols<-setNames(colorRampPalette(c("blue","red"))(3),
c("a","b","c"))
dotTree(tree,x,fsize=0.7,ftype="i",colors=cols)

plot of chunk unnamed-chunk-1

Next we'll fit our model so we don't have to recompute the transition matrix Q for each process.

fit<-fitMk(tree,x,model="ARD")
fittedQ<-matrix(NA,length(fit$states),length(fit$states))
fittedQ[]<-c(0,fit$rates)[fit$index.matrix+1]
diag(fittedQ)<-0
diag(fittedQ)<--rowSums(fittedQ)
colnames(fittedQ)<-rownames(fittedQ)<-fit$states

Now let's run our analysis, first on a single core:

system.time(trees.sc<-make.simmap(tree=tree,x=x,Q=fittedQ,nsim=1000))
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## a b c
## a -0.6678208 0.66782077 0.0000000
## b 0.3374750 -0.71533391 0.3778589
## c 0.9894179 0.07069738 -1.0601152
## (specified by the user);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
##    user  system elapsed 
## 47.60 0.06 47.90
trees.sc
## 1000 phylogenetic trees with mapped discrete characters

now on multiple cores using snow:

library(snow)
cl<-makeSOCKcluster(rep("localhost",4))
system.time(trees.mc<-clusterApply(cl,x=replicate(4,x,simplify=FALSE),
fun=make.simmap,tree=tree,Q=fittedQ,nsim=250))
##    user  system elapsed 
## 0.11 0.07 24.94
trees.mc<-do.call(c,trees.mc)
if(!("multiSimmap"%in%class(trees.mc)))
class(trees.mc)<-c("multiSimmap",class(trees.mc))
stopCluster(cl)
trees.mc
## 1000 phylogenetic trees with mapped discrete characters

I did one peculiar thing which is sending a list of length equivalent to the number of cores consisting of replicates of our character vector x. This is because both clusterApply and make.simmap take the argument x. I could have instead written a simple wrapper function with different argument names if I wanted.

It's easy to show that our results are (more or less - remember this is stochastic mapping) identical in the two cases:

plot(summary(trees.sc)$ace,summary(trees.mc)$ace,cex=1.5,pch=21,
bg="grey",xlab="posterior probabilities from single core",
ylab="posterior probabilities from multi-core")

plot of chunk unnamed-chunk-5

That's pretty neat right?

Thanks to Will Gearty for the tip.

More user control in co-phylogenetic plotting with phytools

$
0
0

I just addeda little more user control of the S3 plot method for the object class "cophylo"for co-phylogenetic plotting.

Firstly, I allow the user to access the internally-used argument part which can be adjusted to permit more space between the two plotted trees. For instance:

library(phytools)
t1
## 
## Phylogenetic tree with 26 tips and 25 internal nodes.
##
## Tip labels:
## R, Q, A, P, F, B, ...
##
## Rooted; includes branch lengths.
t2
## 
## Phylogenetic tree with 20 tips and 19 internal nodes.
##
## Tip labels:
## t4, t6, t13, t16, t15, t14, ...
##
## Rooted; includes branch lengths.
assoc
##    tree 1 tree 2
## 1 S t14
## 2 V t18
## 3 L t19
## 4 X t12
## 5 D t20
## 6 E t16
## 7 I t15
## 8 Z t11
## 9 N t7
## 10 R t2
## 11 F t1
## 12 G t4
## 13 B t3
## 14 Y t13
## 15 U t8
## 16 Q t9
## 17 F t5
## 18 J t10
## 19 N t6
## 20 C t17
obj<-cophylo(t1,t2,assoc=assoc)
## Rotating nodes to optimize matching...
## Done.
plot(obj) ## the default

plot of chunk unnamed-chunk-1

plot(obj,part=0.3)

plot of chunk unnamed-chunk-1

Secondly, I added control over the lines linking the tips of the tree to the tip labels. This is because sometimes when rendered as a PDF this lines can seem just too fine! So, for instance:

plot(obj,tip.lwd=2,link.type="curved",link.lwd=3,
link.col=make.transparent("grey",0.4),
part=0.35,lwd=2)

plot of chunk unnamed-chunk-2

## or
plot(obj,tip.lty="solid",link.lwd=3,link.lty="solid",
link.col=make.transparent("blue",0.2),
part=0.3,lwd=3,pts=F)

plot of chunk unnamed-chunk-2

Kind of neat. Of course, in this case the two trees have no genuine association - but nonetheless.

These updates can be obtained by installing phytools directly from GitHub using the package devtools.

Object class & S3 methods for multiple Mantel test in phytools

$
0
0

The phytools package contains a small function, multi.mantel, for multiple matrix regression. The idea of this function is merely to fit a model in which distance or correlation matrix Y ~ X1 + X2 and so on, but then obtain our p-values for both the full model & the model coefficients via (Mantel) permutations of the rows & columns together in Y.

A colleague today contacted me with some questions about the function & so I decided it needed a small re-boot. None of the internal operation of the function has been changed; however, I have now updatedit with a new object class, as well as print, residuals, and fitted S3 methods.

For fun, this is what the print method looks like:

print.multi.mantel<-function(x,...){
if(hasArg(digits)) digits<-list(...)$digits
else digits<-6
star<-function(p){
obj<-if(p>0.1) "" else if(p<=0.1&&p>0.05) "." else
if(p<=0.05&&p>0.01) "*" else if(p<=0.01&&p>0.001) "**" else
if(p<=0.001) "***"
obj
}
cat("\nResults from a (multiple) Mantel regression using \"multi.mantel\":\n\n")
cat("Coefficients:\n")
object<-data.frame(x$coefficients,
x$tstatistic,x$probt,
sapply(x$probt,star))
rownames(object)<-names(x$coefficients)
colnames(object)<-c("Estimate","t value","Pr(>|t|)","")
print(object)
cat("---\n")
cat(paste("Signif. codes: 0 \u2018***\u2019 0.001 \u2018**\u2019 0.01",
"\u2018*\u2019 0.05 \u2018.\u2019 0.1 \u2018 \u2019 1\n"))
cat(paste("Pr(>|t|) based on",x$nperm,
"(Mantel) permutations of rows & columns together in Y.\n\n"))
cat(paste("Multiple R-squared:",round(x$r.squared,digits),"\n"))
cat(paste("F-statistic: ",round(x$fstatistic,digits),
", p-value (based on ",x$nperm," permutations): ",
x$probF,"\n\n",sep=""))
}

Here's a small example. It takes as arguments a single dependent matrix, Y, and either a single independent matrix or a list of such matrices, X. Either matrices or class "dist" objects are acceptable. Here I use the latter.

library(phytools)
packageVersion("phytools")
## [1] '0.6.48'
dY
##             1          2          3          4          5          6
## 2 0.54603980
## 3 1.69212862 2.23816843
## 4 0.16256209 0.70860190 1.52956653
## 5 0.58192166 1.12796146 1.11020697 0.41935957
## 6 0.83933577 0.29329596 2.53146439 1.00189786 1.42125743
## 7 0.58679271 0.04075291 2.27892134 0.74935480 1.16871437 0.25254306
## 8 1.75571440 2.30175420 0.06358577 1.59315231 1.17379274 2.59505017
## 9 0.93129804 1.47733784 0.76083058 0.76873595 0.34937638 1.77063381
## 10 0.97782473 1.52386454 0.71430389 0.81526264 0.39590307 1.81716050
## 7 8 9
## 2
## 3
## 4
## 5
## 6
## 7
## 8 2.34250711
## 9 1.51809075 0.82441636
## 10 1.56461744 0.77788967 0.04652669
dX
## [[1]]
## 1 2 3 4 5 6
## 2 1.75416715
## 3 0.63607244 1.11809471
## 4 3.05970185 1.30553469 2.42362940
## 5 0.77190528 2.52607244 1.40797773 3.83160713
## 6 0.58102311 1.17314404 0.05504933 2.47867874 1.35292839
## 7 2.09863486 0.34446771 1.46256242 0.96106698 2.87054015 1.51761176
## 8 1.45771945 0.29644770 0.82164701 1.60198239 2.22962474 0.87669634
## 9 0.46813108 1.28603607 0.16794136 2.59157077 1.24003636 0.11289203
## 10 1.23194515 0.52222200 0.59587271 1.82775670 2.00385043 0.65092204
## 7 8 9
## 2
## 3
## 4
## 5
## 6
## 7
## 8 0.64091541
## 9 1.63050378 0.98958837
## 10 0.86668972 0.22577430 0.76381407
##
## [[2]]
## 1 2 3 4 5 6
## 2 0.54790843
## 3 1.16512317 1.71303160
## 4 0.26507653 0.28283189 1.43019970
## 5 0.95864760 1.50655603 0.20647557 1.22372414
## 6 0.36213942 0.18576901 1.52726259 0.09706289 1.32078703
## 7 0.03566574 0.51224269 1.20078891 0.22941080 0.99431334 0.32647369
## 8 1.29414610 1.84205453 0.12902294 1.55922264 0.33549850 1.65628553
## 9 1.73415785 2.28206628 0.56903468 1.99923439 0.77551025 2.09629728
## 10 0.30273318 0.24517525 1.46785635 0.03765665 1.26138078 0.05940624
## 7 8 9
## 2
## 3
## 4
## 5
## 6
## 7
## 8 1.32981184
## 9 1.76982359 0.44001175
## 10 0.26706744 1.59687928 2.03689103
fit<-multi.mantel(dY,dX)
fit
## 
## Results from a (multiple) Mantel regression using "multi.mantel":
##
## Coefficients:
## Estimate t value Pr(>|t|)
## (intercept) 0.9667831 4.196181 0.618
## X1 -0.1771598 -1.579220 0.103
## X2 0.3950495 2.802433 0.023 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## Pr(>|t|) based on 1000 (Mantel) permutations of rows & columns together in Y.
##
## Multiple R-squared: 0.213525
## F-statistic: 5.701406, p-value (based on 1000 permutations): 0.028
residuals(fit)
##               1            2            3            4            5
## 2 -0.326426449
## 3 0.377750562 0.792734343
## 4 -0.366883315 -0.138625616 0.427153292
## 5 -0.626824208 0.013532473 0.311292771 -0.352048579
## 6 -0.167576456 -0.539041195 0.971089391 0.435892244 0.172382466
## 7 -0.022286474 -1.067365644 1.096874319 -0.137794548 0.317672448
## 8 0.535928683 0.659786867 -0.808605010 0.294205815 0.469470892
## 9 -0.637629371 -0.163140682 -0.400996988 -0.528721733 -0.704087161
## 10 0.109698119 0.552741765 -0.726790552 0.157408232 -0.714186278
## 6 7 8 9
## 2
## 3
## 4
## 5
## 6
## 7 -0.574353601
## 8 1.129267512 0.963926842
## 9 -0.004290689 0.140999293 -0.140877962
## 10 0.942226161 0.645871995 -0.779741779 -1.589612195
fitted(fit)
##            1         2         3         4         5         6         7
## 2 0.8724663
## 3 1.3143781 1.4454341
## 4 0.5294454 0.8472275 1.1024132
## 5 1.2087459 1.1144290 0.7989142 0.7714081
## 6 1.0069122 0.8323372 1.5603750 0.5660056 1.2488750
## 7 0.6090792 1.1081186 1.1820470 0.8871494 0.8510419 0.8268967
## 8 1.2197857 1.6419673 0.8721908 1.2989465 0.7043218 1.4657827 1.3785803
## 9 1.5689274 1.6404785 1.1618276 1.2974577 1.0534635 1.7749245 1.3770915
## 10 0.8681266 0.9711228 1.4410944 0.6578544 1.1100894 0.8749343 0.9187454
## 8 9
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9 0.9652943
## 10 1.5576314 1.6361389

You get the idea.

I simulated the distance matrices for the preceding as follows:

dY<-dist(y<-rnorm(10))
dX<-list(dist(rnorm(10)),dist(sqrt(0.5)*y+rnorm(10,sd=0.5)))

That's it.

Visualizing the rate of change in a discrete character through time

$
0
0

A friend & colleague asked me about trying to track reconstructed changes in a discrete character across the tree with the idea of testing the hypothesis that changes are distributed heterogeneously through time.

The following is one relatively simple idea that is to conduct stochastic mapping & then visualize the mean number of sampled changes by time segment across the tree. This is not precisely the same as fitting a model that is heterogeneous in rate across time; however it might nonetheless give a reasonable idea of what is going on.

library(phytools)
tree
## 
## Phylogenetic tree with 200 tips and 199 internal nodes.
##
## Tip labels:
## t8, t24, t25, t60, t61, t21, ...
##
## Rooted; includes branch lengths.
x
##   t8  t24  t25  t60  t61  t21  t22  t41 t162 t163  t76  t17 t137 t138 t100 
## b b b b b b b b b b b b b b b
## t67 t50 t179 t180 t69 t127 t128 t132 t133 t82 t2 t34 t35 t44 t45
## c b b b c c c b b b b b b b b
## t62 t63 t47 t146 t147 t68 t93 t94 t74 t197 t198 t199 t200 t87 t71
## a a b a a b b a b b b c c c c
## t38 t154 t155 t7 t27 t28 t9 t10 t11 t152 t153 t149 t125 t135 t136
## c c c b b b a b b b b b b b b
## t54 t14 t15 t181 t182 t12 t13 t108 t161 t175 t176 t42 t37 t6 t78
## b b b b b b c c c c c c c b b
## t79 t18 t30 t31 t88 t89 t53 t55 t114 t115 t49 t109 t110 t20 t118
## b b b b b b b b b b b b b b b
## t119 t104 t171 t172 t148 t26 t65 t66 t1 t91 t92 t70 t23 t195 t196
## b b b b b b a b b b b b b b b
## t158 t64 t173 t174 t166 t167 t29 t111 t112 t105 t19 t116 t117 t83 t52
## b b b b c c b b b b b b b b b
## t39 t40 t120 t121 t193 t194 t150 t151 t98 t99 t80 t81 t46 t58 t59
## b b b b b b a a a a a a c a a
## t130 t131 t129 t169 t170 t168 t159 t160 t191 t192 t5 t102 t103 t77 t95
## a a a a a a a a a a a b b b b
## t96 t43 t144 t145 t90 t97 t124 t156 t157 t139 t51 t189 t190 t101 t33
## b b b a b b b b b b c c c c c
## t86 t185 t186 t113 t183 t184 t164 t165 t106 t107 t75 t32 t140 t141 t134
## c c c c c c c c c c c c c c c
## t122 t123 t142 t143 t126 t187 t188 t84 t85 t16 t72 t73 t177 t178 t3
## c c c c c c c b b b b a b b b
## t4 t56 t57 t48 t36
## b c c c c
## Levels: a b c
trees<-make.simmap(tree,x,model="ER",nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## a b c
## a -0.004767771 0.002383886 0.002383886
## b 0.002383886 -0.004767771 0.002383886
## c 0.002383886 0.002383886 -0.004767771
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
trees
## 100 phylogenetic trees with mapped discrete characters

First, phytools has a function for plotting the posterior distribution of changes directly on the tree. It also returns to the environment the positions of all these changes on the tree.

library(RColorBrewer)
colors<-setNames(brewer.pal(3,"Set1"),letters[1:3])
plotTree(tree,ftype="off",lwd=1)
par(fg="transparent")
tiplabels(pie=to.matrix(x,letters[1:3]),piecol=colors,cex=0.2)
par(fg="black")
changes<-sapply(trees,markChanges,
colors=sapply(colors,make.transparent,alpha=0.3))

plot of chunk unnamed-chunk-2

Note that these are not marked changes from one history, but from a sample of 100 such histories on the tree.

The object returned by this set of calls to markChanges looks as follows:

head(changes,n=2)
## [[1]]
## x y
## b->c 92.39156 16.00000
## b->c 79.08969 20.75000
## b->a 62.74820 33.65625
## a->b 95.08825 33.00000
## a->b 97.62994 36.00000
## a->b 97.40935 37.00000
## b->c 75.98806 46.28125
## b->a 78.73534 52.00000
## b->c 81.91755 67.00000
## b->c 74.75532 71.71875
## b->a 88.76125 97.00000
## b->c 82.54870 110.50000
## b->c 62.94025 121.56250
## c->b 64.89966 121.56250
## b->a 31.11766 139.70312
## a->c 90.50957 133.00000
## b->a 96.94503 154.00000
## b->c 25.29105 181.05078
## c->a 47.65342 192.20312
## a->b 60.05642 192.20312
## b->c 79.20952 183.50000
## b->c 81.67096 185.75000
## b->a 94.56798 192.00000
##
## [[2]]
## x y
## b->c 99.36067 16.00000
## b->c 75.13884 20.75000
## b->a 69.79316 31.50000
## b->a 87.57828 34.50000
## b->a 98.22282 38.00000
## b->c 26.20428 47.64062
## c->b 85.50603 49.00000
## b->a 75.34404 52.00000
## b->c 91.32295 67.00000
## b->c 76.94021 71.71875
## b->a 99.43863 97.00000
## b->c 86.20844 110.50000
## b->a 22.31200 139.70312
## a->c 97.90611 133.00000
## b->a 96.23615 154.00000
## b->c 36.15047 166.43750
## b->c 74.66465 184.62500
## b->a 87.48430 192.00000
## b->c 52.34096 199.12500

Now let's go about trying to compute the average number of changes for each of 20 (say) temporal segments from the root of the tree to the present.

h<-max(nodeHeights(tree))
b<-20
segs<-cbind(seq(0,h-h/b,h/b),
seq(1/b*h,h,h/b))
segs
##       [,1] [,2]
## [1,] 0 5
## [2,] 5 10
## [3,] 10 15
## [4,] 15 20
## [5,] 20 25
## [6,] 25 30
## [7,] 30 35
## [8,] 35 40
## [9,] 40 45
## [10,] 45 50
## [11,] 50 55
## [12,] 55 60
## [13,] 60 65
## [14,] 65 70
## [15,] 70 75
## [16,] 75 80
## [17,] 80 85
## [18,] 85 90
## [19,] 90 95
## [20,] 95 100

These will be our segments. (The total tree length is 100.)

Now let's compute the mean number of sampled changes for each segment:

nchanges<-rep(0,b)
for(i in 1:length(changes)){
for(j in 1:nrow(changes[[i]])){
ind<-which((changes[[i]][j,1]>segs[,1])+
(changes[[i]][j,1]<=segs[,2])==2)
nchanges[ind]<-nchanges[ind]+1/length(changes)
}
}
plot(h-as.vector(t(segs)),rbind(nchanges,nchanges),type="l",lwd=2,
xlim=c(max(segs),min(segs)),
lend=0,xlab="time since the present",ylab="mean number of changes")
plotTree(tree,add=TRUE,ftype="off",lwd=1,color=make.transparent("blue",0.1),
mar=par()$mar,direction="leftwards",xlim=c(max(segs),min(segs)))

plot of chunk unnamed-chunk-5

Neat. Of course, under a constant process we expect the number of changes to increase towards the present due to the greater edge length of that part of the tree. To control for this we could also compute the total edge length of each segment. This is a bit tricky, but we can do it as follows:

LTT<-ltt(tree,plot=FALSE)
LTT<-cbind(LTT$ltt[2:(length(LTT$ltt)-1)],
LTT$times[2:(length(LTT$ltt)-1)],
LTT$times[3:length(LTT$ltt)])
ii<-1
edge.length<-rep(0,b)
for(i in 1:nrow(segs)){
done.seg<-FALSE
while(LTT[ii,2]<=segs[i,2]&&done.seg==FALSE){
edge.length[i]<-edge.length[i]+
LTT[ii,1]*(min(segs[i,2],LTT[ii,3])-
max(segs[i,1],LTT[ii,2]))
if(LTT[ii,3]>=segs[i,2]) done.seg<-TRUE
if(LTT[ii,3]<=segs[i,2]) ii<-if(ii<nrow(LTT)) ii+1 else ii
}
}

Now let's plot the mean number of changes per segment, divided by the total edge length encompassed by the segment:

plot(h-as.vector(t(segs)),
rbind(nchanges/edge.length,nchanges/edge.length),type="l",lwd=2,
xlim=c(max(segs),min(segs)),
lend=0,xlab="time since the present",
ylab="mean number of changes / total edge length")
plotTree(tree,add=TRUE,ftype="off",lwd=1,color=make.transparent("blue",0.1),
mar=par()$mar,direction="leftwards",xlim=c(max(segs),min(segs)))

plot of chunk unnamed-chunk-7

Cool.

Now we can undertaken the same procedure, but this time with a different character, y:

trees<-make.simmap(tree,y,model="ER",nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## a b c
## a -0.002287677 0.001143839 0.001143839
## b 0.001143839 -0.002287677 0.001143839
## c 0.001143839 0.001143839 -0.002287677
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
plotTree(tree,ftype="off",lwd=1)
par(fg="transparent")
tiplabels(pie=to.matrix(x,letters[1:3]),piecol=colors,cex=0.2)
par(fg="black")
changes<-sapply(trees,markChanges,
colors=sapply(colors,make.transparent,alpha=0.3))

plot of chunk unnamed-chunk-8

h<-max(nodeHeights(tree))
b<-20
segs<-cbind(seq(0,h-h/b,h/b),
seq(1/b*h,h,h/b))
nchanges<-rep(0,b)
for(i in 1:length(changes)){
for(j in 1:nrow(changes[[i]])){
ind<-which((changes[[i]][j,1]>segs[,1])+
(changes[[i]][j,1]<=segs[,2])==2)
nchanges[ind]<-nchanges[ind]+1/length(changes)
}
}
LTT<-ltt(tree,plot=FALSE)
LTT<-cbind(LTT$ltt[2:(length(LTT$ltt)-1)],
LTT$times[2:(length(LTT$ltt)-1)],
LTT$times[3:length(LTT$ltt)])
ii<-1
edge.length<-rep(0,b)
for(i in 1:nrow(segs)){
done.seg<-FALSE
while(LTT[ii,2]<=segs[i,2]&&done.seg==FALSE){
edge.length[i]<-edge.length[i]+
LTT[ii,1]*(min(segs[i,2],LTT[ii,3])-
max(segs[i,1],LTT[ii,2]))
if(LTT[ii,3]>=segs[i,2]) done.seg<-TRUE
if(LTT[ii,3]<=segs[i,2]) ii<-if(ii<nrow(LTT)) ii+1 else ii
}
}
plot(h-as.vector(t(segs)),
rbind(nchanges/edge.length,nchanges/edge.length),type="l",lwd=2,
xlim=c(max(segs),min(segs)),
lend=0,xlab="time since the present",
ylab="mean number of changes / total edge length")
plotTree(tree,add=TRUE,ftype="off",lwd=1,color=make.transparent("blue",0.1),
mar=par()$mar,direction="leftwards",xlim=c(max(segs),min(segs)))

plot of chunk unnamed-chunk-9

This plot shows a greater concentration of changes (controlling for total edge length) towards the base of the tree. In fact, that is what I simulated by using an 'EB' transformation of the tree for simulation!

Data were simulated as follows:

tree<-pbtree(n=200,scale=100)
Q<-matrix(c(-0.004,0.002,0.002,
0.002,-0.004,0.002,
0.002,0.002,-0.004),3,3)
rownames(Q)<-colnames(Q)<-letters[1:3]
x<-as.factor(sim.history(tree,Q)$states)
EB<-phytools:::ebTree(tree,-0.06)
EB$edge.length<-EB$edge.length/max(nodeHeights(EB))*100
y<-as.factor(sim.history(EB,10*Q)$states)

More on visualizing the rate of discrete evolution through time

$
0
0

Last week I posted on visualizing the mean number of changes (or changes per unit of edge length) per interval of time from the root of the tree to the present day. This can create a nice plot; however it is more or less meaningless unless we can compare it to what we might expect under some kind of neutral process.

Consequently, today I have addedsome new function & methods that:

1) Automate calculation of this CTT or 'changes-through-time' plot based on stochastic mapped trees.

2) Simulate CTT plots for a given transition matrix, Q and (optionally) ancestral state, by first simulating a discrete character & then sampling stochastic character maps consistent with this character. This can be done a large number of times to generate a null distribution of the accumulation of changes through time under the model.

3) Visualize our empirical CTT plot and a 100×(1-α)% distribution of CTTs under the a null model.

Here's what that looks like:

library(phytools)
packageVersion("phytools")
## [1] '0.6.49'
tree
## 
## Phylogenetic tree with 200 tips and 199 internal nodes.
##
## Tip labels:
## t5, t29, t68, t199, t200, t11, ...
##
## Rooted; includes branch lengths.
y
##   t5  t29  t68 t199 t200  t11  t18 t107 t108  t72   t3  t24  t85  t86  t44 
## b c c c c c c c c c c c c c b
## t48 t51 t52 t186 t195 t196 t115 t116 t91 t134 t135 t113 t42 t90 t97
## b b b b b b c c c c c c c a a
## t98 t158 t159 t50 t144 t145 t32 t43 t151 t152 t112 t30 t53 t54 t47
## b a a a a a a a a a a a a a a
## t82 t83 t57 t58 t157 t176 t177 t76 t75 t92 t93 t95 t139 t140 t153
## a a a a a a a a a a a a a a a
## t154 t36 t37 t80 t101 t102 t73 t74 t31 t124 t125 t33 t162 t163 t96
## a a a a a a a a a a a a a a a
## t114 t122 t123 t132 t133 t78 t79 t71 t17 t21 t22 t14 t25 t26 t89
## a a a a a b a a a a a a a a a
## t187 t188 t170 t171 t23 t172 t173 t84 t148 t149 t156 t182 t183 t81 t197
## a a a a a c c a a a c c c c c
## t198 t141 t49 t189 t190 t19 t20 t184 t185 t155 t110 t111 t106 t117 t118
## c c c c c c c c c c c c c c c
## t13 t4 t38 t39 t12 t150 t180 t181 t45 t46 t8 t142 t143 t40 t164
## b c c c c c c c c c c c c c c
## t165 t178 t179 t55 t56 t41 t59 t87 t88 t35 t193 t194 t138 t174 t175
## c c c c c c a a a a a a c c c
## t15 t9 t128 t129 t160 t161 t119 t62 t63 t10 t2 t191 t192 t69 t70
## c a a a a a a a a a a b b b b
## t146 t147 t103 t77 t109 t168 t169 t136 t137 t130 t131 t120 t121 t126 t127
## b b b b b b b b b b b b b b b
## t166 t167 t7 t64 t65 t34 t16 t27 t66 t67 t99 t100 t60 t61 t104
## b b b a a a b b b b b b b b b
## t105 t94 t28 t6 t1
## b b b b b
## Levels: a b c
trees<-make.simmap(tree,y,nsim=100,model="ER")
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## a b c
## a -0.003042964 0.001521482 0.001521482
## b 0.001521482 -0.003042964 0.001521482
## c 0.001521482 0.001521482 -0.003042964
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
object<-ctt(trees)
object
## Object of class "ctt" consisting of:
## (1) a matrix (segments) with the beginning & ending time of each segment.
## (2) a vector (nchanges) with the mean number of changes in each segment.
## (3) a vector (edge.length) containing the total edge length of each segement.
## (4) an object of class "phylo".
plot(object,type="number")

plot of chunk unnamed-chunk-1

More often it seems more likely that we'd be interested in tracking the number of changes per unit of edge length rather than the total number of changes - because in all reconstructed phylogeny of extant taxa there is more edge length towards the tips of the tree than towards the root.

plot(object,type="rate")

plot of chunk unnamed-chunk-2

Of course, as I mentioned in the preamble, it probably doesn't make much sense to generate this type of visualization unless we have a sense of what might be expected under some kind of reasonable null hypothesis - such as a constant rate of character evolution through time & among lineages. Luckily, we can simulate this too. The function sim.ctt will simulate a CTT plot for a given transition matrix, Q, by first simulating a discrete character and then by sampling stochastic character maps for that character. This is still somewhat slow, so it may take a few minutes to run.

I'll use the function sim.multiCtt to simulate various rather than a single CTT:

Q<-trees[[1]]$Q
Q
##              a            b            c
## a -0.003042964 0.001521482 0.001521482
## b 0.001521482 -0.003042964 0.001521482
## c 0.001521482 0.001521482 -0.003042964
nulo<-sim.multiCtt(tree,Q,nsim=100)
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
## Starting stochastic mapping with simulated data vector.... Done.
nulo
## 100 objects of class "ctt" in a list.
plot(nulo,type="number")
plot(object,add=TRUE,type="number")

plot of chunk unnamed-chunk-3

That's kind of neat. Now, for the rate. Here, I will change the α level for our confidence interval.

plot(nulo,alpha=0.2,ylim=c(0,0.025))
plot(object,add=TRUE)

plot of chunk unnamed-chunk-4

In this case, the data were simulated under a process of declining rate towards the present. Perhaps that was captured by our plots?

tree<-pbtree(n=200,scale=100)
Q<-matrix(c(-0.004,0.002,0.002,
0.002,-0.004,0.002,
0.002,0.002,-0.004),3,3)
rownames(Q)<-colnames(Q)<-letters[1:3]
EB<-phytools:::ebTree(tree,-0.06)
EB$edge.length<-EB$edge.length/max(nodeHeights(EB))*100
y<-as.factor(sim.history(EB,10*Q)$states)

Fitting a variable-process model of discrete character evolution on the tree using phytools

$
0
0

Now for something a little different.

Today, I have builta new method that fits a model of discrete character evolution in which the transition matrix Q varies among different parts of the tree.

These might be edges of clades specified arbitrarily by the user (for instance, using paintBranches or paintSubTree in phytools), or they could be regimes mapped onto the tree using the procedure of stochastic character mapping.

The way I did this was pretty simple. I just took the function that implements Felsenstein's famous pruning algorithm to compute the likelihood, but then I modified so that it could use a different Q for different edges. The only complication was that we might like our regime to change along an edge rather than merely at a node. To solve that, I used the phytoolsfunction map.to.singleton to convert our "simmap" object with singleton nodes and only a single regime per edge. Problem solved.

Note that this function, like fitMk, uses code for the pruning algorithm adapted from Emmanuel Paradis' ape package.

Let's try it.

First, our data:

library(phytools)
packageVersion("phytools")
## [1] '0.6.50'
plot(tree,ftype="off",colors=setNames(c("blue","red"),
mapped.states(tree)),xlim=c(0,1.05*max(nodeHeights(tree))))
tiplabels(pie=to.matrix(x,c(0,1)),piecol=c("black","white"),
cex=0.4,offset=0.01)

plot of chunk unnamed-chunk-1

So the idea is simply that we will fit a model in which the rate of transition between 0&1 (show here as black & white) depends on the state (red or blue) mapped onto the edges of the tree. Right?

Let's fit our model:

fitERmulti<-fitmultiMk(tree,x,model="ER")
fitERmulti
## Object of class "fitmultiMk".
##
## Fitted value of Q[a]:
## 0 1
## 0 -0.646581 0.646581
## 1 0.646581 -0.646581
##
## Fitted value of Q[b]:
## 0 1
## 0 -10.14495 10.14495
## 1 10.14495 -10.14495
##
## Fitted (or set) value of pi:
## 0 1
## 0.5 0.5
##
## Log-likelihood: -35.883658
##
## Optimization method used was "nlminb"

Or an "ARD" model that differs between parts of the tree:

fitARDmulti<-fitmultiMk(tree,x,model="ARD")
fitARDmulti
## Object of class "fitmultiMk".
##
## Fitted value of Q[a]:
## 0 1
## 0 -2.218063 2.218063
## 1 0.604146 -0.604146
##
## Fitted value of Q[b]:
## 0 1
## 0 -25.36092 25.36092
## 1 12.73165 -12.73165
##
## Fitted (or set) value of pi:
## 0 1
## 0.5 0.5
##
## Log-likelihood: -34.515745
##
## Optimization method used was "nlminb"

Of course, we can compare this, if we'd like, to a model with but a single regime on the tree. For instance:

fitER<-fitMk(tree,x,model="ER")
fitER
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## 0 1
## 0 -1.651121 1.651121
## 1 1.651121 -1.651121
##
## Fitted (or set) value of pi:
## 0 1
## 0.5 0.5
##
## Log-likelihood: -41.613061
##
## Optimization method used was "nlminb"

This suggests that our data justifies the greater model complexity of multiple regimes on the tree. That's good, because we simulated them that way!

tree<-pbtree(n=100,tip.label=LETTERS,scale=0.5)
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
tree<-sim.history(tree,Q,anc="a")
sim.tree<-as.phylo(tree)
q<-setNames(c(1,10),letters[1:2])
sim.tree$edge.length<-colSums(t(tree$mapped.edge[,letters[1:2]])*q)
rownames(Q)<-colnames(Q)<-0:1
x<-as.factor(sim.history(sim.tree,Q)$states)

(As of yet there is not function to simulate multiple Mk models in different parts of the tree - so what I did above is to stretchthe edge lengths of the tree by regime, and simulate under a constant regime on the stretched tree.)


Type I error rates for variable-rate/process Mk discrete character evolution model

$
0
0

Earlier today I describeda new method for testing hypotheses about heterogeneity in the rate (or process) of discrete character evolution across a tree.

Although mathematically quite different, this methods owes its intellectual provenance to an article by Brian O'Meara & colleagues published nearly a dozen years ago in 2006. The tactic here is basically the same, but applied to discrete rather than continuous traits.

I just pushedanother small update that modifies the logLik methods of the "fitMk"and "fitmultiMk" object classes to facilitate comparison of models in which the rate (or process) is either homogeneous ("fitMk") or non-homogeneous ("fitmultiMk") across the edges of the tree.

In ensuring too that full credit goes where it is due, I should note that for computing the likelihood I adapted Felsenstein's pruning algorithmas implemented in Emmanuel Paradis's apepackage.

Below, I'll examine the type I error of the method when used in hypothesis testing against a homogeneous rate/process model.

First, we need to simulate our 'regimes' on the trees. These could be specified arbitrarily, but in this case I will just simulate a binary character on a set of pure-birth phylogenies using sim.history:

library(phytools)
packageVersion("phytools")
## [1] '0.6.51'
trees<-pbtree(n=100,scale=1,nsim=200)
Q<-matrix(c(-0.5,0.5,0.5,-0.5),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
trees<-lapply(trees,sim.history,Q=Q,message=FALSE)
class(trees)<-c("multiSimmap","multiPhylo")

I generated 200 trees. Let's plot the first 100 of these:

par(mfrow=c(10,10))
nulo<-sapply(trees[1:100],plot,colors=setNames(c("blue","red"),
letters[1:2]),ftype="off",lwd=1)

plot of chunk unnamed-chunk-2

Next, we can simulate a discrete character on each of these trees, but in which the transition process is unrelated to the mapped regimes - that is, in which the null hypothesis of no difference in rate or process exists:

Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-0:1
X<-lapply(trees,function(x,Q) getStates(sim.history(x,Q,
message=FALSE),"tips"),Q=Q)

Now let's fit our two models to each data vector & tree:

fits.single<-mapply(fitMk,tree=trees,x=X,MoreArgs=list(model="ER"),
SIMPLIFY=FALSE)
fits.multi<-mapply(fitmultiMk,tree=trees,x=X,MoreArgs=list(model="ER"),
SIMPLIFY=FALSE)

We can look at a single fitted model of each type:

fits.single[[1]]
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## 0 1
## 0 -0.912078 0.912078
## 1 0.912078 -0.912078
##
## Fitted (or set) value of pi:
## 0 1
## 0.5 0.5
##
## Log-likelihood: -41.127165
##
## Optimization method used was "nlminb"
fits.multi[[1]]
## Object of class "fitmultiMk".
##
## Fitted value of Q[a]:
## 0 1
## 0 -0.888472 0.888472
## 1 0.888472 -0.888472
##
## Fitted value of Q[b]:
## 0 1
## 0 -1.032995 1.032995
## 1 1.032995 -1.032995
##
## Fitted (or set) value of pi:
## 0 1
## 0.5 0.5
##
## Log-likelihood: -41.116092
##
## Optimization method used was "nlminb"

We can also conduct a likelihood-ratio test on each pair of fitted models. I will do that using lmtest::lrtest as follows:

library(lmtest)
suppressWarnings(LR.test<-mapply(lrtest,fits.single,fits.multi,
SIMPLIFY=FALSE))
## for example
LR.test[[1]]
## Likelihood ratio test
##
## Model 1: dots[[1L]][[1L]]
## Model 2: dots[[2L]][[1L]]
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 1 -41.127
## 2 2 -41.116 1 0.0221 0.8817

Let's pull out the P-values from these tests & plot them:

P<-sapply(LR.test,function(x) x[["Pr(>Chisq)"]][2])
obj<-hist(P,20,plot=FALSE)
plot(obj$mids,obj$counts/sum(obj$counts),type="h",
lwd=18,col=make.transparent("blue",0.4),lend=1,
xlab="P-value",ylab="relative frequency",ylim=c(0,0.2))
title(main="P-values for LR-test for data simulated under the null")
abline(h=0.05,col=make.transparent("red",0.5),lwd=1,
lty="dotted")
text(0.96,0.05,"0.05",pos=3)

plot of chunk unnamed-chunk-7

Neat. This is close to what we'd hope for.

Pybus & Harvey's γ through time

$
0
0

A phytools user today contacted me to ask about an evident discrepancy between an LTT plot (which on average seemed to be curved upward), and a calculated value of Pybus & Harvey's (2000) γ statistic (which was nonsignificant but negative). Visually, to me it seemed that this was most likely explained by a notable slightly negative curvature of the LTT plot towards the present day where there are many more internode distances. However, it also occurred to me that we can easily plot γ through time as well, and this might help us to see what parts of our plot are influencial in the measured value of γ.

Note that this assumes a pure-birth process of lineage accumulation through time. If we have extinction (which, in general, should result in positive γ), we need the 'pull of the present' to detect it, and by slicing of the end of our tree in computing γ we might find a value close to zero rather than positive!

Here are our functions:

gtt<-function(tree,n=100,...){
if(hasArg(plot)) plot<-list(...)$plot
else plot<-FALSE
obj<-ltt(tree,plot=FALSE)
t<-obj$times[which(obj$ltt==3)[1]]
h<-max(nodeHeights(tree))
x<-seq(t,h,by=(h-t)/(n-1))
trees<-lapply(x,treeSlice,tree=tree,orientation="rootwards")
gamma<-sapply(trees,function(x,plot){
obj<-unlist(gammatest(ltt<-ltt(x,plot=FALSE)));
if(plot) plot(ltt,xlim=c(0,h),ylim=c(1,Ntip(tree)),
log.lineages=FALSE,log="y");
Sys.sleep(0.01);
obj},plot=plot)
object<-list(t=x,gamma=gamma[1,],p=gamma[2,],tree=tree)
class(object)<-"gtt"
object
}
## plot method
plot.gtt<-function(x,...){
args<-list(...)
args$x<-x$t
args$y<-x$gamma
if(!is.null(args$show.tree)){
show.tree<-args$show.tree
args$show.tree<-NULL
} else show.tree<-TRUE
if(is.null(args$xlim)) args$xlim<-c(0,max(x$t))
if(is.null(args$xlab)) args$xlab<-"time"
if(is.null(args$ylab)) args$ylab<-expression(gamma)
if(is.null(args$lwd)) args$lwd<-3
if(is.null(args$type)) args$type<-"s"
do.call(plot,args)
if(show.tree) plotTree(x$tree,add=TRUE,ftype="off",mar=par()$mar,
xlim=args$xlim,color=make.transparent("blue",0.1))
}
print.gtt<-function(x,...)
cat("Object of class \"gtt\".\n\n")

Let's try it with a simulated tree:

library(phytools)
tree
## 
## Phylogenetic tree with 500 tips and 499 internal nodes.
##
## Tip labels:
## t106, t107, t97, t451, t452, t75, ...
##
## Rooted; includes branch lengths.
object<-gtt(tree)
object
## Object of class "gtt".
plot(object)

plot of chunk unnamed-chunk-2

Note that since our intervals are evenly spaced, rather than linked to events on the tree, the choice of a 'step' line plot is purely cosmetic and I could've also used straight lines:

plot(object,type="l")

plot of chunk unnamed-chunk-3

The function gtt has an option plot=TRUEthat will create an animation of the LTT plot for the tree. Something like this:

object<-gtt(tree,plot=TRUE)

The .gif was generated in R using ImageMagick as follows:

png(file="gtt-%03d.png",width=600,height=600)
object<-gtt(tree,plot=TRUE)
dev.off()
system("ImageMagick convert -delay 10 -loop 0 *.png gtt-anim.gif")
file.remove(list.files(pattern=".png"))

Matching edges between topologically identical trees with different node rotations & edge lengths (and computing their average values)

$
0
0

A few days ago a phytools user wrote:

“I apologize if this has been discussed elsewhere, but I am curious if you have an easy way to extract and compare homologous edges of a tree that all contain the same taxa but have different edge lengths and are written differently in there newick format? I would use grep to extract these normally but I have a case where I dated a large topology and the tree output is written differently than the tree I input in newick format. I want to check correlation of dates to the actual branch lengths but they are not in the same order in the newick format.”

I'm sure there are a number of ways to do this in R; however, one way is using the phytools function matchNodes.

library(phytools)
trees
## 9 phylogenetic trees
## first, the trees are all topologically identical:
densityTree(trees,type="cladogram",nodes="intermediate")

plot of chunk unnamed-chunk-1

## now that they have different node rotations & numbering
par(mfrow=c(3,3))
plotTree(trees,lwd=1,node.numbers=T,fsize=0.8)

plot of chunk unnamed-chunk-1

The function matchNodes matches node numbers between different trees based on (by default) commonality in the identity of the descendants of each node. In the following code I will match each of the trees in the set to the first - but this is arbitrary. Note that we must do this separately for labels (using sister function matchLabels) and then combine the two matrices.

M1<-matrix(NA,Ntip(trees[[1]]),length(trees),
dimnames=list(trees[[1]]$tip.label,
paste("t[[",1:length(trees),"]]",sep="")))
M2<-matrix(NA,trees[[1]]$Nnode,length(trees),
dimnames=list(1:trees[[1]]$Nnode+Ntip(trees[[1]]),
paste("t[[",1:length(trees),"]]",sep="")))
for(i in 1:length(trees)){
M1[,i]<-matchLabels(trees[[1]],trees[[i]])[,2]
M2[,i]<-matchNodes(trees[[1]],trees[[i]])[,2]
}
M<-rbind(M1,M2)
M
##    t[[1]] t[[2]] t[[3]] t[[4]] t[[5]] t[[6]] t[[7]] t[[8]] t[[9]]
## H 1 25 9 21 5 22 6 5 1
## I 2 26 8 20 6 21 7 6 2
## F 3 22 6 18 7 24 9 3 4
## E 4 23 7 17 8 23 8 2 5
## G 5 24 5 19 9 25 10 4 3
## A 6 20 3 25 4 17 2 8 7
## C 7 18 1 24 3 19 4 10 9
## B 8 19 2 23 2 18 3 9 8
## D 9 21 4 22 1 20 5 7 6
## J 10 17 10 26 10 26 1 1 10
## U 11 4 12 13 12 6 16 23 23
## V 12 3 13 11 14 4 14 24 22
## W 13 2 14 12 13 5 15 25 21
## Z 14 7 17 16 15 3 11 22 26
## Y 15 6 15 15 17 2 12 20 25
## X 16 5 16 14 16 1 13 21 24
## T 17 1 11 10 11 7 17 26 20
## O 18 15 25 4 23 8 22 18 16
## P 19 12 24 1 24 9 25 15 17
## R 20 14 23 3 25 11 24 17 18
## Q 21 13 22 2 26 10 23 16 19
## K 22 8 19 7 20 12 18 13 12
## L 23 9 18 8 19 13 19 14 13
## M 24 11 20 6 21 14 21 11 15
## N 25 10 21 5 22 15 20 12 14
## S 26 16 26 9 18 16 26 19 11
## 27 27 27 27 27 27 27 27 27 27
## 28 28 43 28 43 28 43 28 28 28
## 29 29 44 29 44 29 44 29 29 29
## 30 30 48 33 45 33 48 33 30 30
## 31 31 51 36 48 34 49 34 33 31
## 32 32 49 34 46 35 50 35 31 32
## 33 33 50 35 47 36 51 36 32 33
## 34 34 45 30 49 30 45 30 34 34
## 35 35 46 31 50 31 46 31 35 35
## 36 36 47 32 51 32 47 32 36 36
## 37 37 28 37 28 37 28 37 37 37
## 38 38 29 38 37 38 29 38 46 46
## 39 39 30 39 38 39 30 39 47 47
## 40 40 31 40 39 40 33 42 50 48
## 41 41 32 41 40 41 34 43 51 49
## 42 42 33 42 41 42 31 40 48 50
## 43 43 34 43 42 43 32 41 49 51
## 44 44 35 44 29 44 35 44 38 38
## 45 45 36 45 30 45 36 45 39 39
## 46 46 40 49 31 49 37 49 43 43
## 47 47 41 50 32 50 38 50 44 44
## 48 48 42 51 33 51 39 51 45 45
## 49 49 37 46 34 46 40 46 40 40
## 50 50 38 47 36 47 41 47 42 41
## 51 51 39 48 35 48 42 48 41 42

The way to interpret this is that the i,jth element of M matches the ith taxon label or node number in the row names to the nodes in the column jth tree.

Next we want to get the edge lengths of all the corresponding edges in all the trees. This should be fairly easy now. Note that we have one few edge length in each tree than the number of rows in M because (generally) the root node does not have an edge length.

M<-M[-Ntip(trees[[1]])-1,] ## trim root node from M
E<-matrix(NA,nrow(M),ncol(M),dimnames=dimnames(M))
for(i in 1:ncol(M)) for(j in 1:nrow(M))
E[j,i]<-trees[[i]]$edge.length[which(trees[[i]]$edge[,2]==M[j,i])]
print(E,digits=3)
##    t[[1]] t[[2]] t[[3]] t[[4]] t[[5]] t[[6]] t[[7]] t[[8]] t[[9]]
## H 41.507 38.43 40.89 40.93 35.380 36.741 40.698 36.70 39.46
## I 41.507 38.43 40.89 40.93 35.380 36.741 40.698 36.70 39.46
## F 25.372 25.51 22.98 24.74 23.830 27.529 25.377 26.37 25.94
## E 25.372 25.51 22.98 24.74 23.830 27.529 25.377 26.37 25.94
## G 34.043 31.93 31.06 32.62 29.796 33.251 33.857 31.85 32.47
## A 28.602 33.14 29.13 29.46 28.879 25.782 32.828 26.38 26.89
## C 3.499 3.74 2.98 1.27 1.862 0.522 1.192 2.49 5.84
## B 3.499 3.74 2.98 1.27 1.862 0.522 1.192 2.49 5.84
## D 36.666 38.67 36.18 37.60 39.167 37.088 35.332 36.74 37.25
## J 97.374 94.51 92.68 90.95 93.051 97.177 91.018 94.77 91.71
## U 12.275 16.01 15.30 16.34 14.012 13.259 15.002 13.60 11.46
## V 5.248 4.90 4.79 3.60 3.848 4.543 9.768 6.24 4.03
## W 5.248 4.90 4.79 3.60 3.848 4.543 9.768 6.24 4.03
## Z 7.952 12.70 13.57 13.49 8.555 13.479 10.517 12.95 8.01
## Y 5.860 12.70 13.57 10.00 7.919 13.479 10.517 10.32 8.01
## X 5.860 12.70 13.57 10.00 7.919 13.479 10.517 10.32 8.01
## T 77.345 78.97 78.68 79.36 77.708 80.346 80.827 81.22 78.95
## O 33.773 30.63 32.62 32.31 29.050 29.524 31.983 35.08 35.77
## P 29.987 25.67 30.19 25.47 28.086 28.537 25.924 29.73 26.43
## R 18.898 15.71 16.82 15.43 16.459 13.036 12.350 15.81 15.68
## Q 18.898 15.71 16.82 15.43 16.459 13.036 12.350 15.81 15.68
## K 21.646 19.60 18.16 19.51 22.487 18.644 18.263 19.11 19.64
## L 21.646 19.60 18.16 19.51 22.487 18.644 18.263 19.11 19.64
## M 29.576 24.77 29.13 25.65 27.042 24.809 28.559 22.98 24.85
## N 29.576 24.77 29.13 25.65 27.042 24.809 28.559 22.98 24.85
## S 66.926 60.54 69.51 66.75 64.339 70.155 69.758 62.07 62.50
## 28 2.466 5.65 9.39 2.68 3.798 5.466 12.050 5.17 7.41
## 29 6.992 8.90 7.17 4.12 8.859 6.314 5.507 7.45 7.40
## 30 19.058 18.65 18.28 18.35 22.574 18.007 16.631 18.48 16.88
## 31 29.817 28.53 26.34 27.56 26.239 36.115 28.183 32.14 27.96
## 32 37.280 35.04 36.17 35.86 31.822 39.605 35.023 36.99 34.95
## 33 8.672 6.42 8.08 7.88 5.966 5.722 8.480 5.47 6.53
## 34 53.716 46.94 49.33 49.24 45.026 53.775 50.179 50.58 47.05
## 35 8.064 5.53 7.05 8.13 10.288 11.306 2.504 10.35 10.36
## 36 25.102 29.41 26.15 28.20 27.016 25.259 31.636 23.89 21.05
## 37 19.982 21.20 17.92 14.27 19.142 20.337 17.424 18.72 20.17
## 38 2.513 0.00 5.46 0.00 0.000 1.960 4.817 0.00 0.00
## 39 64.567 56.63 60.85 59.94 60.023 61.396 65.656 61.40 64.04
## 40 0.502 6.32 2.54 3.07 3.672 5.691 0.169 6.22 3.46
## 41 7.027 11.11 10.51 12.74 10.165 8.716 5.234 7.36 7.43
## 42 4.825 9.63 4.27 5.93 9.130 5.471 4.654 6.87 6.90
## 43 2.093 0.00 0.00 3.49 0.636 0.000 0.000 2.64 0.00
## 44 12.931 18.43 14.63 12.61 13.368 12.151 15.886 19.15 16.45
## 45 24.173 21.19 26.63 23.24 26.666 27.403 22.921 20.84 20.47
## 46 8.980 8.72 10.26 11.20 8.623 13.228 14.854 6.15 6.27
## 47 3.786 4.96 2.43 6.84 0.964 0.987 6.059 5.36 9.34
## 48 11.089 9.96 13.37 10.04 11.627 15.501 13.574 13.92 10.75
## 49 6.915 8.75 11.83 12.77 5.908 8.559 13.627 10.35 10.51
## 50 14.191 11.00 12.89 11.23 9.278 15.549 14.946 11.77 11.88
## 51 6.261 5.83 1.92 5.09 4.723 9.384 4.651 7.91 6.67

Neat.

Of course, if we want we could get the average edge lengths for corresponding edges across the set of trees as follows:

edge.length<-rowMeans(E)
ii<-sapply(trees[[1]]$edge[,2],function(x,y) which(y==x),y=M[,1])
tree<-trees[[1]]
tree$edge.length<-edge.length[ii]
plotTree(tree)

plot of chunk unnamed-chunk-4

We can see that when we average the edge lengths from a set of topologically identical ultrametric trees, the resulting tree is also ultrametric.

And, to see that this is indeed the average, let's overlay it on our plot made with densityTree, this time using a square phylogram plotting style:

densityTree(trees,nodes="centered",compute.consensus=FALSE)
par(fg="transparent")
plotTree(tree,nodes="centered",mar=par()$mar,
direction="leftwards",xlim=get("last_plot.phylo",
envir=.PlotPhyloEnv)$x.lim,add=TRUE,
color=make.transparent("darkgrey",0.5),lwd=6,ftype="i")

plot of chunk unnamed-chunk-5

par(fg="black")

The 'average' tree is show in grey, whereas our sample of trees are given in blue.

Note that it is also possible to do this automatically (that is, obtaining the average values of corresponding edges from a set of trees) using the phytools function consensus.edges.

I hope this addresses the user's question.

For this exercise I simulated my set of trees as follows:

tree<-pbtree(n=26,tip.label=LETTERS,scale=100)
trees<-list()
for(i in 1:9){
nodes<-sample(1:tree$Nnode+Ntip(tree),10)
trees[[i]]<-tree
trees[[i]]$edge.length<-abs(tree$edge.length+
rnorm(n=nrow(tree$edge),sd=3))
trees[[i]]<-force.ultrametric(trees[[i]])
for(j in 1:length(nodes))
trees[[i]]<-untangle(rotate(trees[[i]],nodes[j]),"read.tree")
}
class(trees)<-"multiPhylo"

although I assume that in the empirical case they would come from my inference method.

That's it.

Drawing colored boxes around phylogenetic tip labels using R base graphics

$
0
0

I recently saw a post describing how to plot a tree with colored boxes around tip labels using the neat package ggtree.

Of course, it is also straightforward to do this using R base graphics. The following is just one example of how to do that:

library(phytools)
## custom function I'm going to use for the box labels
boxlabel<-function(x,y,text,cex=1,bg="transparent",offset=0){
w<-strwidth(text)*cex*1.1
h<-strheight(text)*cex*1.4
os<-offset*strwidth("W")*cex
rect(x+os,y-0.5*h,x+w+os,y+0.5*h,col=bg,border=0)
text(x,y,text,pos=4,offset=offset,font=3)
}
## our tree
tree
## 
## Phylogenetic tree with 26 tips and 25 internal nodes.
##
## Tip labels:
## M.tlxdfmsc, Y.fnblkxm, T.njirxywqec, D.brdqgfkwz, A.cmegtpxbjn, P.gdpbcqm, ...
##
## Rooted; includes branch lengths.
## our character for the colors:
x
##   M.tlxdfmsc    Y.fnblkxm T.njirxywqec  D.brdqgfkwz A.cmegtpxbjn 
## c b b b a
## P.gdpbcqm R.pyjxbva N.brklwqft G.opbufyimec K.atmhkx
## b b a c b
## C.oithaf V.zmplhwtjcs F.ualphzk Z.iwghoftx L.myxblzdj
## c a a b a
## S.cxmvdgeuy W.lpsafhe E.zaxwcnjq I.kavmis B.pxiwkbzum
## a a c c c
## X.zoabkhc Q.wyhvmegitz H.cjrunxewak U.nuxeic O.kcwyhl
## c c b b c
## J.ytcumhoagw
## b
## Levels: a b c
## our colors:
cols<-setNames(RColorBrewer::brewer.pal(length(unique(x)),"Accent"),sort(unique(x)))
cols
##         a         b         c 
## "#7FC97F" "#BEAED4" "#FDC086"
## now our plot:
par(fg="transparent")
plotTree(tree)
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
N<-Ntip(tree)
par(fg="black")
for(i in 1:Ntip(tree)) boxlabel(pp$xx[i],pp$yy[i],tree$tip.label[i],bg=cols[x[i]])

plot of chunk unnamed-chunk-1

Of course, if the colors for the tip data come from a character purported to have evolved on the tree, perhaps we want to also show a stochastic character map of the same character evolving up the tree as follows:

par(fg="transparent")
plot(make.simmap(tree,x),cols,lwd=4)
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## a b c
## a -0.5411151 0.5411151 0.0000000
## b 0.5411151 -1.2691823 0.7280672
## c 0.0000000 0.7280672 -0.7280672
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
par(fg="black")
for(i in 1:Ntip(tree)) boxlabel(pp$xx[i],pp$yy[i],tree$tip.label[i],bg=cols[x[i]])
add.simmap.legend(colors=cols,prompt=F,x=5.7,y=26,fsize=2,shape="circle")

plot of chunk unnamed-chunk-2

Or maybe we want to show posterior probabilities at nodes of the tree from a set of stochastic maps:

trees<-make.simmap(tree,x,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## a b c
## a -0.5411151 0.5411151 0.0000000
## b 0.5411151 -1.2691823 0.7280672
## c 0.0000000 0.7280672 -0.7280672
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
par(fg="transparent")
plot(trees[[1]],cols,lwd=4)
nodelabels(pie=summary(trees)$ace,piecol=cols,cex=0.8)
par(fg="black")
for(i in 1:Ntip(tree)) boxlabel(pp$xx[i],pp$yy[i],tree$tip.label[i],bg=cols[x[i]])
add.simmap.legend(colors=cols,prompt=F,x=5.7,y=26,fsize=2,shape="circle")

plot of chunk unnamed-chunk-3

That's pretty neat too.

The tree & data were simulated (to have realistic looking tip labels) as follows:

tree<-rtree(26,tip.label=LETTERS)
for(i in 1:Ntip(tree))
tree$tip.label[i]<-paste(tree$tip.label[i],".",paste(sample(letters,sample(6:10,1)),
collapse=""),sep="")
Q<-matrix(c(-1,1,0,1,-2,1,0,1,-1),3,3,dimnames=list(letters[1:3],letters[1:3]))
x<-as.factor(sim.history(tree,Q)$states)

Adding a geological legend to a fan-style tree using concentric circles

$
0
0

Today an R-sig-phylo subscriber asked:

“Does anyone know of a function to plot a geologic time scale as a series of concentric circles on a circularly plotted tree?”

Here is a demo using solid, rather than semi-transparent (as in geo.legend colors. This avoids the problem of having to plot 'donuts' rather than filled circles. The former scenario I will consider later.

Note that in the following tree - though a genuine empirical phylogeny - has been arbitrarily rescaled to have a total depth of 100 my for illustrative purposes only!

library(phytools)
## Loading required package: ape
## Loading required package: maps
library(plotrix)
tree
## 
## Phylogenetic tree with 82 tips and 81 internal nodes.
##
## Tip labels:
## Anolis_ahli, Anolis_allogus, Anolis_rubribarbus, Anolis_imias, Anolis_sagrei, Anolis_bremeri, ...
##
## Rooted; includes branch lengths.
plotTree(tree,ftype="off",ylim=c(-0.2*Ntip(tree),Ntip(tree)),lwd=1,
xlim=c(max(nodeHeights(tree)),0),direction="leftwards")
obj<-geo.legend() ## this is just to get the colors

plot of chunk unnamed-chunk-1

r<-max(obj$leg[,1])-obj$leg[,2]

plotTree(tree,type="fan",fsize=0.7,lwd=1,ftype="i")
for(i in 1:nrow(obj$leg)){
color<-paste(strsplit(obj$colors[i],"")[[1]][1:7],collapse="")
draw.circle(0,0,radius=r[i],col=color,border="transparent")
}
par(fg="transparent")
plotTree(tree,type="fan",add=TRUE,fsize=0.7,lwd=1,ftype="i")
par(fg="black")

add.simmap.legend(colors=sapply(obj$colors[rownames(obj$leg)],
function(x) paste(strsplit(x,"")[[1]][1:7],collapse="")),
prompt=FALSE,x=0.95*par()$usr[1],y=0.7*par()$usr[3])

plot of chunk unnamed-chunk-1

Neat.

Function to simulate variable-rate Mk model mapped onto the tree

$
0
0

A little while ago I posteda function to fit multiple Mk models to different painted 'regimes' on the tree.

This method differs from others such as the 'covarion' model of Penny et al. (2001) or the related hidden rates model of Beaulieu et al. (2013) in that it treats the regimes as 'known' - rather than integrating over uncertainty in their painting. Thus, the approach is more appropriate for comparing among a limited number of a priori hypotheses - such as a difference in rate between clades or across time periods (such as geological eras), rather than for testing for a relationship between two or more discrete traits or for trying to identify changes in the rate of character evolution for a single discrete trait, absent an a priori hypothesis regarding how the rate may have changed.

I'm trying to write a short note describing this method, but first I need a simulatorfor the model. In a prior post, I simulated under a variable-rate model merely by stretching the branch lengths of the tree. Unfortunately, this only works if Q2 is a scalar multiple of Q1. For more complex scenarios in which the various Qi we'd like to simulate are not proportional we'll need a tool that is a tiny bit more sophisticated.

Using the phytools"simmap" object class to encode our paintings, the way I decided to do this was first by converting our "simmap" object to an ordinary "phylo" object with singleton nodes instead of mappings along edges. This way, we can only have one regime per edge. Then I proceeded across all the edges of the tree with singletons and for each edge computed P=expm(Qi×t) in which Qi indicates the correct transition matrix for that edge, t is the length of the edge, expm is the matrix exponential, and P is the matrix of transition probabilities. Next, I proceeded to carry out a single preorder tree traversal in which every node is visited, a character state is assigned using the state at the parent node and the computed matrix of transition probabilities, P, and then the algorithm proceeds to the daughters of the current node until all nodes have been visited.

Here is what that code looks like:

sim.multiMk<-function(tree,Q,anc=NULL,...){
ss<-rownames(Q[[1]])
tt<-map.to.singleton(reorder(tree))
P<-vector(mode="list",length=nrow(tt$edge))
for(i in 1:nrow(tt$edge))
P[[i]]<-expm(Q[[names(tt$edge.length)[i]]]*tt$edge.length[i])
if(is.null(anc)) anc<-sample(ss,1)
STATES<-matrix(NA,nrow(tt$edge),2)
root<-Ntip(tt)+1
STATES[which(tt$edge[,1]==root),1]<-anc
for(i in 1:nrow(tt$edge)){
new<-ss[which(rmultinom(1,1,P[[i]][STATES[i,1],])[,1]==1)]
STATES[i,2]<-new
ii<-which(tt$edge[,1]==tt$edge[i,2])
if(length(ii)>0) STATES[ii,1]<-new
}
x<-as.factor(
setNames(sapply(1:Ntip(tt),function(n,S,E) S[which(E==n)],
S=STATES[,2],E=tt$edge[,2]),tt$tip.label))
x
}

For completeness, here's an equivalent simulator for a single Q matrix:

sim.Mk<-function(tree,Q,anc=NULL,...){
ss<-rownames(Q)
tt<-reorder(tree)
P<-vector(mode="list",length=nrow(tt$edge))
for(i in 1:nrow(tt$edge))
P[[i]]<-expm(Q*tt$edge.length[i])
if(is.null(anc)) anc<-sample(ss,1)
STATES<-matrix(NA,nrow(tt$edge),2)
root<-Ntip(tt)+1
STATES[which(tt$edge[,1]==root),1]<-anc
for(i in 1:nrow(tt$edge)){
new<-ss[which(rmultinom(1,1,P[[i]][STATES[i,1],])[,1]==1)]
STATES[i,2]<-new
ii<-which(tt$edge[,1]==tt$edge[i,2])
if(length(ii)>0) STATES[ii,1]<-new
}
x<-as.factor(
setNames(sapply(1:Ntip(tt),function(n,S,E) S[which(E==n)],
S=STATES[,2],E=tt$edge[,2]),tt$tip.label))
x
}

For fun, let's imagine the following regimes:

library(phytools)
colors<-setNames(c("blue","red"),0:1)
plot(tree,lwd=1,colors=colors,ftype="off")

plot of chunk unnamed-chunk-3

& then let's simulate two characters, x& y on this tree in which x evolved via a variable-rate process, whereas y evolved under a single set of transition rates:

Qx<-setNames(
list(matrix(c(-0.5,0.5,0.5,-0.5),2,2,dimnames=list(letters[1:2],
letters[1:2])),
matrix(c(-5,5,5,-5),2,2,dimnames=list(letters[1:2],
letters[1:2]))),0:1)
Qx
## $`0`
## a b
## a -0.5 0.5
## b 0.5 -0.5
##
## $`1`
## a b
## a -5 5
## b 5 -5
Qy<-matrix(c(-1,1,1,-1),2,2,dimnames=list(letters[1:2],letters[1:2]))
Qy
##    a  b
## a -1 1
## b 1 -1
x<-sim.multiMk(tree,Qx)
plot(tree,colors,lwd=1,ftype="off",type="fan",mar=c(1.1,1.1,4.1,1.1))
par(fg="transparent")
tiplabels(pie=to.matrix(x,letters[1:2]),
piecol=RColorBrewer::brewer.pal(n=3,"PRGn")[c(1,3)],
cex=0.3)
par(fg="black")
title(main="Discrete trait simulated via a variable-rate process")

plot of chunk unnamed-chunk-4

y<-sim.Mk(tree,Qy)
plot(tree,colors,lwd=1,ftype="off",type="fan",mar=c(1.1,1.1,4.1,1.1))
par(fg="transparent")
tiplabels(pie=to.matrix(y,letters[1:2]),
piecol=RColorBrewer::brewer.pal(n=3,"PRGn")[c(1,3)],
cex=0.3)
par(fg="black")
title(main="Discrete trait simulated via a constant-rate process")

plot of chunk unnamed-chunk-4

Finally, we can fit both the constant- and variable-rate models to each data vector:

fit1.x<-fitMk(tree,x)
fit1.x
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## a b
## a -1.140655 1.140655
## b 1.140655 -1.140655
##
## Fitted (or set) value of pi:
## a b
## 0.5 0.5
##
## Log-likelihood: -102.305606
##
## Optimization method used was "nlminb"
fitmulti.x<-fitmultiMk(tree,x)
fitmulti.x
## Object of class "fitmultiMk".
##
## Fitted value of Q[0]:
## a b
## a -0.641288 0.641288
## b 0.641288 -0.641288
##
## Fitted value of Q[1]:
## a b
## a -5.843021 5.843021
## b 5.843021 -5.843021
##
## Fitted (or set) value of pi:
## a b
## 0.5 0.5
##
## Log-likelihood: -95.203038
##
## Optimization method used was "nlminb"
fit1.y<-fitMk(tree,y)
fit1.y
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## a b
## a -1.079919 1.079919
## b 1.079919 -1.079919
##
## Fitted (or set) value of pi:
## a b
## 0.5 0.5
##
## Log-likelihood: -101.832729
##
## Optimization method used was "nlminb"
fitmulti.y<-fitmultiMk(tree,y)
fitmulti.y
## Object of class "fitmultiMk".
##
## Fitted value of Q[0]:
## a b
## a -1.141888 1.141888
## b 1.141888 -1.141888
##
## Fitted value of Q[1]:
## a b
## a -0.967775 0.967775
## b 0.967775 -0.967775
##
## Fitted (or set) value of pi:
## a b
## 0.5 0.5
##
## Log-likelihood: -101.770944
##
## Optimization method used was "nlminb"

and even compare between models:

library(lmtest)
LR.x<-suppressWarnings(lrtest(fit1.x,fitmulti.x))
LR.x
## Likelihood ratio test
##
## Model 1: fit1.x
## Model 2: fitmulti.x
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 1 -102.306
## 2 2 -95.203 1 14.205 0.0001639 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LR.y<-suppressWarnings(lrtest(fit1.y,fitmulti.y))
LR.y
## Likelihood ratio test
##
## Model 1: fit1.y
## Model 2: fitmulti.y
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 1 -101.83
## 2 2 -101.77 1 0.1236 0.7252

Neat.

Another technique for including a time axis in a circular fan-style tree

$
0
0

In a prior post I showed how to add a time axis to a plotted fan tree using the plotTree / plot.simmap argument part.

Though this works well, our open angle is between part× 360o and 360o, rather than centered around 0o / 3600. To do this, we should ignore the argument part& instead combine the poorly documented arguments tips and maxY. These arguments exist to allow the user to supply a custom tip height spacing/order for a square phylogram; however, we used in a fan-style tree the effect is to transform these values into the circular space.

Here's what I mean:

library(phytools)
library(plotrix)
tree
## 
## Phylogenetic tree with 400 tips and 399 internal nodes.
##
## Tip labels:
## t307, t308, t181, t128, t129, t110, ...
##
## Rooted; includes branch lengths.
plotTree(tree,type="fan",tips=seq(5,380,by=375/399),maxY=400,
ftype="off",lwd=1)
T<-max(nodeHeights(tree))
tick.spacing<-10
min.tick<-10
obj<-axis(1,pos=0,at=seq(T,min.tick,by=-tick.spacing),cex.axis=0.5,
labels=FALSE)

for(i in 1:length(obj)){
a1<-0
a2<-2*pi
draw.arc(0,0,radius=obj[i],a1,a2,lwd=1,
col=make.transparent("blue",0.1))
}

axis(1,pos=0,at=seq(T,min.tick,by=-tick.spacing),cex.axis=0.5,
labels=FALSE)
text(obj,rep(-5,length(obj)),T-obj,cex=0.6)
text(mean(obj),-10,"time (mybp)",cex=0.8)

plot of chunk unnamed-chunk-1

Neat.

Note that in theory we could do this with the ape function axisPhylo, although this function seems to be somewhat unpredictable as to whether or not it works as expected. For instance:

plotTree(tree,type="fan",tips=seq(15,385,by=370/399),maxY=400,
ftype="off",lwd=1)
axisPhylo()

plot of chunk unnamed-chunk-2


Macroevolution summer 2018 workshop in Mexico City

$
0
0

(Spanish version follows.)

We are pleased to announce a new graduate-level intensive short course on the use of R for phylogenetic comparative analysis and downstream implementation in macroevolutionary studies. The course will be four days in length and free of charge, and will take place at the Universidad Nacional Autónoma de México in Mexico City from the 26th to the 29th of June, 2018. This course is partially funded by the National Science Foundation, with additional support from the University of Massachusetts Boston and the Universidad Nacional Autónoma de México. There are a number of full stipends available to cover the cost of travel and lodging for qualified students and post-docs. Applicants are welcome from any country; however, we expect that most admitted students will come from Mexico, Central America, the Caribbean, and from other countries within Latin America. Accepted students from further afield may be offered only partial funding for their travel expenses.

Topics covered will include: an introduction to the R scientific computing environment, tree manipulation, independent contrasts and phylogenetic generalized least squares, ancestral state reconstruction, models of character evolution, diversification analysis, and visualization methods for phylogenies and comparative data. Course instructors will include Dr. Liam Revell (University of Massachusetts Boston), Dr. Luke Harmon (University of Idaho), Dr. Michael Alfaro (UCLA), and Dr. Alejandro Gonzalez-Voyer (Universidad Nacional Autónoma de México), along with the possibility of other co-instructors yet to be announced. Dr. Gonzalez-Voyer will also be in charge of coordinating the course.

Instruction in the course will be primarily in English; however some of the instructors and TAs of the course are competent or fluent in Spanish and English. Discussion, exercises, and activities will be conducted in both languages.

To apply for the course, please submit your CV along with a short (1 page) description of your research interests, background, and reasons for taking the course. Admission is competitive, and preference will go towards students with background in phylogenetics and a compelling motivation for taking the course. In your application please indicate your preferred travel airport, if appropriate. Applications should be submitted by email to mexico@phytools.org by March 1st, 2018. Applications may be written in English or Spanish; however all students must have a basic working knowledge of scientific English. Questions can be directed to Drs. Liam Revell (liam.revell@umb.edu) or Alejandro Gonzalez-Voyer (alejandro.gonzalez@iecologia.unam.mx).


Curso 2018 en macroevolución y uso de métodos filogenéticos comparativos en R

Nos complace anunciar un nuevo curso intensivo, con modalidad de taller, destinado a estudiantes graduados / de posgrado, acerca del uso de métodos filogenéticos comparativos en R. Estos métodos tienen diversas aplicaciones en estudios macroevolutivos. El curso será gratuito, tendrá una duración de cuatro días, y se dictará en la Universidad Nacional Autónoma de México, Ciudad Universitaria, Ciudad de México entre los días 26 a 29 de junio de 2018. Este curso estará parcialmente financiado por la National Science Foundation (Estados Unidos), y contará con el apoyo adicional de la University of Massachusetts Boston y de la Universidad Nacional Autónoma de México. El financiamiento cubriría los costos de los pasajes de avión y del alojamiento de los alumnos que sean aceptados en el curso, si bien la totalidad de la cobertura podría estar sujeta a cambios, dependiendo de la localización geográfica de los postulantes seleccionados.

El curso se encuentra destinado a estudiantes avanzados, estudiantes de maestría o doctorado en ciencias biológicas o carreras afines, investigadores y profesionales interesados en la temática. Recibiremos solicitudes de cualquier país; sin embargo anticipamos que los postulantes mexicanos, centroamericanos, caribeños y de otros países latinoamericanos constituirán la mayoría de los estudiantes admitidos al programa. Los estudiantes provenientes de países más lejanos que resulten elegidos tendrán la posibilidad de recibir únicamente apoyo parcial para costear sus gastos del viaje.

Los temas que serán discutidos en el curso incluyen: una introducción al ambiente computacional de R, manipulación de árboles filogenéticos, mínimos cuadrados generalizados en un contexto filogenético, reconstrucción de estados ancestrales, modelos evolutivos de rasgos, análisis de diversificación filogenética, y visualización de filogenias y datos comparativos, entre otros. El curso estará a cargo de los instructores Dr. Liam Revell (University of Massachusetts Boston), Dr. Luke Harmon (University of Idaho), Dr. Michael Alfaro (University of California, Los Angeles) y Dr. Alejandro Gonzalez-Voyer (UNAM) contándose con la posible participación de instructores adicionales. El Dr. Alejandro Gonzalez-Voyer también será el coordinador de este curso.

El curso será dictado principalmente en inglés; sin embargo, algunos de los instructores y ayudantes de enseñanza del curso hablan español fluido. Las discusiones, los ejercicios, y las actividades del curso se harán en español e inglés.

Los interesados en solicitar la admisión deberán enviar su currículum vitae y una descripción corta (1 página) de sus intereses científicos, experiencia, y razones por las cuales quieren tomar el curso. El proceso de admisión será competitivo, y se dará preferencia a estudiantes con conocimientos de filogenética y que estén desarrollando investigaciones relacionadas a los temas del curso. Se espera que todos los estudiantes tengan un nivel básico de inglés científico. En la solicitud debe indicarse el aeropuerto de viaje preferido (si aplica). Las solicitudes pueden estar escritas en inglés o en español y deben ser enviadas por email a mexico@phytools.org antes del 1 marzo de 2018. Preguntas adicionales pueden ser dirigidas al Dr. Liam Revell (liam.revell@umb.edu) o al Dr. Alejandro Gonzalez-Voyer (alejandro.gonzalez@iecologia.unam.mx).

How to fit a λ tree transformation for discrete character data with uncertainty on the phylogeny

$
0
0

The following is a small demo of how to fit a λ tree transformation to a discrete character with uncertainty.

This is possible to do in cases without uncertainty using the argument value transform="lambda" in the geiger function fitDiscrete. In the following code I'll use fitMkfrom phytools, which permits uncertainty in the tip states, and then one of R's built-in optimizers to find the ML value of the tree transformation parameter, λ.

The phylogeny in the following example comes from Gamble et al. (2014), and the data from a study on the evolutionary correlates of urban tolerance by my current graduate student (but soon to be postdoc) Kristin Winchell.

First, let's get our data:

library(phytools)
tree<-read.tree("lizard-tree.tre")
tree
## 
## Phylogenetic tree with 219 tips and 218 internal nodes.
##
## Tip labels:
## grahami, conspersus, garmani, opalinus, valencienni, lineatopus, ...
## Node labels:
## 1, 0.999900039984006, 0.999800079968013, 1, 0.289984006397441, 0.880347860855658, ...
##
## Rooted; includes branch lengths.
X<-read.csv("tip-states.csv",row.names=1)
## this is what our data looks like
head(X)
##         urban tolerant avoid
## acutus 0.27 0.73 0.00
## aeneus 0.52 0.48 0.00
## agueroi 0.33 0.33 0.33
## ahli 0.00 0.24 0.76
## alayoni 0.00 0.41 0.59
## alfaroi 0.33 0.33 0.33
library(geiger)
obj<-name.check(tree,X)
obj
## $tree_not_data
## [1] "aequatorialis" "agassizi" "altae"
## [4] "anatoloros" "annectens" "anoriensis"
## [7] "aquaticus" "auratus" "bicaorum"
## [10] "biporcatus" "bitectus" "bonairensis"
## [13] "breslini" "calimae" "capito"
## [16] "carolinensis" "carpenteri" "casildae"
## [19] "chloris" "chocorum" "chrysolepis"
## [22] "crassulus" "cupreus" "danieli"
## [25] "euskalerriari" "festae" "fitchi"
## [28] "fraseri" "frenatus" "fuscoauratus"
## [31] "garridoi" "gemmosus" "heterodermus"
## [34] "huilae" "humilis" "inderanae"
## [37] "insignis" "intermedius" "isthmicus"
## [40] "jacare" "koopmani" "L_carinatus"
## [43] "laeviventris" "lemurinus" "limifrons"
## [46] "lineatus" "lionotus" "longiceps"
## [49] "loveridgei" "maculigula" "meridionalis"
## [52] "microtus" "neblininus" "nebuloides"
## [55] "nebulosus" "nicefori" "nitens"
## [58] "onca" "ortonii" "oscelloscapularis"
## [61] "oxylophus" "P_acutirostris_1" "P_acutirostris_2"
## [64] "pachypus" "pandoensis" "peraccae"
## [67] "podocarpus" "poecilopus" "polylepis"
## [70] "polyrhachis" "princeps" "punctatus"
## [73] "purpurgularis" "quercorum" "rejectus"
## [76] "sericeus" "sminthus" "sp_nov_1"
## [79] "sp_nov_2" "sp_nov_3" "tigrinus"
## [82] "trachyderma" "tranquillus" "transversalis"
## [85] "tropidogaster" "tropidonotus" "uniformis"
## [88] "utilensis" "vanzolinii" "ventrimaculatus"
## [91] "woodi" "zeus"
##
## $data_not_tree
## [1] "agueroi" "fairchildi" "litoralis" "roosevelti" "terraealtae"
tree<-drop.tip(tree,obj$tree_not_data)
X<-X[tree$tip.label,]
name.check(tree,X)
## [1] "OK"
X<-as.matrix(X)
plotTree(tree,type="fan",lwd=1,ftype="off")
cols<-setNames(c("grey","white","darkgreen"),colnames(X))
tiplabels(pie=X,cex=0.5,piecol=cols)
add.simmap.legend(colors=cols,prompt=FALSE,shape="circle",
x=-81,y=81)

plot of chunk unnamed-chunk-1

Now let's fit an ordered model the regular way:

ordered<-matrix(c(0,1,0,
2,0,1,
0,2,0),3,3,byrow=TRUE,
dimnames=list(colnames(X),colnames(X)))
ordered
##          urban tolerant avoid
## urban 0 1 0
## tolerant 2 0 1
## avoid 0 2 0
fit<-fitMk(tree,X,model=ordered)

Now let's fit our λ model:

## our likelihood function
lk.lambda<-function(lambda,tree,x,...)
-logLik(fitMk(phytools:::lambdaTree(tree,lambda),
x,...))
opt<-optimize(lk.lambda,c(0,phytools:::maxLambda(tree)),tree=tree,
x=X,model=ordered)
opt
## $minimum
## [1] 0.8745017
##
## $objective
## [1] 128.1338
## attr(,"df")
## [1] 2

This tells us that the ML value of λ is:

lam<-opt$minimum
lam
## [1] 0.8745017

Let's visualize the likelihood surface just to make sure we got it right:

lambda<-seq(0,1,by=0.001)
lik<--sapply(lambda,lk.lambda,tree=tree,x=X,model=ordered)
plot(lambda,lik,type="l",xlab=expression(lambda),ylab="log(likelihood)",
lwd=2,col="darkgrey")
abline(v=lam,lty="dashed")
text(x=lam,y=-1.005*opt$objective,expression(paste("ML(",lambda,")")),
pos=4)

plot of chunk unnamed-chunk-6

Wow. The super curious thing about this surface is its irregular appearance which we do not generally see for λ with continuous characters. Neat.

Now that we're confident that we have the ML solution, we can even do many of the ordinary things we might do with our fitted model. For instance, let's compare it to a λ = 0 model. One interpretation of this might be as a test for significant 'phylogenetic signal':

fit.lambda<-fitMk(phytools:::lambdaTree(tree,lam),X,model=ordered)
fit.h0<-fitMk(phytools:::lambdaTree(tree,0),X,model=ordered)
LR<--2*(logLik(fit.h0)-logLik(fit.lambda))
P.chisq<-pchisq(LR,df=1,lower.tail=FALSE)
P.chisq
## [1] 4.120873e-06
## attr(,"df")
## [1] 2

This tells us that - at least by this measure - there is phylogenetic signal: our data are more probable on our structured tree than on a star tree.

That's it.

Update to Mk simulators for multiple simulations

$
0
0

I just pushedupdates for the relatively newphytoolsfunctions sim.Mk and sim.multiMk that permit the functions to simulate more than one character at a time.

Though previously this would have been trivial to script using (for instance) a single replicate call, the advantage gained by adding it to the function directly is that the different transition matrices for each edge (or edge segment, in the case of fit.multiMk) only need to be computed once by the function instead of separately for each simulation.

Here is an example of the the speed-up:

library(phytools)
packageVersion("phytools")
## [1] '0.6.55'
tree<-pbtree(n=100,scale=2)
Q<-matrix(c(-1,1,0,1,-2,1,0,1,-1),3,3,byrow=T,
dimnames=list(0:2,0:2))
Q
##    0  1  2
## 0 -1 1 0
## 1 1 -2 1
## 2 0 1 -1
## old way:
system.time(X<-replicate(100,sim.Mk(tree,Q)))
##    user  system elapsed 
## 3.14 0.00 3.14
## new way:
system.time(X<-sim.Mk(tree,Q,nsim=100))
##    user  system elapsed 
## 0.32 0.00 0.31

Much faster.

For fun, let's take our second set of simulations and fit an ordered, single-rate Mk model to each of them. (This is the model we used for simulation.)

ordered<-matrix(c(0,1,0,1,0,1,0,1,0),3,3)
fits<-apply(X,2,fitMk,tree=tree,model=ordered)
q<-sapply(fits,function(x) x$rates)
mean(q)
## [1] 1.037991

This should be around 1.0 - the value we used for simulation.

Now let's do the same for a heterogeneous rate process:

Q<-setNames(list(
matrix(c(-0.5,0.5,0,0.5,-1,0.5,0,0.5,-0.5),3,3,dimnames=list(0:2,0:2)),
matrix(c(-2,2,0,2,-4,2,0,2,-2),3,3,dimnames=list(0:2,0:2))),c("a","b"))
Q
## $a
## 0 1 2
## 0 -0.5 0.5 0.0
## 1 0.5 -1.0 0.5
## 2 0.0 0.5 -0.5
##
## $b
## 0 1 2
## 0 -2 2 0
## 1 2 -4 2
## 2 0 2 -2
tree<-paintSubTree(tree,108,"b","a")
cols<-setNames(c("blue","red"),c("a","b"))
plot(tree,cols,ftype="off")

plot of chunk unnamed-chunk-5

X<-sim.multiMk(tree,Q,nsim=100)
fits<-apply(X,2,fitmultiMk,tree=tree,model=ordered)
q<-t(sapply(fits,function(x) x$rates))
colMeans(q)
## [1] 0.5427954 2.1405881

Again, hopefully these are similar to 0.5 and 2.0.

Fitting continuous character models when some or all internal nodes are known

$
0
0

I'm working on some code to fit continuous character models in which some or all of the internal (node) states are node.

Here is the function that can (so far) fit only a simple Brownian model:

fitInternal<-function(tree,x,a,model="BM"){
root<-as.character(Ntip(tree)+1)
if(root%in%names(a)){
ii<-which(names(a)==root)
a0<-a[ii]
a<-a[-ii]
} else a0<-NULL
if(model=="BM"){
if(is.null(a0)){
new.root<-sample(names(a),1)
a0<-a[new.root]
a<-a[-which(names(a)==new.root)]
new.tree<-root(tree,node=as.numeric(new.root))
M<-matchNodes(tree,new.tree,"distances")
new.nodes<-sapply(as.numeric(names(a)), function(n,M)
as.character(M[which(M[,1]==n),2]),M=M)
names(a)<-new.nodes
tree<-new.tree
}
nn<-c(names(x),names(a))
V<-vcvPhylo(tree,TRUE)[nn,nn]
likBM<-function(sig2,V,y,a0)
-as.numeric(-t(y-a0)%*%solve(sig2*V)%*%(y-a0)/2-
nrow(V)*log(2*pi)/2-determinant(sig2*V)$modulus[1]/2)
fit<-optimize(likBM,c(0,10),V=V,y=c(x,a),a0=a0)
obj<-list(sig2=fit$minimum,logL=-fit$objective)
}
obj
}

One of the things that I realized is that since Brownian motion is reversible, as long as I had one internal node value, I could just move the root to that node and then we'd no longer have to estimate the root node state as a second parameter in the Brownian model. Obviously, this won't work for models without this property, such as the 'early-burst' model.

Let's see how this very simple function works.

First, I'll start by simulating a tree & some data:

library(phytools)
tree<-pbtree(n=100)
x<-fastBM(tree,sig2=0.5,internal=TRUE)
a<-sample(x[Ntip(tree)+1:tree$Nnode],20) ## 20 internal nodes chosen at random
x<-x[1:Ntip(tree)] ## our tip data only

Now let's fit our model:

fit<-fitInternal(tree,x,a)
fit
## $sig2
## [1] 0.4366435
##
## $logL
## [1] -98.52515

As I've documented in the past, another way to fit this model is by adding terminal edges of zero length to each node for which we have data. Let's do this to cross-check our result:

nodes<-as.numeric(names(a))
tt<-tree
for(i in 1:length(nodes)){
M<-matchNodes(tree,tt,"distances")
node<-M[which(M[,1]==nodes[i]),2]
tt<-bind.tip(tt,nodes[i],edge.length=0,where=node)
}
plotTree(tt,fsize=0.5,lwd=1,color="darkgrey",ftype='i')

plot of chunk unnamed-chunk-4

fitContinuous(multi2di(tt),c(x,a))
## GEIGER-fitted comparative model of continuous data
## fitted 'BM' model parameters:
## sigsq = 0.433003
## z0 = 0.180955
##
## model summary:
## log-likelihood = -98.214025
## AIC = 200.428050
## AICc = 200.530614
## free parameters = 2
##
## Convergence diagnostics:
## optimization iterations = 100
## failed iterations = 0
## frequency of best fit = 1.00
##
## object summary:
## 'lik' -- likelihood function
## 'bnd' -- bounds for likelihood search
## 'res' -- optimization iteration summary
## 'opt' -- maximum likelihood parameter estimates

These are almost the same - but not quite. It turns out that our first estimate, though estimed using ML, is identical (to a high degree of numerical precision) to the REML estimation of σ2. We can show that using the phytools function brownieREML as follows:

tt<-paintSubTree(multi2di(tt),Ntip(tt)+1,"1")
fitREML<-brownieREML(tt,c(x,a))
fitREML
## REML single-rate model:
## s^2 se k logL
## value 0.4366 0.0566 1 -98.5251
##
## REML multi-rate model:
## s^2(1) se(1) k logL
## value 0.4366 0.0566 1 -98.5251
##
## R thinks it has found the REML solution.
fitREML$sig2.single
## [1] 0.4366413
fit$sig2
## [1] 0.4366435

Nonetheless, the purpose of doing this is to fit more complicated models such as OU or EB, so I will plan to add this later.

Small fix for 'static' 3D phylomorphospace plotting

$
0
0

I recently pusheda tiny update to the phytools function phylomorphospace3d (also used internally by fancyTree(...,type="traitgram3d") for method="static". This method uses the package scatterplot3d to simulate a three-dimensional projection of the tree into morphospace (or, in the case of "traitgram3d", a projection of the tree into a three-dimensional space defined by two morphological traits and time since the root). This update is in response to user reports that with some tree & trait scale combinations the space of the plot can appear extremely stretched or flattened.

Let's see how the method works with some simulated data:

library(phytools)
tree<-pbtree(n=30,scale=100)
vcv<-diag(c(0.01,1.0,10.0))
X<-sim.corrs(tree,vcv)
X
##            [,1]       [,2]       [,3]
## t8 -1.83242085 6.7210134 -8.664145
## t12 -1.15607024 6.9973451 -13.088127
## t24 -1.20569729 9.7727382 -15.425783
## t25 -1.10653325 9.7555249 -16.819261
## t5 -1.48039787 10.4446110 -23.209189
## t27 -1.31163205 19.3352026 -38.736942
## t28 -1.42809690 19.8008715 -35.600357
## t26 -1.40833515 20.4864698 -34.721207
## t23 -1.70902492 20.1792875 -38.754384
## t19 -1.68176620 1.4706946 -36.622222
## t20 -2.24323061 1.5492804 -37.739377
## t10 -1.19887289 3.7941856 -33.382013
## t11 -1.17648353 12.5826137 -19.722484
## t13 0.13900173 -1.1595256 -14.034705
## t21 -0.03889558 7.5848620 -36.576738
## t22 0.37449846 -0.6209873 -30.928140
## t14 0.53322638 4.2027966 -39.983553
## t9 0.24085843 2.3645035 -63.081849
## t29 -1.48428219 4.8378941 -14.046100
## t30 -1.47328983 5.4626160 -14.496791
## t2 -0.88728348 7.3949668 -24.898061
## t1 -1.84599443 5.6916087 -56.213488
## t4 -0.68412064 3.6165038 13.830777
## t15 -1.95272282 -0.3652135 49.609482
## t16 -1.38347887 2.9467029 67.053588
## t6 -0.66320389 -0.4777789 14.183655
## t7 -2.03189993 1.1912653 67.740006
## t17 -1.71885755 4.9178237 40.701543
## t18 -1.32472518 8.6595917 33.794385
## t3 -1.72539854 -1.0492751 31.683496
phylomorphospace3d(tree,X,method="static")

plot of chunk unnamed-chunk-2

Now a 3D traitgram with the 1st & 3rd dimensions:

fancyTree(tree,X=X[,c(1,3)],type="traitgram3d",method="static")
## Warning in traitgram3d(tree, ..., control = control): anc.ML may not have
## converged; consider increasing maxit.

plot of chunk unnamed-chunk-3

Neat.

To get this fix users can update in a clean R session from GitHub as follows:

library(devtools)
install_github("liamrevell/phytools")

Another related method is also in fancyTree:

fancyTree(tree,X=X,type="scattergram")
## Computing multidimensional phylogenetic scatterplot matrix...

plot of chunk unnamed-chunk-5

That's it.

Viewing all 801 articles
Browse latest View live