Quantcast
Channel: Phylogenetic Tools for Comparative Biology
Viewing all 800 articles
Browse latest View live

Picking a tree or set of trees at random from a "multiPhylo" object

$
0
0

A friend recently asked:

"Does anybody know how to, in R, select a single tree at random from a posterior distribution of trees (e.g., generated via MrBayes)?"

This is not too hard. A set of trees read into memory using read.tree or read.nexus is just a list of trees with the class attribute set to "multiPhylo". To pick one at random we can just do:

random.tree<-sample(trees,size=1)[[1]]
We include the index [[1]] to (non-recursively) unlist the object returned by sample.

If we want to sample more than one, say 100 random trees from a posterior sample of 1,000, the procedure is the same. So we can just do:

random.trees<-sample(trees,size=100)

That's it.


Combining results from a phylogenetic regression on multiple trees in a posterior sample

$
0
0

A recent query on the R-sig-phylo email list asked:

"Imagine that I run a PIC analysis on two traits using 1000 post-burn-in trees. What would be the best way to summarize these results? Average p-values across all analyses? Perhaps a specific method to combine the resulting probabilities e.g. Fisher's test?"

A little bit of discussion ensued, but my suggestion read as follows:

...we should probably (begin by) combin(ing) the variance among estimates obtained from different trees in the posterior sample with the sampling variance of any single estimate... (to get the total variance of our parameter estimates for hypothesis testing)

One fairly sensible way to do this is to compute the variance due to phylogenetic uncertainty as the variance among estimates obtained from the trees of the posterior sample; and then to compute the sampling variance as the mean variance of the estimator from each tree; and then add the two variances them. The standard error of our estimate (computed as the mean across trees) is the square-root of this variance. To conduct a hypothesis test on the regression coefficient, then, you would compute the mean across trees and then the ratio of the parameter and its standard error should have a t-distribution with n-2 degrees of freedom for n taxa (not contrasts).

To do this from a practical perspective from trees in 'multiPhylo' object (here "trees") and data in x & y for a simple bivariate regression, we do the following:

# first define the following custom function
ff<-function(tree,x,y){
   pic.x<-pic(x,tree)
   pic.y<-pic(y,tree)
   fit<-lm(pic.y~pic.x-1)
   setNames(c(coef(fit),vcov(fit)),c("beta","var(beta)"))
}
# now apply to all trees in your sample
BB<-t(sapply(trees,ff,x,y))
# total variance in beta estimated by
varBeta<-var(BB[,"beta"])+mean(BB[,"var(beta)"])
t.beta<-mean(BB[,"beta"])/sqrt(varBeta)
P.beta<-2*pt(abs(t.beta),df=length(trees[[1]]$tip)-2, lower.tail=FALSE)

I think that's right.

Now, someone wisely pointed out that this approach is somewhat inelegant in combining Bayesian, maximum likelihood, & frequentist hypothesis testing. I can't argue with this and would agree that a better & more elegant solution would be to simultaneously sample trees with branch lengths & the parameters of our evolutionary model for trait evolution from their joint posterior probability distribution. Presently, this is impractical.

Using make.simmap on a set of trees

$
0
0

A recent commenter asked:

"I wonder if it would be possible to apply make.simmap to an object multiphylo (to deal with phylogenetic uncertainty) and to summarize the outcome on a consensus tree."

Let's take this one bit at a time. First, the task of applying make.simmap, the phytools function for stochastic character mapping, to a set of trees - say a sample from the posterior distribution in a Bayesian analysis.

At present, make.simmap takes a single tree and data vector as input; and can return as many simulated stochastic maps as the user demands. It is possible to iterate over a list of trees and then combine the results into a single object of class "multiPhylo" - but this is a little annoying. This is because make.simmap(...,nsim>1) returns a list of trees; and thus lapply(trees,make.simmap,...,nsim>1) returns a list of lists. Various attempts to first unlist and then relist left me more & more annoyed - but the following hack seems to do the trick:

ff<-function(tree,x){
   zz<-make.simmap(tree,x,nsim=10)
   class(zz)<-NULL
   zz
}

mtrees<-unlist(sapply(trees,ff,x,simplify=FALSE), recursive=FALSE)
class(mtrees)<-"multiPhylo"

I have now added this to the latest version of make.simmap, & also built a new version of phytools (phytools 0.2-36), which can be downloaded & installed from source.

Instead of only taking a single tree as input, this tree can take a list of trees (an object of class "multiPhylo") & will automatically generate nsim stochastic character maps per input tree.

OK, here's a demo of the new version using a set of 15,001 trees from the posterior distribution of a real Bayesian run (thanks Graham Reynolds), and a simulated binary character with states a and b.

> packageVersion("phytools")
[1] ‘0.2.36’
> trees<-read.nexus("posterior.sample.trees")
> trees
15001 phylogenetic trees
> # too many, let's randomly subsample
trees<-sample(trees[5001:15001],100)
> trees
100 phylogenetic trees
> # ok, now generate 10 stochastic maps for each tree
> mtrees<-make.simmap(trees,x,nsim=10,message=FALSE)
> mtrees
1000 phylogenetic trees
> # now let's visualize the variability
> # again, 1000 is too many
> layout(matrix(1:100,10,10,byrow=TRUE))
> cols<-setNames(c("blue","red"),letters[1:2])
> plotSimmap(mtrees[0:99*10+1],cols,pts=F,ftype="off")
Waiting to confirm page change...

Cool. Now let's try describe.simmap:

> XX<-describe.simmap(mtrees)
1000 trees with a mapped discrete character with states:
 a, b

trees have 15.509 changes between states on average

changes are of the following types:
       a,b   b,a
x->y 9.078 6.431

mean total time spent in each state is:
               a           b    total
raw  176.5796569 145.4146415 321.9943
prop   0.5482752   0.4517248   1.0000

The times & state changes computed by describe.simmap will be correct - however the posterior probabilities for ancestral nodes (here, XX$ace) will not because different trees in the posterior sample have different nodes & node numbers.

Nonetheless, cool!

Bug fix for make.simmap with asymmetric substitution model; new version of phytools

$
0
0

Yesterday I received a user report of some problems with make.simmap(...,model="ARD") when it resulted in some of the fitted transition rates being zero. This was a known (to me) issue with make.simmap, and it is because although we can compute the conditional likelihoods with this matrix no problem - when we are trying to draw waiting times from an exponential distribution to map character changes along internal branches, rexp(...,rate=0) won't evaluate. One solution to this would be to return Inf or some arbitrarily large number when the rate is 0. Instead, and for other reasons of computation, I decide to add a small number, tol=1e-08 to off-diagonal position of Q that are 0 in the MLE. (There are also other calculations that make this necessary.)

Fixing this issue turned up another more serious problem and that is that recent versions of make.simmap have been using the transpose of Q in simulating along edges for asymmetric transition models, instead of Q itself (or, alternatively, that it has been calling a row index instead of a column index during an important stage in calculation). I believe that this bug appeared during my recent major re-write of make.simmap. Obviously, this doesn't affect symmetrical models of character change in which Q==t(Q) (such as model="ER" or model="SYM" - the default), but it will affect model="ARD".

Here's a little more specific detail on the error. In the internally called function smap, I had:

Q<-t(Q)
where I should not have; or, alternatively:
p<-expm(Q*tree$edge.length[j])[NN[j,1],]* L[as.character(tree$edge[j,2]),]
instead of:
p<-expm(Q*tree$edge.length[j])[,NN[j,1]]* L[as.character(tree$edge[j,2]),]

Source code for the fixed version of make.simmap is here. In this version, users can also control the value of tol by way of the optional argument, well, tol. tol is only used if any of the off-diagonal elements of Q are less than tol, which has a default value of tol=1e-08, as noted above.

This update to make.simmap is also in a new phytools package build, phytools 0.2-37, which can be installed from source.

Finally, here is a demo in which I simulate with a very low backward rate & then show what the new version of make.simmap does (instead of failing) if that backward transition rate has a MLE of 0. Note that if make.simmap seems to hang - it may be possible to resolve this by increasing tol.

> require(phytools)
Loading required package: phytools
> packageVersion("phytools")
[1] ‘0.2.37’
> tree<-pbtree(n=200,scale=1)
> Q<-matrix(c(-1,1,0.01,-0.01),2,2)
> rownames(Q)<-colnames(Q)<-letters[1:2]
> tree<-sim.history(tree,Q,anc="a")
> cols<-setNames(c("black","red"),letters[1:2])
> # this is the true character history
> plotSimmap(tree,cols,pts=F,ftype="off")
> mtrees<-make.simmap(tree,tree$states,model="ARD", nsim=100)

Warning: some elements of Q not numerically distinct from 0; setting to 1e-08

make.simmap is sampling character histories conditioned on the transition matrix
Q =
            a           b
a -0.96697341  0.96697341
b  0.00000001 -0.00000001
(estimated using likelihood);
and root node prior probabilities
pi =
  a   b
0.5 0.5

Done.

And a little reality check:

> # true history
> describe.simmap(tree)
1 tree with a mapped discrete character with states:
 a, b

tree has 26 changes between states

changes are of the following types:
  a  b
a 0 26
b 0  0

mean total time spent in each state is:
             a         b    total
raw  26.703708 18.438526 45.14223
prop  0.591546  0.408454  1.00000

> # stochastic maps
> describe.simmap(mtrees,plot=T,show.tip.label=FALSE)
100 trees with a mapped discrete character with states:
 a, b

trees have 25.48 changes between states on average

changes are of the following types:
       a,b b,a
x->y 25.48   0

mean total time spent in each state is:
              a          b    total
raw  26.2457674 18.8964671 45.14223
prop  0.5814016  0.4185984  1.00000

In the new phytools build I've also added the function getStates (which can be used to pull the states at nodes or tips from a tree with a mapped discrete character and is called internally by describe.simmap) to the NAMESPACE so that it can be called by phytools users.

Please don't hesitate to report any bugs or issues with the present version of make.simmap or phytools.

Thanks!

New completely re-written version of phylomorphospace with mapped discrete trait

$
0
0

I just posted a new, completely re-written version of phylmorphospace - the phytools function that does a projection of the tree into a two dimensional morphospace. Since I wrote the original version way back in 2010, and the guts of the function had persisted largely unchanged since that time, this was probably overdue.

There was no major issue with the prior version; however I wanted to add some features and realized that the code could be a lot nicer - so I decided to re-write the function (pretty much) from scratch.

Phylomorphospace plots are pretty easy in principle. We just need to supply tip states & compute or supply all the states at internal nodes. Having done that, we can just plot all the tip & node states in our bivariate morphospace, and then add edges connecting all parent & daughter nodes.

Source code for the updated version of phylomorphospace is here; I have also posted a new build of phytools with these updates (phytools 0.2-38).

The update that inspired the re-write was that I wanted to be able to plot the state of a mapped discrete character along the edges of the tree, à la (for example) this version of phenogram. To do this for a projection of the tree into two dimensions is a little more complicated, because in phenogram the time spent in a mapped state is just plotted on the interval demarcated by the horizontal (i.e., time) axis. In two phenotypic trait dimensions, this is a little more complicated. Here, we have to compute the proportion of time spent in each state on each edge and then color the edge proportionally by those states, accordingly.

OK, here's a quick demo:

> require(phytools)
Loading required package: phytools
> packageVersion("phytools")
[1] ‘0.2.38’
> # first the standard version
> tree<-pbtree(n=20,scale=1)
> X<-fastBM(tree,nsim=2)
> phylomorphospace(tree,X,xlab="X1",ylab="X2")

OK, now for something more interesting let's simulate a discrete character on the tree; and then generate data for two continuous traits in which both the rate & evolutionary correlation differ depending on the mapped discrete character:

> Q<-0.5*matrix(c(-1,1,1,-1),2,2)
> rownames(Q)<-colnames(Q)<-letters[1:2]
> mtree<-sim.history(tree,Q,anc="a")
> # here's our discrete character history on the tree
> plotSimmap(mtree,colors=cols,pts=F)>
> # this is for simulation
> R<-list(matrix(c(0.5,0.45,0.45,0.5),2,2),
+ matrix(c(2,0,0,2),2,2))
> names(R)<-letters[1:2]
> X<-sim.corrs(mtree,R)
> cols<-c("blue","red"); names(cols)<-letters[1:2]
> phylomorphospace(mtree,X,xlab="X1",ylab="X1",colors=cols)

Pretty cool, I guess.... The evolutionary pattern that we simulated - low rate & high evolutionary correlation on the blue branches; & high rate but low evolutionary correlation on the red branches - is pretty evident in the plot.

One little note about plotting tip labels. During the re-write I noticed that I'd used the function textxy from the package 'calibrate' in place of the base graphics function text - but I'd forgotten why. Turns out textxy is a neat function that plots text labels for points with an offset that varies depending on the plot quadrant. This is perfect for a function like phylomorphospace, because it helps push the labels away from other plotted lines & points.

That's it for now.

phytools 0.2-40 on CRAN

$
0
0

A new version of phytools is now available on CRAN. I submitted the last CRAN version of phytools March 20th, so there are not a huge number of updates in the present version - but some of them are important or very cool, so I decided to get the new phytools version on CRAN anyway. Over the new few days, Windows & Mac OS X binaries should be built and then gradually percolate through the CRAN mirror repositories.

Here are some of the updates in this version relative to phytools 0.2-30, the last CRAN phytools version:

1. A new function to get the marginal ancestral state reconstructions of a discrete character using the re-rooting method.

2. A new function to summarize the results of stochastic mapping (1, 2, 3).

3. An update to make.simmap to allow uncertain tip states.

4. A small update to phylosig in the calculation of P-values.

5. An update to make.simmap to allow it to performing mapping on a set of input trees, for instance from the posterior distribution of a Bayesian analysis.

6. An important bug fix in make.simmap for asymmetric substitution models (e.g., model="ARD").

7. Finally, a new, totally rewritten version of phylomorphospace that also allows users to show a mapped discrete character on the tree.

Please report any bugs or issues.

Estimating ancestral states when species values are uncertain or unknown, part II

$
0
0

Last month I described a method whereby stochastic mapping can be used to estimate ancestral states when tip states are uncertain or unknown. An ancillary benefit of this approach (now implemented in the 'phytools' function make.simmap) is that it can also be used to get posterior probabilities on the states for those uncertain or unknown terminal taxa.

Well, the same general tactic can be used to get marginal ancestral state reconstructions using likelihood. (I.e., the empirical Bayes marginal reconstructions of Yang 2006. Yang calls these "empirical Bayes" reconstructions, because they are Bayesian posterior probabilities - but they treat the empirical tree & MLE model of evolution as if they are known without error.)

Marginal ancestral state reconstruction using the re-rooting method of Yang is implemented in the phytools function rerootingMethod. This method works by taking advantage of the fact that the normalized likelihoods at the root node from the pruning algorithm are the same as the (posterior) probabilities of each state at that node (Yang 2006) - that is, assuming our model of evolution is symmetric. That means that we should be able to re-root the tree at each internal node and compute the marginal ancestral states (posterior probabilities) for that node via one post-order tree traversal.

When tip nodes are unknown or uncertain this is treated as a prior probability distribution on the state for the tip. For example, we might specify a totally unknown tip as having an equal prior probability of being in any of the states; or we might know that a tip is in state a or b, but not c, in which case we might specify a flat prior probability distribution but only on a&b, with state c getting a prior probability of 0. We can then use these prior probabilities just as we would the conditional likelihoods of internal nodes during the pruning algorithm.

To get the posterior probability of tip nodes is simple - we just re-root the tree at the tip of interest and then compute the normalized conditional likelihoods for that node. These are are empirical Bayes posterior probabilities for the tip states, given our data & fitted model of evolution.

The code for this new function is here; however if you want to try it out, you might want to get the new non-CRAN build of phytools (phytools 0.2-41) because this function calls some internal functions of the phytools package (which it won't be able to do if you just load the function from source).

Here's a quick demo of how it works:

> require(phytools)
Loading required package: phytools
> # simulate a tree & data
> tree<-pbtree(n=20,scale=1)
> Q<-matrix(c(-1,1,1,-1),2,2)
> rownames(Q)<-colnames(Q)<-letters[1:2]
> x<-sim.history(tree,Q)$states
> plot(tree,no.margin=T,edge.width=2,label.offset=0.02)
> tiplabels(pie=to.matrix(x,letters[1:2])[tree$tip.label,], piecol=c("blue","red"),cex=0.6)
> # ok, get the marginal ASRs without uncertainty
> PP<-rerootingMethod(tree,x)
> nodelabels(pie=PP$marginal.anc,piecol=c("blue","red"), cex=0.6)

OK, now let's pretend that some of our tip states are uncertain/unknown:

> # now let's pretend we have some uncertainty in
> # our tip states
> Pr<-to.matrix(x,letters[1:2])
> Pr["t3",]<-c(0.5,0.5)
> Pr["t12",]<-c(0.5,0.5)
> Pr["t19",]<-c(0.5,0.5)
> QQ<-rerootingMethod(tree,Pr)
> tiplabels(pie=QQ$marginal.anc[tree$tip.label,], piecol=c("blue","red"),cex=0.6)
> nodelabels(pie=QQ$marginal[as.character(1:tree$Nnode +length(tree$tip)),],piecol=c("blue","red"),cex=0.6)

The posterior probabilities for the tip nodes here also seem totally sensible. For example, tip t3 is on the end of a very long branch - so the posterior probability is dominated by the prior. By contrast, tip t19 is on a short branch & nested within a clade in state b (i.e., red); and thus the empirical Bayes posterior probability that t19 is also in state b is very high - as shown in the figure.

Pretty cool!

Small bug fix in make.simmap for multiple input trees

$
0
0

For some reason in a recent version of make.simmap with multiple input trees I had the brilliant idea that it would be smart to print Done. at the end of every set of simulations. The consequence of that self-inflicted wound is a fatal error when multiple input trees are read in! This is because I evaluate the optional argument messageinside a conditional if(class(tree)=="phylo"), but then try to compute if(message) message("Done.")outside that if statement. The result is that, after calling itself recursively many times, at the very endmake.simmap fails when trying to evaluate if(message). Oh no!

I also identified a different, non-fatal bug that affected make.simmap(...,nsim=1) for multiple input trees. This bug came about because for every input tree I called make.simmap(...,nsim) and then non-recursively unlisted the resulting object. The problem is that if nsim=1 then the non-recursive unlisting unlists the tree object. Oops!

Both of these bugs are in the latest CRAN release, which is annoying - but I have posted fixed code for make.simmaphere and an updated build of phytools (phytools 0.2-42).


Some performance tests of make.simmap

$
0
0

The phytools function make.simmap performs stochastic mapping following Bollback (2006); however it does not implement a full hierarchical Bayesian model because stochastic maps are drawn conditional on the most probable value of the transition matrix, Q, found using maximum likelihood (given our data & tree - which are also assumed to be fixed). This is different from what (I understand) is implemented for morphological characters in the program SIMMAP, in which stochastic character histories & transition rates are jointly sampled, with a prior distribution on the parameters of the evolutionary model specified a priori by the user. Conditioning on the MLE(Q) (rather than integrating over the posterior distribution of Q) would seem to make what is done in make.simmap an empirical Bayes (EB) method.

From stochastic mapping we can compute posterior probability distributions on the number transitions between each pair of states; the total number of transitions; and the proportion of or total time spent in each state on the tree. (We can also get posterior probabilities of each state at internal nodes - i.e., ASRs - but these are boring &make.simmap does fine with this, so I won't focus on that here.)

I anticipated that make.simmap would have the following properties:

1. The point estimates of our variables of interest should be fine - i.e., unbiased and as good as the full hierarchical model.

2. The variance around those estimates computed from the posterior sample of stochastically mapped trees should be too small. This is because we have effectively ignored uncertainty in Q by fixing the transition matrix at its most likely value.

Finally, 3. the discrepancy in 2. should decline asymptotically for increasing tree size. This is because the more tips we have in our tree, the more information they will contain about the substitution process. In other words, the influence of ignoring uncertainty in Q should decrease as the information in our data about Q is increased.

These predictions are intuitive, but at least 1. & 2. are also typical of EB generally. Number 3. just seems sensible.

Unfortunately, since stochatic mapping is a computationally intensive Monte Carlo methods, testing these predictions is somewhat time consuming - and what I'll show now is only a very limited test. Basically, I simulated trees & data containing either 100 or 200 tips (and then 40 tips, see below) using pbtree and sim.history. The data are a binary character with states a and b, and the substitution model is symmetric - i.e., Qa,b = Qb,a. I wanted to see if our variables were estimated without (much) bias; and if (1-α)% CIs based on the posterior sample included the observed values (which we know from simulation, of course) the correct fraction of the time (e.g., 95% of the time for α = 0.05).

OK - I won't give all the code for simulation, but here is a figure showing a visualization of a single example result from one simulated 100-taxon tree & dataset. The panels give the transition rates between state, the total changes, or the relatively time spent in state a. (Since this last quantity is expressed as a proportion of the total time, the time spent in b is just 1 minus this.) The vertical dashed line is the value of each variable on the true tree.

This result was chosen at random, believe it or not (actually, it was the last of 100 simulations under the same conditions), but it happens to be a replicate in which make.simmap did really quite well.

Here is a table containing a summary of the first 10 of 100 simulations using 100-taxon trees, just to give the reader a sense of the data being collected. Hopefully the column headers are obvious.

> result[1:10,1:4]
   a,b low.a,b mean.a,b high.a,b
1    8       2    5.820       11
2    7       3    5.750       10
3    9      10   19.155       30
4   15      11   15.660       20
5    5       5    8.780       12
6    4       2    7.030       14
7   13       9   11.940       16
8    8      11   17.420       25
9    5       3    6.810       12
10   7      10   18.650       28
> result[1:10,5:8]
   b,a low.b,a mean.b,a high.b,a
1   16      12   15.070       19
2   10       7   10.410       14
3   21      16   25.510       34
4    5       2    5.940       12
5    5       3    5.875       10
6   17      13   17.265       22
7    8       5    8.295       12
8   18      10   17.490       24
9   12       7   11.005       15
10  18       9   18.150       26
> result[1:10,9:12]
    N low.N mean.N high.N
1  24    15 20.890     28
2  17    11 16.160     22
3  30    32 44.665     60
4  20    15 21.600     29
5  10    11 14.655     20
6  21    16 24.295     32
7  21    14 20.235     27
8  26    25 34.910     45
9  17    13 17.815     25
10 25    27 36.800     48
> result[1:10,13:16]
   time.a low.time.a mean.time.a high.time.a
1   0.229      0.196       0.244       0.310
2   0.296      0.225       0.289       0.361
3   0.335      0.271       0.392       0.541
4   0.762      0.651       0.743       0.805
5   0.364      0.305       0.394       0.467
6   0.182      0.173       0.230       0.322
7   0.485      0.410       0.456       0.505
8   0.320      0.333       0.441       0.549
9   0.201      0.209       0.284       0.397
10  0.260      0.294       0.423       0.565

We can first check to see if the point estimates, obtained by averaging over stochastic maps for each simulation, give good estimates of the generating values for the variables that we are interested in. So, let's take the transitions from b to a as an example:

> plot(RR[,"b,a"],RR[,"mean.b,a"],xlab="true b->a", ylab="mean(b->a)")
> lines(c(0,max(RR[,"b,a"],RR[,"mean.b,a"])), c(0,max(RR[,"b,a"],RR[,"mean.b,a"])),lty="dashed")
We see that our point estimates track the known true number of transitions fairly well - so clearly we're not doing too bad with regard to bias. Let's quantify it across all the variables of interest:
> mean(RR[,"mean.a,b"]-RR[,"a,b"])
[1] 1.97765
> mean(RR[,"mean.b,a"]-RR[,"b,a"])
[1] 1.5159
> mean(RR[,"mean.N"]-RR[,"N"])
[1] 3.49355
> mean(RR[,"mean.time.a"]-RR[,"time.a"])
[1] 0.005261939
There looks to be a slight upward bias in the estimated number of substitutions - but this might just be due to the fact that our posterior sample is truncated at 0. (I.e., we might do better with the mode or median from the posterior sample instead of the arithmetic mean.) The time spent in a (and thus also b) seems to be estimated unbiasedly.

We can also ask, for instance, if the interval defined by [α/2, 1-α/2]% of the posterior sample includes the generating values (i.e., from our simulated tree) (1-α)% of the time. Setting α to 0.05:

> colMeans(on95)
   a,b    b,a      N time.a
  0.76   0.89   0.79   0.82
we see that indeed, and as expected, the variance on our variables is too small. Not way too small - what should be our 95% CI is actually our "76-89% CI" for these trees & data - but too small nonetheless.

Finally, prediction 3. Since I expect that the fact that our CIs are too small is due to fixing Q rather than integrating over uncertainty in Q (as we'd do in the full hierarchical model), I predicted that the variance of our parameters should asymptotically approach the true variance for more & more data. To "test" this, I simulated trees with 200 taxa and repeated the analyses above.

Here is one representative result, as before:

And let's check the average outcome across 100 simulations:
> colMeans(on95)
   a,b    b,a      N time.a
  0.85   0.87   0.77   0.86
To be honest, I was kind of surprised not to have found a larger effect here, so I decided to go in the other direction - and try simulating with quite small trees, say 40-taxa. First, here's a representative set of visualizations of the posterior densities/distributions for our focal variables in one randomly chosen simulation:
And here is our measure of the fraction of results on the (ostensible) 95% CI from the posterior:
> colMeans(on95)
   a,b    b,a      N time.a
  0.83   0.75   0.73   0.74
This result does seem to support premise 3., although, as for when we went from 100 to 200 tips, the effect of going from 100 to 40 is not especially large.

I should also be careful to note that this doesn't mean, of course, that we don't get much better parameter estimation from larger trees with more data - we do. It is just to say that convergence of our [α/2, 1-α/2]% to the true (1-α)% is flatter than I expected as sample size is increased, if it happens at all.

So, in summary - make.simmap does pretty well in point estimation. 95% CIs from the posterior sample will be too small - but not way too small, even for relatively small trees.

What's the way forward from here? Well, it would be nice to compare to SIMMAP, as I'm not aware of this kind of analysis having been done & published for morphological traits. In addition, there are steps that could be taken with make.simmap other than going to the full hierarchical model - for instance, instead of fixing Q at its MLE, I could use MLE(Q) to parameterize a prior probability distribution of the transition matrix. This would still be EB, just of a different flavor.

OK - that's all for now!

New version of tree simulator with taxa & time stops (and both)

$
0
0

This is not what I set out to do when I reopened the function pbtree that I use often, but have barely looked at in two years (that'll come in a future post); however last night & this morning I thought it would be neat to add a time-stop criterion to the existing taxa-stop in the stochastic pure-birth tree simulator in phytools, pbtree. Having done that, I realized it would be straightforward to just use rejection sampling to simulate simultaneously conditioning on both N&t.

Tree simulation is much less complicated than it might seem at first glance. The waiting time between speciation events on a Yule tree are exponentially distributed with the rate parameter, λ = b× m where m is the number of "open" edges. Having drawn a random waiting time, we first add the wait time to all the open edges of the tree; and then we speciate any of the open edges at random. We repeat this procedure until we reach a set number of tips or time.

One way to condition on both the total time & the number of tips is to repeatedly simulate trees under the time-stop criterion until a tree with the correct number of tips is obtained. In theory this could take a long time, even if we use a birth-rate where E[N]=eb×t. This is because the variance on stochastic pure-birth trees is very large. For instance:

> source("pbtree.R")
> # simulate 1000 trees with a time-stop
> trees<-pbtree(b=1,t=log(100)-log(2),nsim=1000)
(I use t=log(100)-log(2) because this is the amount of time we expect will result in 100 tips given b=1, on average.)
> # how many tips in each tree?
> N<-sapply(trees,function(x) length(x$tip.label))
> mean(N)
[1] 100.499
> var(N)
[1] 4946.41
> hist(N,0:17*25,col="grey",xlab="frequency",ylab="number of tips",main=NULL)

Nonetheless, pbtree is sufficiently speedy that it is still possible to simulate using both a taxon & time-stop. Here's a demo:

> system.time(tree<-pbtree(b=1,t=log(100)-log(2),n=100))
simulating with both taxa-stop (n) and time-stop is
performed via rejection sampling & may be slow

  126 trees rejected before finding a tree

   user  system elapsed
   0.61    0.00    0.61
> tree

Phylogenetic tree with 100 tips and 99 internal nodes.

Tip labels:
        t17, t18, t69, t70, t21, t12, ...

Rooted; includes branch lengths.
Cool.

The code for the new version of pbtree is here.

Discrete- & continuous-time phylogeny simulation

$
0
0

Earlier today I posted a new version of the phytools pure-birth tree simulator: pbtree. This version is neat because it allows you to condition on the total number of tips in the tree, the total time, or (via rejection sampling) both.

However, when I started looking at pbtree it was actually because I wanted to add in discrete-time simulations, in addition to the continuous-time tree simulations already available in pbtree. (There is at least some interest in this.) Unlike discrete-time continuous character simulations, such as in the visualization tool bmPlot, simulating trees under a discrete-time model is not the same as simulating every time-step from 0 to the total time of the simulation. Rather, it just means we draw our waiting times between speciation events from a different probability distribution.

For continuous-time simulations we draw waiting times from an exponential distribution in which the rate parameter, λ, is equal to the speciation rate × the number of "open" edges at the last event. For discrete-time simulations, we draw waiting times from the discrete analog - the geometric distribution - which describes the probability that the first success in a series of Bernoulli trials occurs in trial x. Since the outcome of a discrete-time speciation process can be viewed as a Bernoulli trial (i.e., a lineage speciates, or it does not, in every time-step), this is the correct distribution of waiting times under our model.

There is a little wrinkle, here, though. Do you recall that to get the next wait time given m open edges we simulated rexp(n=1,lambda=b*m)? Well, in discrete-time the analogy is imperfect. What we need to do instead is simulate rgeom(n=m,prob=b) and then the wait-time to the next even is the minimum of that. Why? This is because we can think about the m open lineages as m different series of Bernoulli trials, and we want to find the wait time to the next event on any lineage in the tree.

In addition, we have to allow for the possibility that there are multiple speciation events within a single time-step - as long as all take place on different branches. To do this, we need to get the set of minimal rgeom(n=m,prob=b), rather than just a single minimum.

Now that we have a waiting time or set of (equal) waiting times, we just add a new node to any of the "open" edges in the tree, all with equal probability. We do that until we reach n or t. For discrete-time we sometimes have a problem when our stop criterion is the number of tips, not time. This is that sometimes multiple speciation events will occur within a timestep. The current version of the function allows this and spits a warning if the number of tips in the simulated tree are different from the stopping criterion.

The code for this updated version of pbtree is here; however since it uses an internal phytools function, to try it out you should install the latest version of phytools from source (phytools 0.2-43). Here's a quick demo:

> require(phytools)
Loading required package: phytools
> packageVersion("phytools")
[1] ‘0.2.43’
>
> ## this is our speciation prob to get 100 tips after
> ## 1000 time-steps, on average in discrete time
> b<-exp((log(100)-log(2))/1000)-1
>
> # simulate
> trees<-pbtree(b=b,t=1000,type="discrete",nsim=2000)
> N<-sapply(trees,function(x) length(x$tip.label))
> mean(N)
[1] 98.7815
> var(N)
[1] 4848.484
> range(N)
[1] 3 595
> hist(N,0:24*25,col="grey",xlab="frequency",ylab="number of tips",main=NULL)

As before, we can also condition on both N and t; however, since this is done via rejection sampling, it is probably wise (at least for illustrative purposes here) to choose a reasonable combination of b, n, and t. For instance:

> # use our speciation probability from before
> tree<-pbtree(b=b,n=100,t=1000,type="discrete")
simulating with both taxa-stop (n) and time-stop (t) is
performed via rejection sampling & may be slow

  41 trees rejected before finding a tree

> plotTree(tree,ftype="off")

Cool. That's it for now.

New version of tree simulator with death (as well as birth)

$
0
0

As described in a couple of earlier posts (1, 2) I have been doing some work on the function pbtree which does pure-birth (i.e., Yule process) phylogeny simulations.

Well, as the title of this post would suggest, the function can now do birth-death simulation, as well as pure-birth. Remember, pbtree can now condition on total time, the number of tips, or both (by rejection sampling); and can simulate in continuous or discrete time.

Simulating with death (i.e., extinction) is not too hard. For continuous time simulation, whereas before we drew our waiting times from the exponential distribution with λ = b× m for birth-rate b& number of 'open' lineages m, with extinction we draw our wait times from the exponential distribution with λ = (b + d) × m. This is because if the wait time to an event of class A is exponentially distributed with rate λA& the wait time to an event of class B is exponentially distributed with rate λB, and A&B are independent, then the wait times to AorB are exponentially distributed with λ = λA + λB. The probabilities that an event is a birth or a death given that it has occurred are just b/(b+d) and d/(b+d), respectively. This means that our procedure for growing the tree should be (1) draw a random wait time; (2) pick a lineage at random; and (3) and then have it speciate or die with the probabilities given above.

For discrete time the procedure is only slightly different. Now we draw a random wait time from the geometric distribution with p = b + d.** (**Remember that we are actually drawing m independent waiting times & then using the one or more smallest times.) However, since p is a probability (whereas λ was a rate), we are now constrained such that b + d≤ 1.0. This is because a given lineage can speciate, go extinct, or do nothing within a time-step - but it can't do more than one of these. (b + d = 1.0 guarantees that an event of some kind - speciation or extinction - occurs in every lineage in every generation.)

Code for the new version of pbtree is here. There is also a new phytools build (phytools 0.2-44) with these updates, and it can be downloaded & installed from source.

Here's a little demo of the function, using continuous time simulations:

> require(phytools)
Loading required package: phytools
> packageVersion("phytools")
[1] ‘0.2.44’
> # continuous time, taxa stop
> tree<-pbtree(d=0.2,n=50)
> plotTree(tree,fsize=0.8,ftype="i")
> tree<-pbtree(d=0.2,n=50)
> plotTree(tree,fsize=0.8,ftype="i")
> # continuous time, time stop
> tt<-log(50)-log(2); tt
[1] 3.218876
> tree<-pbtree(b=1.2,d=0.2,t=tt)
> plotTree(tree,fsize=0.8,ftype="i")
> max(nodeHeights(tree))
[1] 3.218876
> # retain only extant lineages
> tree<-pbtree(d=0.5,n=100,extant.only=TRUE)
> plotTree(tree,ftype="off")
> ltt(tree,plot=FALSE)$gamma # pull of the recent
[1] 2.298986
> # continuous time, taxa & time stop
> tt # time for simulation
[1] 3.218876
> tree<-pbtree(b=1.2,d=0.2,n=50,t=tt)
simulating with both taxa-stop (n) and time-stop (t) is
performed via rejection sampling & may be slow

  13 trees rejected before finding a tree

> max(nodeHeights(tree))
[1] 3.218876
> length(getExtant(tree))
[1] 50
> plotTree(tree,fsize=0.8,ftype="i")
> # multiple trees, time stop
> # function to get the number of extant tips
> ff<-function(x){
  tol<-1e-12
  if(max(nodeHeights(x))<(tt-tol)) 0
  else length(getExtant(x))
 }
> # low extinction rate
> lowD<-pbtree(b=1.1,d=0.1,t=tt,nsim=1000)
> lowD
1000 phylogenetic trees
> NlD<-sapply(lowD,ff)
> mean(NlD)
[1] 48.283
> var(NlD)
[1] 1360.986
> # high extinction rate
> highD<-pbtree(b=1.5,d=0.5,t=tt,nsim=1000)
> highD
1000 phylogenetic trees
> NhD<-sapply(highD,ff)
> mean(NhD)
[1] 48.681
> var(NhD)
[1] 2321.631
> # distribution of tree sizes
> hist(NlD,0:ceiling(max(c(NlD,NhD))/10)*10,col="grey", xlab="frequency",ylab="number of tips",main=NULL)
> hist(NhD,0:ceiling(max(c(NlD,NhD))/10)*10,col="grey", xlab="frequency",ylab="number of tips",main=NULL)

Pretty cool. The purpose of the final exercise was just to illustrate that although the expected number of lineages at time t can be computed as N = e(b-dt, the variance among simulations goes up with increasing b.

That's it. I've been testing out the new pbtree extensively - but please report if you are able to break it! Thanks!

Simulating trees conditioned on taxa & time with 'direct' sampling

$
0
0

The day before yesterday I posted a new version of pbtree that could simulate first pure-birth& then birth-death phylogenies conditioned on both the total number of tips & the total time via rejection sampling. That is, it simulates trees for total time t and then rejects any phylogeny that doesn't have the correct number of tips, N.

Well, as I was out walking & eating a cookie today, it suddenly occurred to me how we might sample birth-death trees conditioned on both N&t 'directly' (rather than via rejection). It would be by the following two step procedure:

1. Simulate a set of wait-times to total time t. For each wait time, we also need to compute a birth or death event and keep count of the births minus the deaths (m) so that we can draw our exponential waiting times with λ = m× (b + d). Repeat until the number of lineages, m, at t is equal to N.

2. With this set of waiting-times & birth or death events in hand, simulate the tree.

By simulating the waiting-times first, followed by the tree, we know before we start that our tree will be of length t with N taxa. This should reduce computation time dramatically.

I've posted an experimental version of this here. The following is a demo:

> source("pbtree.R")
> tt<-log(100)-log(2)
> system.time(tree1<-pbtree(b=1.2,d=0.2,n=100,t=tt))
simulating with both taxa-stop (n) and time-stop (t) is
performed via rejection sampling & may be slow

  114 trees rejected before finding a tree

   user  system elapsed
   3.88    0.00    3.88
> length(getExtant(tree1))
[1] 100
> max(nodeHeights(tree1))
[1] 3.912023
>
> system.time(tree2<-pbtree(b=1.2,d=0.2,n=100,t=tt, method="direct"))
simulating with both taxa-stop (n) & time-stop (t) using
'direct' sampling. this is experimental
   user  system elapsed
   0.13    0.00    0.12
> length(getExtant(tree2))
[1] 100
> max(nodeHeights(tree2))
[1] 3.912023

We can also do somethings that are just impossible with rejection sampling - for instance, simulating with a very high extinction rate, e.g.:

> system.time(tree3<-pbtree(b=4,d=3,n=100,t=tt, method="direct"))
simulating with both taxa-stop (n) & time-stop (t) using
'direct' sampling. this is experimental
   user  system elapsed
  10.30    0.00   10.31
> tree3

Phylogenetic tree with 365 tips and 364 internal nodes.

Tip labels:
        t1, t2, t3, t12, t53, t56, ...

Rooted; includes branch lengths.
> length(getExtant(tree3))
[1] 100
> max(nodeHeights(tree3))
[1] 3.912023
> plotTree(tree,ftype="off",lwd=1)

This isn't without its limits, of course - but we could never do this via rejection sampling. Neat.

CI on LTT plot from a sample of trees

$
0
0

Today, someone posted the following query to R-sig-phylo:

Does anyone know of a good implementation of an LTT plot that can draw a Confidence Interval or HPD interval from a set of trees? I've seen things like `ltt` in phytools that can draw one line for each tree in the sample. However, this can look a bit messy, and I'd ideally love to just plot the 95% CI or HPD of the ages/lineages in the trees. Has anyone seen anything like this?

Dave Bapst promptly responded that there is a function in his paleotree that can do this. For example:

> require(phytools)
Loading required package: phytools
> require(paleotree)
Loading required package: paleotree
> trees<-pbtree(n=50,t=log(50)-log(2),method="direct", nsim=200,quiet=T)
> multiDiv(trees,(log(50)-log(2))/100,plotLogRich=TRUE)

This is pretty cool.

It occurred to me that there are two different CIs that we might be interested in: the CI(lineages), given a time in the past; or the CI(time) given a number of lineages. The former, CI(n|t), can be read as the (say) 95% CI of the number of lineages alive at time t; whereas the latter, CI(t|n), is the 95% CI on the timing of the nth speciation event.

Even before Dave responded, especially because phytools was mentioned in the query, I'd already started working on this. Here's a link to a function that does this, and also computes the mean (rather than median), and returns the result invisibly. So, for instance:

trees<-pbtree(n=50,t=log(50)-log(2),method="direct",nsim=200,quiet=T)
> # same as paleotree
> XX<-ltt95(trees,log=TRUE)
> # here is the object returned
> XX
            time low(lineages) lineages high(lineages)
 [1,] 0.00000000             1      1.0              1
 [2,] 0.03218876             2      2.0              3
 [3,] 0.06437752             2      2.0              3
 [4,] 0.09656627             2      2.0              3
 [5,] 0.12875503             2      2.0              4
 [6,] 0.16094379             2      2.0              4
 [7,] 0.19313255           ...

Now on time with the mode changed to "mean":

> # now on time + mode="mean"
> ltt95(trees,log=TRUE,method="times",mode="mean")

It also works for trees with varying total length or number of tips (although in the latter case, only for method="lineages". So, for instance:

> treesN<-pbtree(n=50,nsim=200)
> ltt95(treesN,log=TRUE,method="lineages",mode="mean")

Or:
> treesT<-pbtree(t=log(50)-log(2),nsim=200)
> ltt95(treesT,log=TRUE,method="lineages",mode="mean")

Finally, we can set an arbitrary α level. For instance:

> XX<-ltt95(trees,alpha=0.1,mode="mean",log=FALSE)

Basically, you get the idea. Please note that the version of pbtree that I'm using in the above simulations is in a non-CRAN phytools update (phytools 0.2-46).

Critical fix for make.simmap

$
0
0

I just identified a critical error in make.simmap that I introduced when I allowed make.simmap to accept uncertain tip states. The priors on our tip states are treated as conditional likelihoods during node sampling following Bollback (2006) and consequently I just put these probabilities in the 1 through Nth element of the matrix of conditional likelihoods from the pruning algorithm. Unfortunately, in an internally called function that assigns node states I neglected to update:

NN[which(tree$edge[,1]==root),1]<- rstate(L[1,]*pi/sum(L[1,]*pi))
which assumes that the root node is in row 1 of L (as it was, through phytools 0.2-33), to:
NN[which(tree$edge[,1]==root),1]<- rstate(L[as.character(root),]*pi/sum(L[as.character(root),]*pi))

Source code for the fixed version of make.simmap is here, but I will post a new phytools version shortly & this fix will be in the next CRAN release.


New version of reroot

$
0
0

I just posted a new version of the function reroot, which I've moved to the source file utilities.R. The main different between reroot and the 'ape' function root is that reroot allows the user to root the tree along any internal or terminal edge, rather than just at nodes.

The previous version of this function had two problems: (1) it behaved incorrectly along edges arising from the root node of the tree; and (2) it depended on root(...,node=xx) which seems to act weird under some conditions & may have a bug. The new version of this function works totally differently. It still uses root - but with root(...,outgroup="XX") instead of using option node. This seems to fix the problems in the prior version. Instead, it uses the phytools functions splitTree to split the tree at the new root position, and then it reroots the basal subtree (i.e., the subtree rootward of the split point) at the tip, and then attaches the two subtrees together at this new root using the phytools function paste.tree.

Just for interest, here's a quick step-by-step demo of the process:

> require(phytools)
Loading required package: phytools
> # simulate tree
> tree<-rtree(n=10); tree$edge.length<-rep(1,nrow(tree$edge))
> plotTree(tree,node.numbers=T)
> # let's re-root halfway along the branch ending in 18
> node.number<-18
> position=0.5*tree$edge.length[which(tree$edge[,2]==18)]
> # split the tree
> tt<-splitTree(tree,list(node=node.number,bp=position))
> plot(tt,no.margin=TRUE,root.edge=TRUE)

Note that the top & bottom trees have a different scale - I show them for illustration of the split only.

> p<-tt[[1]]; d<-tt[[2]]
> # re-root the rootward subtree
> p<-root(p,outgroup="NA",resolve.root=T)
> # adjust branch lengths so that all the edge length
> # leading to "NA" is moved to the other side of the root
> bb<-which(p$tip.label=="NA")
> ee<-p$edge.length[which(p$edge[,2]==bb)]
> p$edge.length[which(p$edge[,2]==bb)]<-0
> cc<-p$edge[which(p$edge[,2]==bb),1]
> dd<-setdiff(p$edge[which(p$edge[,1]==cc),2],bb)
> p$edge.length[which(p$edge[,2]==dd)]<- p$edge.length[which(p$edge[,2]==dd)]+ee
> plot(p,no.margin=TRUE,root.edge=TRUE)
> plot(d,no.margin=TRUE,root.edge=TRUE)
> # re-attach
> tt<-paste.tree(p,d)
> plotTree(tt)
That's it.

New version of estDiversity with multiple methods

$
0
0

I just posted a new version of the function estDiversity, which also adds a new method for computing the reconstructed diversity at each internal node. This function basically implements the method of Mahler et al. (2010) in which we first reconstructed historical biogeography of Greater Antillean Anolis - and then we estimated the standing 'island-wise' lineage density at each node, integrating over the islands on which the node might have been found.

Our method for doing this is relatively simple. We just reconstructed ancestral states first at each node, and then along each edge at the height of the node; and then we just added the probabilities from the latter & multiplied them by the probabilities of the former. So, for instance, if a given target node had a (empirical Bayesian) posterior probability of being on Puerto Rico of 0.5, and Hispaniola of 0.5; and a hypothetical competing lineage was on Puerto Rico with probability 0.5 & Hispaniola with probability 0.5 (for convenience of illustration), then our point estimate of the number of competing lineages for the first node is:

Pt(PR)×Pc(PR)+Pt(H)×Pc(H)
=0.5 × 0.5 + 0.5 × 0.5
=0.5

in which subscripts t&c are target&competitor, respectively.

This makes perfect sense because the following four scenarios are expected with equal probability: (PRt , PRc), (PRt , Hc), (Ht , PRc), and (Ht , Hc) - and exactly 1/2 of the time our target node arises with one competitor. The same logic extends to multiple competing lineages, probabilities different from 0.5, & more than two islands.

The only identified issue with prior versions of estDiversity is that I had been using scaled conditional likelihoods for Pt instead of marginal reconstructed ancestral states; whereas for Pc I was (properly) using marginal reconstructions. (For more information on the distinction, search my blog.) I've realized that this was not right for some time, but had not heard any interest in the function so I hadn't gotten around to fixing it.

In addition to this, though, the new version of the function also gives the option of using stochastic maps to compute ancestral lineage density directly. It does this just by performing stochastic mapping using make.simmap, and then - at each internal node - counting the number of lineages at the height of the node with the same state as the node. For each stochastic map, each time a new node coexists with a competitor from mapped to the same island, we add 1/nsim. Otherwise, we add nothing.

We should probably actually prefer to this because this will account for the possibility (virtually guaranteed when nodes are close to each other in the tree) that the probabilities at a node and along competing edges are not independent.

Both methods are really slow, unfortunately. They seem to yield similar results:

> d.asr<-estDiversity(tree,x)
Please wait. . . . Warning - this may take a while!
Completed 10 nodes
Completed 20 nodes
Completed 30 nodes
Completed 40 nodes
> d.asr
       48         49         50         51         52
 0.0000000  0.8721167  1.8606261  3.2793402  4.7093087
        53         54         55         56         57
15.0590177  6.5464430  9.5738490 11.7358135  0.2020746
        58         59         60         61         62
 1.0892816  2.0400146  3.0361920  3.6605135  0.4269534
        63         64        ...
 1.2145138        ...
> d.maps<-estDiversity(tree,x,method="simulation")
Please wait. . . . Warning - this may take a while!
Completed 10 nodes
Completed 20 nodes
Completed 30 nodes
Completed 40 nodes
>> d.maps
   48    49    50    51    52    53    54    55    56    57
 0.00  0.99  2.67  3.64  6.28 16.08  9.10 12.02 14.04  0.22
   58    59    60    61    62    63    64    65   ...
 1.03  2.03  3.03  4.20  0.39  1.19  2.19   ...
> plot(d.asr,d.maps)

This new version depends on a number of other updates, so to test it (and please do!), I recommend downloading the newest phytools build (phytools 0.2-47), rather than trying to run the function from source.

Bug fix in new version of estDiversity for method="asr"

$
0
0

After building a new version of the function estDiversity, which estimates the lineage diversity at each internal node using a method based on (but modified from) Mahler et al. (2010), I realized today that there was a bug for estDiversity(...,method="asr"). Basically the lines:

tr<-reroot(tree,node.number=ee[j],position=tt-hh[j])
D[i,]<-D[i,]+apeAce(tr,x[tree$tip.label],model= model)$lik.anc[1,]
should have read:
tr<-reroot(tree,node.number=ee[j],position=tt-hh[j])
D[i,]<-D[i,]+apeAce(tr,x[tr$tip.label],model= model)$lik.anc[1,]
This is now fixed, but I have also modified the internally called function apeAce (a function for computing the conditional likelihoods which is modified from code in 'ape') so that it no longer requires that the phenotypic vector is in the order of tree$tip.label.

Yesterday I also made a little bit of hay over the fact that the two methods - one based on empirical Bayesian marginal ancestral state reconstruction at nodes & along edges in the tree; and the second based on stochastic mapping - could return different results. Well, it turns out I'm full of $h!t. At least in the one empirical example I've been running to test the code (from Luke Mahler) that discrepancy is entirely due to the bug. Here's what I mean:

> require(phytools)
Loading required package: phytools
> packageVersion("phytools")
[1] ‘0.2.48’
> d.asr<-estDiversity(tree,x)
Please wait. . . . Warning - this may take a while!
Completed 10 nodes
Completed 20 nodes
Completed 30 nodes
Completed 40 nodes
> d.sim<-estDiversity(tree,x,method="simulation")
Please wait. . . . Warning - this may take a while!
Completed 10 nodes
Completed 20 nodes
Completed 30 nodes
Completed 40 nodes
> plot(d.asr,d.sim)

This is a little disappointing as I thought that this might have been a nice illustration of the difference between marginal & joint reconstructions - turns out it is simply due to a bug! This distinction may still be important for some datasets - just not this one under the model we have chosen.

Since I was updating this function anyway, I now give user control of the model of evolution for the discrete character to the user; however be warned that for method="asr" only symmetric models are allowed. Non-symmetric models will be changed to method="ER". (This is not true of estDiversity(...,method="simulation") which should work for both symmetric and non-symmetric models.)

The new build of phytools (phytools 0.2-48) containing this updates is here.

Computing the Strahler numbers for nodes on a tree

$
0
0

An R-sig-phylo subscriber asks:

Can anyone suggest an easy way of determining Strahler numbers of nodes on a given phylogenetic tree (see http://en.wikipedia.org/wiki/Strahler_number for details).

This should be possible with a simple post-order tree traversal, which we can do easily by reordering our tree "pruningwise" using reorder.phylo(...,"pruningwise") from the ape package.

I think the following code does this, but I'm going to ask the poster to check this for me:

strahlerNumber<-function(tree,plot=TRUE){
  cw<-reorder(tree,"pruningwise")
  oo<-sapply(tree$edge[,2],function(x,y)
   which(x==y),y=cw$edge[,2])
  SS<-matrix(0,nrow(cw$edge),2)
  SS[cw$edge[,2]<=length(cw$tip.label),2]<-1
  nn<-unique(cw$edge[,1])
  for(i in 1:cw$Nnode){
    jj<-which(cw$edge[,1]==nn[i])
    s<-sort(SS[jj,2],decreasing=TRUE)
    SS[jj,1]<-if(all(sapply(s[2:length(s)],"<",s[1]))) s[1]
     else s[1]+1
    SS[which(cw$edge[,2]==cw$edge[jj[1],1]),2]<-SS[jj[1],1]
  }
  ss<-setNames(c(SS[oo,][1,1],SS[oo,2]),
   c(tree$edge[1,1],tree$edge[,2]))
  ss<-ss[order(as.numeric(names(ss)))]
  names(ss)[1:length(tree$tip.label)]<-tree$tip.label
  if(plot){
    plot(tree,no.margin=TRUE,edge.width=2)
    nodelabels(ss[1:tree$Nnode+length(tree$tip.label)])
  }
  return(ss)
}

E.g.,

> tree<-pbtree(n=12)
> ss<-strahlerNumber(tree,plot=TRUE)
> ss
t11 t12  t3  t2  t5  t6  t4  t9 t10  t7  t8  t1
  1  1  1  1  1  1  1  1  1  1  1  1
 13  14  15  16  17  18  19  20  21  22  23
  3  3  2  2  3  3  2  3  2  2  2

Extracting the set of most inclusive clades with Strahler number i

$
0
0

This is a response to the following question about Strahler numbers of trees:

I have an additional question, if I may: how could I go about extracting clades of a certain order as unique groups? .... I've been trying to extract the groups based solely on the assigned Strahler number for each node, but the lower ranking identical numbers mess with the extraction (i.e create unique clades that are just subsets of a bigger clade of the same order).

This is not too hard. Here is some code that should do this:

sn<-strahlerNumber(tree)
i<-3 # this is the order we want to extract
sn<-sn[sn==i]
# get descendant tip numbers for all clades
ll<-lapply(as.numeric(names(sn)),getDescendants,tree=tree)
# figure out which ones are most inclusive
ff<-function(x,y) !all(sapply(x,"%in%",y))
GG<-sapply(ll,function(x,y) sapply(y,ff,x=x),y=ll)
ii<-which(colSums(GG)==(ncol(GG)-1))
# extract these clades
trees<-lapply(as.numeric(names(sn))[ii],extract.clade,
 phy=tree)
class(trees)<-"multiPhylo"

OK - let's try it out:

> tree<-pbtree(n=30)
> sn<-strahlerNumber(tree,plot=TRUE)
Apply the code from above, then plot with Strahler numbers to verify:
> nplots<-2*ceiling(length(trees)/2)
> layout(matrix(1:nplots,ceiling(nplots/2),2,byrow=TRUE))
> sNN<-lapply(trees,strahlerNumber,plot=TRUE)

Cool.

Viewing all 800 articles
Browse latest View live