Quantcast
Channel: Phylogenetic Tools for Comparative Biology
Viewing all articles
Browse latest Browse all 805

Bounded Brownian motion & the most important phylogenetic comparative methods paper you've probably never read

$
0
0

I’m going to make a fairly bold assertion.

That assertion is as follows. The most important paper in phylogenetic comparative methods you’ve never heard of (at least from the last ten years or so) is Boucher & Démery’s (2016) article on bounded Brownian motion. You might be excused from knowing about this paper – in nearly 8 years it’s only been cited 32 times.

The paper adds an interesting model to the toolkit of phylogenetic comparative biologists: Brownian motion with reflexive bounds. Bounded Brownian motion is an evolutionary scenario that’s easy to conceptualize, trivial to simulate, but (prior to Boucher & Démery’s 2016) effectively impossible to fit to observed data. That, on its own, makes this paper important.

In the abstract of the article, however, the authors mention that they “provide analytical expressions for the likelihood of BBM [bounded Brownian motion] and present a method to calculate the likelihood numerically…” (emphasis mine). To be frank, I didn’t get this at first – but when I finally did, my head just about exploded.

To put it in the simple terms that I myself understand, Boucher & Démery merely observe that, in the limit, the probability (density) of a set of data evolving via Brownian motion and an ordered, symmetric, continuous-time Markov chain (Mk) model converge as the step size between adjacent discrete states goes to zero – and can get quite close to convergence, so long as the step size is small relative to the range of our observed data.

Being someone who is used to teaching about the fundamental underlying differences between the ways in which we model discrete & continuous traits in phylogenetic comparative biology, learning this surprised me at first and triggered a liberal measure of disbelief – however, upon further introspection, this claim is fundamentally no different from an assertion that Brownian motion and a discrete-time random walk converge as the time interval diminishes to zero! (Something that is rather easy to accept, in fact.)

Diving into the weeds only a tiny bit, we can also say that in the limit, the relationship between the transition rate between adjacent states, \(q\), and the Brownian evolutionary rate, \(\sigma^2\), is given by the following expression:

$$ \sigma^2 = 2q \left( \frac{w}{k} \right)^2 $$

In which \(w\) is the upper and lower limits of our trait (either set to be well outside the observed range of the trait, if we are approximating Brownian motion, or at the trait limits, if we are modeling bounded evolution), and \(k\) is the number of discrete trait categories that we’re using the approximate the likelihood. In general (for a given value of \(w\)), as \(k\) is increased we should approach the true likelihood – though not always precisely monotonically!

Astute readers might wonder how this could be true, as when we fit a continuous trait model, we’re maximizing a likelihood defined as the probability density of our observed data (say, \(\mathbf{x}\)) given a set of model parameters (say, \(\sigma^2\) and \(x_0\), in the case of Brownian motion); while, on the other hand, when we fit a discrete trait model, we’ll be maximizing a likelihood defined as the probability (not probability density) of our data, given our model. Indeed, this confused me as well! The answer can be seen when we appreciate that the probability of observing a continuous random variable on an interval can be obtained from a density function by integrating over that interval; and that, in the limit as the interval width approaches zero, the integral of a function is merely its width times the function height. This means that the converse must also be true, that the density can be approximated by taking the probability of a bin & dividing by the width of a bin. Using the same terms as above, our log-likelihood is thus approximated as:

$$ \log(L) = \log(L’) - N\log\left(\frac{w}{k}\right) $$

In which \(\log(L)\) is our approximation of the log-likelihood density (on our data’s original scale), \(\log(L')\) is the log-likelihood of our discretized continuous data, \(N\) is the number of tips in our tree, and both \(w\) and \(k\) are defined as before. Once again, after transformation, & in the limit as our bin width shrinks (or, equivalently, as \(k \rightarrow \infty\)), this log-likelihood is directly comparable to a log-likelihood density obtained on the original, undiscretized trait data.

So, what’s the big deal? We can fit bounded Brownian motion? No! We can fit any model in which the back & forth transitions between character states can be expressed as an ordered, continuous-time Markov chain. This was appreciated by a secondvery important article that you probably haven’t read: Boucher et al. (2018). (To be fair, this one’s been cited 40 times rather than a mere 32 – so perhaps more people have read it.) Excellent as this article is, I think even it underappreciates the significance & power of this approach. Expect to hear more about this from me later!

Boucher has bounded Brownian motion and other models from his collaborative 2018 article implemented in R software here– but (in case I haven’t made the point sufficiently clear yet) I think this is a very important model for phylogenetic comparative methods, so I’ve also recently added bounded BM (along with some other related functionality, such as ancestral state estimation – more on this later as well) to phytools. (Please check out my code here. It’s remarkably simple!)

Implementation ran into a number of quite interesting computational challenges. For example, optimizing the model requires very large matrix exponentiation – once for each edge, for each time computing the likelihood! I was able to speed this up by realizing that I could first compute the set of large matrix exponentials for all edges of the tree in parallel using the foreach package, rather than in serial during pruning itself.

To demonstrate this function (called bounded_bm), let’s first ‘prove’ its equivalency to geiger::fitContinuous for no bounds (or, rather, bounds much broader than the observed range of the trait). Then we can proceed to demonstrate that we’re able to correctly measure bounded evolution, when it occurs. Finally, I’ll apply it to an empirical case of maximum length evolution in eels.

library(phytools)
## Loading required package: ape
## Loading required package: maps

(This version of phytools must be obtained from GitHub.)

packageVersion("phytools")
## [1] '2.1.11'

We can start off by simulating unbounded Brownian motion using phytools::fastBM.

For no reason at all, I’ll use a genuine empirical tree: the phytools data object "mammal.tree" which comes originally from Garland et al. (1992).

data(mammal.tree)
x<-fastBM(mammal.tree,sig2=0.1,a=1.5)
head(x)
##  U._maritimus     U._arctos U._americanus     N._narica      P._lotor   M._mephitis 
##     0.5136932     0.5178826    -1.2364921     2.9156133     4.1869694     3.5433668

Now, let’s fit our unbounded model with phytools::bounded_bm.

If we don’t specify a particular value for our bounds, our function automatically sets them to 3 \(\times\) the range of observed values – making this (essentially) an unbounded evolutionary scenario. (This optimization can take a while, with most of the computing power consumed by \(500 \times 500\) matrix exponentiation. Boucher & Démery suggest that using a smaller value of levs– the number of discrete categories of the discretized trait, \(k\) above – such as levs = 200, will normally be sufficient.)

mammal_unbounded<-bounded_bm(mammal.tree,x,lik.func="parallel",
  root="mle",levs=500)
mammal_unbounded
## Object of class "bounded_bm" based on
##  	a discretization with k = 500 levels.
## 
## Unwrapped (i.e., bounded) model
## 
## Set or estimated bounds: [ -10.346509 , 13.065879 ]
## 
## Fitted model parameters:
## 	 sigsq: 0.080119 
##      x0: 1.055324 
## 
## Log-likelihood: -75.734562 
## 
## R thinks it has found the ML solution.

Now let’s compare this result to geiger::fitContinuous. Be forewarned. If you haven’t believed me to this point about the equivalency (in the limit) of continuous-time Brownian motion and our Mk process, your mind may be about to be blown….

geiger::fitContinuous(mammal.tree,x)
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
## 	sigsq = 0.080204
## 	z0 = 1.063045
## 
##  model summary:
## 	log-likelihood = -75.764287
## 	AIC = 155.528573
## 	AICc = 155.789443
## 	free parameters = 2
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 100
## 	frequency of best fit = 1.000
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates

(Seriously. What?) If you don’t believe it, or think the similar could be accidental, try different generating parameters and try increasing or decreasing levs!

Well, if we could only fit equivalent models to those of fitContinuous, this would be a curiosity. (I’d still be interested, but it might be harder to persuade empiricists to share that enthusiasm.)

Obviously, the whole point of this post is that this property of our discretized Markov model is that it permits us to fit models for which the likelihood function is unknown, or for which computing the likelihood is impractical. Bounded Brownian motion is just one example of such a model.

Conveniently, our previous simulator (phytools::fastBM) can also simulate for the bounded case, so let’s test it out.

y<-fastBM(mammal.tree,sig2=0.4,a=1,bounds=c(0,5))
head(y)
##  U._maritimus     U._arctos U._americanus     N._narica      P._lotor   M._mephitis 
##     3.6066062     3.6149849     0.1062355     2.4257928     4.9685050     3.6812999
range(y)
## [1] 0.1062355 4.9685050

Boucher & Démery assert that the ML solution for the bounds of a bounded Brownian process will always be the trait range.

This makes sense. In this simulation, on the other hand, we happen to know the generating range (we set it), so we can fix the bounds where we know them to be & we should expect to be improved. (I suspect that the former is rather more common in nature than the latter, but I could be wrong.)

When I fit my model, I’m going to update one other aspect – I’ll set root="nuisance" instead of root="mle". (This can be explained in another post.)

mammal_bounded<-bounded_bm(mammal.tree,y,lims=c(0,5),
  lik.func="parallel",root="nuisance",levs=200)
mammal_bounded
## Object of class "bounded_bm" based on
##  	a discretization with k = 200 levels.
## 
## Unwrapped (i.e., bounded) model
## 
## Set or estimated bounds: [ 0 , 5 ]
## 
## Fitted model parameters:
## 	 sigsq: 0.405921 
##      x0: 2.470428 
## 
## Log-likelihood: -75.125155 
## 
## R thinks it has found the ML solution.

Now let’s compare to unbounded Brownian motion fit using geiger::fitContinuous.

geiger::fitContinuous(mammal.tree,y)
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
## 	sigsq = 0.175411
## 	z0 = 2.430212
## 
##  model summary:
## 	log-likelihood = -94.936936
## 	AIC = 193.873871
## 	AICc = 194.134741
## 	free parameters = 2
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 100
## 	frequency of best fit = 1.000
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates

We should typically find that our bounded model fits much better, and provides a much better estimate of the generating value of \(\sigma^2\), than does its unbounded counterpart!

Finally, let’s try applying this to a real empirical dataset, rather than an empirical tree combined with simulated data (as we’ve been doing so far).

For this example, I’ll use a phylogeny and dataset for maximum total length in eels from Collar et al. (2014) – and only because the length of long skinny things, like snakes & eels, always struck me as something that might evolve by a bounded Brownian process. (Don’t @ me eel biologists.)

data(eel.tree)
data(eel.data)
max_TL<-setNames(eel.data$Max_TL_cm,rownames(eel.data))
eel_bounded<-bounded_bm(eel.tree,max_TL,lims=range(max_TL),
  lik.func="parallel",root="mle")
eel_bounded
## Object of class "bounded_bm" based on
##  	a discretization with k = 200 levels.
## 
## Unwrapped (i.e., bounded) model
## 
## Set or estimated bounds: [ 15 , 250 ]
## 
## Fitted model parameters:
## 	 sigsq: 188.95333 
##      x0: 15.5875 
## 
## Log-likelihood: -322.74907 
## 
## R thinks it has found the ML solution.

(Here I left levs at its default value of levs = 200– just to speed along computaton!)

eel_bm<-geiger::fitContinuous(eel.tree,max_TL,model="BM")
eel_bm
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
## 	sigsq = 113.836440
## 	z0 = 85.702289
## 
##  model summary:
## 	log-likelihood = -338.935210
## 	AIC = 681.870419
## 	AICc = 682.077316
## 	free parameters = 2
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 100
## 	frequency of best fit = 1.000
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates

Lastly, let’s compare both of these to a different constraint model – the Ornstein-Uhlenbeck.

eel_ou<-geiger::fitContinuous(eel.tree,max_TL,model="OU")
eel_ou
## GEIGER-fitted comparative model of continuous data
##  fitted 'OU' model parameters:
## 	alpha = 0.043393
## 	sigsq = 278.039659
## 	z0 = 92.068484
## 
##  model summary:
## 	log-likelihood = -329.253814
## 	AIC = 664.507628
## 	AICc = 664.928681
## 	free parameters = 3
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 2
## 	frequency of best fit = 0.020
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates
AIC(eel_bounded,eel_bm,eel_ou)
##             df      AIC
## eel_bounded  4 653.4981
## eel_bm       2 681.8704
## eel_ou       3 664.5076

This should tell us that – even accounting for increased parameter complexity of the model – bounded Brownian motion is a substantially better explanation of these data than the other two models we’ve considered.

That’s all I really have to say about this, but here is a cool plot illustrating bounded Brownian evolution for when I post about this on Twitter!

data("salamanders")
h<-max(nodeHeights(salamanders))
mt<-make.era.map(salamanders,seq(0,h,length.out=100))
st<-map.to.singleton(mt)
x<-fastBM(st,sig2=0.8,internal=TRUE,bounds=c(-20,20))
par(mar=c(5.1,4.1,1.1,1.1))
phenogram(st,x,ftype="i",fsize=0.7,
  mar=c(5.1,4.1,1.1,1.1),lwd=4,las=1,cex.axis=0.8)
segments(x=c(0,0),y=c(-20,20),x1=c(h,h),y0=c(-20,20),
  lwd=2,col=palette()[2])
clip(par()$usr[1],h,min(x),max(x))
grid()
phenogram(st,x,ftype="off",add=TRUE,lwd=2,
  color="white")

plot of chunk unnamed-chunk-18

Thanks for reading, y’all!


Viewing all articles
Browse latest Browse all 805

Trending Articles