Tuesday, April 3, 2012

Simulating the Fossil Record: Diversification and Morpho-Taxa

Hello all! Fair warning: this is going to be a really long post.

Recently I submitted a short description of my library, paleotree, as an Applications paper to Methods in Ecology and Evolution and got back some great reviews. However, based on the confusion of my reviewers on some of the points, I clearly flubbed some of the explaining about how the most basic simulation functions work. I've also had similar questions from some users based on some of the concepts I discuss in the documentation. Unfortunately, Applications papers in MEE have rather restrictive word-count limits, so while I think of how to write a clear SHORT description for MEE, here’s an extra-long version which probably won’t see the light of day in a real publication. I'm going to pitch this as if readers aren't familiar with things like 'birth-death models' so bear with me if you are all too familiar with birth-death models.

So, here's the central question: how does paleotree simulate the fossil record and what makes it different from the sort of simulators already out there? Today I'm just going to talk about how diversification simulation in paleotree works, specifically with respect to the different modes of morphological-branching offered by the function simFossilTaxa.

First, let's lay out what models and simulations are.

First, let's talk about models. Models (in my view) are just sets of simplified assumptions we make to describe the world around us; you could make an argument that hypotheses and theories are just models which are testing or have tested repeatedly. All of science, pretty much, is just coming up with models and asking how relatively well they describe our observations. I'd argue that scientists really can't know much more than the relative fit of models to our observations; there is always the potential that we haven't yet considered the One Awesome Model to Explain Everything. (I should jump in here and point out that I'm a weak-instrumentalist when it comes to philosophy of science... I think there is really some true system of how things work but I definitely don't think we can actually describe that system, just approximate it with our feeble little models.) The important thing about working with models is to always recognize the vast full sum of the assumptions and pinpoint which assumptions might make a big difference.

Models stack on top of each other in ways that can make things pretty hard to do this sort of thought-dissection, but its so important to do it! Think about this: when we take the standard arithmetic mean of some values when we want a single value to describe the central tendency of our observations (say, we take the mean of all the kids in a second grade class), we are assuming the underlying data structure is normally-distributed. If that assumption is really wrong, then the mean won't... heh, mean much at all! The mean time you spend waiting in lines probably isn't the value that you experience most often when waiting in lines at the bank, as the distribution of those waiting times is probably exponentially or gamma distributed. However, the mean in that cause could give you valuable information about the rate that bank tellers are processing customers. There's a whole field of statisticians devoted to understand the math of such queues, which are particularly important for designing effective call centers (who knew those were based on science... geesh...).

So, if that what models are, what's a simulation? What do I mean when I say we are interested in 'simulating the fossil record'? Simulations are models of stochastic processes (processes where the result can differ in outcome, as opposed to deterministic processes) which attempt to recreate the steps that create those outcomes. A likelihood equation that describes how likely a given parameter ism given some observed data, is not a simulation. A simulation is generally more like a board game, and in fact does not really need to be computational: Monopoly is a simulation of what its like to be a real estate mogul in Atlantic City, it just makes a bunch of assumptions we probably don't think are realistic about that system. Of course, for really complex processes with many steps, we'd like to automate them and run them many many times very quickly, which is why turning simulations into computer code is so valuable. That said, I find that coding simulations is incredibly analogous to game design, which happens to be my one hobby. Just gotta keep track of where little particles are moving through and simplifying the rules and the steps involved without negatively affecting the outcome.

Okay, so that's Dave's view on Models and Simulations. Obviously, simulating the fossil record is actually a meaningless phrase, so let's go a step further and now what we really want is to simulate the fossil record, specifically, the patterns of taxonomic diversification in the fossil record. For example, maybe we would like to create a little model that can recreate patterns like Sepkoski’s classic curve of taxonomic diversity across the Phanerozoic (the last 500 million years). For example, here's the family-level curve from Raup and Sepkoski, 1982.

If you haven't seen this before, this plot basically juxtaposes the number of marine taxonomic families (y-axis) that Jack found described in the paleontological literature as occurring in a given time interval against time before present (x-axis). That's a bit of a simplification, actually; technically Jack only tallied the first time the family showed up and the last time it was found, as the vast number of lineages do not have continuously sampled temporal ranges in the fossil record.

Note that this is a plot of taxonomic families: Sepkoski's curves were done at the supraspecific (above species) level. There's a lot of reasons for this and some people think this is a pretty big problem with Sepkoski's work, but let's just sidestep that for now. For our intents, we want to make simulated datasets like Sepkoski's, where the units function more or less like single species lineages do.

Obviously, this plot and its ilk have had a huge affect on paleontology as a science and the understanding of the history of life for even non-scientists, as it reveals a pattern of ups and downs (evolutionary radiations and extinction events) with a general trend of increase to the present. There's also a general appearance of flat plateaus, where diversity may follow an equilibrium, maybe diversity-dependent diversification. So, we might infer a lot of things at face value from this curve, but we might want a deeper understand than that. Maybe we'd like to know how the processes that produced this curve work, and for that we would turn to describing those processes as models and simulating them. In other words, simulating the fossil record.

Okay, so, how would we actually make a simulation that creates data like that...?

A good starting point is to talk about how biologists simulate such processes, such as if they want to simulate the phylogenetic relationships among Darwin's finches. Here's an example of such a tree, using the example data in geiger (Harmon et al., 2008; I have heard this is probably not the real phylogeny of Darwin's finches (there's apparently a lot of uncertainty involved), but I don't know anything at all about bird phylogenetics and it makes a nice example).

This is a time-scaled phylogeny, so time proceeds from the left (where the most recent common ancestor is) to the right, where the living descendants are listed by their species names or genus name. Cool!

Notice that time-scaled molecular phylogenies of living animals produces these phylogenies where all the tips are at the same point in time. We call these ultrametric phylogenies, as opposed to the non-ultrametric trees you might expect of extinct fossil taxa, where the tips are at different distances from the roots. Note also that the vertical axis doesn't really mean anything here at all; the tree is only presented as it is so we can easily see the relationships and the tip labels.

Evolutionary biologists have it a bit easier, really, than a paleontologists trying to simulate the full fossil record. In evolutionary biology and paleontology, there are a set of special models which describe population growth that can also be applied to describe a branching process like a phylogeny. In general, you see people use a pure-birth process, where there is some speciation (birth) rate and no extinction (extinction rate = 0), or a birth-death process, where there is speciation and a non-zero extinction rate. Events (speciation or extinction) are modeled as Poisson processes, with exponentially-distributed waiting times between them. Biologists can just simulate a pure-birth or birth-death process (Nee, 2007) until they hit some maximum time or maximum number of co-extant lineages and save the relationships among those taxa. For example, we might get a tree with a hundred taxa like this:

The result is the sort of phylogeny we would expect to observe for the living members of a clade, if we (a) sampled all the taxa alive today and (b) perfectly reconstructed the branching relationships and the timing of the branching events. Do these assumptions matter? Probably, but in general when evolutionary biologists want to test whether a method works, they assume these, maybe relaxing (a).

So, let's take a tree without dropping the extinct taxa. That will be like simulating the fossil record, right? Right? (Note this isn't the same tree as above.)
Note that since this is a simulation, the timescale is arbitrary, just as the number of turns in a board game is only informative relative to how much happens in a turn (the third turn of Diplomacy can be three hours into the game!). We could take a tree like that, count up when speciation and extinction events happen and construct a curve like Jack Sepkoski's above (the resulting curve this particular tree would superficially similar to Jack's curve, actually).

But how good of an approximation is this simulation? Well, its kind-of-but-not-really like the fossil record. Obviously, there are some assumptions here which might make using such output as a simulation an unrealistic way of simulating diversification in the fossil record, just as some of the simplicities of Monopoly would cause us not to think its a realistic simulator of the ups and downs of property values. Let's sort our brains and think about how it differs:

  1. Almost every real dataset from the fossil record is incompletely sampled. The above is perfectly, completely sampled. (Even ole Chuck Darwin would say this is unrealistic! ;) For those of you who don't get the joke, Darwin famously blamed the lack of intermediates in the fossil record on it being incomplete, like a half-burned book with half the pages missing. Probably not a bad allusion.)
  2. The relationships among the taxa is perfectly known, with no uncertainty. This is probably unrealistic, as we rarely can resolve relationships with zero uncertainty in the fossil record. This doesn't really affect things at the moment, though if we're just interested in the diversity curve as long as we still independently know the times of speciation and extinction and we're measuring diversity like Jack did.
  3. Taxa in the fossil record are identified, distinguished and recognized based on morphology; if we accept the above as a 'fossil' record, then every time a branch starts and ends, we say they are morphologically distinct.
  4. Events in the fossil record rarely can be resolved down to a continuous time-scale; rather a discrete time-scale is generally necessary where first and last appearances are only known to occur within bins (discrete intervals) that may last as long as 10 million years. This could have a big impact on our data.
  5. To return to a point previously made, Sepkoski's data was at the familial or generic taxonomic level, not the species level. This obviously could make a big difference on the sort of patterns we pick up; for example, if it was a plot of marine phyla, the curve would flatten after the Cambrian! There are various ways of dealing with this in simulations by having a hierarchical branching model(Patzkowsky, 1993; Foote, 2011) but for now let's ignore it. I think the more phylogenetically-minded paleontologists who will be using paleotree will be interested more in simulating species-level patterns. Creating a quick function for converting data to higher level taxa is a neat idea for a future improvement for paleotree, though!

So, 1 is a big problem, because we'd like to know how incomplete sampling affects things. We know that sampling can make a huge difference in the fossil record based on decades of literature, and as Matt Friedman once told me (and I paraphrase), paleontology is the science of studying a degraded biological record. Dealing with that degradation is central to the art of being paleobiologists, I think. 2 isn't a problem if we just want a diversity curve, so let's ignore that for now. We don't really know whether 3 is a good or a bad assumption and we will just ignore 4 and 5 for the moment. So, maybe if we can just account for sampling we'll have some usable simulations?

Unfortunately, morphological identification and sampling are tangled. Why is this? It is because the basic unit of paleontological study is the morphologically-defined taxon, and it is these units which are sampled or not sampled in time. Consider one of the most basic assumption of most paleobiological data, which is range-through. For those who do not know what range-through is, let's return to how Jack was analyzing his data. He was looking for the first and last appearances of taxonomic families and genera in the systematic literature and considering any time in between those dates as the taxon being present. This is range-through. It's practical a Lyellian law of biochronology: 'If the paleontology community can morphologically recognize a specimen as being of a taxon at time 1 and at time 2, then it is present in all the samples in between.' This is why many paleo datasets are in the form of first and last appearances (this isn't true for the PBDB, though). Obviously, this sort of an assumption could be misled by convergence. At the same time, paleontologists can't identify truly cryptic species in the fossil record: two lineages which are morphologically identical would be defined as a single morphotaxon.

My point here is that no matter we choose here, we need to make some sort of assumption, either implicitly or explicitly, about the morphological distinctiveness of taxa. It is always better to be explicit when it comes to models. There is a variety of ways in which taxa could relate to each other morphologically and change morphology with respect to branching events. Some of this is obviously tied up with puncuated equilibrium, which posited that morphological change only occurred as sudden 'punctuated' shifts at cladogenesis (another term for branching events which produce new species). This idea is still in dispute.

A simple model that is often used in paleontology is budding cladogenesis, where at each branching event there is an ancestral taxon which persists and a single daughter lineage which 'buds' off and is immediately morphologically distinct from its ancestor. Morphological change never happens outside of these budding events. (Foote, 1996, describes these modes, as does Wagner and Erwin (1995) who simply refer to this as 'cladogenesis'. Rohlf et al., 1990, also use these models but refer to it as the 'punctuational' model.) The budding cladogenesis model does not really necessitate punctuated equilibrium: all that would be required is for new species to become morphologically distinct rapidly after speciation, relative to the geological timescale. The lack of change outside of the budding events is generally treated more as a 'nuisance parameter' (a term for an assumption considered to be of small effect in a model) instead of biological reality.

For your reference, here's a little schematic of what this mode of morpho-taxon-branching is supposed to look like, from Foote (1996), along with some other modes I'll introduce in a bit.

Again, like the phylogenies above, the horizontal axis isn't meaningful; it just designates shifts between morphotaxa we would consistently recognize.

For the purpose of approximating diversity count data like what Sepkoski analyzed, we could probably just assume this single model of budding morphological change. Along with a simple model of sampling in the fossil record such as sampleRanges in paleotree (which I'll cover in a future post), we could probably have an acceptably realistic model of those sort of dynamics.

There's a very similar model of morphological change, generally referred to as bifurcating cladogenesis. In that model, each branching event produces two new species which are morphologically distinct from their ancestor, which to an observer looking at data from the fossil record would appear to superficially go extinct at that branching event. We generally call such events 'pseudo-extinction'. Just like in the pure-budding model, there's not morphological shifts at all except at the branching events.

This model, just like the budding model, is probably an adequate model for morphotaxa and simulating continuous-time diversity curves. The datasets will differ from the budding model because the morphological shifts along both daughter lineages mean that taxa cannot persist through speciation events. Remember range-through? No more range-through. If A is our ancestor and B and C are its descendants, the nature of the fossil record would lead us to expect some gap in time between when we last sample A and its pseudo extinction, and a gap in time between the origination of B and C and their first observed appearances in the sampled fossil record. Maybe not a huge bias, but we begin to see how our assumptions about morphology are important to how we simulate even the diversification of lineages.

What if we want to go further? What if we bin our temporal data like Sepkoski did? For either the budding model or the bifurcating model, we would expect some bias because long intervals will appear to have many taxa present which might individually have been short-lived, but we can intuitively expect this bias. If we bin datasets produced under bifurcation, we find a new bias has appeared which wasn't present with the budding cladogenesis datasets. The presence of a pseudo-extinct ancestor in the same interval as both of its descendants will inflate the taxonomic diversity counts and also make the data itself look more volatile, with more apparent speciations and extinction per interval.

Enough to make a difference? Well, it really depends on the rates and what sort of a difference we're interested in. It's different, but whether that matters really comes down to what matters for the question you're asking. My point here is that considering how morphological change is occurring is an important aspect of simulating diversification dynamics in the fossil record. If we want to talk about how we recognize changes in diversity, we should be thinking about is how morphotaxa are changing and being differentiated. Its as important as the assumptions we make about whether there is extinction or not and thus inseparable from any real simulator of diversification in the fossil record.

For this reason, simFossilTaxa, which is a function in paleotree designed to simulate the birth and death of species in the fossil record, explicitly uses a budding cladogenesis model, as described in the documentation for paleotree. The function also allows for varying levels of bifurcating cladogenesis and for anagenesis, when a lineage in the fossil record shifts morphologically, such that it is no longer identified as the same morphotaxon and thus creating apparent pseudo-extinction event and pseudo-speciation event. Thus, one can even simulate datasets where morphological change was occurring under budding, bifurcation and anagenesis.

In terms of what this actually means, it means differences in how the units of the output get broken down. All else held constant, a clade simulated with a positive rate of anagenesis will just have more arbitrary devisions into morphotaxa wit shorter durations. Because it makes many other things easier, the output of the simFossilTaxa is meant to already be in the units of morphotaxa that would be recognize by some imaginary systematicist. Also, as the number of morphotaxa identified across the evolutionary history of a group is often the criteria of sampling used in paleobiological studies, I feel it is important to have the morphological shifts which distinguish morphotaxa interwoven into a diversification model rather than place these shifts secondarily. Otherwise, it would actually be much more complicated to simulate a dataset of '1000 taxa'.

So, yes, simFossilTaxa isn't simulating morphology, but models of morphological change are still necessary to touch upon to describe diversification in the fossil record. That said, the models offered in simFossilTaxa allow for a lot of variability previous to other simulators, but not the full range. What if you don't think that speciation/branching events and morphological change need to be tied at all? (A world without any punc-eq like phenomena or character displacement, basically.) simFossilTaxa is not designed to simulate data under this model. It's on the to-do list, though. In that model, taxa would be 'seen' as their ancestor until some anagenetic shift happened to distinguish them from their descendants.

We would imagine the mean duration of these nascent, cryptic lineages to be something like the reciprocal of the summed instantaneous rates of anagenesis and extinction combined, in other words, they would last as long until either extinction or anagenesis occurred (remember that the reciprocal of the rate in an exponential process is the mean waiting time, and the rate of an exponential process where either event A or event B can happen is the sum of the respective rates; see the really detailed and handy Wikipedia article on the Exponential Distribution). It would be a little more complicated than this, because there's also the chance of these nascent lineages branching off and producing new also-cryptic species. Humbug!

Now, you can approximate the 'pure-anagenesis' model in simFossilTaxa under a particular scenario where you thought anagenesis was happening very frequently (so much so that stasis was very rare). You could just assume new taxa will never be cryptic then and simulate this scenario as bifurcating cladogenesis and increase the rate of anagenesis arbitrary high in simFossilTaxa. Of course, this may violate some assumptions of the basic way simFossilTaxa does things. As in the birth-death models I described above, I treat anagenesis as a Poisson process, a statistical model which assumes events are fairly rare.

You might wonder which of any of these models of morphological change is supported by the fossil record. Well, Folmer Bokma and his lab has done some work with molecular phylogenies and models of trait change which suggests that large portion of morphological change, but not the majority, occurs at branching events. Gene Hunt has found similar things that he has presented at conferences, using fossil data, and he's previously published that stasis seems to be pretty common in the fossil record. The only studies of whether change is generally more by budding or bifurcation are by Wagner and Erwin (1995), who found generally much more probable events of budding cladogenesis compared to bifurcation.

That last finding is a bit shocking, though! Under a model where speciation and morphological shifts are unrelated, if the rate of the later was high enough not to commonly have those un-seen cryptic lineages, you would expect apparent bifurcation to happen sometime to (i.e. have anagenesis occur shortly after in both daughter lineages). So, either (a) the past history of life is rife with unidentified cryptic species or (b) there really is a relationship between branching events and morphological change and it tends to be asymmetrical.

Anyway, to sum up all of the above, modeling and simulating the fossil record is complex and it is important to consider all the caveats and assumptions one is making, as these may have a large effect on the result. In simFossilTaxa, the diversification of lineages, which is generally modeled as non-morphological births and deaths, is tightly interwoven with models of how recognizable morphotaxa start and end in the fossil record. How one morphologically interprets taxa has big implications for how we reconstruct things like history of diversity. It matters even more for our next topic: how do we simulate extracting morphological relationships from the fossil record?

Later, I'll be posting more about how the simulators in paleotree work. The next post will be about how I can take patterns of diversification from simFossilTaxa and pull different approximations of relationships from the taxonomic units within, and I'll discuss a little about a recent result I had. The third and last post will be on incomplete sampling in the fossil record and I'll discuss how I'm implementing some new features in sampleRanges to include a variety of models of sampling in the fossil record.

Edit 04-04-12:
I mention at the start about this post being initiated by working on some revisions, but I just want to be clear: the reviews for my paleotree paper were pretty nice! Thanks, reviewers! I just don't have a venue without a word count, other than my blog, where I can fully get into these issues about how we simulate diversification in the fossil record. And yet, this information needs to be out there.