Sunday, April 29, 2012
First things first: I released a new version of paleotree, version 1.3! It has a number of new elements, particularly one item I will talk to you about today: cryptic cladogenesis.
In the last missive, I argued that any simulator of "diversification in the fossil record" had to be enmeshed with some assumptions (i.e. model) of how morphologically distinguishable taxonomic units arise, as these are the basic units that we (paleontologists) can identify, relate and measure. These morphologically-deliminated, sometimes temporally-extensive units can represent something very different than equivalent taxa in evolutionary biology (Forey et al., 2004; Ezard et al., 2012).
To clarify some of my explanations last time, it may be helpful to think of two general classes of events. First, one could have 'anagenesis', where a lineage experiences a morphological change that is geologically-sudden, producing a new 'descendant' morphotaxon distinguishable from the previous 'ancestor' which no longer appears. As we generally define and distinguish taxa based on discrete or meristic characters, like the presence or absence of spines, one can think about these events as the change of one or more such characters. If they ever change again, then that would be another 'anagenetic' event and another new morphotaxon would origination, and so on.
(Note that this is different from how most people define anagenesis! Most of the literature using this term is referring to changes in continuous traits, particularly traits which don't (generally) get used to distinguish taxa. I'm only interested in shifts between recognizable morphotaxa, so I'm limiting my usage of anagenesis to describe that and being totally agnostic to how non-systematically-informative traits vary within lineages.)
Now let's think about how branching events, cladogenesis, comes into this. Let's limit ourselves to only bifurcating events, which produce two daughter lineages. We can contextualize the 'bifurcating and budding cladogenesis' of the last post by considering these as all part of a system of whether morphological change must happen in one, both or neither of the daughter lineages. Like so:
The function simFossilTaxa can simulate all of these and any mixture of these processes, within the (generally assumed) constraint that diversification and the morphological shifts occur as Poisson processes. The big major change I had to do to allow for cryptic cladogenesis in paletree 1.3 was a new column which describes which morphotaxon each lineage would be assigned to (due to being functionally identical).
With this new feature, we can do fun things like simulating only under cryptic cladogenesis and anagenesis. This gives us patterns like these, using a particularly relevant example.
Okay, now, this post is supposed to be about how we can turn simulated data from simFossilTaxa into cladograms and phylogenies, using the functions taxa2cladogram and taxa2phylo. Just how does paleotree do that, really?
Well, check out this figure that I totally wish I had room for putting in my MEE submission on paleotree.
So, let me walk you through this. In (a), we have three morphotaxa, related to each other by budding cladogenesis and (b), (c) and (d) are various phylogenetic interpretations of that data.
In particular, (b) is the result of transforming such a dataset into an uncscaled cladogram with taxa2cladogram. This is an unscaled set of nesting relationships (i.e. clades), containing all the clades that could be resolved with morphological data, assuming that shifts in systematic characters can only occur when new morpho-taxa originate. (This is a pretty good assumption: if we see shifts in systematic characters within a lineage, we generally start calling the critters a new name in the fossil record!) The distinctions between morphotaxa are captured in all that information output by simFossilTaxa.
Note that in this case, you get a polytomy. For cases where there is a single ancestor, static in systematic characters and multiple descendants via budding cladogenesis, you get a polytomy, which was originally shown by Smith (1994) and Wagner and Erwin (1995). This is true if you sample two descendants and an ancestor or just three descendants. You can also get it if you have bifurcating cladogenesis and sample ancestors. You will end up with more than two taxa that contain no actual synapomorphies, although in practice this would actually look like either some poorly-supported relationships (on a set of most parsimonious trees) or a polytomy (on a consensus tree).
I've been looking at this issue in great detail lately, with respect to varying how shifts occur under the various models of morphological differentiation we've discussed and with varying rates of sampling in the fossil record. I've decided to write this up as a chapter for my dissertation, so I can't say much at the moment, but the short answer is that it could be a very serious issue: some simulations have have very few resolvable clades at realistic sampling parameters.
Now, (b) and (c) are a little more complicated, representing different ways of translating (a) into time-scaled phylogenies using taxa2phylo. The first thing to understand is that there is no such thing as a single time-scaled tree that will describe the relationships for lineages that span intervals of time. None!
All we can do is talk about the relationships about populations at particular points in time. We might want to do this, for example, if we want to simulate continuous traits evolving on the tree using any of the typically used trait simulators in ape or geiger. We would need to pick a particular 'time' of observation of our simulated morphotaxa in order to even have a time-scaled description of relationships at those dates.
That's what taxa2phylo does. It constructs the time-scaled tree which perfectly describes the set of relationships among for particular points in time within the simulated ranges of taxa. The taxonomic identity of branches is lost, leaving only the historical patterns of branching that get us to our points of interest. 'Ancestral' taxa which have multiple descendants (like taxon A) get chopped up into segments which become separate branches on the resulting output tree.
The figure above shows how different the result can be for different choices of 'observation times'. For (b), the time of interest is the first appearance times of the taxa (I call this the observation times or 'obs times' in the arguments for the function taxa2phylo). For (c), these are the mid-points of the taxon ranges. By default, the observation times used in taxa2phylo are the last appearances times which are not directly figured above but would essentially produce a tree with the branch lengths and branching events equivalent to the simulated dataset, except that taxa which went pseudo-extinct (such as in a bifurcation or anagenesis event) would be attached to the tree as a tip with a zero-length terminal branch.
taxa2phylo should not be used for any purpose but simulation: it doesn't represent anything but a perfect representation of the phylogenetic and temporal relationships. In particular, this is good for simulating datasets in simulators that require a tree (like rTraitCont) but not for testing whether a tree-based analysis works. Using the unscaled partially-unresolved cladograms (from taxa2cladogram) and sampled fossil occurrences, in particular on discrete interval time-scales, will be a more accurate description of the type of data recoverable in the fossil record.
Okay, so that's how I can turn simulated fossil records into trees with paleotree! Next post will be about the aforementioned sampling of the fossil record and how paleotree simulates it!
Tuesday, April 3, 2012
Recently I submitted a short description of my library, paleotree, as an Applications paper to Methods in Ecology and Evolution and got back some great reviews. However, based on the confusion of my reviewers on some of the points, I clearly flubbed some of the explaining about how the most basic simulation functions work. I've also had similar questions from some users based on some of the concepts I discuss in the documentation. Unfortunately, Applications papers in MEE have rather restrictive word-count limits, so while I think of how to write a clear SHORT description for MEE, here’s an extra-long version which probably won’t see the light of day in a real publication. I'm going to pitch this as if readers aren't familiar with things like 'birth-death models' so bear with me if you are all too familiar with birth-death models.
So, here's the central question: how does paleotree simulate the fossil record and what makes it different from the sort of simulators already out there? Today I'm just going to talk about how diversification simulation in paleotree works, specifically with respect to the different modes of morphological-branching offered by the function simFossilTaxa.
First, let's lay out what models and simulations are.
First, let's talk about models. Models (in my view) are just sets of simplified assumptions we make to describe the world around us; you could make an argument that hypotheses and theories are just models which are testing or have tested repeatedly. All of science, pretty much, is just coming up with models and asking how relatively well they describe our observations. I'd argue that scientists really can't know much more than the relative fit of models to our observations; there is always the potential that we haven't yet considered the One Awesome Model to Explain Everything. (I should jump in here and point out that I'm a weak-instrumentalist when it comes to philosophy of science... I think there is really some true system of how things work but I definitely don't think we can actually describe that system, just approximate it with our feeble little models.) The important thing about working with models is to always recognize the vast full sum of the assumptions and pinpoint which assumptions might make a big difference.
Models stack on top of each other in ways that can make things pretty hard to do this sort of thought-dissection, but its so important to do it! Think about this: when we take the standard arithmetic mean of some values when we want a single value to describe the central tendency of our observations (say, we take the mean of all the kids in a second grade class), we are assuming the underlying data structure is normally-distributed. If that assumption is really wrong, then the mean won't... heh, mean much at all! The mean time you spend waiting in lines probably isn't the value that you experience most often when waiting in lines at the bank, as the distribution of those waiting times is probably exponentially or gamma distributed. However, the mean in that cause could give you valuable information about the rate that bank tellers are processing customers. There's a whole field of statisticians devoted to understand the math of such queues, which are particularly important for designing effective call centers (who knew those were based on science... geesh...).
So, if that what models are, what's a simulation? What do I mean when I say we are interested in 'simulating the fossil record'? Simulations are models of stochastic processes (processes where the result can differ in outcome, as opposed to deterministic processes) which attempt to recreate the steps that create those outcomes. A likelihood equation that describes how likely a given parameter ism given some observed data, is not a simulation. A simulation is generally more like a board game, and in fact does not really need to be computational: Monopoly is a simulation of what its like to be a real estate mogul in Atlantic City, it just makes a bunch of assumptions we probably don't think are realistic about that system. Of course, for really complex processes with many steps, we'd like to automate them and run them many many times very quickly, which is why turning simulations into computer code is so valuable. That said, I find that coding simulations is incredibly analogous to game design, which happens to be my one hobby. Just gotta keep track of where little particles are moving through and simplifying the rules and the steps involved without negatively affecting the outcome.
Okay, so that's Dave's view on Models and Simulations. Obviously, simulating the fossil record is actually a meaningless phrase, so let's go a step further and now what we really want is to simulate the fossil record, specifically, the patterns of taxonomic diversification in the fossil record. For example, maybe we would like to create a little model that can recreate patterns like Sepkoski’s classic curve of taxonomic diversity across the Phanerozoic (the last 500 million years). For example, here's the family-level curve from Raup and Sepkoski, 1982.
If you haven't seen this before, this plot basically juxtaposes the number of marine taxonomic families (y-axis) that Jack found described in the paleontological literature as occurring in a given time interval against time before present (x-axis). That's a bit of a simplification, actually; technically Jack only tallied the first time the family showed up and the last time it was found, as the vast number of lineages do not have continuously sampled temporal ranges in the fossil record.
Note that this is a plot of taxonomic families: Sepkoski's curves were done at the supraspecific (above species) level. There's a lot of reasons for this and some people think this is a pretty big problem with Sepkoski's work, but let's just sidestep that for now. For our intents, we want to make simulated datasets like Sepkoski's, where the units function more or less like single species lineages do.
Obviously, this plot and its ilk have had a huge affect on paleontology as a science and the understanding of the history of life for even non-scientists, as it reveals a pattern of ups and downs (evolutionary radiations and extinction events) with a general trend of increase to the present. There's also a general appearance of flat plateaus, where diversity may follow an equilibrium, maybe diversity-dependent diversification. So, we might infer a lot of things at face value from this curve, but we might want a deeper understand than that. Maybe we'd like to know how the processes that produced this curve work, and for that we would turn to describing those processes as models and simulating them. In other words, simulating the fossil record.
Okay, so, how would we actually make a simulation that creates data like that...?
A good starting point is to talk about how biologists simulate such processes, such as if they want to simulate the phylogenetic relationships among Darwin's finches. Here's an example of such a tree, using the example data in geiger (Harmon et al., 2008; I have heard this is probably not the real phylogeny of Darwin's finches (there's apparently a lot of uncertainty involved), but I don't know anything at all about bird phylogenetics and it makes a nice example).
This is a time-scaled phylogeny, so time proceeds from the left (where the most recent common ancestor is) to the right, where the living descendants are listed by their species names or genus name. Cool!
Notice that time-scaled molecular phylogenies of living animals produces these phylogenies where all the tips are at the same point in time. We call these ultrametric phylogenies, as opposed to the non-ultrametric trees you might expect of extinct fossil taxa, where the tips are at different distances from the roots. Note also that the vertical axis doesn't really mean anything here at all; the tree is only presented as it is so we can easily see the relationships and the tip labels.
Evolutionary biologists have it a bit easier, really, than a paleontologists trying to simulate the full fossil record. In evolutionary biology and paleontology, there are a set of special models which describe population growth that can also be applied to describe a branching process like a phylogeny. In general, you see people use a pure-birth process, where there is some speciation (birth) rate and no extinction (extinction rate = 0), or a birth-death process, where there is speciation and a non-zero extinction rate. Events (speciation or extinction) are modeled as Poisson processes, with exponentially-distributed waiting times between them. Biologists can just simulate a pure-birth or birth-death process (Nee, 2007) until they hit some maximum time or maximum number of co-extant lineages and save the relationships among those taxa. For example, we might get a tree with a hundred taxa like this:
The result is the sort of phylogeny we would expect to observe for the living members of a clade, if we (a) sampled all the taxa alive today and (b) perfectly reconstructed the branching relationships and the timing of the branching events. Do these assumptions matter? Probably, but in general when evolutionary biologists want to test whether a method works, they assume these, maybe relaxing (a).
So, let's take a tree without dropping the extinct taxa. That will be like simulating the fossil record, right? Right? (Note this isn't the same tree as above.)
Note that since this is a simulation, the timescale is arbitrary, just as the number of turns in a board game is only informative relative to how much happens in a turn (the third turn of Diplomacy can be three hours into the game!). We could take a tree like that, count up when speciation and extinction events happen and construct a curve like Jack Sepkoski's above (the resulting curve this particular tree would superficially similar to Jack's curve, actually).
But how good of an approximation is this simulation? Well, its kind-of-but-not-really like the fossil record. Obviously, there are some assumptions here which might make using such output as a simulation an unrealistic way of simulating diversification in the fossil record, just as some of the simplicities of Monopoly would cause us not to think its a realistic simulator of the ups and downs of property values. Let's sort our brains and think about how it differs:
- Almost every real dataset from the fossil record is incompletely sampled. The above is perfectly, completely sampled. (Even ole Chuck Darwin would say this is unrealistic! ;) For those of you who don't get the joke, Darwin famously blamed the lack of intermediates in the fossil record on it being incomplete, like a half-burned book with half the pages missing. Probably not a bad allusion.)
- The relationships among the taxa is perfectly known, with no uncertainty. This is probably unrealistic, as we rarely can resolve relationships with zero uncertainty in the fossil record. This doesn't really affect things at the moment, though if we're just interested in the diversity curve as long as we still independently know the times of speciation and extinction and we're measuring diversity like Jack did.
- Taxa in the fossil record are identified, distinguished and recognized based on morphology; if we accept the above as a 'fossil' record, then every time a branch starts and ends, we say they are morphologically distinct.
- Events in the fossil record rarely can be resolved down to a continuous time-scale; rather a discrete time-scale is generally necessary where first and last appearances are only known to occur within bins (discrete intervals) that may last as long as 10 million years. This could have a big impact on our data.
- To return to a point previously made, Sepkoski's data was at the familial or generic taxonomic level, not the species level. This obviously could make a big difference on the sort of patterns we pick up; for example, if it was a plot of marine phyla, the curve would flatten after the Cambrian! There are various ways of dealing with this in simulations by having a hierarchical branching model(Patzkowsky, 1993; Foote, 2011) but for now let's ignore it. I think the more phylogenetically-minded paleontologists who will be using paleotree will be interested more in simulating species-level patterns. Creating a quick function for converting data to higher level taxa is a neat idea for a future improvement for paleotree, though!
So, 1 is a big problem, because we'd like to know how incomplete sampling affects things. We know that sampling can make a huge difference in the fossil record based on decades of literature, and as Matt Friedman once told me (and I paraphrase), paleontology is the science of studying a degraded biological record. Dealing with that degradation is central to the art of being paleobiologists, I think. 2 isn't a problem if we just want a diversity curve, so let's ignore that for now. We don't really know whether 3 is a good or a bad assumption and we will just ignore 4 and 5 for the moment. So, maybe if we can just account for sampling we'll have some usable simulations?
Unfortunately, morphological identification and sampling are tangled. Why is this? It is because the basic unit of paleontological study is the morphologically-defined taxon, and it is these units which are sampled or not sampled in time. Consider one of the most basic assumption of most paleobiological data, which is range-through. For those who do not know what range-through is, let's return to how Jack was analyzing his data. He was looking for the first and last appearances of taxonomic families and genera in the systematic literature and considering any time in between those dates as the taxon being present. This is range-through. It's practical a Lyellian law of biochronology: 'If the paleontology community can morphologically recognize a specimen as being of a taxon at time 1 and at time 2, then it is present in all the samples in between.' This is why many paleo datasets are in the form of first and last appearances (this isn't true for the PBDB, though). Obviously, this sort of an assumption could be misled by convergence. At the same time, paleontologists can't identify truly cryptic species in the fossil record: two lineages which are morphologically identical would be defined as a single morphotaxon.
My point here is that no matter we choose here, we need to make some sort of assumption, either implicitly or explicitly, about the morphological distinctiveness of taxa. It is always better to be explicit when it comes to models. There is a variety of ways in which taxa could relate to each other morphologically and change morphology with respect to branching events. Some of this is obviously tied up with puncuated equilibrium, which posited that morphological change only occurred as sudden 'punctuated' shifts at cladogenesis (another term for branching events which produce new species). This idea is still in dispute.
A simple model that is often used in paleontology is budding cladogenesis, where at each branching event there is an ancestral taxon which persists and a single daughter lineage which 'buds' off and is immediately morphologically distinct from its ancestor. Morphological change never happens outside of these budding events. (Foote, 1996, describes these modes, as does Wagner and Erwin (1995) who simply refer to this as 'cladogenesis'. Rohlf et al., 1990, also use these models but refer to it as the 'punctuational' model.) The budding cladogenesis model does not really necessitate punctuated equilibrium: all that would be required is for new species to become morphologically distinct rapidly after speciation, relative to the geological timescale. The lack of change outside of the budding events is generally treated more as a 'nuisance parameter' (a term for an assumption considered to be of small effect in a model) instead of biological reality.
For your reference, here's a little schematic of what this mode of morpho-taxon-branching is supposed to look like, from Foote (1996), along with some other modes I'll introduce in a bit.
Again, like the phylogenies above, the horizontal axis isn't meaningful; it just designates shifts between morphotaxa we would consistently recognize.
For the purpose of approximating diversity count data like what Sepkoski analyzed, we could probably just assume this single model of budding morphological change. Along with a simple model of sampling in the fossil record such as sampleRanges in paleotree (which I'll cover in a future post), we could probably have an acceptably realistic model of those sort of dynamics.
There's a very similar model of morphological change, generally referred to as bifurcating cladogenesis. In that model, each branching event produces two new species which are morphologically distinct from their ancestor, which to an observer looking at data from the fossil record would appear to superficially go extinct at that branching event. We generally call such events 'pseudo-extinction'. Just like in the pure-budding model, there's not morphological shifts at all except at the branching events.This model, just like the budding model, is probably an adequate model for morphotaxa and simulating continuous-time diversity curves. The datasets will differ from the budding model because the morphological shifts along both daughter lineages mean that taxa cannot persist through speciation events. Remember range-through? No more range-through. If A is our ancestor and B and C are its descendants, the nature of the fossil record would lead us to expect some gap in time between when we last sample A and its pseudo extinction, and a gap in time between the origination of B and C and their first observed appearances in the sampled fossil record. Maybe not a huge bias, but we begin to see how our assumptions about morphology are important to how we simulate even the diversification of lineages.
What if we want to go further? What if we bin our temporal data like Sepkoski did? For either the budding model or the bifurcating model, we would expect some bias because long intervals will appear to have many taxa present which might individually have been short-lived, but we can intuitively expect this bias. If we bin datasets produced under bifurcation, we find a new bias has appeared which wasn't present with the budding cladogenesis datasets. The presence of a pseudo-extinct ancestor in the same interval as both of its descendants will inflate the taxonomic diversity counts and also make the data itself look more volatile, with more apparent speciations and extinction per interval.
Enough to make a difference? Well, it really depends on the rates and what sort of a difference we're interested in. It's different, but whether that matters really comes down to what matters for the question you're asking. My point here is that considering how morphological change is occurring is an important aspect of simulating diversification dynamics in the fossil record. If we want to talk about how we recognize changes in diversity, we should be thinking about is how morphotaxa are changing and being differentiated. Its as important as the assumptions we make about whether there is extinction or not and thus inseparable from any real simulator of diversification in the fossil record.
For this reason, simFossilTaxa, which is a function in paleotree designed to simulate the birth and death of species in the fossil record, explicitly uses a budding cladogenesis model, as described in the documentation for paleotree. The function also allows for varying levels of bifurcating cladogenesis and for anagenesis, when a lineage in the fossil record shifts morphologically, such that it is no longer identified as the same morphotaxon and thus creating apparent pseudo-extinction event and pseudo-speciation event. Thus, one can even simulate datasets where morphological change was occurring under budding, bifurcation and anagenesis.
In terms of what this actually means, it means differences in how the units of the output get broken down. All else held constant, a clade simulated with a positive rate of anagenesis will just have more arbitrary devisions into morphotaxa wit shorter durations. Because it makes many other things easier, the output of the simFossilTaxa is meant to already be in the units of morphotaxa that would be recognize by some imaginary systematicist. Also, as the number of morphotaxa identified across the evolutionary history of a group is often the criteria of sampling used in paleobiological studies, I feel it is important to have the morphological shifts which distinguish morphotaxa interwoven into a diversification model rather than place these shifts secondarily. Otherwise, it would actually be much more complicated to simulate a dataset of '1000 taxa'.
So, yes, simFossilTaxa isn't simulating morphology, but models of morphological change are still necessary to touch upon to describe diversification in the fossil record. That said, the models offered in simFossilTaxa allow for a lot of variability previous to other simulators, but not the full range. What if you don't think that speciation/branching events and morphological change need to be tied at all? (A world without any punc-eq like phenomena or character displacement, basically.) simFossilTaxa is not designed to simulate data under this model. It's on the to-do list, though. In that model, taxa would be 'seen' as their ancestor until some anagenetic shift happened to distinguish them from their descendants.
We would imagine the mean duration of these nascent, cryptic lineages to be something like the reciprocal of the summed instantaneous rates of anagenesis and extinction combined, in other words, they would last as long until either extinction or anagenesis occurred (remember that the reciprocal of the rate in an exponential process is the mean waiting time, and the rate of an exponential process where either event A or event B can happen is the sum of the respective rates; see the really detailed and handy Wikipedia article on the Exponential Distribution). It would be a little more complicated than this, because there's also the chance of these nascent lineages branching off and producing new also-cryptic species. Humbug!
Now, you can approximate the 'pure-anagenesis' model in simFossilTaxa under a particular scenario where you thought anagenesis was happening very frequently (so much so that stasis was very rare). You could just assume new taxa will never be cryptic then and simulate this scenario as bifurcating cladogenesis and increase the rate of anagenesis arbitrary high in simFossilTaxa. Of course, this may violate some assumptions of the basic way simFossilTaxa does things. As in the birth-death models I described above, I treat anagenesis as a Poisson process, a statistical model which assumes events are fairly rare.
You might wonder which of any of these models of morphological change is supported by the fossil record. Well, Folmer Bokma and his lab has done some work with molecular phylogenies and models of trait change which suggests that large portion of morphological change, but not the majority, occurs at branching events. Gene Hunt has found similar things that he has presented at conferences, using fossil data, and he's previously published that stasis seems to be pretty common in the fossil record. The only studies of whether change is generally more by budding or bifurcation are by Wagner and Erwin (1995), who found generally much more probable events of budding cladogenesis compared to bifurcation.
That last finding is a bit shocking, though! Under a model where speciation and morphological shifts are unrelated, if the rate of the later was high enough not to commonly have those un-seen cryptic lineages, you would expect apparent bifurcation to happen sometime to (i.e. have anagenesis occur shortly after in both daughter lineages). So, either (a) the past history of life is rife with unidentified cryptic species or (b) there really is a relationship between branching events and morphological change and it tends to be asymmetrical.
Anyway, to sum up all of the above, modeling and simulating the fossil record is complex and it is important to consider all the caveats and assumptions one is making, as these may have a large effect on the result. In simFossilTaxa, the diversification of lineages, which is generally modeled as non-morphological births and deaths, is tightly interwoven with models of how recognizable morphotaxa start and end in the fossil record. How one morphologically interprets taxa has big implications for how we reconstruct things like history of diversity. It matters even more for our next topic: how do we simulate extracting morphological relationships from the fossil record?
I mention at the start about this post being initiated by working on some revisions, but I just want to be clear: the reviews for my paleotree paper were pretty nice! Thanks, reviewers! I just don't have a venue without a word count, other than my blog, where I can fully get into these issues about how we simulate diversification in the fossil record. And yet, this information needs to be out there.