Sunday, April 29, 2012

Simulating the Fossil Record: Cryptic Species, Phylogenies and Resolvable Clades

Okay, so let's pick up where we left off with a small tangent.

First things first: I released a new version of paleotree, version 1.3! It has a number of new elements, particularly one item I will talk to you about today: cryptic cladogenesis.

In the last missive, I argued that any simulator of "diversification in the fossil record" had to be enmeshed with some assumptions (i.e. model) of how morphologically distinguishable taxonomic units arise, as these are the basic units that we (paleontologists) can identify, relate and measure. These morphologically-deliminated, sometimes temporally-extensive units can represent something very different than equivalent taxa in evolutionary biology (Forey et al., 2004; Ezard et al., 2012).

To clarify some of my explanations last time, it may be helpful to think of two general classes of events. First, one could have 'anagenesis', where a lineage experiences a morphological change that is geologically-sudden, producing a new 'descendant' morphotaxon distinguishable from the previous 'ancestor' which no longer appears. As we generally define and distinguish taxa based on discrete or meristic characters, like the presence or absence of spines, one can think about these events as the change of one or more such characters. If they ever change again, then that would be another 'anagenetic' event and another new morphotaxon would origination, and so on.

(Note that this is different from how most people define anagenesis! Most of the literature using this term is referring to changes in continuous traits, particularly traits which don't (generally) get used to distinguish taxa. I'm only interested in shifts between recognizable morphotaxa, so I'm limiting my usage of anagenesis to describe that and being totally agnostic to how non-systematically-informative traits vary within lineages.)

Now let's think about how branching events, cladogenesis, comes into this. Let's limit ourselves to only bifurcating events, which produce two daughter lineages. We can contextualize the 'bifurcating and budding cladogenesis' of the last post by considering these as all part of a system of whether morphological change must happen in one, both or neither of the daughter lineages. Like so:

In cryptic cladogenesis, both daughter would continue to be diagnosed as the same morphotaxon as the ancestor, in budding one of the daughter lineages becomes a new morphotaxon while the other experiences no morphological shifts, while in bifurcating cladogenesis, two new morphotaxa arise and the ancestor no longer exists. We can describe any model of how distinguishable morphotaxa arise in the fossil record as some mixture of these four event classes (anagenesis, cryptic clado., budding clado. and bifurcating clado), which even more simply can be described as 'shifts within branches and shifts at branching events'. If we can describe these processes in a model, then we can include most previously described models of morphological differentiation, at least the ones described for processes on geologic timescales.

The function simFossilTaxa can simulate all of these and any mixture of these processes, within the (generally assumed) constraint that diversification and the morphological shifts occur as Poisson processes. The big major change I had to do to allow for cryptic cladogenesis in paletree 1.3 was a new column which describes which morphotaxon each lineage would be assigned to (due to being functionally identical).

With this new feature, we can do fun things like simulating only under cryptic cladogenesis and anagenesis. This gives us patterns like these, using a particularly relevant example.

Okay, now, this post is supposed to be about how we can turn simulated data from simFossilTaxa into cladograms and phylogenies, using the functions taxa2cladogram and taxa2phylo. Just how does paleotree do that, really?

Well, check out this figure that I totally wish I had room for putting in my MEE submission on paleotree.

So, let me walk you through this. In (a), we have three morphotaxa, related to each other by budding cladogenesis and (b), (c) and (d) are various phylogenetic interpretations of that data.

In particular, (b) is the result of transforming such a dataset into an uncscaled cladogram with taxa2cladogram. This is an unscaled set of nesting relationships (i.e. clades), containing all the clades that could be resolved with morphological data, assuming that shifts in systematic characters can only occur when new morpho-taxa originate. (This is a pretty good assumption: if we see shifts in systematic characters within a lineage, we generally start calling the critters a new name in the fossil record!) The distinctions between morphotaxa are captured in all that information output by simFossilTaxa.

Note that in this case, you get a polytomy. For cases where there is a single ancestor, static in systematic characters and multiple descendants via budding cladogenesis, you get a polytomy, which was originally shown by Smith (1994) and Wagner and Erwin (1995). This is true if you sample two descendants and an ancestor or just three descendants. You can also get it if you have bifurcating cladogenesis and sample ancestors. You will end up with more than two taxa that contain no actual synapomorphies, although in practice this would actually look like either some poorly-supported relationships (on a set of most parsimonious trees) or a polytomy (on a consensus tree).

I've been looking at this issue in great detail lately, with respect to varying how shifts occur under the various models of morphological differentiation we've discussed and with varying rates of sampling in the fossil record. I've decided to write this up as a chapter for my dissertation, so I can't say much at the moment, but the short answer is that it could be a very serious issue: some simulations have have very few resolvable clades at realistic sampling parameters.

Now, (b) and (c) are a little more complicated, representing different ways of translating (a) into time-scaled phylogenies using taxa2phylo. The first thing to understand is that there is no such thing as a single time-scaled tree that will describe the relationships for lineages that span intervals of time. None!

All we can do is talk about the relationships about populations at particular points in time. We might want to do this, for example, if we want to simulate continuous traits evolving on the tree using any of the typically used trait simulators in ape or geiger. We would need to pick a particular 'time' of observation of our simulated morphotaxa in order to even have a time-scaled description of relationships at those dates.

That's what taxa2phylo does. It constructs the time-scaled tree which perfectly describes the set of relationships among for particular points in time within the simulated ranges of taxa. The taxonomic identity of branches is lost, leaving only the historical patterns of branching that get us to our points of interest. 'Ancestral' taxa which have multiple descendants (like taxon A) get chopped up into segments which become separate branches on the resulting output tree. 

The figure above shows how different the result can be for different choices of 'observation times'. For (b), the time of interest is the first appearance times of the taxa (I call this the observation times or 'obs times' in the arguments for the function taxa2phylo). For (c), these are the mid-points of the taxon ranges. By default, the observation times used in taxa2phylo are the last appearances times which are not directly figured above but would essentially produce a tree with the branch lengths and branching events equivalent to the simulated dataset, except that taxa which went pseudo-extinct (such as in a bifurcation or anagenesis event) would be attached to the tree as a tip with a zero-length terminal branch.

taxa2phylo should not be used for any purpose but simulation: it doesn't represent anything but a perfect representation of the phylogenetic and temporal relationships. In particular, this is good for simulating datasets in simulators that require a tree (like rTraitCont) but not for testing whether a tree-based analysis works. Using the unscaled partially-unresolved cladograms (from taxa2cladogram) and sampled fossil occurrences, in particular on discrete interval time-scales, will be a more accurate description of the type of data recoverable in the fossil record.

Okay, so that's how I can turn simulated fossil records into trees with paleotree! Next post will be about the aforementioned sampling of the fossil record and how paleotree simulates it!