Thursday, July 26, 2012

Resolving Polytomies According to Temporal Order

Hello everyone!

A user of paleotree recently asked about resolving non-bifurcating nodes according to the order of stratigraphic appearance. If you're familiar with the library ape, you know there's a function called multi2di, which resolves polytomies randomly into bifurcating nodes. But what if we want to include information we have about WHEN taxa show up in the fossil record, assuming that taxa which show up later are more likely to be closely related?

My library paleotree has the Sampling Rate Calibrated time-scaling methods (SRC timescaling), which will resolve polytomies according to a probabilistic approach about gaps in the fossil record. This function can allow for ancestor-descendant relationships, although that aspect can be shut off or minimized to the user's discretion. However, how SRC methods work and whether it works well isn't something that's known to anyone but me, (maybe) my committee and whoever has asked me about it. Also, it needs an estimate of sampling rate, which isn't always obtainable.

So, as another option for resolving polytomies, I've made a new function called timeLadderTree, which takes polytomies and turns them into little pectinate (ladder-like) sub-trees. Each lineage in the pectinate sub-tree is ordered according to the time of first-appearance datum (FAD) for each lineage, for both clades and single taxa. This method of resolving polytomies assumes that the order of stratigraphic appearance perfectly depicts the order of branching. This may not be a good assumption for poorly sampled fossil records, but hey, maybe it's still better than assuming a completely random solution for a polytomy.

I'd paste the code here, but it's really long. If you really want it, email me or you can get this function either from the public paleotree library source file on github, which I'm pushing to as we speak, or just wait until the next paleotree release (sometime in the next month). Also in the newest release, I'll be including this function as an additional option for timePaleoPhy and bin_timePaleoPhy to resolve the tree with. The new argument will be called 'timeres' and setting it to TRUE will cause trees output with those functions to have polytomies resolved according to time of first appearance. (Obviously, this is incompatible with also setting the argument 'randres' for random resolution via multi2di to TRUE and doing such will return a warning).

Okay, so how does it work? Well, let's say we have data and some stratigraphic ranges that look like so...

...And we get a nice pectinate tree.

What that poorly-scrawled M$Paint image means is we need a tree in R (in 'phylo' format) with polytomies and a set of temporal data, in timeData format as is standard for taxon ranges in continuous time in paleotree. timeData matrices look like the following, where row-names are the taxon names.

            FAD         LAD
t1   170.391157 147.8839782
t2   158.519694 152.6571314
t3   149.415686 128.5232837
t4    ...                  ...

Note that this function is for resolving tree when a continuous time-scale is known. For discrete time-scales, where appearance dates are known from intervals rather than from specific points in time, users should use the function bin_timePaleoPhy, which stochastically produces arrangements of dates in discrete intervals in the course of time-scaling. This would be the preferred way of doing things with datasets where taxa are known on discrete time-scales (most data in paleontology, probably...).

You might wonder how this function handles ties. Well, taxa with the same identical first appearance date will be ordered randomly. Thus, the output could be slightly stochastic (it'll differ each time you run it), but this only occurs when taxa descended from the first node have the same first appearance datum in continuous time exist. This is probably uncommon with real data on continuous time-scales. Thus, resolving polytomies based on time-order will probably produce a single same tree each time you use it, unlike multi2di which is guaranteed to not produce the same tree each time. It's a pretty strong assumption though, to make, that order of appearance perfectly predicts branching order, though!

Because simulating clades in paleotree often produces partially unresolved trees (for reasons I explained last time) we can test this function pretty easily.

library(paleotree)
set.seed(444)
taxa<-simFossilTaxa(0.1,0.1,mintaxa=100)
tree<-taxa2cladogram(taxa)
ranges<-sampleRanges(taxa,r=0.5)
tree1<-timeLadderTree(tree,ranges)

And we can see the difference in our two trees, with the second have some very pectinate-looking regions...

layout(1:2)
plot(ladderize(tree),show.tip.label=FALSE)
plot(ladderize(tree1),show.tip.label=FALSE)
(Apologies for all the white-space, I didn't crop this one before pasting it...)

As always, let me know if you have any new ideas for paleotree that you would find useful in your own work!
-Dave

4 comments:

  1. Hi Dave,

    I can see how this is useful. Although depending on what you are doing, "biasing" your tree towards something that best fits stratigraphy might not be ideal (e.g. it artificially increases "fit" to stratigraphy, and favours shorter branch lengths for PCM).

    What I would actually like (although not enough to spend the time coding it!) is a function that produces all possible random resolutions of a tree with polytom(ies), as a means of constraining an analysis to include all possible bifurcating trees. Of course, in practical terms this may be a vast number, but in theory a short function that simply works out what this number is might be a nice preliminary.

    Another thought whilst reading this is that in practical terms polytomies tend to be the result of using a consensus tree. In such cases I wouldn't want to use all random resolutions, nor a temporally scaled resolution, as these would likely contain suboptimal trees (i.e. those with a longer length than the shortest tree under maximum parsimony). In other words, the most parsimonious trees that form a strict consensus are often a smaller set than the number of all possible bifurcating solutions for their consensus. (If the number of possible bifurcations could be calculated - see above - I suspect I could show this with a nice graph using my set of dinosaur analyses.) Of course, when multiple trees are known there is no need for a function to deal with this: just use the most parsimonious trees and not the strict consensus. (This seems obvious to me, but I have seen cases of people randomly resolving a consensus tree to get bifurcating solutions instead of just using the MPTs that ultimately created it!)

    Good stuff though!

    Graeme

    ReplyDelete
  2. Graeme-

    As for the first thing, you're correct: a small number of polytomies are all that's necessary to completely overload the total number of possible bifurcating solutions. If we just want a bunch of possible bifurcating solutions (without being complete), that's basically what multi2di does: it randomly resolves polytomies and repeated use will produce a large sample of possible resolutions. Also check out allTrees in the phangorn package.

    The number of possible bifurcations has a known equation. I think its in Felsenstein's book.

    Actually, I'm not certain using MPTs really is the best idea. I have much more to say on this subject than I can say here, but consider a simple case where you really do have several taxa which are descended from a single relatively static ancestor. The result of this in a cladistic analysis would be a polytomy with very poor support for any particular solution, unless there is considerable homoplasy. Using only the MPTs and not the consensus would get rid of the information that the relationships among these taxa are contentious. My SRC methods can actually consider the possibility of this multi-budding scenario if the taxa are input as a polytomy, while using the MPTs wouldn't allow this possibility. Of course, there are also cases where using the MPTs are the best idea. It just isn't clear to me that the case where polytomies are informative are really that uncommon.

    The most optimal solution would be to consider support from shared characters and from time-scaling under a model that considers sampling rate intensity, but we're pretty far from that solution at the moment...

    ReplyDelete
  3. Hi Dave,

    Boy, the ancestry thing really affects everything...

    BTW, have you seen this paper:

    http://onlinelibrary.wiley.com/doi/10.1111/j.1096-0031.2010.00320.x/abstract

    Graeme

    ReplyDelete
  4. I had seen that paper, but I hadn't read it as I am kind of blaise about stratigraphic consistency (...I'd rather deal with likelihoods). But I notice their point about polytomies and strict consensus trees: I'll have to print this out and bring it to South Dakota with me.

    Thanks Graeme!
    -Dave

    ReplyDelete

Note: Only a member of this blog may post a comment.