Thursday, February 21, 2013

How to Find a Paleontological Phylogeny (For Those With a Rapidly Approaching Deadline)

Hello all! Recently I got some email requests about where to find paleontological phylogenies, particularly for invertebrate groups. I wrote some responses and then asked if I could post my answer to them here, as I expect they aren't the only ones to have such issues. I mainly got the sense these individuals wanted trees for use with comparative analyses (macroevolutionary analyses) and found me through my authorship of paleotree, the predominant function of which is to prepare a phylogenetic dataset for such evolutionary analyses.

So, I'll assume for this that you, dear reader, are a person in need of a phylogenetic tree of some fossil taxa, particularly invertebrate fossil taxa. Furthermore, I'll assume that the best answer doesn't apply here: that if one really wants a good, believable tree, one should go make a tree the hard way, using cladistics or other phylogeny-building methods and a morphological matrix. The main reason this may not apply is maybe you, dear reader, is if you want a tree for a class project or some other short-term goal, such as to just prove you can apply some method of interest even partially to a group. So you are looking for a tree without needing to make a completely new tree yourself.

First off, not all groups have experienced equal effort in understanding their relationships using algorithmic phylogenetics (i.e. maximum parsimony, likelihood, Bayesian, etc). The difference in relative effort to make trees using quantitative methods from morphological matrices is particularly large between vertebrate and invertebrate groups, but there is even variation about invertebrate groups. Check out this figure (below) from Neige et al. (2007). The main point of that paper was to show how little worked ammonites are using cladistics, but they also reviewed phylogenetic effort in a number of other major fossiliferous group. Some groups, such as bivalves, also have not been the focus of much analysis. In contrast, echinoderms and trilobites have been the focus of considerable cladistic work. Partly its cultural (sometimes there are just too few workers who are acquainted with cladistic methodology) and also its just because some groups, like bivalves, often don't preserve many of the important systematic characters needed for cladistic analysis. In some groups, systematic characters are dominated by continuous measurements, and there is still a lively debate about how to use those traits to determine relationships reliably.

This means if you want to work on a specific group, you may not be able to find a tree built using an explicit cladistic analysis. But let's hope this isn't the case for the moment. Where would you find such a phylogenetic hypothesis, if it existed? There's no general database for specifically paleontological phylogenies. While paleontologists were ahead of the curve by putting collection and occurrence data online in the PBDB years ago, collecting phylogenetics data has lagged behind. There are some morphological datasets and tree files on MorphBank, treeBASE and Dryad, but if you have a specific group you're interested in, particularly invertebrates, they probably don't contain it. 

EDIT (02-22-13):
Graeme Lloyd, who I knew had been keeping a number of fossil vertebrate character matrices and trees on his website, pointed out that he expanded at some point to also include invertebrate groups too! So, thanks for proving me wrong, Graeme!  (I clearly don't read Graeme's website often enough...) 

EDIT (02-24-13): Graeme has commented he doesn't currently have any invert matrices up just yet, so his lists at the moment just reflect literature where you could find matrices. Still, that's quite a service!
Also, I noticed a Bristol database Graeme mentioned which I was unaware of contains a number of matrices. Again though, coverage is still pretty spotty for some groups.

So that means you'll have to turn to the primary literature to find trees to use, probably using some well chosen keywords in Google Scholar. Now, let's assume you find the tree you want, which in 99% of the cases will be some cladogram (I recommend the consensus trees...). Now, the tree itself may be offered in supplemental files or you might be able to get it by contact the authors, but let's be perfectly realistic and admit that sometimes those avenues won't work, or they may only work on very long time-scales (hard drives get fried, babies are had and emails go unanswered, etc). Instead, you'll probably need to use what's actually at hand: the tree as printed on the page. I've done this myself quite a bit, and there are some handy programs out there that automate this, but I've always just copied the tree out by hand. 

For small trees, the simplest way to do this is to copy trees out as Newick string. Newick is just a quick way to write relationships among taxa as a series of nested parentheses, with sister lineages seperated by commas (sometimes also called 'phylip' format). A quick example of Newick format for ctenophore, man, graptolite, fly and brachiopod genera: (Pleurobranchia,((Homo,Nemagraptus)(Drosophila,Terebratula))) You can read that into the programming language R as a text file using read.tree() in library ape. For large trees this can get tedious: you could also write out part of the tree and then modify it, by adding taxa, removing taxa and collapsing clades in a GUI, such as Mesquite. R has some of these tools but it doesn't make wholesale tree editing easy. If you're eventual goal is to put the tree in R, I've found it is sometimes necessary to save Mesquite files as PHYLIP format and then open them and resave them in TreeFig before using read.tree. If you're curious how long this will take, I've often found I can copy over a large tree in half a day or so.

Now, let's continue on the path of assuming you found a cladogram. (Yes, reader who cannot find a cladogram, I'll get to you in a moment.) If your question just needs the nesting relationships shown by the cladogram, you're done. You got what you wanted. But maybe you want to go further. Maybe you want to make time-scaled phylogenies, which is what paleotree is for, and probably why you're reading this blog to begin with. These time-scaled phylogenies are what we want for making most evolutionary inferences.

If you want a time-scaled phylogeny, you'll need to find stratigraphic data to use, in addition to the cladogram. These stratigraphic data should record at least which intervals or dates the taxa you want to analyze first and last appear at. These might be in the same publication as the cladogram you found, but they may not. You will probably need to go look at published range charts for this group and spend some time figuring out where obscure regional stages correlated to the global time-scale (this is where Gradstein and Ogg volumes become very useful). You should also check the PBDB, although for some groups the data can be very coarse and will requite some cleaning. Again, there is a right way to do this, but again I'm imagining you have a class project you want to complete with some rapidly approaching deadline. To do anything with paleotree, you just need to get the tree to the point you can read it into R as a 'phylo' object and read the ranges in as a matrix. The functions in paleotree can be applied once you have that. Note, by the way, about these stratigraphic ranges, I'm assuming you have almost all the known species in your dataset. If you are doing something like sampling one species per family, the first appearance times of taxa in your dataset will be poor indicators of when clades branched from each other, at least using the sort of methods I offer in paleotree.

Okay, so what if you couldn't find a cladogram for the group you wanted to work on? Or what if you found a cladogram, but its at the genus-level and your data is species? Or what if the cladogram has half the species you want to analyze, but not the other half, and its non-random with respect to your question (like, let's say body size affects taphonomy in your group, so the cladogram has all the big individuals, but you're interested in body size evolution...). Well, you have several possibilities. You may just be out of luck and maybe you will need to start over or consider making a tree yourself.

If it's just the problem that your taxa are on a bunch of trees and you don't know how to make any sense of all the conflicting relationships, don't worry, supertree methods were invented for a reason and you might want to look into those. However, more likely the issue is the taxa you want on the tree have never been on the tree. So, what you might want to do is look for data on relationships that isn't the product of an explicit phylogenetic analysis. For example, some invert groups have a number of stratophenetic diagrams, which are built based on expert opinion relating to morphological and stratigraphic data. There is also an increasing number of analyses being conducted with informal trees built from a combination of cladistic hypotheses and traditional taxonomic data (like the widely-used Phylomatic for plants; Webb and Donoghue, 2006), or just from taxonomic data alone (e.g. Green et al., 2011 or 'Common Tree', see These non-cladistic options, or options which are only partly based on cladistic evidence, *might* be okay to use. 

 Whether such shortcuts are acceptable approaches or not are you and your question. You should carefully consider how sensitive your analyses are to uncertain relationships. It's not really an easy question to ask, but maybe for a first-pass rough cut (such as for a class project), a summary hypothesis based on taxonomic information in replace of a tree may be okay. Some people will feel very strongly about this one way or another, and you'll ultimately be the one who will have to defend what you did and how you did it. The function expandTaxonTree in paleotree can be useful if you do decide to go this route, as it can do helpful things like turning a rough genus-level tree into a species-level tree by treating each genus as a soft polytomy. (Also, as I stated above, if you want a time-scaled tree of fossil taxa, you still need stratigraphic ranges, unless you are using a stratophenetic tree, in which case it should already be time-scaled.) 

Now, for those of you who read this and maybe think this post could inadvertently serve to inspire risky behavior, I agree you might have a point. Phylogenetics can be slow, because good phylogenetics requires patience, care and due consideration. All I can say is that students will always end up in situations where they need datasets as part of coursework (or whatever) and this questions will thus always come up, because making a new morphology-based tree from scratch isn't a feasible solution for most classes (or most students). Overall, I hope people go and do simple analyses with back-of-the-envelope trees, see how awesome phylogeny-based analyses are and get inspired to construct trees for doing those same analyses as part of a larger project where they have more time.

Anyway, I hope all this helps!