Nemagraptus gracilis: 2013

Friday, December 13, 2013

Stratigraphic Uncertainty, timePaleoPhy versus bin_timePaleoPhy, and My Own Paleontological Myopia

Hi all!

So, its been a rough week! I've gotten more emails about issues or questions about paleotree this week than I have in the past four months! Which is awesome, cause I love talking to people about using paleotree, but there's something that's been worrying me in some of these emails. These emails have given me a grave concern that a number of users don't understand the distinction between two key time-scaling functions in paleotree: timePaleoPhy and bin_timePaleoPhy. The problem is, not understanding this difference is extremely dangerous, as they could generate time-scaled trees that don't make sense without realizing it.

But I can't really blame them, because I think its my fault as the package author to explain this distinction clearly enough... the problem is I didn't understand something critical and fundamental three years ago: Our Fossil Records are Different.

See, the fossil record for your group is probably really different from the fossil record I'm working on, and that means not only how we do things is different, but the sort of assumptions we come to the table with and what we think X means is different. And that sounds kind of obvious, until you meet someone whose fossil record is radically different from yours and they don't even have a concept for this one thing you thought everybody knew about.

The problem is that people are using timePaleoPhy who probably shouldn't be using timePaleoPhy, when they should be using bin_timePaleoPhy instead. It isn't helped by the argument 'rand.obs' in timePaleoPhy, which some interpreted as representing how to bring stratigraphic uncertainty into an analysis. After explaining to several users over email why I was concerned about their use of paleotree, I felt like a specific blog post on the topic was needed.

But to understand where this misconception of some users comes from, and why I think it stems from my own misconceptions in the past that maybe fossil records were more similar than I know now, I've also got to explain my reasons for writing both functions the way I did.

So, I wrote timePaleoPhy first, when I had some stratigraphic data for graptolite ranges, with stratigraphic meters scaled to an absolute time-scale. Since graptolite species often persist over time, I had a matrix of first and last appearance dates (FADs and LADs, respectively), like so...

Normalograptus_normalis 457.9 438.3
Climacograptus_typicalis 463.2 442.8

(This is totally fake data; I don't even know if typicalis is still a Climacograptus, but maybe its the type?...I can't remember at this very moment.)

The dates here represent actual ranges; i.e. N. normalis is first found in rocks dated 457.9 Ma and is found fairly continuously until you get to rocks dated 438.3 Ma. Of course, the rocks aren't actually 'dated' so precisely, more like we're assuming continuous sedimentation rates and so we're inferring those dates using the absolute dates we do have.

This means when we time-scale the tree, unless we're allowing for ancestors (which we don't normally do unless we know something about apoomorphic and plesiomorphic taxa) the clade containing N. normalis has to be at least as old as 457.9 Ma.

And that's what timePaleoPhy does, if you hand it data like above, you get back a tree where the nodes are as old as the dates in the first column (the taxon names should be row labels), because it thinks those are first appearance dates, known very precisely.

A different question is where the tips go (or 'when', really). As I've talked about before here on this blog and in published papers, this is a difficult question. For most things, such as looking at a tree or estimating lineage diversity through time, you'd want the whole taxon range 'added' to the tree, so you should use the LAD as the tip-date. However, comparative methods for studying trait evolution assume that the date of tips represents the date of the population from which trait values were taken. Of course, maybe your trait doesn't vary much (or at all) over the duration of the taxon, but that's a story for another time. In the papers I've written, I often call these tip-dates 'times of observation'... cause I think some reviewer didn't like 'tip-date', or maybe it was a committee member. Well, reviewers seem to dislike 'times of observation' too.

(I can't freakin' win when it comes to vocabulary.)

At the time, I was most interested in comparative methods, so I assumed this would be true of everyone else. In general, it seemed like people were using the FAD primarily in the literature (or at least, I thought so at the time) as the tip-date for comparative methods, so I made the FAD the default tip-dates that timePaleoPhy spits out. To get timePaleoPhy to use the LAD as the tip-date, you have to set add.term=TRUE, which adds the range to the terminal branch the taxon sits on.

But using the FAD or the LAD wasn't really a satisfying solution: the tip-date in a comparative analysis supposed to be the time of the sampled population (hence why I call them 'times of observation'). For my long-ago purpose, I had I had size measurements from a bunch of figured specimens for graptolites but with very little data on 'when' those specimens came from. They were from 'the Late Ordovician' or 'Early Silurian' and I didn't have enough information to figure out a more specific tip-date. So that's pretty much useless information for figuring out the 'when' for these specimens, and if I want to do some PCM, well the tips are supposed to represent the date of the population where I measured the trait. I just know the Normalograptus normalis specimen I measured has a 4.2 mm long doodad. Using the FAD or the LAD wasn't really an honest approach given the uncertainty involved.

So... rand.obs is how I dealt with this uncertainty, still using the FAD for N. normalis to adjust the clade age, but allowing the tips to fall at some date randomly selected between the FAD and the LAD. Maybe I had measured doodad length from a normalis specimen from 441.3 Ma, so, the tip would be at that date, but the clade constraining that taxon would still be constrained by the 457.9 Ma first appearance date. Note that rand.obs only works if we also set add.term=TRUE, cause the way I decided to do it (and no, I'm not changing it) was that these random observation dates were a temporary replacement of the LADs. Thankfully, timePaleoPhy returns an error if you ever try using rand.obs=TRUE while add.term=FALSE, or else people might get very confused.

Later, I needed to make use of a much less precise dataset, where taxon first and last occurrences were graptolite zones. So like, the different zones had minimum and maximum dates associated with their starting and ending, like this...

zone_A 466.2 452.4
zone_B 452.4 448.5
zone_C 448.5 441.1
zone_D 441.1 436.2
zone_E 436.2 431.0

(This is totally fake cause graptolite zones are usually shorter than that.)

To keep track of everything, I needed some way of recording the dates for the intervals and which intervals taxa first and last showed up in. Normally you see this thing displaying in papers as a range chart, where species A will have Xs in intervals 1, 2, 3, and then not 4, meaning it last appeared in interval 4. This seemed like a fairly non-straightforward way of handling the information, particularly as I was only really interested in the first and last interval each taxon appeared in.

So, what I did was I listed the first and last interval that each taxon was known from as the row of the interval matrix. Like so:

N._normalis 1 4
C._typicalis 1 3

So, if we look at this in reference to the prior matrix, this tells us that N. normalis first appears in zone A and last appears in zone D, which means it might have first shown up anywhere between 466.2 and 452.4 Ma and last appeared somewhere between 441.1 and 436.2 Ma. And so now, if we have both this 'taxon times' matrix and the 'interval times' matrix above, we have all the information on the dates associated with when these taxa were first and last occurring.

In paleotree, I decided to call a list that contained both matrices a 'timeList' object because as you have no doubt deduced, I have terrible taste when it comes to coming up with new vocabulary. A lot of functions in paleotree use these timeList objects. (It's not a real object class, cause I didn't understand S3 classes yet when I wrote paleotree.) Anyway, I went with a list of two matrices rather than two seperate matrices because the taxonTimes matrix doesn't make sense without reference to the intervalTimes matrix and so I felt I needed to keep them sorted together. To get a timeList, you can just make the two matrices by hand in your favorite spreadsheet program, read them into R and then combine as a list. The last step is as simple as typing time<-list(intervalTimes,taxonTimes) into the terminal. Note that the interval names and taxon names are rownames, so the matrices should both have only two columns each.

Now, to time-scale a tree with this sort of data, we need to somehow get precise first and last appearance dates and then make a tree from there. So I made bin_timePaleoPhy to draw dates from these intervals, under a uniform probability distribution, and then use the randomly drawn dates to make a time-scaled tree. So, maybe normalis was first seen at 454.2 or 461.8 Ma, and last seen at 440.7 Ma, or 438.8 Ma. Or lots of other combinations! We can resample those dates and get all sorts of new FADs. and LADs and make lots of time-scaled trees. Unlike timePaleoPhy, bin_timePaleoPhy actually does something about stratigraphic uncertainty, at least as far as having data in discrete intervals goes.

But this doesn't fix the unknown times of observation issue for trait data... so bin_timePaleoPhy also has a rand.obs argument just like , so that we can let our observed doodad length for N. normalis come from 452.6 Ma or 446.3 Ma or whatever, depending on the set of FADs and LADs we plucked from the intervals in a given run. Again, just like with timePaleoPhy, the date of the nodes are still going to depend on the FADs, rand.obs is only going to impact the tip-dates, those enigmatic 'times of observation'.

So, let's be clear:

1) timePaleoPhy does *not* deal in stratigraphic uncertainty of appearance dates: it believes the numbers you are handing it are very precise first and last appearance dates, which exist for some fossil records but probably not most. It does deal with uncertainty in the times of observation of specimen measurement, though, which is what rand.obs is for.

2) bin_timePaleoPhy *does* deal with stratigraphic uncertainty, but it requires slightly more complicated input: two matrices, as a list.

And I think, somehow, I said something that is confusing people about these two facts. I'm repeatedly getting emails from people that think the matrix for timePaleoPhy is a matrix of the minimum and maximum ages for a point occurrence of a fossil. In other words, their fossil taxa only appear at a single point in time which is imprecisely known, and so their dates are the bounds on that single date, rather than two very precisely known dates that represent the first and last appearances.

This practice is very worrying, because if they put those dates in and then set rand.obs=TRUE and add.term=TRUE, they'll think their pulling their taxon occurrence dates randomly from their min and max dates... which is true for the tip-dates. But the nodes are being pushed back in time to those minimum dates, which really means that clades are 'as old as they could possibly be' or something wacky like that. And I bet people have done this without realizing that's what is happening! Its absolute nonsense and this nonsense has to stop.

I think the real issues is that when I wrote these functions, I was excessively myopic. I looked at things like a graptolite worker and I thought everyone had data like graptolite data, with taxa that typically first and last appeared at different times. I actually didn't realize that other people had data where it was more common to see things like 'species A is known from one skeleton, found somewhere in the early Cretaceous'. Some vert paleo friends of mine have kindly helped me understand that this isn't true, at least, for lots of vert datasets: they actually have taxa known from single collections that may be as poorly unconstrained as 'anywhere within a 40 million year interval', which, well, that blows my mind every single time I read this sentence.

So, given the apparent and continual misuse of timePaleoPhy, I have actually considered pulling the function entirely from the CRAN version of paleotree, or hiding the function, making it unavailable to regular users. This isn't ideal: some people, such as those who work with detailed biostratigraphic records such as for index fossils and microfossils, actually have precisely dated first and last appearances. I don't know if anyone like that uses paleotree yet, but I want to keep the door open for them! Plus, bin_timePaleoPhy is really just a dumb wrapper for timePaleoPhy. If someone really wants to write their own wrapper for timePaleoPhy, that's fine! I'm happy for them.

But honestly, the majority of people won't have data that's infinitely precise. They should be using bin_timePaleoPhy. (Or you know, bin_cal3TimePaleoPhy, but maybe I shouldn't get ahead of myself...)

So, the real question is how to convince people to want to use bin_timePaleoPhy. Well, although I just explained the function from a grapt-point-of-view, I didn't code it for that only that type of data. The timeList objects takes by bin_timePaleoPhy and other paleotree functions can actually be much less formal than my example (well, a few specific functions have very specific needs, but that isn't true of bin_timePaleoPhy).

For one thing, the intervals in the interval matrix can be completely messy. For example, you could have an interval matrix that looks like:
Lower Cumberbatchian 134.5 129.4
Cumberbatchian-Martinian 134.5 109.6
Cumberbatchian-Bloomean 134.5 99.8
Middle Upper Martinian 107.5 105.8
Martinian-Bloomean 115.4 99.8

What I've discovered in a recent collaborative project is if you get data from the PBDB, you'll probably have to have an interval matrix like this. It's okay. But if you have multiple collections for a taxa, try to get the most precise first and last intervals.

For example, say your taxa only ever appear in single interval, well than the first and last intervals are identical, so you'd just structure the second part of the input list so the columns matched, like so:

Rareosaurus 1 1
Uniqueodon 2 2

This is a little more tricky if your taxa also only ever appear in a single collection. By default, bin_timePaleoPhy assumes taxa first and last appear at different dates within an interval. For example, Rareosaurus will not be assigned matching FADs and LADs. However, if you set the argument point.occur=TRUE, then bin_timePaleoPhy will treat all your taxa as if their first and last occurrences are identical: i.e. all your taxa will be treated as point occurrences.

Of course, I have made it so that bin_timePaleoPhy has a little more functionality, to deal with those special cases that often crop up with uncertainty in appearance times; these can be usually handled with what I called the 'sites' matrix. For example, perhaps the first specimens of two taxa appear in the same fossil assemblage. That single fossil assemblage may be very poor resolved to a wide interval, but we know the first appearance date for both taxa should be the same date. If you make a site matrix as I describe in the bin_timePaleoPhy help file, and set the site number the same for the first appearance of both taxa, then bin_timePaleoPhy will always use the same randomly-drawn date for those taxa. You could do this for lots of taxa, or do this to account for a site where some taxa have their last appearance at and other taxa first appear at, or all sorts of neat modifications. I have to be honest and point out that this sites matrix idea was first suggested by Jon Mitchell over lunch about three years ago.

In reality, the point.occur argument I described is just a simplified way of modifying the site matrix so all first and last intervals are coded as having the same 'site' for each taxon. If you want to have only some taxa in a dataset constrained to be point occurrances, you can just specify a custom-made site matrix where only those taxa have their first and last interval constrained to be the same.

To be truly honest though, I don't think its a lack of advertising about the options offered by bin_timePaleoPhy that stops people from using it... really, its the timeList input format. A large number of people have suggested that this is a very obscure way of recording this data and more than a few have complained that its just not amendable to their use. Peter Smits told me that the timeList was specifically very 'ugly' to have to use two matrices. Overall, I have a feeling that I just did a shoddy job of describing timeList objects in the help files to begin with.

Well, I don't know what to do about it guys! Coming up with the two-matrix fix was the best idea I had after thinking about it for a while, and while I admit its inelegant and while I admit that the timeList structure is pretty dissimilar from the typical data structure you might see in a paleontological dataset (as opposed to a taxon by interval range chart), I just don't see a better solution, even though I can that its actively driving people to misuse the functions in paleotree in ways that I think are unpredictable and a little dangerous.

But I know that I don't know everything... so if anyone has a better idea of how to make the bin_timePaleoPhy input work better, I'd love to hear it and implement your idea in paleotree! Do you see a better, simpler way of conveying the same information? Or, is there a data format that you use, that you'd like to have a function for converting into a timeList? Let me know in the comments below.

For those of you interested, I give a real example of a very nicely behaved timeList object in my time-scaling method tutorial from this summer, which you can find here:
http://nemagraptus.blogspot.com/2013/06/a-tutorial-to-cal3-time-scaling-using.html

On a more philosophical note, I think those of us who write methods for paleontological data have to keep in mind that Fossil Records are Different. Had I been less nearsighted and graptolite-oriented in my design and help file writing in the beginning, maybe this user preference for the wrong function could have been avoided, and so for that, I really only can blame myself.

Finally, if you're reading this, and you think you might be one of those people who sent me an email this week and contributed to my concerns, I just want to thank you for emailing me and forcing me to confront the fact that there was something about my software that users were finding consistently confusing. I still love reading and responding to every paleotree email I get.

-Dave

Monday, June 17, 2013

Sampling Rates and Dealing with Fossil Records Less Than Conducive to Their Estimation

So, in my last post, I went at length into a lot of the nuances we need to consider when we estimate sampling rates and sampling probabilities from the frequency distribution of ranges in the fossil record. This gave me the impetus to do something I'd been meaning to do for a while: actually track down the various estimates of sampling probability or sampling rate, turn them all into per Lmy sampling rates and compare them. Here they are, ordered from greatest to least:

Sampling Rate (per Lmy)    Taxon    Reference
17.4577776    Oklahoma Trilobites Species (Single Locality)    Foote and Raup, 1996
1.660731207    Cenozoic Macroperforate Planktic Forams    Ezard et al., 2011
1.046292901    Neogene Iberian Mammal Genera    Alba et al., 2001
0.868900035    Phanerozoic Brachiopoda Genera    Foote and Sepkoski,1999
0.797580429    Neogene Iberian Mammal Species    Alba et al., 2001
0.483501825    Phanerozoic Brachiopoda Genera    Foote and Sepkoski,1999
0.437808292    Phanerozoic Cephalopoda Genera    Foote and Sepkoski,1999
0.434450018    Phanerozoic Conodonta Genera    Foote and Sepkoski,1999
0.410974389    N.A. Cenozoic Mammal Species    Foote and Raup, 1996
0.408044166    European Jurassic Bivalve Species    Foote and Raup, 1996
0.385502461    Phanerozoic Trilobita Genera    Foote and Sepkoski,1999
0.357475065    Phanerozoic Graptolithina Genera    Foote and Sepkoski,1999
0.334331480 Phanerozoic Cephalopoda Genera    Foote and Sepkoski,1999
0.244922481    Phanerozoic Bryozoa Genera    Foote and Sepkoski,1999
0.214987601    Phanerozoic Bryozoa Genera    Foote and Sepkoski,1999
0.201575023    Phanerozoic Conodonta Genera    Foote and Sepkoski,1999
0.196147211    Phanerozoic Echinodea Genera    Foote and Sepkoski,1999
0.192764386    Phanerozoic Trilobita Genera    Foote and Sepkoski,1999
0.182563024    Phanerozoic Graptolithina Genera    Foote and Sepkoski,1999
0.159239636    Phanerozoic Echinodea Genera    Foote and Sepkoski,1999
0.153449104    Phanerozoic Gastropoda Genera    Foote and Sepkoski,1999
0.145183217    Phanerozoic Ostracoda Genera    Foote and Sepkoski,1999
0.133448941    Phanerozoic Bivalvia Genera    Foote and Sepkoski,1999
0.127046142    Phanerozoic Ostracoda Genera    Foote and Sepkoski,1999
0.122426282    Phanerozoic Anthozoa Genera    Foote and Sepkoski,1999
0.116261536    Phanerozoic Porifera Genera    Foote and Sepkoski,1999
0.115432413    Phanerozoic Blastozoa Genera    Foote and Sepkoski,1999
0.112799434    Phanerozoic Bivalvia Genera    Foote and Sepkoski,1999
0.099553348    Phanerozoic Blastozoa Genera    Foote and Sepkoski,1999
0.099553348    Phanerozoic Gastropoda Genera    Foote and Sepkoski,1999
0.099021026    Early Paleozoic Crinoid Genera    Foote and Raup, 1996
0.093263457    Phanerozoic Anthozoa Genera    Foote and Sepkoski,1999
0.086915600    Phanerozoic Porifera Genera    Foote and Sepkoski,1999
0.084205114    Phanerozoic Crinoidea Genera    Foote and Sepkoski,1999
0.078324167    Phanerozoic Crinoidea Genera    Foote and Sepkoski,1999
0.075548263    Phanerozoic Malacostraca Genera    Foote and Sepkoski,1999
0.067297159    Phanerozoic Osteichtyes Genera    Foote and Sepkoski,1999
0.054279636    Phanerozoic Asterozoa Genera    Foote and Sepkoski,1999
0.052305831    Phanerozoic Asterozoa Genera    Foote and Sepkoski,1999
0.039758685    Phanerozoic Malacostraca Genera    Foote and Sepkoski,1999
0.029548896    Phanerozoic Osteichtyes Genera    Foote and Sepkoski,1999
0.023242431    Phanerozoic Chondrichtyes Genera    Foote and Sepkoski,1999
0.011674604    Phanerozoic Chondrichtyes Genera    Foote and Sepkoski,1999
0.009677980 Phanerozoic Polychaeta Genera    Foote and Sepkoski,1999
0.009326054    Phanerozoic Polychaeta Genera    Foote and Sepkoski,1999
0.008051675    Cenozoic Chiroptera Genera    Eiting and Gunnell, 2009

Most of these were published as sampling probabilities, so I just found the mean interval length and got the sampling rater per Lmy using my paleotree function 'sProb2sRate'. The Foote and Sepkoski values weren't published as numbers, just shown in a plot, so I used a ruler rather than send an email to my old advisor. It's just a blog post, not a research paper.

Note that every group in Foote and Sepkoski had two sampling probabilities plotted, for two different systems of how to break the Phanerozoic into smaller intervals. Interestingly, this reveals considerable variation in the estimated sampling rate, suggesting that the estimates from freqRat or from the maximum likelihood variant can be sensitive to how we break up intervals (as I suggested in the last blog post).

The rate estimated for the Oklahoma trilobite data is wildly high, but note this is a densely sampled single locality in the Ordovician. Beware the disconnect between global estimates and local estimates... and beware the fact that they don't cleanly scale between each other. If all of your taxa are from Australia, if you do a global analysis, it will be just an analysis of Australia that you're only pretending is global. Maybe that is the 'global' rate, but it depends on your assumptions about whether there are taxa that you didn't sample in non-Australia regions of the world. There's also taxonomic issues here too: many of these analyses were done at the supraspecific level (note that doesn't mean persistent taxa are only at the genus level and above; paleontologists just have a tendency to analyze genera, not species).

Now let's look at the other end of the scale. We've got some fish groups, then polychaetes, then bats. Makes sense right? If you had to imagine a really bad fossil record, squishy worms and fragile little bats sounds about right. But hold on. Let's ask ourselves, what's missing from this list. Well, there aren't any terrestrial groups on this other than mammals. I used to think this was just because no one had done it yet.

Then I started talking to various vertebrate paleontologists and researchers working with datasets from non-Cenozoic vertebrate paleontology datasets and they told me that in other vertebrates groups, like dinosaurs, every species is found in what essentially amounts to a single stratigraphic interval, at least at the global scale. They might be found throughout a unit or formation that represents a limited window of time, like the Morrison or whatever, but you don't find the same morphotaxon persisting across intervals like you do in invertebrates. The people I talk to often call it 'point occurrence' data so I'll use that term for the rest of this post, even though that isn't exactly ideal: 'point-occurrence' really isn't completely accurate if some of those taxa are found at multiple horizons in a unit or formation, its less descriptive than we might desire and 'point occurrence' is sometimes used to mean other things. (There's other terms too like 'singleton' which could apply here, but their usage is even more muddled.)

Disclaimer: I know nothing about vertebrates or their fossil record. As evidence, I present that I squeaked by with a C- in Primate Anthropology, my one attempt to learn anything about vertebrates. I felt it was important to start off this post with that, because, great golly, I don't want anyone reading this and thinking I know anything about the vertebrate fossil record. Heck, I haven't even seen any datasets firsthand. All my information is hearsay from other scientists, most of whom contacted me in trying to time-scale their trees. If someone can show me a range chart that has a bunch of sauropod species persisting through multiple geologic intervals, then great, I'd love to see it.

So, anyway, I found this description of vertebrate paleo datasets to be really strange. Why are they so different from the ones we can get sampling rates from, like those above? One explanation I have been given is that this occurs because groups like dinosaurs are just more poorly sampled than shelly invertebrate groups. If I try to simulate a really poorly sampled clade but still have a number of sampled taxa, under a simple model where extinction and sampling are homogenous Poisson processes, I can't make all the taxa point occurrences. There's a real good reason for this: as the sampling rate decreases, the number of original true taxa must increase, such that the sampled taxa is a tiny portion of a vast unsampled diversity. As the number of total taxa increases, you get more and more extremely long-lived taxa, a tiny minority, but they become so increasingly long-live that the probability of NOT sampling any of them one of them twice (and thus getting a persistent taxon) is incredibly low. I cannot avoid simulated datasets with persistent taxa, not under this simple model of extinction and sampling. This is true even if I break up the time-scale of these simulations into discrete intervals. I still get persistent taxa with first and last appearances sampled in multiple intervals.

So, to me, that means it isn't just 'poorer sampling'. There needs to be something else going on. We're missing some explanatory factor that could cause morphotaxa to either be very 'short-lived' in the fossil record and/or would make it very difficult to sample any morphotaxon that might survive into another stratigraphic interval. Of the former, it is possible to imagine a fossil record of point occurrences resulting from extremely high anagenesis rates, such that species really don't have any opportunity to be persistent. However, that means there's something about dinosaurs and other such groups that is fundamentally different from groups like North American mammals and fossil invertebrates. It doesn't need to represent a real evolutionary difference: it might be that differences in the available material gives more characters and thus allows for much much finer species delimination (or, possibly, that species delimination is being overly aggressive). The later, where some factor makes it difficult to sample any persist morphotaxon more than once, could occur if there is some sort of complex regional alternation in the deposition and preservation of rock packages that could preserve vertebrates; i.e. North America get the Morrison, then nothing for a long time, something like that (I have no idea if the preceding statement is true). But then why the shift going from Mesozoic dinosaurs to Cenozoic mammals?

Remember, to use the freqRat or the variants of it, you need to have some persistent taxa so that means there are some persistent bat taxa, found in multiple stages (as that's what Eitling and Gunnell used) but not dinosaurs. Now, an important distinction here is maybe it's that Eitling and Gunnell used genera. So, okay, forget bats. But why do Alroy's Cenozoic mammal species have persistent taxa? (as analyzed in Foote and Raup) Whatever the reason for this point occurrence pattern in some groups, it will probably be obscure for a while. I'd bet it is probably the result of multiple factors.

But, to get back to the practical point, what's a researcher with a dataset unanimously dominated by point occurrence taxa supposed to do if they want to use cal3, which absolutely requires an estimate of the sampling rate? (We'll ignore also needing branching and extinction rate estimates for the moment.)

It's really quite a conundrum. These are my thoughts.

First is that the freqRat and the ML variant aren't the only ways to get sampling rate. Foote (2004) also introduced a survivorship curve method, but that also won't work if you don't have persistent taxa. Solow and Smith (1997) presented a method that obtained sampling rate estimates from the waiting time distribution between sampling events in single taxa; that information definitely isn't available here, at least not at the global, phylogenetic scale we're trying to get sampling rates for.. There's also capture-mark-recapture methods, which also estimate sampling probabilities, but I don't know much about them or whether they could be applied to data that effectively has no recaptures. An idea that's been banging around in my skull is that Friedman and Brazeau (2012) introduced an idea that might bear water: get sampling estimates from the ghost branch distribution. But I think that issue is a lot more complicated than Friedman and Brazeau suggested, and there's some theoretical roadblocks relating to differentiation patterns (budding versus bifurcation, anagenesis, etc) that I haven't thought around yet.

So, that exhausts the methods (that I know of) for actually estimating the sampling rate.

One utter kludge would be to just pull them out of thin air. Ask yourself what fossil records on the chart above are probably of similar quality and use those values. Even better, choose a range of likely values and use those. The problem with this is that maybe your fossil record is even worse than the ones described here. Also, these values are for a limited selection of data: most of them are global and their relationship to regional datasets is thus obscure. Many of them are at the generic level, particularly the more poorly sampled records, and it is difficult to envision any way of converting a generic estimate of sampling to a species-level estimate (although maybe paraclade models might offer such a route; see Patzkowsky, 1995; Foote, 2012). Overall, I cannot recommend that pulling values out of thin air is the right way to use cal3. It's a quick and dirty way of getting something done, just like using a 'taxon tree' as a replacement for a cladogram, but its ultimately difficult to robustly defend the assumptions you'd be making in such an analysis. Class projects and pilot work can have quick and dirty techniques, but for real research we want to be able to trust the conclusions of, we would want something that involves less guesswork.

And there's an addition problem to such cross-eyed guesstimations: remember my simulation woes above? We can't get point occurrence data under our typical models of sampling and extinction events, where they occur as under a Poisson process. We know the assumptions of our models are not valid in those cases. And that means even ignoring the 'thin air' part of pulling sampling rates out of thin air, the models that cal3 assumes probably don't apply to this type of data. And it's not just cal3 that assumes those models; that's the typical assumptions of many paleobiological methods that involve putting any sort of expectation on a sampling process. Of course, just because model assumptions are violated doesn't mean a method will return the wrong answer. Methods can be more robust than that. However, without knowing what leads to point occurrence type data it is hard to make predictions, since we don't know exactly what about our model assumptions are incorrect. My presumption would be that if this data type occurs because sampling is much more complicated than a global Poisson process, then cal3 will give fairly wrong answers. Again, though, that's probably also true of many other methods too. If, instead, it's because these are groups where morphotaxa are extremely finely delimited, such that you can't get persistent taxa, then maybe the cal3 method is okay. This violation of model assumptions just means morphotaxa are 'terminating' via anagenetic pseudoextinction much more often than lineages terminating due to extinction. In other words, there's a disconnect between the morphological differentiation of lineages and their birth-death dynamics. cal3's assumption have more to do with the probability of not sampling a clade at all or not, something which should be independent of the morphological differentiation pattern within that clade.

Regardless, and some people have pushed me on this point, I do not claim that the cal3 method is applicable to all datasets. I fully admit that it is not! I did this to time-scale graptolite phylogenies and while I want the method to be as useful and general for as many people as possible, the only reasonable starting point for making an explicitly model-based time-scaling method was to start with the simplest models of sampling and diversification possible. Simple models are the only logical foundation for moving forward. And while I might be able to realistically defend the assumptions of those simple models in graptolites or other 'well-sampled' groups, as a scientific community, we have a lot more work ahead of us to tackle every dataset. Even though sampling processes are as important as origination and extinction in shaping the fossil record, if not more so, I think we still spend much more effort in understanding and modeling the dynamics of diversification in paleobiology than sampling processes.

But this difficulty in applying cal3 to every dataset poses a new question: are the other time-scaling methods that we have available sufficient for the sort of phylogeny-based analyses of evolution ('phylogenetic comparative methods') that we want to do in the fossil record?

Well, we can talk more about that this Sunday in Snowbird, if you'll be there. ;)
http://evo2013.evolutionmeeting.org/engine/search/index.php?func=detail&aid=149

Cheers,
-Dave

Friday, June 14, 2013

A Tutorial to cal3 Time-Scaling Using a Real Dataset

Hi guys!

So, my paper on time-scaling methods for trees of fossil taxa has been accepted to Methods in Ecology and Evolution! You won't see it right away though; it's part of a special issue of MEE and all the papers included in it will be held back for a simultaneous online release, probably in August (I guess?). Anyway, until then, you can go read chapter 3 of my dissertation, it's mostly the same thing, and its up on UMI Proquest now. Or you can contact me for a copy.

Anyway, to commemorate the acceptance, I have actually pulled a real dataset out of the literature and written up a tutorial about applying the various time-scaling methods to this data, in particular the cal3 method. The data in question (of course) a graptolite group, the Silurian retiolitids. It's a generic dataset, but, hey, it's just a tutorial right! Chill, dudes.

Such a tutorial was persuasively argued for by one of my reviewers at MEE, but I had to respond with "I don't have the room!" However, I've got infinite room here on this blog and also it means anyone can come see it here, for free, forever (well as long as Blogger exists...). So here it is! Thanks to Reviewer #1 for making such a strong case for it. Sorry, Reviewer #1, but I did not take you up on that other suggestion that I use a vertebrate dataset example. I just know graptolites better, and vertebrates are a headache to get sampling rates from (I'll touch on that in a later blog post, thanks).

Now, this is a very LONG tutorial. Why? I want to cover in detail all the hiccups that can happen on the way to wanting to time-scale a tree, and I really wanted to nail down what the data structure looks like. So, apologies for the length, but I think anyone actually attempting this stuff will appreciate the detail I go into, particularly, how the degree we should exercise caution and skepticism in estimating sampling rates.

I've written this tutorial much like a cave man from 1984 might have, as an overly chatty, heavily commented R script. If someone who knows LaTEX and Sweave better than me wants to helpfully convert this to a document suitable for release as a vignette file for paleotree well... Well, I will buy that person a few beers at Evolution! Otherwise, you guys are getting just this: an R script.

I've pasted the script text below, with figures inserted and a few key console outputs. The images overhang the margins. I don't know why and I can't seem to fix it. Stupid Blogger! But rather than read my poorly formatted blog post, I instead recommend downloading the script and the dataset and going through it yourself in R! It's more... interactive, that way!

You can get the retiolitid data file and the R script as a zipped archive here:
http://webpages.sdsmt.edu/~dbapst/cal3tutorial/cal3tutorial.zip

This gave me the idea for another post about published estimates of sampling rates, which I'll put up in a few days. I hope you all enjoy this tutorial as much as I enjoyed writing it,
-Dave

#############################################################
#A Tutorial to Time-scaling Methods with the Retiolitinae
#David W. Bapst
#06-12-13

#Introduction

#The package paleotree has functions for time-scaling phylogenies of fossil
#taxa but none of the help examples give a concrete case of the functions being
#applied to real data. Instead, I opted to only give examples using simulated data.
#The reasons for this are simply that real data tends to be very complex, and
#although the functions can certainly be applied to real data, it will often require
#more effort than what can really be put in an R help file.
#
#However, without a clear example of applying the functions to real data, I have
#opened the door for lots of confusion about how a user should do this. Here, I
#use a simple dataset of graptoloid generic ranges and relationships from a recent
#publication (Bates et al., 2005) to exemplify how the time-scaling functions in
#in paleotree work, what the data input looks like and just how complicated even
#a simple real dataset can be.
#
#By the way, our example data today is of the Silurian retiolitids, which are really
#cool looking. Check 'em out!
#http://www.insugeo.org.ar/libros/cg_18/FIG_1.jpg

######################################
#Know Your Data

#First, we need to load paleotree...
library(paleotree)

#Next, we need to load the retiolitid data file, which needs to be in our working dir.
load("retiolitinae.rda")

#You can also load the data directly from the current version of paleotree:
data(retiolitinae)
#if this doesn't work, you have an old version of paleotree! Go download a new one.

#And that should have added two new objects to our workspace, retioTree and retioRanges.

#Let's look at retioRanges first. This is our temporal/interval data.

#retioRanges

str(retioRanges)

> str(retioRanges)
List of 2
$ int.times :'data.frame':    20 obs. of 2 variables:
..$ start_time: num [1:20] 440 439 437 437 436 ...
..$ end_time : num [1:20] 439 437 437 436 435 ...
$ taxon.times:'data.frame':    22 obs. of 2 variables:
..$ first_int: int [1:22] 8 6 2 8 8 8 8 13 8 13 ...
..$ last_int : int [1:22] 8 10 10 8 14 14 14 14 14 14 ...

#It's a list with two components, both matrices. Let's see what they are!

#First component:
retioRanges[[1]]

> retioRanges[[1]]
                                                      start_time end_time
Coronograptus cyphus                                      440.18   439.37
Demirastrites triangulatus - Demirastrites pectinatus     439.37   437.46
Pernerograptus argenteus                                  437.46   436.97
Lituigraptus convolutus                                   436.97   435.85
Stimulograptus sedgwicki                                  435.85   434.90
Spirograptus guerichi                                     434.90   432.50
Spirograptus turriculatus - Stretograptus crispus         432.50   430.44
Monoclimacis griestoniensis - Monoclimacis crenulata      430.44   429.43
Spirograptus spiralis                                     429.43   429.01
Nanograptus lapworthi - Cyrtograptus insectus             429.01   427.07
Crytograptus centhfugus - Cyrtograptus murchisoni         427.07   426.53
Monograptus riccartonensis - Monograptus belophorus       426.53   426.06
Cyrtograptus rigidus - Cyrtograptus perneri               426.06   424.92
Cyrtograptus lundgreni                                    424.92   424.02
Gothograptus nassa - Pristiograptus dubius                424.02   423.09
Colonograptus preadeubeli - Colonograptus deubeli         423.09   422.81
Colonograptus ludensis                                    422.81   422.65
Neolobograptus nilssoni - Lobograptus progenitor          422.65   422.35
Lobograptus scanicus                                      422.35   421.27
Saetograptus leintwardnensis                              421.27   420.44

#A matrix. Each row is a different zone (a biostratigraphically defined interval,
#marked by the appearance of new 'index' fossil taxa), the row names are the index
#fossils for those zones and the start and end times are listed in the two columns
#of the matrix. I sometimes refer to this as the 'interval times' or 'int.times'
#matrix in paleotree documentation.

#Now let's look at the second component of retioRanges:

retioRanges[[2]]

> retioRanges[[2]]
                      first_int last_int
Rotaretiolites                8        8
Pseudoplegmatograptus         6       10
Pseudoretiolites              2       10
Dabashanograptus              8        8
Retiolites                    8       14
Stomatograptus                8       14
Paraplectograptus             8       14
Pseudoplectograptus          13       14
Sokolovograptus               8       14
Eisenackograptus             13       14
Sagenograptus                14       14
Cometograptus                14       14
Plectograptus                17       19
Plectodinemagraptus          20       20
Semiplectograptus            19       20
Baculograptus                14       16
Doliograptus                 15       15
Gothograptus                 14       16
Papiliograptus               16       16
Spinograptus                 16       17
Holoretiolites               18       18
Neogothograptus              18       18

#Another matrix. Each row in this matrix is a taxon, the rownames tell you what
#taxon (note that this was a generic-level analysis) and the columns are which
#interval (graptolite zones) each taxon first and last appeared in the fossil record.
#The numbers in these columns are references to the rows of the interval times matrix,
#for example, Rotaretiolites first and last appears in interval on the 8th row (which
#is the Monoclimacis griestoniensis - Monoclimacis crenulata zone). I usually refer
#to this matrix as the 'taxon times' or 'taxon.times' matrix.

############################################################
#A Digression on Discrete Interval Data in paleotree

#This list format composed of two matrices, an int.times matrix and a taxon.times
#matrix with two columns each, is the keystone data input of all analyses in paleotree
#using data where taxa are only in discrete time intervals. I usually call such objects
#"timeList" objects in paleotree, particularly in the input arguments of functions.
#
#Most paleontological data is only know from discrete intervals, because it is
#fundamentally difficult to place the beds that fossils are found at a specific date.
#Instead, most fossils can only be placed as coming from discrete intervals, and our
#ability to resolve the appearance dates of a fossil can differ widely based on the
#stratigraphy of a locality, such that one fossil taxon may be first known from a two
#million year long interval in the early Triassic, while its sister taxon may be
#very hard to pin down, maybe as generalized as listed as first appearing in
#the 'Triassic'.
#
#The graptolite data here are exceptional well resolved; note that in the interval-time
#matrix (the first component of retioRanges), many of the intervals are less than a
#million years long. All of the occurrences are also resolved to a single zone, and
#none of the zones overlap or contain one another. This doesn't have to be the case.
#In our example above with the two taxa known from the Triassic, we would list the
#short eartly Triassic interval as one row in the interval times matrix and the entire
#Triassic as another row. The respective taxa would refer to those respective intervals.
#Most paleotree functions won't care about overlapping intervals, but there is an
#important exception. getSampProbDisc and the freqRat, both of which are applied below,
#CANNOT use such data. Those functions assume the intervals they are containing are
#successive and of roughly similar length. Here, the graptolite data are all successive
#but as we can see, there is some variation in interval length.
#
#Now, there are exceptional datasets composed of continuous time data, which require
#a much simpler data structure than discrete interval data. All that is needed in
#paleotree isjust a matrix of first and last appearance dates). Such datasets tend to
#be for very recent fossil records, microfossil data from cores or computationally
#correlated stratigraphic columns, such as the output from CONOP (Sadler, 2001).
#
#However, if you know your fossil appearances aren't well resolved, you really
#should take that uncertainty into account using this 'timeList' structure.
#Getting data into a 'timeList' structure isn't too difficult: just compose
#the interval times and taxon times matrices seperately in your favorite spreadsheet
#program, read them into R with read.table() and then compose them into a list (using,
#of course, list()...). You can then save that list with save() or dput() and reuse it.

###################################################################
#Okay, enough of that! Let's try drawing the diversity curve for this
#discrete interval data.

taxicDivDisc(retioRanges)

#taxicDivDisc counts up all the taxa that COULD occur in each interval (remember
#my point about overlapping intervals, above). We're okay here, because our data is
#in successive intervals. Notice how there's a big generic diversity drop around 424 MYA.
#Also notice how the 'blocky'-ness of the diversity curve also displays a little
#information about interval length: notice the extremely short graptolite zone near
#423 MYA.

#Alright, let's move on to our other data object, retioTree. We can plot it using ape...

plot(retioTree)

#And we get an unscaled tree, partially unresolved (i.e. some polytomies). Cool.

#Now that I've explained discrete interval data in paleotree ad nauseum to you, let's
#actually apply time-scaling methods. Since this is discrete interval data,
#we'll time-scale using the bin_timePaleoPhy function, which applies the simplest
#time-scaling algorithms to discrete interval data, by randomly assigning first and
#last appearance dates within each interval to each taxon, every time it creates a
#new time-scaled tree. The idea is to do this many times, create many time-scaled trees
#but for now, let's just create one.
#
#The simplest time-scaling algorithm for cladograms of fossil taxa is the 'basic'
#method (Norell, 1992; Smith, 1994) which just means every clade is as old as the
#oldest appearance of any taxon in that clade. Let's try that one for now.

#Basic Time-Scaling
bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
+ ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

Phylogenetic tree with 22 tips and 13 internal nodes.

Tip labels:
Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.

#Hey look, it warned us not to interpret a single tree. Heed that warning!
#
#Okay, we can see the cladogram and the time-scaled tree produced. Note three things.
#
#First, the polytomies were not resolved. We'll deal with that in a moment.
#
#Second, the little time-axis at the bottom is relative to the latest tip of the tree
#being at 0, which is unfortunate, as the latest tip of the tree is at ~420 MYA. (I have
#tried but failed to figure out how to alter this. It would require me adding an altered
#plot.phylo to paleotree, which I am incredibly loathe to do.) So, that time axis just
#tells us about relative timing, which still makes it pretty informative.
#
#Third, the 'location' of the tips of our time-scaled tree have been placed at the
#FIRST appearance dates of our taxa, not the LAST appearance dates ('FADs' versus
#'LADs'). I call this choice of 'location' for our tips the 'time of observation' issue.
#This issue really only rears its head when you have persistent morphotaxa, like these
#graptolite genera, which occur/are morphologically recognizable across multiple
#time intervals. Obviously, if we want to know something about lineage diversification,
#we want to include those 'terminal' ranges of the taxa and add them to the terminal
#branches of our time-scaled tree. We might want to do something different if we're
#dealing with analyses of trait evolution. We should think more carefully about what our
#'time of observation' for our taxa really is!
#
#Note that if our time of observation is FAD, then all the branches are essentially
#posited unsampled 'ghost' branches or 'ghost' lineages.
#
#Now let's try including the ranges, using argument 'add.term':

#Basic Time-Scaling with Terminal Taxon Ranges

bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
add.term=TRUE,ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
+ add.term=TRUE,ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

Phylogenetic tree with 22 tips and 13 internal nodes.

Tip labels:
Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.

#Alright, we can compare and see what changed. For example, we can see that the tip with
#Pseudoretiolites shifted, because it's a "long-lived" morphotaxa, so whether
#its time of observation is its first or last appearance makes a large difference.

#Polytomies are still unresolved. Let's randomly resolve those nodes, forcing the trees
#to be dichotomous. This is another action which is stochastic and that shouldn't be done
#to create a single tree, but rather should be done to create many time-scaled trees.
#The argument of importance here to change is 'randres'.

#Basic Time-Scaling with Terminal Ranges and Polytomies Randomly Resolved

bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
+ randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions
Warning: Do not interpret a single randomly-resolved tree

Phylogenetic tree with 22 tips and 21 internal nodes.

Tip labels:
Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.

#Alright, we've resolved those nodes, as we can see in the cladogram plotted at the top.
#But look at the time-scaled tree! Why are there those 'apparent' polytomies?
#The reason is Zero-Length Branches, which occur in time-scaling phylogenies of fossil
#taxa when a more derived taxon nested within a clade occurs earlier in the fossil record
#than other taxa in that clade. It pushes nodes up against each other under the
#basic time-scaling method, forcing those divergences to occur simultaneously.

#To avoid this, we need to choose a new time-scaling algorithm. One option is the
#'Minimum Branch Length' method (MBL; Laurin, 2004), where we will force all the branch
#lengths in the tree to be some minimum length and thus push nodes back in time.
#To do this, we'll change the the 'type' argument to "mbl" and the minimum length is set
#using the 'variable time' or 'vartime' argument, which we'll set to 1 MY as an example.

#min branch length time-scaling
bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="mbl",
vartime=1,randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="mbl",
+ vartime=1,randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions
Warning: Do not interpret a single randomly-resolved tree

Phylogenetic tree with 22 tips and 21 internal nodes.

Tip labels:
Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.

#Wow, that changed things, didn't it? You can see that the MBL method has added ALOT of
#extra branch-length to our tree, essentially positing a lot more unobserved
#evolutionary history across this clade. Notice also the long 'spine' of nodes seperated
#by 1 MY going all the way down most of the tree, due to the ladder-like structure of
#the tree.
#
#Obviously, the choices we make while time-scaling a tree makes assumptions, whether
#they are implicit or explicit, about how much evolutionary history is unsampled.
#Perhaps 1 MY is too long of a minimum branch length for this dataset; we wouldn't
#expect this much 'ghost' evolutionary history. But how can we test that or even take
#into account how much 'ghost' evolutionary history we would expect? And even if we did
#know how much missing time there should be on the phylogeny, we would be extremely
#uncertain in where to place it.
#
#The solution I suggest in a recent paper is a stochastic time-scaling method, where
#node ages are shifted around in time at random while keeping the tree consisten with
#the appearance times of the taxa, loosely alike to jiggling a zipper. This method
#biases the node ages to be more often sampled at further or closer in time, relative
#to likely we think the corresponding amount of unobserved evolutionary is. That amount
#is not just a function of how frequently sampled the fossil record is: it is also
#dependent on how often new lineages originate via branching and go extinct. Thus, this
#time-scaling method requires three rates to calibrate its use: sampling, branching
#and extinction, and thus I named it 'cal3'.

#These three rates must be calculated before we can apply cal3.
#How? paleotree has the tools!

#########################################
#Getting Sampling Rate, Other Rates and Learning to Be Skeptical

#For discrete interval data, there are two paleotree functions of use for ultimately
#calculating sampling rates from our range data.

#First, we can try the freqRat (Foote and Raup, 1996), which is just a simple
#ratio of the frequencies of taxa found in just one, two and three intervals.
#Let's apply that to retioRanges and tell it to plot the histogram of taxon durations.

freqRat(retioRanges,plot=TRUE)

> freqRat(retioRanges,plot=TRUE)
freqRat
0.5925926

#Okay, so the number the freqRat spit out (0.59) is a sampling PROBABILITY, the probability of
#a taxon being sampled at least once per interval. It's not a sampling rate... yet.
#
#But we need to consider the assumptions and drawbacks of the freqRat first.
#
#The freqRat, like getSampProbDisc below, assumes that extinction and sampling are
#Poisson processes, such that the waiting times between events will be exponentially
#distributed. Under such a process, we'd expect the histogram of taxon durations to be
#roughly similar to an exponential distribution still, with the slope of a log scale
#indicative of the extinction rate, but with the number of taxa that occur in only a
#single interval increased greatly. In addition, it is assumed that extinction and
#sampling event frequency has been relatively constant across the entire sample.
#
#The retiolitid data kind of match our expectation: there is an over abundance of taxa
#known from a single interval. However the number of taxa with longer durations is
#patchy. In particular, there are no taxa known from three graptolite zones. This
#probably just reflects the low sample size of the dataset (22) but also reveals a
#a weakness of the method: just the first three frequency values are used in the
#freqRat. If our sample size is low, or if intervals are so finely broke up that there
#are few taxa found in only one, two or three intervals, the freqRat may give poor if
#not extremely misleading results.
#
#getSampProbDisc is an alternative which uses a maximum likelihood optimization to find
#the best fitting sampling probability and extinction rate, using the entire distribution
#of taxon durations.

SPres <- getSampProbDisc(retioRanges)
SPres

> SPres
$Title
[1] "Analysis with 1 time bins and 0 groupings ( 0 and 0 States),with 2 parameters and 22 taxa"

$parameters
extRate sampProb Completeness
0.2028359 0.5454152 0.8672921

$log.likelihood
[1] -41.40539

$AICc
[1] 87.44237

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

#We can see that getSampProbDisc gives a lot of output! Of importance is whether the
#optimizer converged; if $convergence is 0, then we're probably okay. The best-fit
#parameters are at the top. The sampling probability per graptolite zone is 0.54.
#
#Now is the time to be suspicious. Is this a realistic value? To do that, we need to
#compare this value to other published values, but those don't exist, not for graptolite
#zones. Instead, we need to find out how long our graptolite zones are and relate it to
#time. As I pointed out above, our zones are uneven in length, an unfortunate reality of
#the geologic record. But how uneven?

intLength<--apply(retioRanges[[1]],1,diff)
hist(intLength)

meanInt<-mean(intLength)
meanInt

> meanInt
[1] 0.987

#Okay, so we can see there's a lot of short zones and some zones longer than one, but
#altogether the mean is close to 1 million years. Is this an okay amount of variation?
#It's hard to tell, or even to test. My recommendation if you're dataset was like this
#would be to create several versions of the range data, glomming smaller zones onto one
#another to try to create as even length intervals as possible, and then run your
#analyses across those. If there's no difference, you're fine. I'm not going to
#give an example of that here, but if you're playing with real data, I'd recommend you
#try that. Instead, let's go with the idea that our graptolite zones are roughly a
#million years long, so we have a 0.54 probability of sampling a retiolitid genus at
#least once per million year interval.
#
#Another way we could use to consider our estimated sampling probability is compare
#the completeness values, which are not dependent on interval length. These represent
#the proportion of total taxa expected to be sampled in the fossil record. In this case,
#the retiolitid fossil record is estimated as being 87% complete.
#
#Unfortunately, both completeness and sampling probability can have a dependency
#on extinction rate, because these variables depend on how long taxa survive before
#being sampled or the end of a given interval. The instantaneous rate of sampling
#rates, the rate of a sampling event occurring per lineage million years, is the only
#sampling variable that can be estimated independently of extinction rate.
#
#We can transform sampling probability to sampling rate using the function sProb2sRate
#and the mean interval length:

sRate <- sProb2sRate(SPres[[2]][2],int.length=meanInt)
sRate

> sRate
[1] 0.7987547

#So, a sampling rate of 0.79 per Lmy.
#
#So, to return to the issue I raised earlier: are our estimates of sampling reasonable?
#Well, there's aren't a whole lot of published estimates of sampling variables. I got
#rather sidetracked by that and went through the literature and collected rate estimates
#from as many publications as possible. (I'll post those later on my blog.)
#
#Foote and Sepkoski gave the only previous graptoloid sampling estimates; specifically
#generic sampling probabilities. I've converted those to sampling rates, which
#are about 0.18-0.35 per Lmy (Foote and Sepkoski gave different estimates based on
#different ways of breaking up intervals). So, our estimate for the retiolitids is
#nearly twice that. This makes our estimates suspicious but not unrealistic: its
#possible that retiolitids are twice better sampled than other graptolite genera. It
#may also be because of factors we haven't accounted for. Remember how the diversity
#curve showed what looked like an extinction event? That was the well-known 'lundgreni
#event'. That, or some other variation in sampling or extinction rate, may be enough
#to have biased our estimates. Alternatively, maybe our sample size it too low to
#get as good an estimate as we'd like.
#
#getSampProbDisc allows for us to fit different parameters to different groups of taxa
#in our dataset, such as taxa in different time-intervals or different clades. This
#can allow us to consider such potential sources of variation, and use AIC to compare
#the fit of these models. Unfortunately, in this case with only 22 taxa, splitting the
#dataset doesn't seem very feasible, so I won't try that here.
#
#For the sake of this tutorial, I'll continue with the sampling rate we estimated,
#but in a real analysis we would probably want to spend a lot more time assessing the
#accuracy of our estimate.
#
#Now, for cal3, we also need extinction rate and branching rate. One way to get those
#would be to estimate them using per capita rates (Foote, 2000), but getSampProbDisc
#also calculates extinction rate, and its better to take this joint estimate then get
#extinction rate from a seperate analysis. We might also assume that branching rate is
#equal to extinction rate. This is an extremely simplifying assumption, but one that
#oddly seems to hold true in the fossil record (Stanely, 1979; Sepkoski, 1998). In
#general, I'd suggest getting rates from both getSampProbDisc and from a per capita
#rates analysis (not yet implemented in paleotree), and comparing the rates they
#estimate, and also to test if branching rate is similar to extinction rate.

#To get the extinction rate (also the branching rate for this tutorial) from
#getSampProbDisc, we simply divide the extinction rate it calculated by the interval
#length.

divRate<-SPres[[2]][1]/meanInt

#Alright and now we have everything we need to use cal3 with the retiolitid dataset.
#Note however that all of the above won't apply for every sort of dataset we might
#want to analyze: if your data doesn't have persistent taxa, then all taxa occur in
#a single interval and freqRat and getSampProbDisc will be of no use to you, whatsoever.
#There's are really any direct method that let you estimate sampling rates in those
#cases, but there might be some workarounds. I'll talk more about that in the blog post
#where I give all those published sampling rate estimates I've been collecting.

########################################################
#And Now, We Can Finally Time-Scale with cal3!

#By default, cal3 randomly resolves polytomies and adds on the terminal ranges of
#taxa, unlike the timePaleoPhy function we applied in an earlier part of this tutorial.
#Just like with timePaleoPhy, there is a 'bin' version of cal3TimePaleoPhy for
#discrete time interval data.

ttree <- bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
sampRate=sRate,ntrees=1,plot=TRUE)

> ttree <- bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
+ sampRate=sRate,ntrees=1,plot=TRUE)
Warning: Do not interpret a single cal3 time-scaled tree
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

#Note well the copious warnings about not interpreting a single tree! cal3 is a
#stochastic time-scaling method.
#
#With such high sampling rates, nodes seperated by ZLBs under basic only move a little
#back further in time: it's unlikely that there is much more unobserved evolutionary
#history.
#
#We can see the diversity curve assumed by this tree...

phyloDiv(ttree)

#But if we really wanted to see the uncertainty in the phylogenetic diversity estimate
#we would need to create a sample of time-scaled phylogenies and calculate
#the median diversity curve across that sample. (Function multiDiv can do this!)
#
#Let's try making a hundred trees and putting them in multiDiv.

ttrees <- bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
sampRate=sRate,ntrees=100,plot=FALSE)
multiDiv(ttrees)

#The thick black line is the median diversity, the gray blocks are 95% quantiles
#across the sample of time-scaled trees.

#Lastly, we can also change the times of observation in cal3 using the argument
#'FAD.only'...

bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
    sampRate=sRate,ntrees=1,FAD.only=TRUE,plot=TRUE)

> bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
+     sampRate=sRate,ntrees=1,FAD.only=TRUE,plot=TRUE)
Warning: Do not interpret a single cal3 time-scaled tree
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

Phylogenetic tree with 22 tips and 21 internal nodes.

Tip labels:
        Rotaretiolites, Pseudoretiolites, Pseudoplegmatograptus, Dabashanograptus, Stomatograptus, Retiolites, ...

Rooted; includes branch lengths.

#Cool? Now you should know everything you need to play around with cal3!

###################################################
#References
#Source for cladogram and zonal ranges for genera:
#Bates, D. E. B., A. Kozlowska, and A. C. Lenz. 2005. Silurian retiolitid graptolites: Morphology and evolution. Acta Palaeontologica Polonica 50(4):705-720.

#Source for interval dates for graptolite zones:
#Sadler, P. M., R. A. Cooper, and M. Melchin. 2009. High-resolution, early Paleozoic (Ordovician-Silurian) time scales. Geological Society of America Bulletin 121(5-6):887-906.

#Other references:
#Foote, M. 1997. Estimating Taxonomic Durations and Preservation Probability. Paleobiology 23(3):278-300.
#Foote, M. 2000. Origination and extinction components of taxonomic diversity: general problems. Pp. 74-102. In D. H. Erwin, and S. L. Wing, eds. Deep Time: Paleobiology's Perspective. The Paleontological Society, Lawrence, Kansas.
#Foote, M., and D. M. Raup. 1996. Fossil preservation and the stratigraphic ranges of taxa. Paleobiology 22(2):121-140.
#Foote, M., and J. J. Sepkoski. 1999. Absolute measures of the completeness of the fossil record. Nature 398(6726):415-417.
#Laurin, M. 2004. The Evolution of Body Size, Cope's Rule and the Origin of Amniotes. Systematic Biology 53(4):594-622.
#Norell, M. A. 1992. Taxic origin and temporal diversity: the effect of phylogeny. Pp. 89-118. In M. J. Novacek, and Q. D. Wheeler, eds. Extinction and phylogeny. Columbia University Press, New York.
#Sadler, P. 2001. Constrained optimization approaches to the paleobiologic correlation and seriation problems: a user's guide and reference manual to the CONOP program family, v.6.1. University of California, Riverside.
#Sepkoski, J. J. 1998. Rates of speciation in the fossil record. Philosophical Transactions: Biological Sciences 353(1366):315-326.
#Smith, A. B. 1994. Systematics and the fossil record: documenting evolutionary patterns. Blackwell Scientific, Oxford.
#Stanley, S. M. 1979. Macroevolution: Patterns and Process. W. H. Freeman, Co., San Francisco.