Monday, June 17, 2013

Sampling Rates and Dealing with Fossil Records Less Than Conducive to Their Estimation



So, in my last post, I went at length into a lot of the nuances we need to consider when we estimate sampling rates and sampling probabilities from the frequency distribution of ranges in the fossil record. This gave me the impetus to do something I'd been meaning to do for a while: actually track down the various estimates of sampling probability or sampling rate, turn them all into per Lmy sampling rates and compare them. Here they are, ordered from greatest to least:

Sampling Rate (per Lmy)    Taxon    Reference
17.4577776    Oklahoma Trilobites Species (Single Locality)    Foote and Raup, 1996
1.660731207    Cenozoic Macroperforate Planktic Forams    Ezard et al., 2011
1.046292901    Neogene Iberian Mammal Genera    Alba et al., 2001
0.868900035    Phanerozoic Brachiopoda Genera    Foote and Sepkoski,1999
0.797580429    Neogene Iberian Mammal Species    Alba et al., 2001
0.483501825    Phanerozoic Brachiopoda Genera    Foote and Sepkoski,1999
0.437808292    Phanerozoic Cephalopoda Genera    Foote and Sepkoski,1999
0.434450018    Phanerozoic Conodonta Genera    Foote and Sepkoski,1999
0.410974389    N.A. Cenozoic Mammal Species    Foote and Raup, 1996
0.408044166    European Jurassic Bivalve Species    Foote and Raup, 1996
0.385502461    Phanerozoic Trilobita Genera    Foote and Sepkoski,1999
0.357475065    Phanerozoic Graptolithina Genera    Foote and Sepkoski,1999
0.334331480    Phanerozoic Cephalopoda Genera    Foote and Sepkoski,1999
0.244922481    Phanerozoic Bryozoa Genera    Foote and Sepkoski,1999
0.214987601    Phanerozoic Bryozoa Genera    Foote and Sepkoski,1999
0.201575023    Phanerozoic Conodonta Genera    Foote and Sepkoski,1999
0.196147211    Phanerozoic Echinodea Genera    Foote and Sepkoski,1999
0.192764386    Phanerozoic Trilobita Genera    Foote and Sepkoski,1999
0.182563024    Phanerozoic Graptolithina Genera    Foote and Sepkoski,1999
0.159239636    Phanerozoic Echinodea Genera    Foote and Sepkoski,1999
0.153449104    Phanerozoic Gastropoda Genera    Foote and Sepkoski,1999
0.145183217    Phanerozoic Ostracoda Genera    Foote and Sepkoski,1999
0.133448941    Phanerozoic Bivalvia Genera    Foote and Sepkoski,1999
0.127046142    Phanerozoic Ostracoda Genera    Foote and Sepkoski,1999
0.122426282    Phanerozoic Anthozoa Genera    Foote and Sepkoski,1999
0.116261536    Phanerozoic Porifera Genera    Foote and Sepkoski,1999
0.115432413    Phanerozoic Blastozoa Genera    Foote and Sepkoski,1999
0.112799434    Phanerozoic Bivalvia Genera    Foote and Sepkoski,1999
0.099553348    Phanerozoic Blastozoa Genera    Foote and Sepkoski,1999
0.099553348    Phanerozoic Gastropoda Genera    Foote and Sepkoski,1999
0.099021026    Early Paleozoic Crinoid Genera    Foote and Raup, 1996
0.093263457    Phanerozoic Anthozoa Genera    Foote and Sepkoski,1999
0.086915600    Phanerozoic Porifera Genera    Foote and Sepkoski,1999
0.084205114    Phanerozoic Crinoidea Genera    Foote and Sepkoski,1999
0.078324167    Phanerozoic Crinoidea Genera    Foote and Sepkoski,1999
0.075548263    Phanerozoic Malacostraca Genera    Foote and Sepkoski,1999
0.067297159    Phanerozoic Osteichtyes Genera    Foote and Sepkoski,1999
0.054279636    Phanerozoic Asterozoa Genera    Foote and Sepkoski,1999
0.052305831    Phanerozoic Asterozoa Genera    Foote and Sepkoski,1999
0.039758685    Phanerozoic Malacostraca Genera    Foote and Sepkoski,1999
0.029548896    Phanerozoic Osteichtyes Genera    Foote and Sepkoski,1999
0.023242431    Phanerozoic Chondrichtyes Genera    Foote and Sepkoski,1999
0.011674604    Phanerozoic Chondrichtyes Genera    Foote and Sepkoski,1999
0.009677980    Phanerozoic Polychaeta Genera    Foote and Sepkoski,1999
0.009326054    Phanerozoic Polychaeta Genera    Foote and Sepkoski,1999
0.008051675    Cenozoic Chiroptera Genera    Eiting and Gunnell, 2009

Most of these were published as sampling probabilities, so I just found the mean interval length and got the sampling rater per Lmy using my paleotree function 'sProb2sRate'. The Foote and Sepkoski values weren't published as numbers, just shown in a plot, so I used a ruler rather than send an email to my old advisor. It's just a blog post, not a research paper.

Note that every group in Foote and Sepkoski had two sampling probabilities plotted, for two different systems of how to break the Phanerozoic into smaller intervals. Interestingly, this reveals considerable variation in the estimated sampling rate, suggesting that the estimates from freqRat or from the maximum likelihood variant can be sensitive to how we break up intervals (as I suggested in the last blog post).

The rate estimated for the Oklahoma trilobite data is wildly high, but note this is a densely sampled single locality in the Ordovician. Beware the disconnect between global estimates and local estimates... and beware the fact that they don't cleanly scale between each other. If all of your taxa are from Australia, if you do a global analysis, it will be just an analysis of Australia that you're only pretending is global. Maybe that is the 'global' rate, but it depends on your assumptions about whether there are taxa that you didn't sample in non-Australia regions of the world. There's also taxonomic issues here too: many of these analyses were done at the supraspecific level (note that doesn't mean persistent taxa are only at the genus level and above; paleontologists just have a tendency to analyze genera, not species).

Now let's look at the other end of the scale. We've got some fish groups, then polychaetes, then bats. Makes sense right? If you had to imagine a really bad fossil record, squishy worms and fragile little bats sounds about right. But hold on. Let's ask ourselves, what's missing from this list. Well, there aren't any terrestrial groups on this other than mammals. I used to think this was just because no one had done it yet.

Then I started talking to various vertebrate paleontologists and researchers working with datasets from non-Cenozoic vertebrate paleontology datasets and they told me that in other vertebrates groups, like dinosaurs, every species is found in what essentially amounts to a single stratigraphic interval, at least at the global scale. They might be found throughout a unit or formation that represents a limited window of time, like the Morrison or whatever, but you don't find the same morphotaxon persisting across intervals like you do in invertebrates. The people I talk to often call it 'point occurrence' data so I'll use that term for the rest of this post, even though that isn't exactly ideal: 'point-occurrence' really isn't completely accurate if some of those taxa are found at multiple horizons in a unit or formation, its less descriptive than we might desire and 'point occurrence' is sometimes used to mean other things. (There's other terms too like 'singleton' which could apply here, but their usage is even more muddled.)

Disclaimer: I know nothing about vertebrates or their fossil record. As evidence, I present that I squeaked by with a C- in Primate Anthropology, my one attempt to learn anything about vertebrates. I felt it was important to start off this post with that, because, great golly, I don't want anyone reading this and thinking I know anything about the vertebrate fossil record. Heck, I haven't even seen any datasets firsthand. All my information is hearsay from other scientists, most of whom contacted me in trying to time-scale their trees. If someone can show me a range chart that has a bunch of sauropod species persisting through multiple geologic intervals, then great, I'd love to see it.

So, anyway, I found this description of vertebrate paleo datasets to be really strange. Why are they so different from the ones we can get sampling rates from, like those above? One explanation I have been given is that this occurs because groups like dinosaurs are just more poorly sampled than shelly invertebrate groups. If  I try to simulate a really poorly sampled clade but still have a number of sampled taxa, under a simple model where extinction and sampling are homogenous Poisson processes, I can't make all the taxa point occurrences. There's a real good reason for this: as the sampling rate decreases, the number of original true taxa must increase, such that the sampled taxa is a tiny portion of a vast unsampled diversity. As the number of total taxa increases, you get more and more extremely long-lived taxa, a tiny minority, but they become so increasingly long-live that the probability of NOT sampling any of them one of them twice (and thus getting a persistent taxon) is incredibly low. I cannot avoid simulated datasets with persistent taxa, not under this simple model of extinction and sampling. This is true even if I break up the time-scale of these simulations into discrete intervals. I still get persistent taxa with first and last appearances sampled in multiple intervals.

So, to me, that means it isn't just 'poorer sampling'. There needs to be something else going on. We're missing some explanatory factor that could cause morphotaxa to either be very 'short-lived' in the fossil record and/or would make it very difficult to sample any morphotaxon that might survive into another stratigraphic interval. Of the former, it is possible to imagine a fossil record of point occurrences resulting from extremely high anagenesis rates, such that species really don't have any opportunity to be persistent. However, that means there's something about dinosaurs and other such groups that is fundamentally different from groups like North American mammals and fossil invertebrates. It doesn't need to represent a real evolutionary difference: it might be that differences in the available material gives more characters and thus allows for much much finer species delimination (or, possibly, that species delimination is being overly aggressive). The later, where some factor makes it difficult to sample any persist morphotaxon more than once, could occur if there is some sort of complex regional alternation in the deposition and preservation of rock packages that could preserve vertebrates; i.e. North America get the Morrison, then nothing for a long time, something like that (I have no idea if the preceding statement is true). But then why the shift going from Mesozoic dinosaurs to Cenozoic mammals?

Remember, to use the freqRat or the variants of it, you need to have some persistent taxa so that means there are some persistent bat taxa, found in multiple stages (as that's what Eitling and Gunnell used) but not dinosaurs. Now, an important distinction here is maybe it's that Eitling and Gunnell used genera. So, okay, forget bats. But why do Alroy's Cenozoic mammal species have persistent taxa? (as analyzed in Foote and Raup) Whatever the reason for this point occurrence pattern in some groups, it will probably be obscure for a while. I'd bet it is probably the result of multiple factors.

But, to get back to the practical point, what's a researcher with a dataset unanimously dominated by point occurrence taxa supposed to do if they want to use cal3, which absolutely requires an estimate of the sampling rate? (We'll ignore also needing branching and extinction rate estimates for the moment.)

It's really quite a conundrum. These are my thoughts.

First is that the freqRat and the ML variant aren't the only ways to get sampling rate. Foote (2004) also introduced a survivorship curve method, but that also won't work if you don't have persistent taxa. Solow and Smith (1997) presented a method that obtained sampling rate estimates from the waiting time distribution between sampling events in single taxa; that information definitely isn't available here, at least not at the global, phylogenetic scale we're trying to get sampling rates for.. There's also capture-mark-recapture methods, which also estimate sampling probabilities, but I don't know much about them or whether they could be applied to data that effectively has no recaptures. An idea that's been banging around in my skull is that Friedman and Brazeau (2012) introduced an idea that might bear water: get sampling estimates from the ghost branch distribution. But I think that issue is a lot more complicated than Friedman and Brazeau suggested, and there's some theoretical roadblocks relating to differentiation patterns (budding versus bifurcation, anagenesis, etc) that I haven't thought around yet.

So, that exhausts the methods (that I know of) for actually estimating the sampling rate.

One utter kludge would be to just pull them out of thin air. Ask yourself what fossil records on the chart above are probably of similar quality and use those values. Even better, choose a range of likely values and use those. The problem with this is that maybe your fossil record is even worse than the ones described here. Also, these values are for a limited selection of data: most of them are global and their relationship to regional datasets is thus obscure. Many of them are at the generic level, particularly the more poorly sampled records, and it is difficult to envision any way of converting a generic estimate of sampling to a species-level estimate (although maybe paraclade models might offer such a route; see Patzkowsky, 1995; Foote, 2012). Overall, I cannot recommend that pulling values out of thin air is the right way to use cal3. It's a quick and dirty way of getting something done, just like using a 'taxon tree' as a replacement for a cladogram, but its ultimately difficult to robustly defend the assumptions you'd be making in such an analysis. Class projects and pilot work can have quick and dirty techniques, but for real research we want to be able to trust the conclusions of, we would want something that involves less guesswork.

And there's an addition problem to such cross-eyed guesstimations: remember my simulation woes above? We can't get point occurrence data under our typical models of sampling and extinction events, where they occur as under a Poisson process. We know the assumptions of our models are not valid in those cases. And that means even ignoring the 'thin air' part of pulling sampling rates out of thin air, the models that cal3 assumes probably don't apply to this type of data. And it's not just cal3 that assumes those models; that's the typical assumptions of many paleobiological methods that involve putting any sort of expectation on a sampling process. Of course, just because model assumptions are violated doesn't mean a method will return the wrong answer. Methods can be more robust than that. However, without knowing what leads to point occurrence type data it is hard to make predictions, since we don't know exactly what about our model assumptions are incorrect. My presumption would be that if this data type occurs because sampling is much more complicated than a global Poisson process, then cal3 will give fairly wrong answers. Again, though, that's probably also true of many other methods too. If, instead, it's because these are groups where morphotaxa are extremely finely delimited, such that you can't get persistent taxa, then maybe the cal3 method is okay. This violation of model assumptions just means morphotaxa are 'terminating' via anagenetic pseudoextinction much more often than lineages terminating due to extinction. In other words, there's a disconnect between the morphological differentiation of lineages and their birth-death dynamics. cal3's assumption have more to do with the probability of not sampling a clade at all or not, something which should be independent of the morphological differentiation pattern within that clade.

Regardless, and some people have pushed me on this point, I do not claim that the cal3 method is applicable to all datasets. I fully admit that it is not! I did this to time-scale graptolite phylogenies and while I want the method to be as useful and general for as many people as possible, the only reasonable starting point for making an explicitly model-based time-scaling method was to start with the simplest models of sampling and diversification possible. Simple models are the only logical foundation for moving forward. And while I might be able to realistically defend the assumptions of those simple models in graptolites or other 'well-sampled' groups, as a scientific community, we have a lot more work ahead of us to tackle every dataset. Even though sampling processes are as important as origination and extinction in shaping the fossil record, if not more so, I think we still spend much more effort in understanding and modeling the dynamics of diversification in paleobiology than sampling processes.

But this difficulty in applying cal3 to every dataset poses a new question: are the other time-scaling methods that we have available sufficient for the sort of phylogeny-based analyses of evolution ('phylogenetic comparative methods') that we want to do in the fossil record?

Well, we can talk more about that this Sunday in Snowbird, if you'll be there. ;)
http://evo2013.evolutionmeeting.org/engine/search/index.php?func=detail&aid=149

Cheers,
-Dave

Friday, June 14, 2013

A Tutorial to cal3 Time-Scaling Using a Real Dataset

Hi guys!

So, my paper on time-scaling methods for trees of fossil taxa has been accepted to Methods in Ecology and Evolution! You won't see it right away though; it's part of a special issue of MEE and all the papers included in it will be held back for a simultaneous online release, probably in August (I guess?). Anyway, until then, you can go read chapter 3 of my dissertation, it's mostly the same thing, and its up on UMI Proquest now. Or you can contact me for a copy.

Anyway, to commemorate the acceptance, I have actually pulled a real dataset out of the literature and written up a tutorial about applying the various time-scaling methods to this data, in particular the cal3 method. The data in question (of course) a graptolite group, the Silurian retiolitids. It's a generic dataset, but, hey, it's just a tutorial right! Chill, dudes.

Such a tutorial was persuasively argued for by one of my reviewers at MEE, but I had to respond with "I don't have the room!" However, I've got infinite room here on this blog and also it means anyone can come see it here, for free, forever (well as long as Blogger exists...). So here it is! Thanks to Reviewer #1 for making such a strong case for it. Sorry, Reviewer #1, but I did not take you up on that other suggestion that I use a vertebrate dataset example. I just know graptolites better, and vertebrates are a headache to get sampling rates from (I'll touch on that in a later blog post, thanks).

Now, this is a very LONG tutorial. Why? I want to cover in detail all the hiccups that can happen on the way to wanting to time-scale a tree, and I really wanted to nail down what the data structure looks like. So, apologies for the length, but I think anyone actually attempting this stuff will appreciate the detail I go into, particularly, how the degree we should exercise caution and skepticism in estimating sampling rates.

I've written this tutorial much like a cave man from 1984 might have, as an overly chatty, heavily commented R script. If someone who knows LaTEX and Sweave better than me wants to helpfully convert this to a document suitable for release as a vignette file for paleotree well... Well, I will buy that person a few beers at Evolution! Otherwise, you guys are getting just this: an R script.

I've pasted the script text below, with figures inserted and a few key console outputs. The images overhang the margins. I don't know why and I can't seem to fix it. Stupid Blogger! But rather than read my poorly formatted blog post, I instead recommend downloading the script and the dataset and going through it yourself in R! It's more... interactive, that way!

You can get the retiolitid data file  and the R script as a zipped archive here:
http://webpages.sdsmt.edu/~dbapst/cal3tutorial/cal3tutorial.zip

This gave me the idea for another post about published estimates of sampling rates, which I'll put up in a few days. I hope you all enjoy this tutorial as much as I enjoyed writing it,
-Dave


#############################################################
#A Tutorial to Time-scaling Methods with the Retiolitinae
#David W. Bapst
#06-12-13

#Introduction

#The package paleotree has functions for time-scaling phylogenies of fossil
#taxa but none of the help examples give a concrete case of the functions being
#applied to real data. Instead, I opted to only give examples using simulated data.
#The reasons for this are simply that real data tends to be very complex, and
#although the functions can certainly be applied to real data, it will often require
#more effort than what can really be put in an R help file.
#
#However, without a clear example of applying the functions to real data, I have
#opened the door for lots of confusion about how a user should do this. Here, I
#use a simple dataset of graptoloid generic ranges and relationships from a recent
#publication (Bates et al., 2005) to exemplify how the time-scaling functions in
#in paleotree work, what the data input looks like and just how complicated even
#a simple real dataset can be.
#
#By the way, our example data today is of the Silurian retiolitids, which are really
#cool looking. Check 'em out!
#http://www.insugeo.org.ar/libros/cg_18/FIG_1.jpg

######################################
#Know Your Data

#First, we need to load paleotree...
library(paleotree)

#Next, we need to load the retiolitid data file, which needs to be in our working dir.
load("retiolitinae.rda")

#You can also load the data directly from the current version of paleotree:
data(retiolitinae)
#if this doesn't work, you have an old version of paleotree! Go download a new one.


#And that should have added two new objects to our workspace, retioTree and retioRanges.

#Let's look at retioRanges first. This is our temporal/interval data.

#retioRanges

str(retioRanges)

> str(retioRanges)
List of 2
 $ int.times  :'data.frame':    20 obs. of  2 variables:
  ..$ start_time: num [1:20] 440 439 437 437 436 ...
  ..$ end_time  : num [1:20] 439 437 437 436 435 ...
 $ taxon.times:'data.frame':    22 obs. of  2 variables:
  ..$ first_int: int [1:22] 8 6 2 8 8 8 8 13 8 13 ...
  ..$ last_int : int [1:22] 8 10 10 8 14 14 14 14 14 14 ...

#It's a list with two components, both matrices. Let's see what they are!

#First component:
retioRanges[[1]]

> retioRanges[[1]]
                                                      start_time end_time
Coronograptus cyphus                                      440.18   439.37
Demirastrites triangulatus - Demirastrites pectinatus     439.37   437.46
Pernerograptus argenteus                                  437.46   436.97
Lituigraptus convolutus                                   436.97   435.85
Stimulograptus sedgwicki                                  435.85   434.90
Spirograptus guerichi                                     434.90   432.50
Spirograptus turriculatus - Stretograptus crispus         432.50   430.44
Monoclimacis griestoniensis - Monoclimacis crenulata      430.44   429.43
Spirograptus spiralis                                     429.43   429.01
Nanograptus lapworthi - Cyrtograptus insectus             429.01   427.07
Crytograptus centhfugus - Cyrtograptus murchisoni         427.07   426.53
Monograptus riccartonensis - Monograptus belophorus       426.53   426.06
Cyrtograptus rigidus - Cyrtograptus perneri               426.06   424.92
Cyrtograptus lundgreni                                    424.92   424.02
Gothograptus nassa - Pristiograptus dubius                424.02   423.09
Colonograptus preadeubeli - Colonograptus deubeli         423.09   422.81
Colonograptus ludensis                                    422.81   422.65
Neolobograptus nilssoni - Lobograptus progenitor          422.65   422.35
Lobograptus scanicus                                      422.35   421.27
Saetograptus leintwardnensis                              421.27   420.44

#A matrix. Each row is a different zone (a biostratigraphically defined interval,
#marked by the appearance of new 'index' fossil taxa), the row names are the index
#fossils for those zones and the start and end times are listed in the two columns
#of the matrix. I sometimes refer to this as the 'interval times' or 'int.times'
#matrix in paleotree documentation.

#Now let's look at the second component of retioRanges:

retioRanges[[2]]

> retioRanges[[2]]
                      first_int last_int
Rotaretiolites                8        8
Pseudoplegmatograptus         6       10
Pseudoretiolites              2       10
Dabashanograptus              8        8
Retiolites                    8       14
Stomatograptus                8       14
Paraplectograptus             8       14
Pseudoplectograptus          13       14
Sokolovograptus               8       14
Eisenackograptus             13       14
Sagenograptus                14       14
Cometograptus                14       14
Plectograptus                17       19
Plectodinemagraptus          20       20
Semiplectograptus            19       20
Baculograptus                14       16
Doliograptus                 15       15
Gothograptus                 14       16
Papiliograptus               16       16
Spinograptus                 16       17
Holoretiolites               18       18
Neogothograptus              18       18

#Another matrix. Each row in this matrix is a taxon, the rownames tell you what
#taxon (note that this was a generic-level analysis) and the columns are which
#interval (graptolite zones) each taxon first and last appeared in the fossil record.
#The numbers in these columns are references to the rows of the interval times matrix,
#for example, Rotaretiolites first and last appears in interval on the 8th row (which
#is the Monoclimacis griestoniensis - Monoclimacis crenulata  zone). I usually refer
#to this matrix as the 'taxon times' or 'taxon.times' matrix.

############################################################
#A Digression on Discrete Interval Data in paleotree

#This list format composed of two matrices, an int.times matrix and a taxon.times
#matrix with two columns each, is the keystone data input of all analyses in paleotree
#using data where taxa are only in discrete time intervals. I usually call such objects
#"timeList" objects in paleotree, particularly in the input arguments of functions.
#
#Most paleontological data is only know from discrete intervals, because it is
#fundamentally difficult to place the beds that fossils are found at a specific date.
#Instead, most fossils can only be placed as coming from discrete intervals, and our
#ability to resolve the appearance dates of a fossil can differ widely based on the
#stratigraphy of a locality, such that one fossil taxon may be first known from a two
#million year long interval in the early Triassic, while its sister taxon may be
#very hard to pin down, maybe as generalized as listed as first appearing in
#the 'Triassic'.
#
#The graptolite data here are exceptional well resolved; note that in the interval-time
#matrix (the first component of retioRanges), many of the intervals are less than a
#million years long. All of the occurrences are also resolved to a single zone, and
#none of the zones overlap or contain one another. This doesn't have to be the case.
#In our example above with the two taxa known from the Triassic, we would list the
#short eartly Triassic interval as one row in the interval times matrix and the entire
#Triassic as another row. The respective taxa would refer to those respective intervals.
#Most paleotree functions won't care about overlapping intervals, but there is an
#important exception. getSampProbDisc and the freqRat, both of which are applied below,
#CANNOT use such data. Those functions assume the intervals they are containing are
#successive and of roughly similar length. Here, the graptolite data are all successive
#but as we can see, there is some variation in interval length.
#
#Now, there are exceptional datasets composed of continuous time data, which require
#a much simpler data structure than discrete interval data. All that is needed in
#paleotree isjust a matrix of first and last appearance dates). Such datasets tend to
#be for very recent fossil records, microfossil data from cores or computationally
#correlated stratigraphic columns, such as the output from CONOP (Sadler, 2001).
#
#However, if you know your fossil appearances aren't well resolved, you really
#should take that uncertainty into account using this 'timeList' structure.
#Getting data into a 'timeList' structure isn't too difficult: just compose
#the interval times and taxon times matrices seperately in your favorite spreadsheet
#program, read them into R with read.table() and then compose them into a list (using,
#of course, list()...). You can then save that list with save() or dput() and reuse it.

###################################################################
#Okay, enough of that! Let's try drawing the diversity curve for this
#discrete interval data.

taxicDivDisc(retioRanges)



#taxicDivDisc counts up all the taxa that COULD occur in each interval (remember
#my point about overlapping intervals, above). We're okay here, because our data is
#in successive intervals. Notice how there's a big generic diversity drop around 424 MYA.
#Also notice how the 'blocky'-ness of the diversity curve also displays a little
#information about interval length: notice the extremely short graptolite zone near
#423 MYA.

#Alright, let's move on to our other data object, retioTree. We can plot it using ape...

plot(retioTree)



#And we get an unscaled tree, partially unresolved (i.e. some polytomies). Cool.

#Now that I've explained discrete interval data in paleotree ad nauseum to you, let's
#actually apply time-scaling methods. Since this is discrete interval data,
#we'll time-scale using the bin_timePaleoPhy function, which applies the simplest
#time-scaling algorithms to discrete interval data, by randomly assigning first and
#last appearance dates within each interval to each taxon, every time it creates a
#new time-scaled tree. The idea is to do this many times, create many time-scaled trees
#but for now, let's just create one.
#
#The simplest time-scaling algorithm for cladograms of fossil taxa is the 'basic'
#method (Norell, 1992; Smith, 1994) which just means every clade is as old as the
#oldest appearance of any taxon in that clade. Let's try that one for now.

#Basic Time-Scaling
bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
    ntrees=1,plot=TRUE)


> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
+ ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

Phylogenetic tree with 22 tips and 13 internal nodes.

Tip labels:
        Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.


#Hey look, it warned us not to interpret a single tree. Heed that warning!
#
#Okay, we can see the cladogram and the time-scaled tree produced. Note three things.
#
#First, the polytomies were not resolved. We'll deal with that in a moment.
#
#Second, the little time-axis at the bottom is relative to the latest tip of the tree
#being at 0, which is unfortunate, as the latest tip of the tree is at ~420 MYA. (I have
#tried but failed to figure out how to alter this. It would require me adding an altered
#plot.phylo to paleotree, which I am incredibly loathe to do.) So, that time axis just
#tells us about relative timing, which still makes it pretty informative.
#
#Third, the 'location' of the tips of our time-scaled tree have been placed at the
#FIRST appearance dates of our taxa, not the LAST appearance dates ('FADs' versus
#'LADs'). I call this choice of 'location' for our tips the 'time of observation' issue.
#This issue really only rears its head when you have persistent morphotaxa, like these
#graptolite genera, which occur/are morphologically recognizable across multiple
#time intervals. Obviously, if we want to know something about lineage diversification,
#we want to include those 'terminal' ranges of the taxa and add them to the terminal
#branches of our time-scaled tree. We might want to do something different if we're
#dealing with analyses of trait evolution. We should think more carefully about what our
#'time of observation' for our taxa really is!
#
#Note that if our time of observation is FAD, then all the branches are essentially
#posited unsampled 'ghost' branches or 'ghost' lineages.
#
#Now let's try including the ranges, using argument 'add.term':

#Basic Time-Scaling with Terminal Taxon Ranges

bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
    add.term=TRUE,ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
+ add.term=TRUE,ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

Phylogenetic tree with 22 tips and 13 internal nodes.

Tip labels:
        Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.


#Alright, we can compare and see what changed. For example, we can see that the tip with
#Pseudoretiolites shifted, because it's a "long-lived" morphotaxa, so whether
#its time of observation is its first or last appearance makes a large difference.

#Polytomies are still unresolved. Let's randomly resolve those nodes, forcing the trees
#to be dichotomous. This is another action which is stochastic and that shouldn't be done
#to create a single tree, but rather should be done to create many time-scaled trees.
#The argument of importance here to change is 'randres'.

#Basic Time-Scaling with Terminal Ranges and Polytomies Randomly Resolved

bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
    randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
+ randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions
Warning: Do not interpret a single randomly-resolved tree

Phylogenetic tree with 22 tips and 21 internal nodes.

Tip labels:
        Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.


#Alright, we've resolved those nodes, as we can see in the cladogram plotted at the top.
#But look at the time-scaled tree! Why are there those 'apparent' polytomies?
#The reason is Zero-Length Branches, which occur in time-scaling phylogenies of fossil
#taxa when a more derived taxon nested within a clade occurs earlier in the fossil record
#than other taxa in that clade. It pushes nodes up against each other under the
#basic time-scaling method, forcing those divergences to occur simultaneously.

#To avoid this, we need to choose a new time-scaling algorithm. One option is the
#'Minimum Branch Length' method (MBL; Laurin, 2004), where we will force all the branch
#lengths in the tree to be some minimum length and thus push nodes back in time.
#To do this, we'll change the the 'type' argument to "mbl" and the minimum length is set
#using the 'variable time' or 'vartime' argument, which we'll set to 1 MY as an example.

#min branch length time-scaling
bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="mbl",
    vartime=1,randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="mbl",
+ vartime=1,randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions
Warning: Do not interpret a single randomly-resolved tree

Phylogenetic tree with 22 tips and 21 internal nodes.

Tip labels:
        Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.


#Wow, that changed things, didn't it? You can see that the MBL method has added ALOT of
#extra branch-length to our tree, essentially positing a lot more unobserved
#evolutionary history across this clade. Notice also the long 'spine' of nodes seperated
#by 1 MY going all the way down most of the tree, due to the ladder-like structure of
#the tree.
#
#Obviously, the choices we make while time-scaling a tree makes assumptions, whether
#they are implicit or explicit, about how much evolutionary history is unsampled.
#Perhaps 1 MY is too long of a minimum branch length for this dataset; we wouldn't
#expect this much 'ghost' evolutionary history. But how can we test that or even take
#into account how much 'ghost' evolutionary history we would expect? And even if we did
#know how much missing time there should be on the phylogeny, we would be extremely
#uncertain in where to place it.
#
#The solution I suggest in a recent paper is a stochastic time-scaling method, where
#node ages are shifted around in time at random while keeping the tree consisten with
#the appearance times of the taxa, loosely alike to jiggling a zipper. This method
#biases the node ages to be more often sampled at further or closer in time, relative
#to likely we think the corresponding amount of unobserved evolutionary is. That amount
#is not just a function of how frequently sampled the fossil record is: it is also
#dependent on how often new lineages originate via branching and go extinct. Thus, this
#time-scaling method requires three rates to calibrate its use: sampling, branching
#and extinction, and thus I named it 'cal3'.

#These three rates must be calculated before we can apply cal3.
#How? paleotree has the tools!

#########################################
#Getting Sampling Rate, Other Rates and Learning to Be Skeptical

#For discrete interval data, there are two paleotree functions of use for ultimately
#calculating sampling rates from our range data.

#First, we can try the freqRat (Foote and Raup, 1996), which is just a simple
#ratio of the frequencies of taxa found in just one, two and three intervals.
#Let's apply that to retioRanges and tell it to plot the histogram of taxon durations.

freqRat(retioRanges,plot=TRUE)

> freqRat(retioRanges,plot=TRUE)
  freqRat
0.5925926


#Okay, so the number the freqRat spit out (0.59) is a sampling PROBABILITY, the probability of
#a taxon being sampled at least once per interval. It's not a sampling rate... yet.
#
#But we need to consider the assumptions and drawbacks of the freqRat first.
#
#The freqRat, like getSampProbDisc below, assumes that extinction and sampling are
#Poisson processes, such that the waiting times between events will be exponentially
#distributed. Under such a process, we'd expect the histogram of taxon durations to be
#roughly similar to an exponential distribution still, with the slope of a log scale
#indicative of the extinction rate, but with the number of taxa that occur in only a
#single interval increased greatly. In addition, it is assumed that extinction and
#sampling event frequency has been relatively constant across the entire sample.
#
#The retiolitid data kind of match our expectation: there is an over abundance of taxa
#known from a single interval. However the number of taxa with longer durations is
#patchy. In particular, there are no taxa known from three graptolite zones. This
#probably just reflects the low sample size of the dataset (22) but also reveals a
#a weakness of the method: just the first three frequency values are used in the
#freqRat. If our sample size is low, or if intervals are so finely broke up that there
#are few taxa found in only one, two or three intervals, the freqRat may give poor if
#not extremely misleading results.
#
#getSampProbDisc is an alternative which uses a maximum likelihood optimization to find
#the best fitting sampling probability and extinction rate, using the entire distribution
#of taxon durations.

SPres <- getSampProbDisc(retioRanges)
SPres

> SPres
$Title
[1] "Analysis with 1 time bins and 0 groupings ( 0 and 0 States),with 2 parameters and 22 taxa"

$parameters
     extRate     sampProb Completeness
   0.2028359    0.5454152    0.8672921

$log.likelihood
[1] -41.40539

$AICc
[1] 87.44237

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

#We can see that getSampProbDisc gives a lot of output! Of importance is whether the
#optimizer converged; if $convergence is 0, then we're probably okay. The best-fit
#parameters are at the top. The sampling probability per graptolite zone is 0.54.
#
#Now is the time to be suspicious. Is this a realistic value? To do that, we need to
#compare this value to other published values, but those don't exist, not for graptolite
#zones. Instead, we need to find out how long our graptolite zones are and relate it to
#time. As I pointed out above, our zones are uneven in length, an unfortunate reality of
#the geologic record. But how uneven?

intLength<--apply(retioRanges[[1]],1,diff)
hist(intLength)

meanInt<-mean(intLength)
meanInt

> meanInt
[1] 0.987

#Okay, so we can see there's a lot of short zones and some zones longer than one, but
#altogether the mean is close to 1 million years. Is this an okay amount of variation?
#It's hard to tell, or even to test. My recommendation if you're dataset was like this
#would be to create several versions of the range data, glomming smaller zones onto one
#another to try to create as even length intervals as possible, and then run your
#analyses across those. If there's no difference, you're fine. I'm not going to
#give an example of that here, but if you're playing with real data, I'd recommend you
#try that. Instead, let's go with the idea that our graptolite zones are roughly a
#million years long, so we have a 0.54 probability of sampling a retiolitid genus at
#least once per million year interval.
#
#Another way we could use to consider our estimated sampling probability is compare
#the completeness values, which are not dependent on interval length. These represent
#the proportion of total taxa expected to be sampled in the fossil record. In this case,
#the retiolitid fossil record is estimated as being 87% complete.
#
#Unfortunately, both completeness and sampling probability can have a dependency
#on extinction rate, because these variables depend on how long taxa survive before
#being sampled or the end of a given interval. The instantaneous rate of sampling
#rates, the rate of a sampling event occurring per lineage million years, is the only
#sampling variable that can be estimated independently of extinction rate.
#
#We can transform sampling probability to sampling rate using the function sProb2sRate
#and the mean interval length:

sRate <- sProb2sRate(SPres[[2]][2],int.length=meanInt)
sRate

> sRate
[1] 0.7987547

#So, a sampling rate of 0.79 per Lmy.
#
#So, to return to the issue I raised earlier: are our estimates of sampling reasonable?
#Well, there's aren't a whole lot of published estimates of sampling variables. I got
#rather sidetracked by that and went through the literature and collected rate estimates
#from as many publications as possible. (I'll post those later on my blog.)
#
#Foote and Sepkoski gave the only previous graptoloid sampling estimates; specifically
#generic sampling probabilities. I've converted those to sampling rates, which
#are about 0.18-0.35 per Lmy (Foote and Sepkoski gave different estimates based on
#different ways of breaking up intervals). So, our estimate for the retiolitids is
#nearly twice that. This makes our estimates suspicious but not unrealistic: its
#possible that retiolitids are twice better sampled than other graptolite genera. It
#may also be because of factors we haven't accounted for. Remember how the diversity
#curve showed what looked like an extinction event? That was the well-known 'lundgreni
#event'. That, or some other variation in sampling or extinction rate, may be enough
#to have biased our estimates. Alternatively, maybe our sample size it too low to
#get as good an estimate as we'd like.
#
#getSampProbDisc allows for us to fit different parameters to different groups of taxa
#in our dataset, such as taxa in different time-intervals or different clades. This
#can allow us to consider such potential sources of variation, and use AIC to compare
#the fit of these models. Unfortunately, in this case with only 22 taxa, splitting the
#dataset doesn't seem very feasible, so I won't try that here.
#
#For the sake of this tutorial, I'll continue with the sampling rate we estimated,
#but in a real analysis we would probably want to spend a lot more time assessing the
#accuracy of our estimate.
#
#Now, for cal3, we also need extinction rate and branching rate. One way to get those
#would be to estimate them using per capita rates (Foote, 2000), but getSampProbDisc
#also calculates extinction rate, and its better to take this joint estimate then get
#extinction rate from a seperate analysis. We might also assume that branching rate is
#equal to extinction rate. This is an extremely simplifying assumption, but one that
#oddly seems to hold true in the fossil record (Stanely, 1979; Sepkoski, 1998). In
#general, I'd suggest getting rates from both getSampProbDisc and from a per capita
#rates analysis (not yet implemented in paleotree), and comparing the rates they
#estimate, and also to test if branching rate is similar to extinction rate.

#To get the extinction rate (also the branching rate for this tutorial) from
#getSampProbDisc, we simply divide the extinction rate it calculated by the interval
#length.

divRate<-SPres[[2]][1]/meanInt

#Alright and now we have everything we need to use cal3 with the retiolitid dataset.
#Note however that all of the above won't apply for every sort of dataset we might
#want to analyze: if your data doesn't have persistent taxa, then all taxa occur in
#a single interval and freqRat and getSampProbDisc will be of no use to you, whatsoever.
#There's are really any direct method that let you estimate sampling rates in those
#cases, but there might be some workarounds. I'll talk more about that in the blog post
#where I give all those published sampling rate estimates I've been collecting.

########################################################
#And Now, We Can Finally Time-Scale with cal3!

#By default, cal3 randomly resolves polytomies and adds on the terminal ranges of
#taxa, unlike the timePaleoPhy function we applied in an earlier part of this tutorial.
#Just like with timePaleoPhy, there is a 'bin' version of cal3TimePaleoPhy for
#discrete time interval data.

ttree <- bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
    sampRate=sRate,ntrees=1,plot=TRUE)

> ttree <- bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
+     sampRate=sRate,ntrees=1,plot=TRUE)
Warning: Do not interpret a single cal3 time-scaled tree
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions


#Note well the copious warnings about not interpreting a single tree! cal3 is a
#stochastic time-scaling method.
#
#With such high sampling rates, nodes seperated by ZLBs under basic only move a little
#back further in time: it's unlikely that there is much more unobserved evolutionary
#history.
#
#We can see the diversity curve assumed by this tree...

phyloDiv(ttree)



#But if we really wanted to see the uncertainty in the phylogenetic diversity estimate
#we would need to create a sample of time-scaled phylogenies and calculate
#the median diversity curve across that sample. (Function multiDiv can do this!)
#
#Let's try making a hundred trees and putting them in multiDiv.

ttrees <- bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
    sampRate=sRate,ntrees=100,plot=FALSE)
multiDiv(ttrees)



#The thick black line is the median diversity, the gray blocks are 95% quantiles
#across the sample of time-scaled trees.

#Lastly, we can also change the times of observation in cal3 using the argument
#'FAD.only'...

bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
    sampRate=sRate,ntrees=1,FAD.only=TRUE,plot=TRUE)

> bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
+     sampRate=sRate,ntrees=1,FAD.only=TRUE,plot=TRUE)
Warning: Do not interpret a single cal3 time-scaled tree
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

Phylogenetic tree with 22 tips and 21 internal nodes.

Tip labels:
        Rotaretiolites, Pseudoretiolites, Pseudoplegmatograptus, Dabashanograptus, Stomatograptus, Retiolites, ...

Rooted; includes branch lengths.


#Cool? Now you should know everything you need to play around with cal3!

###################################################
#References
#Source for cladogram and zonal ranges for genera:
#Bates, D. E. B., A. Kozlowska, and A. C. Lenz. 2005. Silurian retiolitid graptolites: Morphology and evolution. Acta Palaeontologica  Polonica 50(4):705-720.

#Source for interval dates for graptolite zones:
#Sadler, P. M., R. A. Cooper, and M. Melchin. 2009. High-resolution, early Paleozoic (Ordovician-Silurian) time scales. Geological  Society of America Bulletin 121(5-6):887-906.

#Other references:
#Foote, M. 1997. Estimating Taxonomic Durations and Preservation Probability. Paleobiology 23(3):278-300.
#Foote, M. 2000. Origination and extinction components of taxonomic diversity: general problems. Pp. 74-102. In D. H. Erwin, and S. L. Wing, eds. Deep Time: Paleobiology's Perspective. The Paleontological Society, Lawrence, Kansas.
#Foote, M., and D. M. Raup. 1996. Fossil preservation and the stratigraphic ranges of taxa. Paleobiology 22(2):121-140.
#Foote, M., and J. J. Sepkoski. 1999. Absolute measures of the completeness of the fossil record. Nature 398(6726):415-417.
#Laurin, M. 2004. The Evolution of Body Size, Cope's Rule and the Origin of Amniotes. Systematic Biology 53(4):594-622.
#Norell, M. A. 1992. Taxic origin and temporal diversity: the effect of phylogeny. Pp. 89-118. In M. J. Novacek, and Q. D. Wheeler, eds. Extinction and phylogeny. Columbia University Press, New York.
#Sadler, P. 2001. Constrained optimization approaches to the paleobiologic correlation and seriation problems: a user's guide and reference manual to the CONOP program family, v.6.1. University of California, Riverside.
#Sepkoski, J. J. 1998. Rates of speciation in the fossil record. Philosophical Transactions: Biological Sciences 353(1366):315-326.
#Smith, A. B. 1994. Systematics and the fossil record: documenting evolutionary patterns. Blackwell Scientific, Oxford.
#Stanley, S. M. 1979. Macroevolution: Patterns and Process. W. H. Freeman, Co., San Francisco.