Nemagraptus gracilis: A Tutorial to cal3 Time-Scaling Using a Real Dataset

Hi guys!

So, my paper on time-scaling methods for trees of fossil taxa has been accepted to Methods in Ecology and Evolution! You won't see it right away though; it's part of a special issue of MEE and all the papers included in it will be held back for a simultaneous online release, probably in August (I guess?). Anyway, until then, you can go read chapter 3 of my dissertation, it's mostly the same thing, and its up on UMI Proquest now. Or you can contact me for a copy.

Anyway, to commemorate the acceptance, I have actually pulled a real dataset out of the literature and written up a tutorial about applying the various time-scaling methods to this data, in particular the cal3 method. The data in question (of course) a graptolite group, the Silurian retiolitids. It's a generic dataset, but, hey, it's just a tutorial right! Chill, dudes.

Such a tutorial was persuasively argued for by one of my reviewers at MEE, but I had to respond with "I don't have the room!" However, I've got infinite room here on this blog and also it means anyone can come see it here, for free, forever (well as long as Blogger exists...). So here it is! Thanks to Reviewer #1 for making such a strong case for it. Sorry, Reviewer #1, but I did not take you up on that other suggestion that I use a vertebrate dataset example. I just know graptolites better, and vertebrates are a headache to get sampling rates from (I'll touch on that in a later blog post, thanks).

Now, this is a very LONG tutorial. Why? I want to cover in detail all the hiccups that can happen on the way to wanting to time-scale a tree, and I really wanted to nail down what the data structure looks like. So, apologies for the length, but I think anyone actually attempting this stuff will appreciate the detail I go into, particularly, how the degree we should exercise caution and skepticism in estimating sampling rates.

I've written this tutorial much like a cave man from 1984 might have, as an overly chatty, heavily commented R script. If someone who knows LaTEX and Sweave better than me wants to helpfully convert this to a document suitable for release as a vignette file for paleotree well... Well, I will buy that person a few beers at Evolution! Otherwise, you guys are getting just this: an R script.

I've pasted the script text below, with figures inserted and a few key console outputs. The images overhang the margins. I don't know why and I can't seem to fix it. Stupid Blogger! But rather than read my poorly formatted blog post, I instead recommend downloading the script and the dataset and going through it yourself in R! It's more... interactive, that way!

You can get the retiolitid data file and the R script as a zipped archive here:
http://webpages.sdsmt.edu/~dbapst/cal3tutorial/cal3tutorial.zip

This gave me the idea for another post about published estimates of sampling rates, which I'll put up in a few days. I hope you all enjoy this tutorial as much as I enjoyed writing it,
-Dave

#############################################################
#A Tutorial to Time-scaling Methods with the Retiolitinae
#David W. Bapst
#06-12-13

#Introduction

#The package paleotree has functions for time-scaling phylogenies of fossil
#taxa but none of the help examples give a concrete case of the functions being
#applied to real data. Instead, I opted to only give examples using simulated data.
#The reasons for this are simply that real data tends to be very complex, and
#although the functions can certainly be applied to real data, it will often require
#more effort than what can really be put in an R help file.
#
#However, without a clear example of applying the functions to real data, I have
#opened the door for lots of confusion about how a user should do this. Here, I
#use a simple dataset of graptoloid generic ranges and relationships from a recent
#publication (Bates et al., 2005) to exemplify how the time-scaling functions in
#in paleotree work, what the data input looks like and just how complicated even
#a simple real dataset can be.
#
#By the way, our example data today is of the Silurian retiolitids, which are really
#cool looking. Check 'em out!
#http://www.insugeo.org.ar/libros/cg_18/FIG_1.jpg

######################################
#Know Your Data

#First, we need to load paleotree...
library(paleotree)

#Next, we need to load the retiolitid data file, which needs to be in our working dir.
load("retiolitinae.rda")

#You can also load the data directly from the current version of paleotree:
data(retiolitinae)
#if this doesn't work, you have an old version of paleotree! Go download a new one.

#And that should have added two new objects to our workspace, retioTree and retioRanges.

#Let's look at retioRanges first. This is our temporal/interval data.

#retioRanges

str(retioRanges)

> str(retioRanges)
List of 2
$ int.times :'data.frame':    20 obs. of 2 variables:
..$ start_time: num [1:20] 440 439 437 437 436 ...
..$ end_time : num [1:20] 439 437 437 436 435 ...
$ taxon.times:'data.frame':    22 obs. of 2 variables:
..$ first_int: int [1:22] 8 6 2 8 8 8 8 13 8 13 ...
..$ last_int : int [1:22] 8 10 10 8 14 14 14 14 14 14 ...

#It's a list with two components, both matrices. Let's see what they are!

#First component:
retioRanges[[1]]

> retioRanges[[1]]
                                                      start_time end_time
Coronograptus cyphus                                      440.18   439.37
Demirastrites triangulatus - Demirastrites pectinatus     439.37   437.46
Pernerograptus argenteus                                  437.46   436.97
Lituigraptus convolutus                                   436.97   435.85
Stimulograptus sedgwicki                                  435.85   434.90
Spirograptus guerichi                                     434.90   432.50
Spirograptus turriculatus - Stretograptus crispus         432.50   430.44
Monoclimacis griestoniensis - Monoclimacis crenulata      430.44   429.43
Spirograptus spiralis                                     429.43   429.01
Nanograptus lapworthi - Cyrtograptus insectus             429.01   427.07
Crytograptus centhfugus - Cyrtograptus murchisoni         427.07   426.53
Monograptus riccartonensis - Monograptus belophorus       426.53   426.06
Cyrtograptus rigidus - Cyrtograptus perneri               426.06   424.92
Cyrtograptus lundgreni                                    424.92   424.02
Gothograptus nassa - Pristiograptus dubius                424.02   423.09
Colonograptus preadeubeli - Colonograptus deubeli         423.09   422.81
Colonograptus ludensis                                    422.81   422.65
Neolobograptus nilssoni - Lobograptus progenitor          422.65   422.35
Lobograptus scanicus                                      422.35   421.27
Saetograptus leintwardnensis                              421.27   420.44

#A matrix. Each row is a different zone (a biostratigraphically defined interval,
#marked by the appearance of new 'index' fossil taxa), the row names are the index
#fossils for those zones and the start and end times are listed in the two columns
#of the matrix. I sometimes refer to this as the 'interval times' or 'int.times'
#matrix in paleotree documentation.

#Now let's look at the second component of retioRanges:

retioRanges[[2]]

> retioRanges[[2]]
                      first_int last_int
Rotaretiolites                8        8
Pseudoplegmatograptus         6       10
Pseudoretiolites              2       10
Dabashanograptus              8        8
Retiolites                    8       14
Stomatograptus                8       14
Paraplectograptus             8       14
Pseudoplectograptus          13       14
Sokolovograptus               8       14
Eisenackograptus             13       14
Sagenograptus                14       14
Cometograptus                14       14
Plectograptus                17       19
Plectodinemagraptus          20       20
Semiplectograptus            19       20
Baculograptus                14       16
Doliograptus                 15       15
Gothograptus                 14       16
Papiliograptus               16       16
Spinograptus                 16       17
Holoretiolites               18       18
Neogothograptus              18       18

#Another matrix. Each row in this matrix is a taxon, the rownames tell you what
#taxon (note that this was a generic-level analysis) and the columns are which
#interval (graptolite zones) each taxon first and last appeared in the fossil record.
#The numbers in these columns are references to the rows of the interval times matrix,
#for example, Rotaretiolites first and last appears in interval on the 8th row (which
#is the Monoclimacis griestoniensis - Monoclimacis crenulata zone). I usually refer
#to this matrix as the 'taxon times' or 'taxon.times' matrix.

############################################################
#A Digression on Discrete Interval Data in paleotree

#This list format composed of two matrices, an int.times matrix and a taxon.times
#matrix with two columns each, is the keystone data input of all analyses in paleotree
#using data where taxa are only in discrete time intervals. I usually call such objects
#"timeList" objects in paleotree, particularly in the input arguments of functions.
#
#Most paleontological data is only know from discrete intervals, because it is
#fundamentally difficult to place the beds that fossils are found at a specific date.
#Instead, most fossils can only be placed as coming from discrete intervals, and our
#ability to resolve the appearance dates of a fossil can differ widely based on the
#stratigraphy of a locality, such that one fossil taxon may be first known from a two
#million year long interval in the early Triassic, while its sister taxon may be
#very hard to pin down, maybe as generalized as listed as first appearing in
#the 'Triassic'.
#
#The graptolite data here are exceptional well resolved; note that in the interval-time
#matrix (the first component of retioRanges), many of the intervals are less than a
#million years long. All of the occurrences are also resolved to a single zone, and
#none of the zones overlap or contain one another. This doesn't have to be the case.
#In our example above with the two taxa known from the Triassic, we would list the
#short eartly Triassic interval as one row in the interval times matrix and the entire
#Triassic as another row. The respective taxa would refer to those respective intervals.
#Most paleotree functions won't care about overlapping intervals, but there is an
#important exception. getSampProbDisc and the freqRat, both of which are applied below,
#CANNOT use such data. Those functions assume the intervals they are containing are
#successive and of roughly similar length. Here, the graptolite data are all successive
#but as we can see, there is some variation in interval length.
#
#Now, there are exceptional datasets composed of continuous time data, which require
#a much simpler data structure than discrete interval data. All that is needed in
#paleotree isjust a matrix of first and last appearance dates). Such datasets tend to
#be for very recent fossil records, microfossil data from cores or computationally
#correlated stratigraphic columns, such as the output from CONOP (Sadler, 2001).
#
#However, if you know your fossil appearances aren't well resolved, you really
#should take that uncertainty into account using this 'timeList' structure.
#Getting data into a 'timeList' structure isn't too difficult: just compose
#the interval times and taxon times matrices seperately in your favorite spreadsheet
#program, read them into R with read.table() and then compose them into a list (using,
#of course, list()...). You can then save that list with save() or dput() and reuse it.

###################################################################
#Okay, enough of that! Let's try drawing the diversity curve for this
#discrete interval data.

taxicDivDisc(retioRanges)

#taxicDivDisc counts up all the taxa that COULD occur in each interval (remember
#my point about overlapping intervals, above). We're okay here, because our data is
#in successive intervals. Notice how there's a big generic diversity drop around 424 MYA.
#Also notice how the 'blocky'-ness of the diversity curve also displays a little
#information about interval length: notice the extremely short graptolite zone near
#423 MYA.

#Alright, let's move on to our other data object, retioTree. We can plot it using ape...

plot(retioTree)

#And we get an unscaled tree, partially unresolved (i.e. some polytomies). Cool.

#Now that I've explained discrete interval data in paleotree ad nauseum to you, let's
#actually apply time-scaling methods. Since this is discrete interval data,
#we'll time-scale using the bin_timePaleoPhy function, which applies the simplest
#time-scaling algorithms to discrete interval data, by randomly assigning first and
#last appearance dates within each interval to each taxon, every time it creates a
#new time-scaled tree. The idea is to do this many times, create many time-scaled trees
#but for now, let's just create one.
#
#The simplest time-scaling algorithm for cladograms of fossil taxa is the 'basic'
#method (Norell, 1992; Smith, 1994) which just means every clade is as old as the
#oldest appearance of any taxon in that clade. Let's try that one for now.

#Basic Time-Scaling
bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
+ ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

Phylogenetic tree with 22 tips and 13 internal nodes.

Tip labels:
Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.

#Hey look, it warned us not to interpret a single tree. Heed that warning!
#
#Okay, we can see the cladogram and the time-scaled tree produced. Note three things.
#
#First, the polytomies were not resolved. We'll deal with that in a moment.
#
#Second, the little time-axis at the bottom is relative to the latest tip of the tree
#being at 0, which is unfortunate, as the latest tip of the tree is at ~420 MYA. (I have
#tried but failed to figure out how to alter this. It would require me adding an altered
#plot.phylo to paleotree, which I am incredibly loathe to do.) So, that time axis just
#tells us about relative timing, which still makes it pretty informative.
#
#Third, the 'location' of the tips of our time-scaled tree have been placed at the
#FIRST appearance dates of our taxa, not the LAST appearance dates ('FADs' versus
#'LADs'). I call this choice of 'location' for our tips the 'time of observation' issue.
#This issue really only rears its head when you have persistent morphotaxa, like these
#graptolite genera, which occur/are morphologically recognizable across multiple
#time intervals. Obviously, if we want to know something about lineage diversification,
#we want to include those 'terminal' ranges of the taxa and add them to the terminal
#branches of our time-scaled tree. We might want to do something different if we're
#dealing with analyses of trait evolution. We should think more carefully about what our
#'time of observation' for our taxa really is!
#
#Note that if our time of observation is FAD, then all the branches are essentially
#posited unsampled 'ghost' branches or 'ghost' lineages.
#
#Now let's try including the ranges, using argument 'add.term':

#Basic Time-Scaling with Terminal Taxon Ranges

bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
add.term=TRUE,ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
+ add.term=TRUE,ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

Phylogenetic tree with 22 tips and 13 internal nodes.

Tip labels:
Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.

#Alright, we can compare and see what changed. For example, we can see that the tip with
#Pseudoretiolites shifted, because it's a "long-lived" morphotaxa, so whether
#its time of observation is its first or last appearance makes a large difference.

#Polytomies are still unresolved. Let's randomly resolve those nodes, forcing the trees
#to be dichotomous. This is another action which is stochastic and that shouldn't be done
#to create a single tree, but rather should be done to create many time-scaled trees.
#The argument of importance here to change is 'randres'.

#Basic Time-Scaling with Terminal Ranges and Polytomies Randomly Resolved

bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="basic",
+ randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions
Warning: Do not interpret a single randomly-resolved tree

Phylogenetic tree with 22 tips and 21 internal nodes.

Tip labels:
Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.

#Alright, we've resolved those nodes, as we can see in the cladogram plotted at the top.
#But look at the time-scaled tree! Why are there those 'apparent' polytomies?
#The reason is Zero-Length Branches, which occur in time-scaling phylogenies of fossil
#taxa when a more derived taxon nested within a clade occurs earlier in the fossil record
#than other taxa in that clade. It pushes nodes up against each other under the
#basic time-scaling method, forcing those divergences to occur simultaneously.

#To avoid this, we need to choose a new time-scaling algorithm. One option is the
#'Minimum Branch Length' method (MBL; Laurin, 2004), where we will force all the branch
#lengths in the tree to be some minimum length and thus push nodes back in time.
#To do this, we'll change the the 'type' argument to "mbl" and the minimum length is set
#using the 'variable time' or 'vartime' argument, which we'll set to 1 MY as an example.

#min branch length time-scaling
bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="mbl",
vartime=1,randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)

> bin_timePaleoPhy(tree=retioTree,timeList=retioRanges,type="mbl",
+ vartime=1,randres=TRUE,add.term=TRUE,ntrees=1,plot=TRUE)
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions
Warning: Do not interpret a single randomly-resolved tree

Phylogenetic tree with 22 tips and 21 internal nodes.

Tip labels:
Neogothograptus, Holoretiolites, Spinograptus, Papiliograptus, Gothograptus, Doliograptus, ...

Rooted; includes branch lengths.

#Wow, that changed things, didn't it? You can see that the MBL method has added ALOT of
#extra branch-length to our tree, essentially positing a lot more unobserved
#evolutionary history across this clade. Notice also the long 'spine' of nodes seperated
#by 1 MY going all the way down most of the tree, due to the ladder-like structure of
#the tree.
#
#Obviously, the choices we make while time-scaling a tree makes assumptions, whether
#they are implicit or explicit, about how much evolutionary history is unsampled.
#Perhaps 1 MY is too long of a minimum branch length for this dataset; we wouldn't
#expect this much 'ghost' evolutionary history. But how can we test that or even take
#into account how much 'ghost' evolutionary history we would expect? And even if we did
#know how much missing time there should be on the phylogeny, we would be extremely
#uncertain in where to place it.
#
#The solution I suggest in a recent paper is a stochastic time-scaling method, where
#node ages are shifted around in time at random while keeping the tree consisten with
#the appearance times of the taxa, loosely alike to jiggling a zipper. This method
#biases the node ages to be more often sampled at further or closer in time, relative
#to likely we think the corresponding amount of unobserved evolutionary is. That amount
#is not just a function of how frequently sampled the fossil record is: it is also
#dependent on how often new lineages originate via branching and go extinct. Thus, this
#time-scaling method requires three rates to calibrate its use: sampling, branching
#and extinction, and thus I named it 'cal3'.

#These three rates must be calculated before we can apply cal3.
#How? paleotree has the tools!

#########################################
#Getting Sampling Rate, Other Rates and Learning to Be Skeptical

#For discrete interval data, there are two paleotree functions of use for ultimately
#calculating sampling rates from our range data.

#First, we can try the freqRat (Foote and Raup, 1996), which is just a simple
#ratio of the frequencies of taxa found in just one, two and three intervals.
#Let's apply that to retioRanges and tell it to plot the histogram of taxon durations.

freqRat(retioRanges,plot=TRUE)

> freqRat(retioRanges,plot=TRUE)
freqRat
0.5925926

#Okay, so the number the freqRat spit out (0.59) is a sampling PROBABILITY, the probability of
#a taxon being sampled at least once per interval. It's not a sampling rate... yet.
#
#But we need to consider the assumptions and drawbacks of the freqRat first.
#
#The freqRat, like getSampProbDisc below, assumes that extinction and sampling are
#Poisson processes, such that the waiting times between events will be exponentially
#distributed. Under such a process, we'd expect the histogram of taxon durations to be
#roughly similar to an exponential distribution still, with the slope of a log scale
#indicative of the extinction rate, but with the number of taxa that occur in only a
#single interval increased greatly. In addition, it is assumed that extinction and
#sampling event frequency has been relatively constant across the entire sample.
#
#The retiolitid data kind of match our expectation: there is an over abundance of taxa
#known from a single interval. However the number of taxa with longer durations is
#patchy. In particular, there are no taxa known from three graptolite zones. This
#probably just reflects the low sample size of the dataset (22) but also reveals a
#a weakness of the method: just the first three frequency values are used in the
#freqRat. If our sample size is low, or if intervals are so finely broke up that there
#are few taxa found in only one, two or three intervals, the freqRat may give poor if
#not extremely misleading results.
#
#getSampProbDisc is an alternative which uses a maximum likelihood optimization to find
#the best fitting sampling probability and extinction rate, using the entire distribution
#of taxon durations.

SPres <- getSampProbDisc(retioRanges)
SPres

> SPres
$Title
[1] "Analysis with 1 time bins and 0 groupings ( 0 and 0 States),with 2 parameters and 22 taxa"

$parameters
extRate sampProb Completeness
0.2028359 0.5454152 0.8672921

$log.likelihood
[1] -41.40539

$AICc
[1] 87.44237

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

#We can see that getSampProbDisc gives a lot of output! Of importance is whether the
#optimizer converged; if $convergence is 0, then we're probably okay. The best-fit
#parameters are at the top. The sampling probability per graptolite zone is 0.54.
#
#Now is the time to be suspicious. Is this a realistic value? To do that, we need to
#compare this value to other published values, but those don't exist, not for graptolite
#zones. Instead, we need to find out how long our graptolite zones are and relate it to
#time. As I pointed out above, our zones are uneven in length, an unfortunate reality of
#the geologic record. But how uneven?

intLength<--apply(retioRanges[[1]],1,diff)
hist(intLength)

meanInt<-mean(intLength)
meanInt

> meanInt
[1] 0.987

#Okay, so we can see there's a lot of short zones and some zones longer than one, but
#altogether the mean is close to 1 million years. Is this an okay amount of variation?
#It's hard to tell, or even to test. My recommendation if you're dataset was like this
#would be to create several versions of the range data, glomming smaller zones onto one
#another to try to create as even length intervals as possible, and then run your
#analyses across those. If there's no difference, you're fine. I'm not going to
#give an example of that here, but if you're playing with real data, I'd recommend you
#try that. Instead, let's go with the idea that our graptolite zones are roughly a
#million years long, so we have a 0.54 probability of sampling a retiolitid genus at
#least once per million year interval.
#
#Another way we could use to consider our estimated sampling probability is compare
#the completeness values, which are not dependent on interval length. These represent
#the proportion of total taxa expected to be sampled in the fossil record. In this case,
#the retiolitid fossil record is estimated as being 87% complete.
#
#Unfortunately, both completeness and sampling probability can have a dependency
#on extinction rate, because these variables depend on how long taxa survive before
#being sampled or the end of a given interval. The instantaneous rate of sampling
#rates, the rate of a sampling event occurring per lineage million years, is the only
#sampling variable that can be estimated independently of extinction rate.
#
#We can transform sampling probability to sampling rate using the function sProb2sRate
#and the mean interval length:

sRate <- sProb2sRate(SPres[[2]][2],int.length=meanInt)
sRate

> sRate
[1] 0.7987547

#So, a sampling rate of 0.79 per Lmy.
#
#So, to return to the issue I raised earlier: are our estimates of sampling reasonable?
#Well, there's aren't a whole lot of published estimates of sampling variables. I got
#rather sidetracked by that and went through the literature and collected rate estimates
#from as many publications as possible. (I'll post those later on my blog.)
#
#Foote and Sepkoski gave the only previous graptoloid sampling estimates; specifically
#generic sampling probabilities. I've converted those to sampling rates, which
#are about 0.18-0.35 per Lmy (Foote and Sepkoski gave different estimates based on
#different ways of breaking up intervals). So, our estimate for the retiolitids is
#nearly twice that. This makes our estimates suspicious but not unrealistic: its
#possible that retiolitids are twice better sampled than other graptolite genera. It
#may also be because of factors we haven't accounted for. Remember how the diversity
#curve showed what looked like an extinction event? That was the well-known 'lundgreni
#event'. That, or some other variation in sampling or extinction rate, may be enough
#to have biased our estimates. Alternatively, maybe our sample size it too low to
#get as good an estimate as we'd like.
#
#getSampProbDisc allows for us to fit different parameters to different groups of taxa
#in our dataset, such as taxa in different time-intervals or different clades. This
#can allow us to consider such potential sources of variation, and use AIC to compare
#the fit of these models. Unfortunately, in this case with only 22 taxa, splitting the
#dataset doesn't seem very feasible, so I won't try that here.
#
#For the sake of this tutorial, I'll continue with the sampling rate we estimated,
#but in a real analysis we would probably want to spend a lot more time assessing the
#accuracy of our estimate.
#
#Now, for cal3, we also need extinction rate and branching rate. One way to get those
#would be to estimate them using per capita rates (Foote, 2000), but getSampProbDisc
#also calculates extinction rate, and its better to take this joint estimate then get
#extinction rate from a seperate analysis. We might also assume that branching rate is
#equal to extinction rate. This is an extremely simplifying assumption, but one that
#oddly seems to hold true in the fossil record (Stanely, 1979; Sepkoski, 1998). In
#general, I'd suggest getting rates from both getSampProbDisc and from a per capita
#rates analysis (not yet implemented in paleotree), and comparing the rates they
#estimate, and also to test if branching rate is similar to extinction rate.

#To get the extinction rate (also the branching rate for this tutorial) from
#getSampProbDisc, we simply divide the extinction rate it calculated by the interval
#length.

divRate<-SPres[[2]][1]/meanInt

#Alright and now we have everything we need to use cal3 with the retiolitid dataset.
#Note however that all of the above won't apply for every sort of dataset we might
#want to analyze: if your data doesn't have persistent taxa, then all taxa occur in
#a single interval and freqRat and getSampProbDisc will be of no use to you, whatsoever.
#There's are really any direct method that let you estimate sampling rates in those
#cases, but there might be some workarounds. I'll talk more about that in the blog post
#where I give all those published sampling rate estimates I've been collecting.

########################################################
#And Now, We Can Finally Time-Scale with cal3!

#By default, cal3 randomly resolves polytomies and adds on the terminal ranges of
#taxa, unlike the timePaleoPhy function we applied in an earlier part of this tutorial.
#Just like with timePaleoPhy, there is a 'bin' version of cal3TimePaleoPhy for
#discrete time interval data.

ttree <- bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
sampRate=sRate,ntrees=1,plot=TRUE)

> ttree <- bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
+ sampRate=sRate,ntrees=1,plot=TRUE)
Warning: Do not interpret a single cal3 time-scaled tree
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

#Note well the copious warnings about not interpreting a single tree! cal3 is a
#stochastic time-scaling method.
#
#With such high sampling rates, nodes seperated by ZLBs under basic only move a little
#back further in time: it's unlikely that there is much more unobserved evolutionary
#history.
#
#We can see the diversity curve assumed by this tree...

phyloDiv(ttree)

#But if we really wanted to see the uncertainty in the phylogenetic diversity estimate
#we would need to create a sample of time-scaled phylogenies and calculate
#the median diversity curve across that sample. (Function multiDiv can do this!)
#
#Let's try making a hundred trees and putting them in multiDiv.

ttrees <- bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
sampRate=sRate,ntrees=100,plot=FALSE)
multiDiv(ttrees)

#The thick black line is the median diversity, the gray blocks are 95% quantiles
#across the sample of time-scaled trees.

#Lastly, we can also change the times of observation in cal3 using the argument
#'FAD.only'...

bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
    sampRate=sRate,ntrees=1,FAD.only=TRUE,plot=TRUE)

> bin_cal3TimePaleoPhy(retioTree,retioRanges,brRate=divRate,extRate=divRate,
+     sampRate=sRate,ntrees=1,FAD.only=TRUE,plot=TRUE)
Warning: Do not interpret a single cal3 time-scaled tree
Warning: Do not interpret a single tree; dates are stochastically pulled from uniform distributions

Phylogenetic tree with 22 tips and 21 internal nodes.

Tip labels:
        Rotaretiolites, Pseudoretiolites, Pseudoplegmatograptus, Dabashanograptus, Stomatograptus, Retiolites, ...

Rooted; includes branch lengths.

#Cool? Now you should know everything you need to play around with cal3!

###################################################
#References
#Source for cladogram and zonal ranges for genera:
#Bates, D. E. B., A. Kozlowska, and A. C. Lenz. 2005. Silurian retiolitid graptolites: Morphology and evolution. Acta Palaeontologica Polonica 50(4):705-720.

#Source for interval dates for graptolite zones:
#Sadler, P. M., R. A. Cooper, and M. Melchin. 2009. High-resolution, early Paleozoic (Ordovician-Silurian) time scales. Geological Society of America Bulletin 121(5-6):887-906.

#Other references:
#Foote, M. 1997. Estimating Taxonomic Durations and Preservation Probability. Paleobiology 23(3):278-300.
#Foote, M. 2000. Origination and extinction components of taxonomic diversity: general problems. Pp. 74-102. In D. H. Erwin, and S. L. Wing, eds. Deep Time: Paleobiology's Perspective. The Paleontological Society, Lawrence, Kansas.
#Foote, M., and D. M. Raup. 1996. Fossil preservation and the stratigraphic ranges of taxa. Paleobiology 22(2):121-140.
#Foote, M., and J. J. Sepkoski. 1999. Absolute measures of the completeness of the fossil record. Nature 398(6726):415-417.
#Laurin, M. 2004. The Evolution of Body Size, Cope's Rule and the Origin of Amniotes. Systematic Biology 53(4):594-622.
#Norell, M. A. 1992. Taxic origin and temporal diversity: the effect of phylogeny. Pp. 89-118. In M. J. Novacek, and Q. D. Wheeler, eds. Extinction and phylogeny. Columbia University Press, New York.
#Sadler, P. 2001. Constrained optimization approaches to the paleobiologic correlation and seriation problems: a user's guide and reference manual to the CONOP program family, v.6.1. University of California, Riverside.
#Sepkoski, J. J. 1998. Rates of speciation in the fossil record. Philosophical Transactions: Biological Sciences 353(1366):315-326.
#Smith, A. B. 1994. Systematics and the fossil record: documenting evolutionary patterns. Blackwell Scientific, Oxford.
#Stanley, S. M. 1979. Macroevolution: Patterns and Process. W. H. Freeman, Co., San Francisco.

Nemagraptus gracilis

Friday, June 14, 2013

A Tutorial to cal3 Time-Scaling Using a Real Dataset

No comments:

Post a Comment