Wednesday, February 11, 2015

How Do We Treat Fossil Age Data? Dates, Ranges and Occurrences

(Hello all! I’m trying Rmarkdown for writing blog posts! Let’s see how this goes.)

I spend a lot of time thinking about how the datasets we (paleontologists and biologists) work with differ in their structure and information content. The obvious difference is how much paleontological data differs from the sort of data we have for living organisms in evolutionary biology (neontology… whatever…) but the one we so often overlook is how datasets from the fossil record can differ greatly from each other. I talk a lot in my book chapter last year on how much paleontological datasets for different groups of organisms or time-intervals can differ and how this impacts our intention to try phylogeny-based analyses of macroevolution (you can check out my chapter here, thanks to the publisher, including its horrid easter eggs referring to a certain popular series of Game Boy games).

That chapter barely scratched the surface in terms of ‘things to know about paleo data’. Recently, while playing with Paleobiology Database data using the recently release R package paleobioDB, I had some additional thoughts on how we think of and use chronostratigraphic data for fossil taxa.

Warning: Many of you who are paleontology are probably already familiar with a lot of what I’m about to say. In some regards, I’m very mindful that some of my biologist friends read this, so I’m trying to cover what they might not be so familiar with. Also, much of it is a retread of previous blog posts. A major goal of this blog post is an attempt to describe a set of terms for relating how different types of fossil age datasets relate to one another in terms of their information content (in a sense, their ontology), and that requires discussing some of the tiniest of minutia. I also don’t think I’m right about everything I say below–I think there is a lot more thinking we need to have about the ‘ontology’ of paleontological age data.

Four-Date Age Data (“timeList”)

I’ve thought a few ways about to talk about these issues, and the best I can do is to start with how I encountered taxonomic age data initially. So, graptoloids (the planktonic graptolites, my favorite group… why aren’t they yours too?) have a very well known fossil record. Skipping through some of the vagueries of Paleozoic time-scale making, you can generally tell based upon the species composition of a graptolite fossil assemblage which graptolite biostratigraphic zone you are in, meaning that the rocks you’re looking at come before ‘this graptolite’ showing up for the first time but after some other graptolite species shows up. The same taxa are often found across multiple continents, so there is a global correlation of such zones. The use of some complicated annealing software allows us to put absolute dates on the first appearance of graptolite taxa (Sadler et al., 2009), so if you know the graptolite zone of a rock, you know what little segment of (about) a few million years that rock is from. What nice little time-keepers… Its really too bad graptoloids are extinct, isn’t it?

Anyway, most of the data recorded in papers referring to when particular graptoloid taxa appear in the geologic record is often just which graptolite zones these species or genera first and last appear within. This is understandable, as some taxa can be found pretty continuously throughout their stratigraphic range, and all the information important for stratigraphic correlation are in these times of first and last appearance. In many cases, the data is also very well-behaved in that these graptolite zones don’t overlap and are of a rougly equal length (although maybe with more variation than we’d like; Sadler et al., 2009). This is exactly the sort of data I was first exposed to and dealt with in some undergrad research (published in Bapst et al., 2012). You can find the data here), or you can play with it in paleotree because I recently added it as example data. Let’s take a look at that sort of data.

## Loading required package: ape

Our item of interest is graptRanges, which is a timeList object I’ve talked about on this blog before and composed of two matrices: a matrix of the earliest and latest ages for some intervals, in this case graptolite biostratigraphic zones, which are intervals denoted by the first appearance of the taxa the zones are named after.

##                                     start_time end_time
## Nemagraptus gracilis (Gi1)              460.86   456.35
## Orthograptus calcaratus (Gi2)           456.35   455.29
## Diplograptus lanceolatus (Ea1)          455.29   452.21
## Diplacanthograptus spiniferus (Ea2)     452.21   449.73
## Dicellograptus kirki (Ea3)              449.73   448.96
## Dicranograptus gravis (Ea4)             448.96   448.57

…and first and last intervals of appearance for some taxa.

##                                 first_int last_int
## 'Bulmanograptus' macilentus            16       16
## 'Monograptus' arciformis               16       17
## 'Monograptus' austerus                 15       17
## 'Paramplexograptus' kiliani            12       13
## 'Paramplexograptus' paucispinus        12       13
## 'Prisitiograptus' fragilis             16       19

Information-wise, this means every taxon has four dated values associated with it: the earliest and latest dates for the intervals in which a taxon first and last appears. By separating the two into separate matrices (effectively making the data structure no longer ‘flat’) we remove some redundancy of having to list the same dates for different taxa which first or last appear in the same interval. Essentially, we’re compressing the information content of our taxon range data. Losing this redundancy is good because it minimizes needing to update the dates more than once.

Note that when we have consecutive, non-overlapping intervals (like with the graptoloid data above), its possible we could compress the information content even further; for example, we could just have the start dates of intervals listed in the interval times matrix, and their end-date is, by implication, the start date of the next interval listed. paleotree retains the second column however so that it can handle those cases where intervals do overlap or are non-consecutive. Going in the other direction, we might want more than two numbers to define dates for a given collection, if we had more information about dates than simply a minimum and maximum bound with a flat probability density inbetween (for example, if the fossils themselves can be dated using geochemistry, such as Strontium isotopes), but that sort of age information is not common.

Two-Date Age Data (“timeData”)

Particularly in discussions I had on this blog last year, I stressed that there was two other sorts of datasets that might be confused with this. If each taxon was known from only a single collection in the fossil record, such that their first appearance was their last appearance, than we would only have two dates known: the earliest and latest dates for that particular rock it was found in. Alternatively, in an extremely well resolved fossil record, we might have exceptional certainty about the dating of fossil remains, such the first and last appearance dates (FADs and LADs) of taxa were known from strata of extremely precise dating with negligble uncertainty.

Both situations are simply special cases of the four-date datasets; you could still describe them with four dates, but the values would largely be redundant. For comparability across datasets, though, it may be preferable to define taxa as

In both of these cases, each listed taxon has two dated values associated with it, but they imply very different things. I’ve encountered issues with people mistaking one for the other when it comes to providing input to paleotree functions, like timePaleoPhy, hence some recent (in the past year) changes to how timePaleoPhy handles age data (which the help file refers to as “timeData”) to make it more explicit. It doesn’t help that data reflecting age uncertainty of point observations is probably the most common type of data that people seem to have in-hand when attempting to time-scale phylogenies of fossil taxa, while I myself had in hand data of precise first and last appearances when I initially developed paleotree.

One could imagine cases where both of the above special cases are true: i.e., (1) taxa are known from singular remains and appear to not have any persistent morphotaxon duration and (2) the stratigraphic time-scale (this is sometimes called an ‘age model’) is so well-resolved that dates are known with exceptional precision; this means every fossil taxon’s age could be defined by a single dated value. Such datasets are very rare, and you won’t find anything that assumes such data as input in paleotree.

Ultimately, the commonly used functions for time-scaling phylogenies in the fossil record, my timePaleoPhy and strap’s DatePhylo use two-date age data like this. paleotree’s bin_timePaleoPhy and bin_cal3TimePaleoPhy are just wrappers which take the four-date timeList datasets, randomly sample ages within the intervals and hand their respective time-scaling functions two-date age data.

Occurrence Data

Although its possible to describe the age ranges of fossil taxa from many groups as the four-date-value type above, this format is a reduction in information content; a summary of only what is needed to know about the timing of their first and last appearances. Other than those cases where some taxon is found continuously from its first to last appearance, a given taxon may be observed to occur in a larger number of fossil collections. Thus for each of a given taxon, we might end up with two dates, reflecting the earliest and latest dates bounding the age of each collection. I’ll call this occurrence data (which I think is what everyone else calls it, but I’m not certain 100% because, well, paleontologists can be inconsistent with their terminology).

The Paleobiology Database is probably the data source most accessible to anyone who wants age data from the fossil record, and its raw data is in the form of such occurrences within collections. With the recently introduction of the new paleobioDB package (Varela et al., in press), the use of paleobioDB (aka PBDB) occurrence data is probably going to increase.

Let’s look at an example of occurrence data, using the paleobioDB package to download some PBDB records. To make this fun, we’ll look up this blog’s mascot…

## Loading required package: raster
## Loading required package: sp
## Attaching package: 'raster'
## The following objects are masked from 'package:ape':
##     rotate, zoom
## Loading required package: maps
nemaData<-pbdb_occurrences(limit="all", base_name="Nemagraptus", show=c("phylo","ident"))
nemaData<-nemaData[nemaData$rnk==3,]   #keep only occurrences of taxa resolved to species level
##         oid typ   cid                    tna rnk   tid               oei
## 1:4   93753 occ  6986   Nemagraptus gracilis   3 33700 Middle Ordovician
## 1:16 118184 occ  8776   Nemagraptus gracilis   3 33700           Caradoc
## 1:20 118736 occ  8887   Nemagraptus gracilis   3 33700          Actonian
## 1:22 118977 occ  9035   Nemagraptus gracilis   3 33700        Gisbornian
## 1:36 598661 occ 63445   Nemagraptus gracilis   3 33700        Gisbornian
## 1:38 601803 occ 63967 Nemagraptus ? gracilis   3 33700        Gisbornian
##        eag   lag   rid          odl   odn           cll   cln          phl
## 1:4  470.0 458.4   437 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:16 460.9 449.5   613 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:20 455.8 449.5   607 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:22 460.9 456.1   623 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:36 460.9 456.1 18179 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:38 460.9 456.1 18428 Graptoloidea 33606 Graptolithina 33534 Hemichordata
##        phn         idt      ids         mna mra    oli  rst  rss
## 1:4  33518 Nemagraptus gracilis Nemagraptus   5   <NA> <NA> <NA>
## 1:16 33518 Nemagraptus gracilis Nemagraptus   5   <NA> <NA> <NA>
## 1:20 33518 Nemagraptus gracilis Nemagraptus   5 Onnian <NA> <NA>
## 1:22 33518 Nemagraptus gracilis Nemagraptus   5   <NA> <NA> <NA>
## 1:36 33518 Nemagraptus gracilis Nemagraptus   5   <NA> <NA> <NA>
## 1:38 33518 Nemagraptus gracilis Nemagraptus   5   <NA>    ? <NA>

We can see each row of the obtained data-frame corresponds to a different individual occurrence of some named taxon. Myltiple occurrences are listed for the same taxon. We can simplify this by looking at just the taxon name listed for the occurrence and the earliest and latest age bounds for the associated collections.

##                         tna   eag   lag
## 1:4    Nemagraptus gracilis 470.0 458.4
## 1:16   Nemagraptus gracilis 460.9 449.5
## 1:20   Nemagraptus gracilis 455.8 449.5
## 1:22   Nemagraptus gracilis 460.9 456.1
## 1:36   Nemagraptus gracilis 460.9 456.1
## 1:38 Nemagraptus ? gracilis 460.9 456.1

We can already see that many of these occurrences overlap with each other and are not from intervals of the same size. Now, PBDB data is noisy and some of that is due to noise, but some of it is just due to differences in how well collections can be chronostratigraphically constrained. So, here we have a data set where taxa may be represented by hundreds of date-values, effectively two date bounds for every collection a taxon occurs in.

We could try visualizing this; first let’s group the occurrences by species…

occList<-lapply(unique(nemaData$ids),function(x) nemaData[x==nemaData$ids,c("eag","lag"), drop=FALSE])
occList<-lapply(occList,function(x) x[order(x[,1]),, drop=FALSE])

## $gracilis
##        eag   lag
## 1:20 455.8 449.5
## 1:48 458.4 453.0
## 1:16 460.9 449.5
## 1:22 460.9 456.1
## 1:36 460.9 456.1
## 1:38 460.9 456.1
## 1:41 460.9 456.1
## 1:43 460.9 456.1
## 1:45 460.9 449.5
## 1:46 460.9 449.5
## 1:49 460.9 456.1
## 1:4  470.0 458.4
## $exilis
##        eag   lag
## 1:39 460.9 456.1
## 1:40 460.9 456.1
## 1:42 460.9 456.1
## 2    460.9 449.5

And now let’s plot the bounds of the age uncertainty for each of the occurrences, color-coded for their species.

plot(0, 0, type="n", xlim=c(max(sapply(occList,max))*1.02, min(sapply(occList,min)))*.98,
     ylab = "", xlab = "Time (Ma)", main="Age Uncertainty of PBDB Occurrences for Nemagraptus Species")


for(i in 1:length(occList)){
  for(j in 1:nrow(occList[[i]])){

That wasn’t so impressive, as Nemagraptus is a rather species-poor graptolite genus. Let’s try its close dicranograptid cousin genus, Dicellograptus. This time, we’ll make the plotting code a function.

  #where pbdbOccData is a matrix returned by function pbdb_occurrences in package paleobioDB
  #order taxa by species (add more taxonomic levels later?)
  occList<-lapply(unique(pbdbOccData$ids),function(x) pbdbOccData[x==pbdbOccData$ids,c("eag","lag"), drop=FALSE])
  #order taxa by earliest occurrence
  #order the occurrences within taxa
  occList<-lapply(occList,function(x) x[order(x[,1]),, drop=FALSE])
  #set xlims
   xlims<-c(max(sapply(occList,max)), min(sapply(occList,min))) 
  #initiate the plot with a modifiable main title
  plot(0, 0, type="n", xlim=xlims,
       ylab = "", xlab = "Time (Ma)",
       main=ifelse(is.null(groupLabel),"Age Uncertainty of PBDB Occurrences",
             paste("Age Uncertainty of PBDB Occurrences for",groupLabel)))
  #set colors
    occColors<-sample(rainbow(length(occList)))  #scramble colors
  #set line width
  #now plot the occurrences as lines
  for(i in 1:length(occList)){
    for(j in 1:nrow(occList[[i]])){
  #return the occList as an invisible data object

dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"))
dicelloData<-dicelloData[dicelloData$rnk==3,]   #keep only occurrences of taxa resolved to species level

plotOccPBDB(dicelloData,"Dicellograptus Species")

Each color shown here is a different taxon, each line is the age uncertainty of a different occurrence. We can see for dicellograptids, the age uncertainty may potentially be a lot greater than the true sampled age ranges themselves. Many of these collections have age uncertainties that overlap and are not contiguous.

Now, the issue here is that for many statistical analyses that we apply in the fossil record, they use only the first and last appearance times, and they also often presume that these appearances are reported in contiguous, non-overlapping intervals of roughly equal length (let’s call that ‘well-behaved’ four-date data). Why? Well, most of these methods depend on the simple modeling trick that if we have contiguous intervals of roughly equal size, it becomes very easy to calculate the probability of observing a taxon X intervals into the future (or the past), given a sampling parameter. Obviously, there are considerable challenges in translating a dataset of loose occurrences to such well-behaved interval data, and that’s where my mind is right now.

Now, all that said, there are some analytical programs that take occurrence data natively. In particular, PyRate is a Python package that applies a Bayesian MCMC algorithm to estimate taxonomic origination, extinction and sampling rates from occurrence data. The method is described here and the program itself is described in MEE here. One could easily set up a functions to use [paleobioDB] package to download PBDB and use PyRate’s data preparations R functions to produce datasets for analysis in Python. Ultimately, PyRate solves the issue using a similar solution to the one I advocate via bin_timePaleoPhy (I discuss this more below): treating the age uncertainty of occurrences as uniform uncertainty bounds on the actual age of taxa and repeatedly randomly sampling ages from within those bounds and running the PyRate analysis. The sampling of ages is technically done in R; the Python component of PyRate only accepts single-date ‘point’ age data for occurrences as input. The results of multiple analyses can then be considered to see how much influence the uncertainty in taxon ages has on the resulting.

But PyRate doesn’t replace every pre-existing analysis of paleobiological data, and the ‘repeatedly sample and consider the resulting distributions’ trick isn’t an applicable solution to those analyses. So, how do we translate occurrence data into ranges within contiguous intervals? How does the manner in which we do this matter (or not) for the analyses we do? I’ve given this question a lot of thought previously with respect to time-scaling phylogenies of fossil taxa, but what about analyses where phylogenies aren’t involved? In particular, there are a number of functions in paleotree for estimating sampling rates and sampling probabilities that depend on having data from contiguous intervals (e.g., getSampProbDisc or make_durationFreqDisc or make_inverseSurv). These are questions I’ll try to address in future posts, along with a long-promised post about porting data from paleobioDB into paleotree format.

However, on a related node…

What does the Paleobiology Database report as two-date age ranges?

Some analyses only accept the two-date range data, and so people have invented algorithms for obtaining such datasets from occurrence data or four-date range data. However, such datasets must represent either precise age ranges or age uncertainty on a single occurrence, and thus information is being lost when we translate to two-dates (except in the rare cases where each taxon is known from only a single collection, and thus the two-dates are simply age uncertainty on singular finds). One can imagine some simple algorithms for obtaining two-date ranges, such as ‘maximum range’, i.e. take the earliest possible age of the oldest occurrence for a taxon and the latest possible age of the youngest occurrence. Although algorithms such as this retain some information about the geologic age of taxa, my opinion is that the information loss about age uncertainty may impact many analyses, perhaps adding an analytical bias. An alternative philosophy, which I espouse in the bin_timePaleoPhy function in paleotree, is to random draw precise FADs and LADs from four-date ranges (or occurrence data), run an analysis (such as time-scaling a phylogeny of fossil taxa), save the resulting test statistic or parameters of interest, and repeat this procedure many times to generate sampled distributions of the test statistics / parameters. Such a procedure retains (rather than loses) the uncertainty component of the ages into our results, but as I said above when discussing PyRate, this procedure isn’t applicable to every type of analysis.

Two of the various utilities or “apps” (…ugh) associated with the Paleobiology Database will return two-date age ranges for a search query. I became very curious how these ranges were calculated and so I wanted to report on my investigations.


Before I get into these various ways of returning ranges from occurrence data, I should mention a relavant not-range-returning application that deals with fossil taxon ages, FirstApp, a module by Matt Clapham. FirstApp takes paleobioDB data for a given taxon and displays occurrences through time as a density histogram, with occurrences placed as points at the earliest age boundary of the collections they occur within.

Ranges in Paleobiology Database Classic / Fossilworks

First, if you go to the present Paleobiology Database website, you have the option of using PBDB Classic which is a search query engine, with a download option (located here). You can access an almost identical interface at FossilWorks, which mirrors the paleobioDB data along with a number of analytical tools previously housed at the PBDB. FossilWork’s basic download option is located here. The algorithm for calculating two-date ranges used is somewhat complex but doesn’t appear to be described in any documentation, so I went straight to the “horse’s mouth” (his words), and asked John Alroy himself. I’ll paraphrase our conversation.

John has developed this “zone-of-overlap” algorithm to deal with obtaining ranges from the realities of messy occurrence data described above: age uncertainties (i.e. stratigraphic intervals) that overlap, of different lengths and don’t sort into an orderly hierarchy. Rather than returning the earliest possible date and the latest possible date like with the maximum-range algorithm described above, the zone-of-overlap algorithm “finds the oldest base that is older than at least part of all the intervals and the youngest that is younger than at least part of all the intervals” (direct quote, John Alroy).

Here’s three specific examples of how these ranges are calculated, taken from direct quotes from our email conversation. Two are from John and the third is from myself; as such, they are almost comedic examples of our taxonomic preferences:

  • “…for example, if the two oldest occurrences are respectively Paleocene (not better resolved than that) and Thanetian then the base has to be the base of the Thanetian.”

  • “Another example would be Hemphillian (Mio-Pliocene) and Pliocene. These work out to a range going from base of the Pliocene to the top of the Hemphillian even though the Hemphillian is ‘older’ (based on its base) and the Pliocene ‘younger.’”

  • “If I follow correctly, some taxon listed in the PBDB with two occurrences, in a Katian collection (base ~456 Ma) and a Dicellograptus anceps zone (base is ~449 Ma) collection respectively, the lower age bound returned returned will be ~449 Ma.”

In other words, for calculating the first appearance datum (FAD), the algorithm looks for all occurrences that overlap with the age range of the earliest-most occurrence, obtains their earliest boundary ages and returns the latest-most earliest age boundary among these overlapping occurrences. Similarly, for calculating the latest appearance datum (LAD), the algorithm looks for all occurrences that overlap with the age range of the latest-most occurrence, obtains their latest boundary ages and returns the earliest-most latest age boundary among these overlapping occurrences. On theoretical grounds, one could probably describe the zone-of-overlap algorithm as minimizing taxonomic age ranges by assuming that overlapping occurrences probably describe a very similar FAD or LAD, and thus picks the one that extends the taxonomic range the least.

However, this does come with a downside that if these occurrences are not repeated attempts to capture the same FAD or LAD, then the zone-of-overlap algorithm isn’t an accurate depiction of the uncertainty in the ages; the age of the rocks containing the observed FADs and LADs may actually be outside of the reported range, unlike the maximum-range algorithm, which must contain the entire observed age range for a taxon as long as ourour interval dates for collections is accurate. Consider my graptolite zone example above; there is a possibility that the collection from the Katian is actually from an earlier graptolite zone than the D. anceps zone, maybe the Dicranograptus kirki zone (~453 Ma). Better stratigraphic resolution on a collection might very well provide such information in the future, but zone-of-overlap algorithm doesn’t give any weight to that possibility. On the other hand, John argues that a large number of taxa in the PBDB are listed from collections with very broadly defined intervals of time, and so some way is needed to downweight the effect of that lack of stratigraphic interval resolution. Given the nature of the data, the loss of uncertainty may be a necessary evil under the zone-of-overlap algorithm if you want to be as conservative as possible (say with providing molecular clock calibration dates for phylogenetic divergence dates for crown clades). I think the likely answer here is dependent strongly on the group we’re interested in and the reason we want range dates.

Ranges in R package paleobioDB

The paleobioDB package also has a function for obtaining and plotting two-date age ranges, the function pbdb_temp_range. Here’s a species-level example using the Dicellograptus data above, with the plotting functionality turned off.

pbdb_temp_range(dicelloData, rank="species", do.plot=FALSE)
##                              max   min
## Dicellograptus alector     471.8 443.4
## Dicellograptus anceps      455.8 443.7
## Dicellograptus ornatus     453.0 443.4
## Dicellograptus complanatus 453.0 443.7

So what are these ranges? The manual file for pbdb_temp_range just refers to them as temporal ranges or time spans of taxa.

Well, we can go to the Github repo for paleobioDB and read the code for pbdb_temp_range directly, here’s the relevant bit for species. It appears that the ranges reported by pbdb_temp_range are the maximum-ranges, the earliest and latest possible bounds on a taxon’s age range.

Wait, wait, hold on, why are there only four species listed…

So, tangentially, you’re probably wondering why so few species are reported, as we can see from the occurrences figure above that a much larger number appears to be contained within the downloaded occurrences table. Uh… I’ll address that in my next blog post.

PS: I just discovered how paleobioDB’s API allows you to call PhyloPic images for taxa, woah! Here, check out Nemagraptus: