Monday, February 23, 2015

Identifying Unique Species in Paleobiology Database Occurrence Downloads


Hello all!

I’ve had an interesting week or so having conversations with several individuals about best practices for identifying ‘species’ level taxa in fossil occurrence data from the Paleobiology Database following my post two weeks ago where I found that several utilities associated with the PBDB identified ‘species’ in different ways. Let’s try to hit this issue really hard today.

Paleobiology Database Vocabulary

First, in my last post, I was a little confusing since I referred to the same variable by its full name and a shortening that seemed to be used only by the paleobioDB package (Varela et al., in press); it turns out this is simply the difference between variable headings returned by the Paleobiology Database’s API depending on special parameter vocab, which can either be pbdb (full variable names) or com (for the ‘compact’, shortened terminology). You can easily change this in paleobioDB data-pulling functions using argument vocab.

## Loading required package: raster
## Loading required package: sp
## Loading required package: maps
head(pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="com"))
##      oid typ cid                tna rnk   tid            mna mra   mid
## 1:1 3393 occ 328 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:2 4589 occ 378 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:3 4634 occ 379 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:4 4688 occ 380 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:5 8957 occ 403 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:6 9127 occ 424 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
##                   oei   eag   lag rid            idt ids  oli  rst  rss
## 1:1       Black River 460.9 457.5  13 Dicellograptus sp. <NA> <NA> <NA>
## 1:2             Harju 452.0 443.7  13 Dicellograptus sp. <NA> <NA> <NA>
## 1:3             Harju 452.0 443.7  13 Dicellograptus sp. <NA> <NA> <NA>
## 1:4           Caradoc 460.9 449.5  13 Dicellograptus sp. <NA> <NA> <NA>
## 1:5           Ashgill 449.5 443.7  13 Dicellograptus sp. <NA> <NA> <NA>
## 1:6 Middle Ordovician 470.0 458.4  13 Dicellograptus sp. <NA> <NA> <NA>
head(pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="pbdb"))
##     occurrence_no record_type collection_no         taxon_name taxon_rank
## 1:1          3393  occurrence           328 Dicellograptus sp.      genus
## 1:2          4589  occurrence           378 Dicellograptus sp.      genus
## 1:3          4634  occurrence           379 Dicellograptus sp.      genus
## 1:4          4688  occurrence           380 Dicellograptus sp.      genus
## 1:5          8957  occurrence           403 Dicellograptus sp.      genus
## 1:6          9127  occurrence           424 Dicellograptus sp.      genus
##     taxon_no   matched_name matched_rank matched_no    early_interval
## 1:1    33650 Dicellograptus        genus      33650       Black River
## 1:2    33650 Dicellograptus        genus      33650             Harju
## 1:3    33650 Dicellograptus        genus      33650             Harju
## 1:4    33650 Dicellograptus        genus      33650           Caradoc
## 1:5    33650 Dicellograptus        genus      33650           Ashgill
## 1:6    33650 Dicellograptus        genus      33650 Middle Ordovician
##     early_age late_age reference_no     genus_name species_name
## 1:1     460.9    457.5           13 Dicellograptus          sp.
## 1:2     452.0    443.7           13 Dicellograptus          sp.
## 1:3     452.0    443.7           13 Dicellograptus          sp.
## 1:4     460.9    449.5           13 Dicellograptus          sp.
## 1:5     449.5    443.7           13 Dicellograptus          sp.
## 1:6     470.0    458.4           13 Dicellograptus          sp.
##     late_interval genus_reso species_reso
## 1:1          <NA>       <NA>         <NA>
## 1:2          <NA>       <NA>         <NA>
## 1:3          <NA>       <NA>         <NA>
## 1:4          <NA>       <NA>         <NA>
## 1:5          <NA>       <NA>         <NA>
## 1:6          <NA>       <NA>         <NA>

Note that some variables change their formatting depending on whether pbdb or com vocabulary are used; for example, taxon_rank is full nouns (‘genus’, ‘species’, etc) with pbdb, but is a numerical coding with com.

For this post (and future posts), I’ll try to use the full pbdb terminology; you can always check the API page for PBDB occurrence data if you want the corresponding com label though.

A brief aside… paleobioDB package versus the raw API in R

Shanan Peters has usefully pointed out that we shouldn’t mistake paleobioDB as the only way to access the PBDB API. After all, the API will presumably continue to evolve and thus the paleobioDB package may not be up-to-date with the API’s various arguments. Instead of using paleobioDB we could instead access the API directly, for example:

##   occurrence_no record_type reid_no superceded collection_no
## 1          3393  occurrence      NA         NA           328
## 2          4589  occurrence      NA         NA           378
## 3          4634  occurrence      NA         NA           379
## 4          4688  occurrence      NA         NA           380
## 5          8957  occurrence      NA         NA           403
## 6          9127  occurrence      NA         NA           424
##           taxon_name taxon_rank taxon_no   matched_name matched_rank
## 1 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 2 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 3 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 4 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 5 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 6 Dicellograptus sp.      genus    33650 Dicellograptus        genus
##   matched_no    early_interval     late_interval early_age late_age
## 1      33650       Black River       Black River     460.9    457.5
## 2      33650             Harju             Harju     452.0    443.7
## 3      33650             Harju             Harju     452.0    443.7
## 4      33650           Caradoc           Caradoc     460.9    449.5
## 5      33650           Ashgill           Ashgill     449.5    443.7
## 6      33650 Middle Ordovician Middle Ordovician     470.0    458.4
##   reference_no     genus_name genus_reso subgenus_name subgenus_reso
## 1           13 Dicellograptus                       NA            NA
## 2           13 Dicellograptus                       NA            NA
## 3           13 Dicellograptus                       NA            NA
## 4           13 Dicellograptus                       NA            NA
## 5           13 Dicellograptus                       NA            NA
## 6           13 Dicellograptus                       NA            NA
##   species_name species_reso
## 1          sp.             
## 2          sp.             
## 3          sp.             
## 4          sp.             
## 5          sp.             
## 6          sp.

Note, however, some small differences between these otherwise very similar requests. In particular, paleobioDB seems to have removed several variables related to data cleaning, like whether an occurrence has been superceded by another occurrence or new taxonomic assignment. I’ve traced the code used in the function through paleobioDB’s github repo and I’m not sure why these don’t show up.

PBDBpack<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="pbdb")

# same number of columns in both?
## [1] FALSE
# Nope! Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])
## [1] "reid_no"       "superceded"    "subgenus_name" "subgenus_reso"

Perhaps there is a cleaning step where paleobioDB (or Paleobiology Database, when it responds to JSON requests) is removing empty columns (i.e. contains only NAs).

# Are the missing columns just full of NAs?
sapply(missingCol,function(x) all([,x])))
##       reid_no    superceded subgenus_name subgenus_reso 
##          TRUE          TRUE          TRUE          TRUE

There’s no mention to this phenomenon that I can find in paleobioDB’s manual (or any evidence of column dropping in the code of the package’s repo) or the documentation for the Paleobiology Database’s API, however. It’s a little odd because it means one could query, for example, for abundance data (show=c("abund")) and not get back the columns relating to abundance because they are empty. This could create issues for code that expects variables to always be present in a given data download.

For example, here is an extremely contrived example for the bizarre reticulate graptoloid genus Reteograptus (see Finney, 1980, for some morphological details), which is probably a lazarus taxon of the abrograptids (according to Finney).

PBDBpack<-pbdb_occurrences(limit="all", base_name="Reteograptus", show=c("abund"), vocab="pbdb")

# Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])
## [1] "reid_no"     "superceded"  "abund_value" "abund_unit"
#let's look at the abundance values from the rawAPI; are they all NAs?
apply(rawAPI[,c("abund_value","abund_unit")],2,function(x) all(
## abund_value  abund_unit 
##        TRUE        TRUE

There does not appear to be an argument to alter this variable-dropping behavior. I’ve opened an issue ticket here at the paleobioDB repo.

The ultimate result of the is that we can’t always depend on the same variables to (a) be present, (b) have the same column headings or (c) have the same variable codings for tables of PBDB occurrence data, because it depends greatly on how the data was downloaded and which options were used, going beyond the choices involving show =. The authors of paleobioDB package seem to have dealt with this by writing two separate versions of functions for analyzing occurrence data within each function (example), which is inefficient but perhaps necessary.

Distinguishing Species in Paleobiology Database Occurrence Data

It turns out there is no single ‘right’ answer to how to identify unique species in a set of PBDB occurrence data and assign occurrences to these distinct species. To understand the different ways we might perform this task, we should cover the relevant variables and what the information contained within is. The following variables can be obtained by setting the show to both ident and phylo, whether within the API or paleobioDB.

Some of the following information comes from a very informative discussion I have had with Matthew Clapham. However, personallylly, I’m still working through understanding all these variables, and intentionally trying to do this with as little fact-checking by email as possible. This is partly to demonstrate the limitations of the current documentation and also to avoid drowning certain individuals in emails. Better I save up all my questions for the end, right? I will be honest and say I still don’t entirely understand reid_no and superceded. So don’t blame my sources, blame me for when I get things wrong, because that’s part of this little mental experiment.

An important distinction is that one set of variables refers to the taxonomic identification for an occurrence reported within the original reference:

  • taxon_name described as “The taxonomic name by which this occurrence is identified.”

  • taxon_rank described as “The taxonomic rank of the name, if this can be determined.”

  • genus_name described as “The taxonomic name (less species) by which this occurrence was identified. This is often a genus, but may be a higher taxon.”

  • genus_reso described as “The resolution of this taxonomic name, i.e. sensu lato or aff.”

  • species_name described as “The species name (if any) by which this occurrence was identified”

  • species_reso described as “The resolution of the species name, i.e. sensu lato or n. sp”

Related to this group, we have the variable taxon_no:

  • taxon_no described as “The unique identifier of the identified taxonomic name. If this is empty, then the name was never entered into the taxonomic hierarchy stored in this database and we have no further information about the classification of this occurrence.”

…which identifies which taxon ID is assigned to the original taxon identification. However, many taxa apparently get entered without being given unique taxon IDs, and are instead assigned tentatively to a higher taxon. For example, there are many occurrences listed with distinct species_names that are listed under a taxon_no for a genus-level taxon.

A different set of variables refers to the formal taxon assignment currently given within the PBDB’s present state of taxonomic standardization, which attempts to control for synonymies:

  • matched_name described as “The senior synonym and/or currently accepted spelling of the closest matching name in the database to the identified taxonomic name, if any is known, and if this name is different from the value of taxon_name.”

  • matched_rank described as “The taxonomic rank of the matched name, if different from the value of taxon_rank”

  • matched_no described as “The unique identifier of the closest matching name in the database to the identified taxonomic name, if any is known.”

  • genus described as “The name of the genus in which this occurrence is classified”

  • genus_no described as “The identifier of the genus in which this occurrence is classified”

Again, because many lower-taxa get informally assigned to the same higher-taxon ID on a tentative tentatively, taxon_no (and thus matched_no and matched_name in addition) may be the same for occurences listed as multiple distinct species going by taxon_rank and species_name.

Finally, the occurrence itself might be reidentified or superceded with a new taxonomic assignment, which can be found in:

  • reid_no described as “If this occurrence was reidentified, a positive integer that uniquely identifies the reidentification”

  • superceded described as “The value of this field will be true if this occurrence was later identified under a different taxon”

These aren’t exhaustive lists, and if we wanted to discuss pulling out unique supraspecific taxa, we would need to consider additional variables such as family_name reported by the show=phylo parameter.

It turns out the Dicellograptus example I keep using provides an example of one type of disagreement we might see in groups where there has been little standardization and where many species have been entered ‘informally’ as tentative assignments to a higher taxon. So let’s focus on the above variables in that dataset:

#taxonomic variables of interest
taxonVar<-c("reid_no","superceded",      #PBDB occurrence reidentification info
            #original taxon info for occurrence
            #standardized formal taxon info

# due to the column dropping issue, we can't use paleobioDB package:
#dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident","phylo"), vocab="pbdb")

# so we'll use the API instead
##   reid_no superceded         taxon_name taxon_rank     genus_name
## 1      NA         NA Dicellograptus sp.      genus Dicellograptus
## 2      NA         NA Dicellograptus sp.      genus Dicellograptus
## 3      NA         NA Dicellograptus sp.      genus Dicellograptus
## 4      NA         NA Dicellograptus sp.      genus Dicellograptus
## 5      NA         NA Dicellograptus sp.      genus Dicellograptus
## 6      NA         NA Dicellograptus sp.      genus Dicellograptus
##   genus_reso species_name species_reso taxon_no   matched_name
## 1                     sp.                 33650 Dicellograptus
## 2                     sp.                 33650 Dicellograptus
## 3                     sp.                 33650 Dicellograptus
## 4                     sp.                 33650 Dicellograptus
## 5                     sp.                 33650 Dicellograptus
## 6                     sp.                 33650 Dicellograptus
##   matched_rank matched_no          genus genus_no
## 1        genus      33650 Dicellograptus    33650
## 2        genus      33650 Dicellograptus    33650
## 3        genus      33650 Dicellograptus    33650
## 4        genus      33650 Dicellograptus    33650
## 5        genus      33650 Dicellograptus    33650
## 6        genus      33650 Dicellograptus    33650

Just from the first few lines, we can also see a number of Dicellograptus occurrences that were never resolved beyond the genus level and thus are just assigned to the Dicellograptus genus taxon ID (taxon_no = 33650), and the matched_name and other ‘matched’ information is the same, because Dicellograptus has never been synonymized (which is good, because its a valid genus, so it shouldn’t be synonymized).

We can see these listed as “Species lacking formal opinion data” here.

But how many unique taxon_name and species_name values are assigned to taxon_no = 33650?

##  [1] Dicellograptus sp.                 Dicellograptus ? sp.              
##  [3] Dicellograptus alector n. sp.      Dicellograptus gurleyi            
##  [5] Dicellograptus mensurans           Dicellograptus sextans cf.        
##  [7] Dicellograptus spp.                Dicellograptus sextans            
##  [9] Dicellograptus vagus cf.           Dicellograptus intortus           
## [11] Dicellograptus divaricatus cf.     Dicellograptus caduceus cf.       
## [13] Dicellograptus forchammeri cf.     Dicellograptus intortus cf.       
## [15] Dicellograptus moffatensis cf.     Dicellograptus intermedius n. sp. 
## [17] Dicellograptus divaricatus         Dicellograptus russonioides n. sp.
## [19] Dicellograptus johnstrupi          Dicellograptus morrisi            
## [21] Dicellograptus gurleyi cf.         Dicellograptus elegans cf.        
## [23] Dicellograptus morrisi cf.         Dicellograptus smithi cf.         
## [25] Dicellograptus sp. ?               Dicellograptus alector            
## [27] Dicellograptus alector cf.         Dicellograptus ? angulatus cf.    
## [29] Dicellograptus angulatus cf.       Dicellograptus flexuosus          
## [31] Dicellograptus minor               Dicellograptus tumidus            
## [33] Dicellograptus turgidus            Dicellograptus mirabilis          
## 40 Levels: Dicellograptus ? angulatus cf. ... Dicellograptus vagus cf.
##  [1] sp.          alector      gurleyi      mensurans    sextans     
##  [6] spp.         vagus        intortus     divaricatus  caduceus    
## [11] forchammeri  moffatensis  intermedius  russonioides johnstrupi  
## [16] morrisi      elegans      smithi       angulatus    flexuosus   
## [21] minor        tumidus      turgidus     mirabilis   
## 27 Levels: alector anceps angulatus caduceus complanatus ... vagus

Thankfully, all of these are given a taxon_rank of species… right? We can check this by removing any occurrence not given a taxon_rank of species and looking at the unique taxon_name values.

unique(dicelloData[dicelloData$taxon_no==33650 & dicelloData$taxon_rank!="species","taxon_name"])
## [1] Dicellograptus sp.   Dicellograptus ? sp. Dicellograptus spp. 
## [4] Dicellograptus sp. ?
## 40 Levels: Dicellograptus ? angulatus cf. ... Dicellograptus vagus cf.

Okay, in Dicellograptus, at least, all the informal species assigned formally to a genus-level taxon are listed correctly as species.

We can examine another sort of the disagreement among these fields by looking at datasets where there has been a lot of taxonomic standarization in the PBDB (i.e. not graptolites). Matt Clapham showed me a particularly interesting read-out for a query involving Acosarina minuta, a Permian brachiopod (…perhaps a minute brachiopod?), filtering for the same variables as above:

# so we'll use the API, like above...

# Note that the following won't work except from web browser address bar or other utility 
  # which automatically (and silently) replaces " " with "%20"
acoData1<-read.csv(" minuta&show=ident,phylo")
##  [1] "occurrence_no"  "record_type"    "reid_no"        "superceded"    
##  [5] "collection_no"  "taxon_name"     "taxon_rank"     "taxon_no"      
##  [9] "matched_name"   "matched_rank"   "matched_no"     "early_interval"
## [13] "late_interval"  "early_age"      "late_age"       "reference_no"
# the query terminated at the " " in the taxon name and never got to other parameters, like 'show'
# (This tripped me up for an hour. Don't judge me!)

# Okay, now the right way
##   reid_no superceded             taxon_name taxon_rank  genus_name
## 1      NA         NA     Orthotichia indica    species Orthotichia
## 2   20839         NA       Acosarina minuta    species   Acosarina
## 3      NA         NA       Acosarina minuta    species   Acosarina
## 4      NA         NA Acosarina dorsisulcata    species   Acosarina
## 5      NA         NA Acosarina dorsisulcata    species   Acosarina
## 6      NA         NA Acosarina dorsisulcata    species   Acosarina
##   genus_reso species_name species_reso taxon_no     matched_name
## 1                  indica                177488 Acosarina minuta
## 2                  minuta                125301 Acosarina minuta
## 3                  minuta                125301 Acosarina minuta
## 4            dorsisulcata                126711 Acosarina minuta
## 5            dorsisulcata                126711 Acosarina minuta
## 6            dorsisulcata                126711 Acosarina minuta
##   matched_rank matched_no     genus genus_no
## 1      species     125301 Acosarina    26652
## 2      species     125301 Acosarina    26652
## 3      species     125301 Acosarina    26652
## 4      species     125301 Acosarina    26652
## 5      species     125301 Acosarina    26652
## 6      species     125301 Acosarina    26652

All of the occurrences listed here are formally identified in PBDB’s standardized taxonomy as Acosarina minuta. That’s pretty easy to verify by looking at unique values of matched_name and matched_no:

## [1] Acosarina minuta
## Levels: Acosarina minuta
## [1] 125301

However, we can see even just from the head() that there are multiple values of taxon_name, taxon_no, genus_name and species_name formally reassigned to the ‘matched’ taxon Acosarina minuta. Here’s a full list:

##                           taxon_name taxon_no   genus_name species_name
## 1                 Orthotichia indica   177488  Orthotichia       indica
## 2                   Acosarina minuta   125301    Acosarina       minuta
## 4             Acosarina dorsisulcata   126711    Acosarina dorsisulcata
## 15     Acosarina dorsisulcata n. sp.   126711    Acosarina dorsisulcata
## 49                     Orthis indica   125296       Orthis       indica
## 58                  Acosarina indica   175512    Acosarina       indica
## 99                 Kotlaia capillosa   123689      Kotlaia    capillosa
## 100 Kotlaia n. gen. capillosa n. sp.   123689      Kotlaia    capillosa
## 103               Orthotichia minuta   199434  Orthotichia       minuta
## 110             Acosarina minuta cf.   125301    Acosarina       minuta
## 111             Sunacosarina campana   190895 Sunacosarina      campana
## 134        Acosarina kanmerai n. sp.   206333    Acosarina     kanmerai
## 142   Orthis (Schizophoria) indica ?   125296       Orthis       indica
## 143     Orthis (Schizophoria) indica   125296       Orthis       indica
## 217                Dalmanella indica   255744   Dalmanella       indica
## 234               Acosarina ? minuta   125301    Acosarina       minuta

So, how do we navigate these variables to identify unique species in our datasets?

Potentially Misleading Ways To Identify Species in PBDB Occurrence Data

In the last post, I mentioned that a number of my friends, as it turned out, used very different criteria for distinguishing taxa from PBDB occurrence data, criteria which also varied among the formal ‘apps’ of the Paleobiology Database.

Personally, I’d intuitively chosen (two blog posts ago) to use species_name alone, filtered on taxon_rank = "species". It turns out several of my friends also use species_name. Another friend uses a combination of taxon_name and taxon_no. Strategies based on taxon_name or species_name are (at best) not making use of the potential for information of taxonomic synonymies in the PBDB and (at worst) in groups that have undergone a considerable amount of qualified taxonomic standardization, using these variables may be very misleading relative to the current state of knowledge.

It should also be clear that taxon_no alone could be a misleading way to distinguish unique species in occurrence data, even if filtered to species-ranked occurrences. Referring to the last post, I showed that paleobioDB package uses taxon_no to identify unique species in occurrences, filtered to taxon_rank of species. The function match() is then used to find genus_name and species_name which are then combined into a single species label. We can see the code here. As we can see above, even filtering taxon_no to species-level occurrences via taxon_rank would result in mistakingly grouping a number of informally entered species that are formally assigned to their genus as a single species-level taxon with an exaggerated number of occurrences assigned to it. Furthermore, the use of match() returns the first and only the first match in a list, so the assigned species-label is an arbitrary function of whatever species comes first in a table of occurrences. This results in the mistaken reporting of Dicellograptus alector from Dicellograptus occurrence data below:

dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"))
pbdb_temp_range(dicelloData, rank="species", do.plot=FALSE)
##                              max   min
## Dicellograptus alector     471.8 443.4
## Dicellograptus anceps      455.8 443.7
## Dicellograptus ornatus     453.0 443.4
## Dicellograptus complanatus 453.0 443.7

I’ve opened an issue ticket for this issue on the paleobioDB repo here.

It turns out this isn’t what PBDB Navigator does, as I had incorrectly guessed last post. PBDB Navigator is doing something much more complex, counting formal taxa that are listed as daughter taxa. The reason why Dicellograptus gravis appears despite having no occurrences in the PBDB is because that taxon was formally entered without any occurrences. This is discussed more here, where I had thought it was a bug and reported it as such, but it was more of a… misinterpretation. Similarly, PBDB Classic / Fossilworks appears to use the taxonomic standarization in the database to return ranges for all formally-defined species appropriately (under the Download function), but returns any informally defined species in addition.

A Framework for Identifying Unique Species

So, how should we go about pulling species-level taxa from our occurrence data? If we want all ‘species’ information possible, then we must do two seperate processes:

  1. Pull out all formally-defined species, with their matched_name species name. This should be as simple as using matched_rank to filter on species-level formal taxa and pulling out the unique matched_no values.

  2. Pull out all informal species that are assigned formally to supraspecific taxa. This should be possible on filtering only for taxa that do not have a matched_rank = species assignment but have a taxon_rank = species value, and then pulling on unique values of species_name.

Obviously, the second is optional if we want to only keep formal species; depends on our data.

There are some complexities to the above for actually generating the correct labels for these species-level taxa. First, there is no corresponding species_name-like variable for matched_taxon as there is with genus_name and genus. The most reasonable method seems to be splitting the species name from the genus name in matched_name and combining it with genus, but presumably this should have the same effect as just using matched_name whole.

Let’s look at a function that tries to follow this ‘informal versus formal’ dichotomy. Here’s taxonClean from Matthew Clapham’s github repo PBDB-R-scripts, function script located here:

taxonClean <- function(occurrences, tax_level="genus", formal_id="no") {
  #file preparation and data cleaning for species level analysis
    #removes rows where species qualified by cf. or aff., question mark, ex gr. or quotation mark
    resolved_occs <- subset(occurrences, occurrences$species_reso=="" | occurrences$species_reso=="n. sp.")
    if (formal_id=="yes") {
      #deletes occurrences not resolved to species level using formally-classified species
      cleaned_occs <- subset(resolved_occs, resolved_occs$matched_rank=="species")
      cleaned_occs$final_taxon <- cleaned_occs$matched_name
    } else {
      #deletes occurrences not resolved to species level using classified and unclassified species
      resolved_occs <- subset(resolved_occs, resolved_occs$taxon_rank=="species")
      #strips all extraneous qualifiers from taxon name
      resolved_occs$taxon_name <- gsub("\" ", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub("cf. ", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub("aff. ", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub("\\? ", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub("n. gen. ", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub(" n. sp.", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub(" sensu lato", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub(" informal", "", resolved_occs$taxon_name)
      #separates classified and unclassified occurrences
      classified_occs <- subset(resolved_occs, resolved_occs$matched_rank=="species")
      unclassified_occs <- subset(resolved_occs, resolved_occs$matched_rank!="species")
      classified_occs$final_taxon <- classified_occs$matched_name
      unclassified_occs$final_taxon <- unclassified_occs$taxon_name
      #combines classified and unclassified occurrences
      cleaned_occs <- rbind(classified_occs, unclassified_occs)
    #or file preparation and data cleaning for family level analysis
  } else if (tax_level=="family") {
      #deletes occurrences not resolved to at least family level
      resolved_occs <- subset(occurrences, occurrences$matched_rank <= 9)    
      #deletes occurrences where genus or family are qualified with question mark, quotations, cf. or aff.
      if(length(grep("\\? ", resolved_occs$taxon_name)) > 0){
        resolved_occs <- resolved_occs[-grep("\\? ", resolved_occs$taxon_name),]
      if(length(grep("\" ", resolved_occs$taxon_name)) > 0){
        resolved_occs <- resolved_occs[-grep("\" ", resolved_occs$taxon_name),]
      if(length(grep("cf. ", resolved_occs$taxon_name)) > 0){
        resolved_occs <- resolved_occs[-grep("cf. ", resolved_occs$taxon_name),]
      if(length(grep("aff. ", resolved_occs$taxon_name)) > 0){
        resolved_occs <- resolved_occs[-grep("aff. ", resolved_occs$taxon_name),]
      cleaned_occs <- resolved_occs
      cleaned_occs$final_taxon <- cleaned_occs$family
    #or file preparation and data cleaning for genus level analysis
  } else {
      #deletes occurrences not resolved to at least genus level using classified and unclassified species
      resolved_occs <- subset(occurrences, occurrences$matched_rank <= 5)
      #deletes occurrences where genus is qualified with question mark, quotations, cf. or aff.
      cleaned_occs <- subset(resolved_occs, resolved_occs$genus_reso=="" | resolved_occs$genus_reso=="n. gen.")
      #extracts genus name from matched_name string
      cleaned_occs$final_taxon <- gsub(" .*", "", cleaned_occs$matched_name)


(I made some small code alterations because I am just personally terrified by the idea of not embracing the lines of control structure (i.e. if, for, etc.) in curly braces! There was several actual errors: an erroneous resolved_occs before resolved_occs was called, also an attempt to call matched_name = 3 rather than matched_name = "species"…)

Note that taxonClean depends on vocab = "pbdb", which differs from the default for data functions in package paleobioDB. This function will return a table of occurrences, pared down only to the occurrences of taxa at the appropriate level and with their valid taxon labels in a new variable labeled final_taxon.

Let’s see what it does with our Dicellograptus data, using the API.


# with informal species
cleanOccs<-taxonClean(dicelloData, tax_level="species", formal_id="no")
##  [1] Dicellograptus anceps       Dicellograptus complanatus 
##  [3] Dicellograptus ornatus      Dicellograptus alector     
##  [5] Dicellograptus gurleyi      Dicellograptus mensurans   
##  [7] Dicellograptus sextans      Dicellograptus intortus    
##  [9] Dicellograptus intermedius  Dicellograptus divaricatus 
## [11] Dicellograptus russonioides Dicellograptus johnstrupi  
## [13] Dicellograptus morrisi      Dicellograptus flexuosus   
## [15] Dicellograptus minor        Dicellograptus tumidus     
## [17] Dicellograptus turgidus     Dicellograptus mirabilis   
## 19 Levels: Dicellograptus ... Dicellograptus mirabilis
# only formal species
cleanOccs<-taxonClean(dicelloData, tax_level="species", formal_id="yes")
## [1] Dicellograptus anceps      Dicellograptus complanatus
## [3] Dicellograptus ornatus    
## 4 Levels: Dicellograptus ... Dicellograptus ornatus

For me though, I’d probably approach this issue in a slight different way than taxonClean… I’ll show what I cook up next time, when I (hopefully) finally show some stuff I did weeks ago integrating paleobioDB data downloads with paleotree functions.

Wednesday, February 11, 2015

What is the right way to identify unique species in Paleobiology Database downloads?

(Continuing with the RMarkdown-to-Blogger experiment! I apologize for any mysterious indentations that I can’t seem to make go away! Sorry!)

So, let’s recap:

Last post, we had some fun pulling data out of the Paleobiology Database using R package paleobioDB (Varela et al., in press). We even made a nice plot of age uncertainty in occurrences sorted by species in the delightful graptolite genus Dicellograptus. Let’s see that pretty plot again, although we’ll hide some of the R code this time!

dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"))
dicelloData<-dicelloData[dicelloData$rnk==3,]   #keep only occurrences of taxa resolved to species level

plotOccPBDB(dicelloData,"Dicellograptus Species")

Now, we had a little mystery when we decided to look at species ranges from this same data using function pbdb_temp_range in package paleobioDB.

pbdb_temp_range(dicelloData, rank="species", do.plot=FALSE)
##                              max   min
## Dicellograptus alector     471.8 443.4
## Dicellograptus anceps      455.8 443.7
## Dicellograptus ornatus     453.0 443.4
## Dicellograptus complanatus 453.0 443.7

The occurrences I plotted above show something like more than twenty species. pbdb_temp_range shows only four species. What is going on?

Well, for the occurrence plot above, I used the species_name column (the column $ids, see here for its description in the paleobioDB API documentation) to distinguish different species, which we can see contains more than 20 unique identifiers:

##  [1] "anceps"       "alector"      "gurleyi"      "mensurans"   
##  [5] "sextans"      "complanatus"  "vagus"        "intortus"    
##  [9] "divaricatus"  "caduceus"     "forchammeri"  "moffatensis" 
## [13] "intermedius"  "russonioides" "johnstrupi"   "morrisi"     
## [17] "elegans"      "smithi"       "angulatus"    "ornatus"     
## [21] "flexuosus"    "minor"        "tumidus"      "turgidus"    
## [25] "mirabilis"

However, while pbdb_temp_range uses the various ‘name’ entries for identifying independent taxa at other taxonomic levels, pbdb_temp_range uses taxon_no (column $tid) to identify unique species-level taxa. We can see this in the code in paleobioDB’s GitHub repo.

This is suggestive that taxon_no is the Paleobiology Database’s way of handling taxonomic synonymy among species. A quick survey of a few friends who actually use PBDB data on a regular basis (unlike me, I’m a PBDB-newb) reveals some… uh, lack of clarity on this point. Well, a lack of clarity often suggests lack of clear documentation. Indeed, the paleobioDB API documentation describes taxon_no simply as the “unique identifier of the identified taxonomic name” and is remarkably silent on suggesting whether it might indicate unique species-level identities. Looking through the rest of the API documentation, its never said explicitly, but one might infer that this is true, as long as taxonomic rank (column $rnk) has been been limited to species-level taxa only. I.e., all species-level taxa with unique taxon-IDs are valid species-level taxa.

I wonder though: aren’t there cases of synonymized supraspecific taxa? These must be missed if uniqueness at those levels in paleobioDB package is determined based on the various ‘names’ qualifiers. This issue seems like quite the tough nut. It seems properly sorting data out of an occurrances download from the PBDB may require directly referencing the taxonomic database of the PBDB as well.

Anyway, we can see that $tid matches the species returned by pbdb_temp_range, which finds its names for the unique tid values using match(), which just means the species name reported by pbdb_temp_range is whatever species name is listed first with that tid value in a given data table.

## [1] 306364  33650 306226 306367
as.character(sapply(unique(dicelloData$tid),function(x) dicelloData$ids[match(x,dicelloData$tid)]))
## [1] "anceps"      "alector"     "complanatus" "ornatus"

Well, that solves that mystery, but is tid actually the right way to identify unique valid species in a dataset? I wonder, as this isn’t written down anywhere…

I decided to pursue this further and went to Fossilworks, a mirror database of the Paleobiology Database which I mentioned in the previous post. I used the download form at Fossilworks to pull the species ranges for Dicellograptus. It returns ranges for 27 separate species; here’s a summary showing just the first four columns and small selection of ranges:

genus species base of range (Ma) top of range (Ma)
Dicellograptus alector 456.1 443.7
Dicellograptus anceps 449.5 445.6
Dicellograptus angulatus 456.1 449.5
Dicellograptus caduceus 460.9 456.1
Dicellograptus complanatus 449.5 445.6
Dicellograptus divaricatus 468.1 456.1
Dicellograptus vagus 460.9 456.1

I also get 27 species with the download form using PBDB classic at paleobioDB (also discussed in the last post). So, these applications are clearly using species_name and not taxon_no to distinguish what species are in the occurrences data.

Okay, I thought, this is, uh… inconsistent.

Recently, the Paleobiology Database revealed a new interface, called PBDB Navigator, which uses a map and various chronological and taxonomic filters to sort through the PBDB collection data. How many species does Navigator report for Dicellograptus?

Taxon name (number of occurrences in the database)
    Dicellograptus complanatus  (1) 
    Dicellograptus ornatus  (1) 
    Dicellograptus anceps  (1) 
    Dicellograptus gravis  (1) 

Huh. Okay, so Navigator uses taxon_no to identify individual unique species, unlike the Classic PBDB which is on the same website but uses species_name.

This is making my head hurt. What’s the answer? Maybe someone could pipe up in the comments.

Anyway, until next time!

How Do We Treat Fossil Age Data? Dates, Ranges and Occurrences

(Hello all! I’m trying Rmarkdown for writing blog posts! Let’s see how this goes.)

I spend a lot of time thinking about how the datasets we (paleontologists and biologists) work with differ in their structure and information content. The obvious difference is how much paleontological data differs from the sort of data we have for living organisms in evolutionary biology (neontology… whatever…) but the one we so often overlook is how datasets from the fossil record can differ greatly from each other. I talk a lot in my book chapter last year on how much paleontological datasets for different groups of organisms or time-intervals can differ and how this impacts our intention to try phylogeny-based analyses of macroevolution (you can check out my chapter here, thanks to the publisher, including its horrid easter eggs referring to a certain popular series of Game Boy games).

That chapter barely scratched the surface in terms of ‘things to know about paleo data’. Recently, while playing with Paleobiology Database data using the recently release R package paleobioDB, I had some additional thoughts on how we think of and use chronostratigraphic data for fossil taxa.

Warning: Many of you who are paleontology are probably already familiar with a lot of what I’m about to say. In some regards, I’m very mindful that some of my biologist friends read this, so I’m trying to cover what they might not be so familiar with. Also, much of it is a retread of previous blog posts. A major goal of this blog post is an attempt to describe a set of terms for relating how different types of fossil age datasets relate to one another in terms of their information content (in a sense, their ontology), and that requires discussing some of the tiniest of minutia. I also don’t think I’m right about everything I say below–I think there is a lot more thinking we need to have about the ‘ontology’ of paleontological age data.

Four-Date Age Data (“timeList”)

I’ve thought a few ways about to talk about these issues, and the best I can do is to start with how I encountered taxonomic age data initially. So, graptoloids (the planktonic graptolites, my favorite group… why aren’t they yours too?) have a very well known fossil record. Skipping through some of the vagueries of Paleozoic time-scale making, you can generally tell based upon the species composition of a graptolite fossil assemblage which graptolite biostratigraphic zone you are in, meaning that the rocks you’re looking at come before ‘this graptolite’ showing up for the first time but after some other graptolite species shows up. The same taxa are often found across multiple continents, so there is a global correlation of such zones. The use of some complicated annealing software allows us to put absolute dates on the first appearance of graptolite taxa (Sadler et al., 2009), so if you know the graptolite zone of a rock, you know what little segment of (about) a few million years that rock is from. What nice little time-keepers… Its really too bad graptoloids are extinct, isn’t it?

Anyway, most of the data recorded in papers referring to when particular graptoloid taxa appear in the geologic record is often just which graptolite zones these species or genera first and last appear within. This is understandable, as some taxa can be found pretty continuously throughout their stratigraphic range, and all the information important for stratigraphic correlation are in these times of first and last appearance. In many cases, the data is also very well-behaved in that these graptolite zones don’t overlap and are of a rougly equal length (although maybe with more variation than we’d like; Sadler et al., 2009). This is exactly the sort of data I was first exposed to and dealt with in some undergrad research (published in Bapst et al., 2012). You can find the data here), or you can play with it in paleotree because I recently added it as example data. Let’s take a look at that sort of data.

## Loading required package: ape

Our item of interest is graptRanges, which is a timeList object I’ve talked about on this blog before and composed of two matrices: a matrix of the earliest and latest ages for some intervals, in this case graptolite biostratigraphic zones, which are intervals denoted by the first appearance of the taxa the zones are named after.

##                                     start_time end_time
## Nemagraptus gracilis (Gi1)              460.86   456.35
## Orthograptus calcaratus (Gi2)           456.35   455.29
## Diplograptus lanceolatus (Ea1)          455.29   452.21
## Diplacanthograptus spiniferus (Ea2)     452.21   449.73
## Dicellograptus kirki (Ea3)              449.73   448.96
## Dicranograptus gravis (Ea4)             448.96   448.57

…and first and last intervals of appearance for some taxa.

##                                 first_int last_int
## 'Bulmanograptus' macilentus            16       16
## 'Monograptus' arciformis               16       17
## 'Monograptus' austerus                 15       17
## 'Paramplexograptus' kiliani            12       13
## 'Paramplexograptus' paucispinus        12       13
## 'Prisitiograptus' fragilis             16       19

Information-wise, this means every taxon has four dated values associated with it: the earliest and latest dates for the intervals in which a taxon first and last appears. By separating the two into separate matrices (effectively making the data structure no longer ‘flat’) we remove some redundancy of having to list the same dates for different taxa which first or last appear in the same interval. Essentially, we’re compressing the information content of our taxon range data. Losing this redundancy is good because it minimizes needing to update the dates more than once.

Note that when we have consecutive, non-overlapping intervals (like with the graptoloid data above), its possible we could compress the information content even further; for example, we could just have the start dates of intervals listed in the interval times matrix, and their end-date is, by implication, the start date of the next interval listed. paleotree retains the second column however so that it can handle those cases where intervals do overlap or are non-consecutive. Going in the other direction, we might want more than two numbers to define dates for a given collection, if we had more information about dates than simply a minimum and maximum bound with a flat probability density inbetween (for example, if the fossils themselves can be dated using geochemistry, such as Strontium isotopes), but that sort of age information is not common.

Two-Date Age Data (“timeData”)

Particularly in discussions I had on this blog last year, I stressed that there was two other sorts of datasets that might be confused with this. If each taxon was known from only a single collection in the fossil record, such that their first appearance was their last appearance, than we would only have two dates known: the earliest and latest dates for that particular rock it was found in. Alternatively, in an extremely well resolved fossil record, we might have exceptional certainty about the dating of fossil remains, such the first and last appearance dates (FADs and LADs) of taxa were known from strata of extremely precise dating with negligble uncertainty.

Both situations are simply special cases of the four-date datasets; you could still describe them with four dates, but the values would largely be redundant. For comparability across datasets, though, it may be preferable to define taxa as

In both of these cases, each listed taxon has two dated values associated with it, but they imply very different things. I’ve encountered issues with people mistaking one for the other when it comes to providing input to paleotree functions, like timePaleoPhy, hence some recent (in the past year) changes to how timePaleoPhy handles age data (which the help file refers to as “timeData”) to make it more explicit. It doesn’t help that data reflecting age uncertainty of point observations is probably the most common type of data that people seem to have in-hand when attempting to time-scale phylogenies of fossil taxa, while I myself had in hand data of precise first and last appearances when I initially developed paleotree.

One could imagine cases where both of the above special cases are true: i.e., (1) taxa are known from singular remains and appear to not have any persistent morphotaxon duration and (2) the stratigraphic time-scale (this is sometimes called an ‘age model’) is so well-resolved that dates are known with exceptional precision; this means every fossil taxon’s age could be defined by a single dated value. Such datasets are very rare, and you won’t find anything that assumes such data as input in paleotree.

Ultimately, the commonly used functions for time-scaling phylogenies in the fossil record, my timePaleoPhy and strap’s DatePhylo use two-date age data like this. paleotree’s bin_timePaleoPhy and bin_cal3TimePaleoPhy are just wrappers which take the four-date timeList datasets, randomly sample ages within the intervals and hand their respective time-scaling functions two-date age data.

Occurrence Data

Although its possible to describe the age ranges of fossil taxa from many groups as the four-date-value type above, this format is a reduction in information content; a summary of only what is needed to know about the timing of their first and last appearances. Other than those cases where some taxon is found continuously from its first to last appearance, a given taxon may be observed to occur in a larger number of fossil collections. Thus for each of a given taxon, we might end up with two dates, reflecting the earliest and latest dates bounding the age of each collection. I’ll call this occurrence data (which I think is what everyone else calls it, but I’m not certain 100% because, well, paleontologists can be inconsistent with their terminology).

The Paleobiology Database is probably the data source most accessible to anyone who wants age data from the fossil record, and its raw data is in the form of such occurrences within collections. With the recently introduction of the new paleobioDB package (Varela et al., in press), the use of paleobioDB (aka PBDB) occurrence data is probably going to increase.

Let’s look at an example of occurrence data, using the paleobioDB package to download some PBDB records. To make this fun, we’ll look up this blog’s mascot…

## Loading required package: raster
## Loading required package: sp
## Attaching package: 'raster'
## The following objects are masked from 'package:ape':
##     rotate, zoom
## Loading required package: maps
nemaData<-pbdb_occurrences(limit="all", base_name="Nemagraptus", show=c("phylo","ident"))
nemaData<-nemaData[nemaData$rnk==3,]   #keep only occurrences of taxa resolved to species level
##         oid typ   cid                    tna rnk   tid               oei
## 1:4   93753 occ  6986   Nemagraptus gracilis   3 33700 Middle Ordovician
## 1:16 118184 occ  8776   Nemagraptus gracilis   3 33700           Caradoc
## 1:20 118736 occ  8887   Nemagraptus gracilis   3 33700          Actonian
## 1:22 118977 occ  9035   Nemagraptus gracilis   3 33700        Gisbornian
## 1:36 598661 occ 63445   Nemagraptus gracilis   3 33700        Gisbornian
## 1:38 601803 occ 63967 Nemagraptus ? gracilis   3 33700        Gisbornian
##        eag   lag   rid          odl   odn           cll   cln          phl
## 1:4  470.0 458.4   437 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:16 460.9 449.5   613 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:20 455.8 449.5   607 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:22 460.9 456.1   623 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:36 460.9 456.1 18179 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:38 460.9 456.1 18428 Graptoloidea 33606 Graptolithina 33534 Hemichordata
##        phn         idt      ids         mna mra    oli  rst  rss
## 1:4  33518 Nemagraptus gracilis Nemagraptus   5   <NA> <NA> <NA>
## 1:16 33518 Nemagraptus gracilis Nemagraptus   5   <NA> <NA> <NA>
## 1:20 33518 Nemagraptus gracilis Nemagraptus   5 Onnian <NA> <NA>
## 1:22 33518 Nemagraptus gracilis Nemagraptus   5   <NA> <NA> <NA>
## 1:36 33518 Nemagraptus gracilis Nemagraptus   5   <NA> <NA> <NA>
## 1:38 33518 Nemagraptus gracilis Nemagraptus   5   <NA>    ? <NA>

We can see each row of the obtained data-frame corresponds to a different individual occurrence of some named taxon. Myltiple occurrences are listed for the same taxon. We can simplify this by looking at just the taxon name listed for the occurrence and the earliest and latest age bounds for the associated collections.

##                         tna   eag   lag
## 1:4    Nemagraptus gracilis 470.0 458.4
## 1:16   Nemagraptus gracilis 460.9 449.5
## 1:20   Nemagraptus gracilis 455.8 449.5
## 1:22   Nemagraptus gracilis 460.9 456.1
## 1:36   Nemagraptus gracilis 460.9 456.1
## 1:38 Nemagraptus ? gracilis 460.9 456.1

We can already see that many of these occurrences overlap with each other and are not from intervals of the same size. Now, PBDB data is noisy and some of that is due to noise, but some of it is just due to differences in how well collections can be chronostratigraphically constrained. So, here we have a data set where taxa may be represented by hundreds of date-values, effectively two date bounds for every collection a taxon occurs in.

We could try visualizing this; first let’s group the occurrences by species…

occList<-lapply(unique(nemaData$ids),function(x) nemaData[x==nemaData$ids,c("eag","lag"), drop=FALSE])
occList<-lapply(occList,function(x) x[order(x[,1]),, drop=FALSE])

## $gracilis
##        eag   lag
## 1:20 455.8 449.5
## 1:48 458.4 453.0
## 1:16 460.9 449.5
## 1:22 460.9 456.1
## 1:36 460.9 456.1
## 1:38 460.9 456.1
## 1:41 460.9 456.1
## 1:43 460.9 456.1
## 1:45 460.9 449.5
## 1:46 460.9 449.5
## 1:49 460.9 456.1
## 1:4  470.0 458.4
## $exilis
##        eag   lag
## 1:39 460.9 456.1
## 1:40 460.9 456.1
## 1:42 460.9 456.1
## 2    460.9 449.5

And now let’s plot the bounds of the age uncertainty for each of the occurrences, color-coded for their species.

plot(0, 0, type="n", xlim=c(max(sapply(occList,max))*1.02, min(sapply(occList,min)))*.98,
     ylab = "", xlab = "Time (Ma)", main="Age Uncertainty of PBDB Occurrences for Nemagraptus Species")


for(i in 1:length(occList)){
  for(j in 1:nrow(occList[[i]])){

That wasn’t so impressive, as Nemagraptus is a rather species-poor graptolite genus. Let’s try its close dicranograptid cousin genus, Dicellograptus. This time, we’ll make the plotting code a function.

  #where pbdbOccData is a matrix returned by function pbdb_occurrences in package paleobioDB
  #order taxa by species (add more taxonomic levels later?)
  occList<-lapply(unique(pbdbOccData$ids),function(x) pbdbOccData[x==pbdbOccData$ids,c("eag","lag"), drop=FALSE])
  #order taxa by earliest occurrence
  #order the occurrences within taxa
  occList<-lapply(occList,function(x) x[order(x[,1]),, drop=FALSE])
  #set xlims
   xlims<-c(max(sapply(occList,max)), min(sapply(occList,min))) 
  #initiate the plot with a modifiable main title
  plot(0, 0, type="n", xlim=xlims,
       ylab = "", xlab = "Time (Ma)",
       main=ifelse(is.null(groupLabel),"Age Uncertainty of PBDB Occurrences",
             paste("Age Uncertainty of PBDB Occurrences for",groupLabel)))
  #set colors
    occColors<-sample(rainbow(length(occList)))  #scramble colors
  #set line width
  #now plot the occurrences as lines
  for(i in 1:length(occList)){
    for(j in 1:nrow(occList[[i]])){
  #return the occList as an invisible data object

dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"))
dicelloData<-dicelloData[dicelloData$rnk==3,]   #keep only occurrences of taxa resolved to species level

plotOccPBDB(dicelloData,"Dicellograptus Species")

Each color shown here is a different taxon, each line is the age uncertainty of a different occurrence. We can see for dicellograptids, the age uncertainty may potentially be a lot greater than the true sampled age ranges themselves. Many of these collections have age uncertainties that overlap and are not contiguous.

Now, the issue here is that for many statistical analyses that we apply in the fossil record, they use only the first and last appearance times, and they also often presume that these appearances are reported in contiguous, non-overlapping intervals of roughly equal length (let’s call that ‘well-behaved’ four-date data). Why? Well, most of these methods depend on the simple modeling trick that if we have contiguous intervals of roughly equal size, it becomes very easy to calculate the probability of observing a taxon X intervals into the future (or the past), given a sampling parameter. Obviously, there are considerable challenges in translating a dataset of loose occurrences to such well-behaved interval data, and that’s where my mind is right now.

Now, all that said, there are some analytical programs that take occurrence data natively. In particular, PyRate is a Python package that applies a Bayesian MCMC algorithm to estimate taxonomic origination, extinction and sampling rates from occurrence data. The method is described here and the program itself is described in MEE here. One could easily set up a functions to use [paleobioDB] package to download PBDB and use PyRate’s data preparations R functions to produce datasets for analysis in Python. Ultimately, PyRate solves the issue using a similar solution to the one I advocate via bin_timePaleoPhy (I discuss this more below): treating the age uncertainty of occurrences as uniform uncertainty bounds on the actual age of taxa and repeatedly randomly sampling ages from within those bounds and running the PyRate analysis. The sampling of ages is technically done in R; the Python component of PyRate only accepts single-date ‘point’ age data for occurrences as input. The results of multiple analyses can then be considered to see how much influence the uncertainty in taxon ages has on the resulting.

But PyRate doesn’t replace every pre-existing analysis of paleobiological data, and the ‘repeatedly sample and consider the resulting distributions’ trick isn’t an applicable solution to those analyses. So, how do we translate occurrence data into ranges within contiguous intervals? How does the manner in which we do this matter (or not) for the analyses we do? I’ve given this question a lot of thought previously with respect to time-scaling phylogenies of fossil taxa, but what about analyses where phylogenies aren’t involved? In particular, there are a number of functions in paleotree for estimating sampling rates and sampling probabilities that depend on having data from contiguous intervals (e.g., getSampProbDisc or make_durationFreqDisc or make_inverseSurv). These are questions I’ll try to address in future posts, along with a long-promised post about porting data from paleobioDB into paleotree format.

However, on a related node…

What does the Paleobiology Database report as two-date age ranges?

Some analyses only accept the two-date range data, and so people have invented algorithms for obtaining such datasets from occurrence data or four-date range data. However, such datasets must represent either precise age ranges or age uncertainty on a single occurrence, and thus information is being lost when we translate to two-dates (except in the rare cases where each taxon is known from only a single collection, and thus the two-dates are simply age uncertainty on singular finds). One can imagine some simple algorithms for obtaining two-date ranges, such as ‘maximum range’, i.e. take the earliest possible age of the oldest occurrence for a taxon and the latest possible age of the youngest occurrence. Although algorithms such as this retain some information about the geologic age of taxa, my opinion is that the information loss about age uncertainty may impact many analyses, perhaps adding an analytical bias. An alternative philosophy, which I espouse in the bin_timePaleoPhy function in paleotree, is to random draw precise FADs and LADs from four-date ranges (or occurrence data), run an analysis (such as time-scaling a phylogeny of fossil taxa), save the resulting test statistic or parameters of interest, and repeat this procedure many times to generate sampled distributions of the test statistics / parameters. Such a procedure retains (rather than loses) the uncertainty component of the ages into our results, but as I said above when discussing PyRate, this procedure isn’t applicable to every type of analysis.

Two of the various utilities or “apps” (…ugh) associated with the Paleobiology Database will return two-date age ranges for a search query. I became very curious how these ranges were calculated and so I wanted to report on my investigations.


Before I get into these various ways of returning ranges from occurrence data, I should mention a relavant not-range-returning application that deals with fossil taxon ages, FirstApp, a module by Matt Clapham. FirstApp takes paleobioDB data for a given taxon and displays occurrences through time as a density histogram, with occurrences placed as points at the earliest age boundary of the collections they occur within.

Ranges in Paleobiology Database Classic / Fossilworks

First, if you go to the present Paleobiology Database website, you have the option of using PBDB Classic which is a search query engine, with a download option (located here). You can access an almost identical interface at FossilWorks, which mirrors the paleobioDB data along with a number of analytical tools previously housed at the PBDB. FossilWork’s basic download option is located here. The algorithm for calculating two-date ranges used is somewhat complex but doesn’t appear to be described in any documentation, so I went straight to the “horse’s mouth” (his words), and asked John Alroy himself. I’ll paraphrase our conversation.

John has developed this “zone-of-overlap” algorithm to deal with obtaining ranges from the realities of messy occurrence data described above: age uncertainties (i.e. stratigraphic intervals) that overlap, of different lengths and don’t sort into an orderly hierarchy. Rather than returning the earliest possible date and the latest possible date like with the maximum-range algorithm described above, the zone-of-overlap algorithm “finds the oldest base that is older than at least part of all the intervals and the youngest that is younger than at least part of all the intervals” (direct quote, John Alroy).

Here’s three specific examples of how these ranges are calculated, taken from direct quotes from our email conversation. Two are from John and the third is from myself; as such, they are almost comedic examples of our taxonomic preferences:

  • “…for example, if the two oldest occurrences are respectively Paleocene (not better resolved than that) and Thanetian then the base has to be the base of the Thanetian.”

  • “Another example would be Hemphillian (Mio-Pliocene) and Pliocene. These work out to a range going from base of the Pliocene to the top of the Hemphillian even though the Hemphillian is ‘older’ (based on its base) and the Pliocene ‘younger.’”

  • “If I follow correctly, some taxon listed in the PBDB with two occurrences, in a Katian collection (base ~456 Ma) and a Dicellograptus anceps zone (base is ~449 Ma) collection respectively, the lower age bound returned returned will be ~449 Ma.”

In other words, for calculating the first appearance datum (FAD), the algorithm looks for all occurrences that overlap with the age range of the earliest-most occurrence, obtains their earliest boundary ages and returns the latest-most earliest age boundary among these overlapping occurrences. Similarly, for calculating the latest appearance datum (LAD), the algorithm looks for all occurrences that overlap with the age range of the latest-most occurrence, obtains their latest boundary ages and returns the earliest-most latest age boundary among these overlapping occurrences. On theoretical grounds, one could probably describe the zone-of-overlap algorithm as minimizing taxonomic age ranges by assuming that overlapping occurrences probably describe a very similar FAD or LAD, and thus picks the one that extends the taxonomic range the least.

However, this does come with a downside that if these occurrences are not repeated attempts to capture the same FAD or LAD, then the zone-of-overlap algorithm isn’t an accurate depiction of the uncertainty in the ages; the age of the rocks containing the observed FADs and LADs may actually be outside of the reported range, unlike the maximum-range algorithm, which must contain the entire observed age range for a taxon as long as ourour interval dates for collections is accurate. Consider my graptolite zone example above; there is a possibility that the collection from the Katian is actually from an earlier graptolite zone than the D. anceps zone, maybe the Dicranograptus kirki zone (~453 Ma). Better stratigraphic resolution on a collection might very well provide such information in the future, but zone-of-overlap algorithm doesn’t give any weight to that possibility. On the other hand, John argues that a large number of taxa in the PBDB are listed from collections with very broadly defined intervals of time, and so some way is needed to downweight the effect of that lack of stratigraphic interval resolution. Given the nature of the data, the loss of uncertainty may be a necessary evil under the zone-of-overlap algorithm if you want to be as conservative as possible (say with providing molecular clock calibration dates for phylogenetic divergence dates for crown clades). I think the likely answer here is dependent strongly on the group we’re interested in and the reason we want range dates.

Ranges in R package paleobioDB

The paleobioDB package also has a function for obtaining and plotting two-date age ranges, the function pbdb_temp_range. Here’s a species-level example using the Dicellograptus data above, with the plotting functionality turned off.

pbdb_temp_range(dicelloData, rank="species", do.plot=FALSE)
##                              max   min
## Dicellograptus alector     471.8 443.4
## Dicellograptus anceps      455.8 443.7
## Dicellograptus ornatus     453.0 443.4
## Dicellograptus complanatus 453.0 443.7

So what are these ranges? The manual file for pbdb_temp_range just refers to them as temporal ranges or time spans of taxa.

Well, we can go to the Github repo for paleobioDB and read the code for pbdb_temp_range directly, here’s the relevant bit for species. It appears that the ranges reported by pbdb_temp_range are the maximum-ranges, the earliest and latest possible bounds on a taxon’s age range.

Wait, wait, hold on, why are there only four species listed…

So, tangentially, you’re probably wondering why so few species are reported, as we can see from the occurrences figure above that a much larger number appears to be contained within the downloaded occurrences table. Uh… I’ll address that in my next blog post.

PS: I just discovered how paleobioDB’s API allows you to call PhyloPic images for taxa, woah! Here, check out Nemagraptus: