Monday, February 23, 2015

Identifying Unique Species in Paleobiology Database Occurrence Downloads

Introduction

Hello all!

I’ve had an interesting week or so having conversations with several individuals about best practices for identifying ‘species’ level taxa in fossil occurrence data from the Paleobiology Database following my post two weeks ago where I found that several utilities associated with the PBDB identified ‘species’ in different ways. Let’s try to hit this issue really hard today.

Paleobiology Database Vocabulary

First, in my last post, I was a little confusing since I referred to the same variable by its full name and a shortening that seemed to be used only by the paleobioDB package (Varela et al., in press); it turns out this is simply the difference between variable headings returned by the Paleobiology Database’s API depending on special parameter vocab, which can either be pbdb (full variable names) or com (for the ‘compact’, shortened terminology). You can easily change this in paleobioDB data-pulling functions using argument vocab.

library(paleobioDB)
## Loading required package: raster
## Loading required package: sp
## Loading required package: maps
head(pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="com"))
##      oid typ cid                tna rnk   tid            mna mra   mid
## 1:1 3393 occ 328 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:2 4589 occ 378 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:3 4634 occ 379 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:4 4688 occ 380 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:5 8957 occ 403 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:6 9127 occ 424 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
##                   oei   eag   lag rid            idt ids  oli  rst  rss
## 1:1       Black River 460.9 457.5  13 Dicellograptus sp. <NA> <NA> <NA>
## 1:2             Harju 452.0 443.7  13 Dicellograptus sp. <NA> <NA> <NA>
## 1:3             Harju 452.0 443.7  13 Dicellograptus sp. <NA> <NA> <NA>
## 1:4           Caradoc 460.9 449.5  13 Dicellograptus sp. <NA> <NA> <NA>
## 1:5           Ashgill 449.5 443.7  13 Dicellograptus sp. <NA> <NA> <NA>
## 1:6 Middle Ordovician 470.0 458.4  13 Dicellograptus sp. <NA> <NA> <NA>
head(pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="pbdb"))
##     occurrence_no record_type collection_no         taxon_name taxon_rank
## 1:1          3393  occurrence           328 Dicellograptus sp.      genus
## 1:2          4589  occurrence           378 Dicellograptus sp.      genus
## 1:3          4634  occurrence           379 Dicellograptus sp.      genus
## 1:4          4688  occurrence           380 Dicellograptus sp.      genus
## 1:5          8957  occurrence           403 Dicellograptus sp.      genus
## 1:6          9127  occurrence           424 Dicellograptus sp.      genus
##     taxon_no   matched_name matched_rank matched_no    early_interval
## 1:1    33650 Dicellograptus        genus      33650       Black River
## 1:2    33650 Dicellograptus        genus      33650             Harju
## 1:3    33650 Dicellograptus        genus      33650             Harju
## 1:4    33650 Dicellograptus        genus      33650           Caradoc
## 1:5    33650 Dicellograptus        genus      33650           Ashgill
## 1:6    33650 Dicellograptus        genus      33650 Middle Ordovician
##     early_age late_age reference_no     genus_name species_name
## 1:1     460.9    457.5           13 Dicellograptus          sp.
## 1:2     452.0    443.7           13 Dicellograptus          sp.
## 1:3     452.0    443.7           13 Dicellograptus          sp.
## 1:4     460.9    449.5           13 Dicellograptus          sp.
## 1:5     449.5    443.7           13 Dicellograptus          sp.
## 1:6     470.0    458.4           13 Dicellograptus          sp.
##     late_interval genus_reso species_reso
## 1:1          <NA>       <NA>         <NA>
## 1:2          <NA>       <NA>         <NA>
## 1:3          <NA>       <NA>         <NA>
## 1:4          <NA>       <NA>         <NA>
## 1:5          <NA>       <NA>         <NA>
## 1:6          <NA>       <NA>         <NA>

Note that some variables change their formatting depending on whether pbdb or com vocabulary are used; for example, taxon_rank is full nouns (‘genus’, ‘species’, etc) with pbdb, but is a numerical coding with com.

For this post (and future posts), I’ll try to use the full pbdb terminology; you can always check the API page for PBDB occurrence data if you want the corresponding com label though.


A brief aside… paleobioDB package versus the raw API in R

Shanan Peters has usefully pointed out that we shouldn’t mistake paleobioDB as the only way to access the PBDB API. After all, the API will presumably continue to evolve and thus the paleobioDB package may not be up-to-date with the API’s various arguments. Instead of using paleobioDB we could instead access the API directly, for example:

rawAPI<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident&limit=all")
head(rawAPI)
##   occurrence_no record_type reid_no superceded collection_no
## 1          3393  occurrence      NA         NA           328
## 2          4589  occurrence      NA         NA           378
## 3          4634  occurrence      NA         NA           379
## 4          4688  occurrence      NA         NA           380
## 5          8957  occurrence      NA         NA           403
## 6          9127  occurrence      NA         NA           424
##           taxon_name taxon_rank taxon_no   matched_name matched_rank
## 1 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 2 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 3 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 4 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 5 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 6 Dicellograptus sp.      genus    33650 Dicellograptus        genus
##   matched_no    early_interval     late_interval early_age late_age
## 1      33650       Black River       Black River     460.9    457.5
## 2      33650             Harju             Harju     452.0    443.7
## 3      33650             Harju             Harju     452.0    443.7
## 4      33650           Caradoc           Caradoc     460.9    449.5
## 5      33650           Ashgill           Ashgill     449.5    443.7
## 6      33650 Middle Ordovician Middle Ordovician     470.0    458.4
##   reference_no     genus_name genus_reso subgenus_name subgenus_reso
## 1           13 Dicellograptus                       NA            NA
## 2           13 Dicellograptus                       NA            NA
## 3           13 Dicellograptus                       NA            NA
## 4           13 Dicellograptus                       NA            NA
## 5           13 Dicellograptus                       NA            NA
## 6           13 Dicellograptus                       NA            NA
##   species_name species_reso
## 1          sp.             
## 2          sp.             
## 3          sp.             
## 4          sp.             
## 5          sp.             
## 6          sp.

Note, however, some small differences between these otherwise very similar requests. In particular, paleobioDB seems to have removed several variables related to data cleaning, like whether an occurrence has been superceded by another occurrence or new taxonomic assignment. I’ve traced the code used in the function through paleobioDB’s github repo and I’m not sure why these don’t show up.

PBDBpack<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="pbdb")

# same number of columns in both?
ncol(rawAPI)==ncol(PBDBpack)
## [1] FALSE
# Nope! Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])
## [1] "reid_no"       "superceded"    "subgenus_name" "subgenus_reso"

Perhaps there is a cleaning step where paleobioDB (or Paleobiology Database, when it responds to JSON requests) is removing empty columns (i.e. contains only NAs).

# Are the missing columns just full of NAs?
sapply(missingCol,function(x) all(is.na(rawAPI[,x])))
##       reid_no    superceded subgenus_name subgenus_reso 
##          TRUE          TRUE          TRUE          TRUE

There’s no mention to this phenomenon that I can find in paleobioDB’s manual (or any evidence of column dropping in the code of the package’s repo) or the documentation for the Paleobiology Database’s API, however. It’s a little odd because it means one could query, for example, for abundance data (show=c("abund")) and not get back the columns relating to abundance because they are empty. This could create issues for code that expects variables to always be present in a given data download.

For example, here is an extremely contrived example for the bizarre reticulate graptoloid genus Reteograptus (see Finney, 1980, for some morphological details), which is probably a lazarus taxon of the abrograptids (according to Finney).

rawAPI<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Reteograptus&show=abund&limit=all")
PBDBpack<-pbdb_occurrences(limit="all", base_name="Reteograptus", show=c("abund"), vocab="pbdb")

# Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])
## [1] "reid_no"     "superceded"  "abund_value" "abund_unit"
#let's look at the abundance values from the rawAPI; are they all NAs?
apply(rawAPI[,c("abund_value","abund_unit")],2,function(x) all(is.na(x)))
## abund_value  abund_unit 
##        TRUE        TRUE

There does not appear to be an argument to alter this variable-dropping behavior. I’ve opened an issue ticket here at the paleobioDB repo.

The ultimate result of the is that we can’t always depend on the same variables to (a) be present, (b) have the same column headings or (c) have the same variable codings for tables of PBDB occurrence data, because it depends greatly on how the data was downloaded and which options were used, going beyond the choices involving show =. The authors of paleobioDB package seem to have dealt with this by writing two separate versions of functions for analyzing occurrence data within each function (example), which is inefficient but perhaps necessary.


Distinguishing Species in Paleobiology Database Occurrence Data

It turns out there is no single ‘right’ answer to how to identify unique species in a set of PBDB occurrence data and assign occurrences to these distinct species. To understand the different ways we might perform this task, we should cover the relevant variables and what the information contained within is. The following variables can be obtained by setting the show to both ident and phylo, whether within the API or paleobioDB.

Some of the following information comes from a very informative discussion I have had with Matthew Clapham. However, personallylly, I’m still working through understanding all these variables, and intentionally trying to do this with as little fact-checking by email as possible. This is partly to demonstrate the limitations of the current documentation and also to avoid drowning certain individuals in emails. Better I save up all my questions for the end, right? I will be honest and say I still don’t entirely understand reid_no and superceded. So don’t blame my sources, blame me for when I get things wrong, because that’s part of this little mental experiment.

An important distinction is that one set of variables refers to the taxonomic identification for an occurrence reported within the original reference:

  • taxon_name described as “The taxonomic name by which this occurrence is identified.”

  • taxon_rank described as “The taxonomic rank of the name, if this can be determined.”

  • genus_name described as “The taxonomic name (less species) by which this occurrence was identified. This is often a genus, but may be a higher taxon.”

  • genus_reso described as “The resolution of this taxonomic name, i.e. sensu lato or aff.”

  • species_name described as “The species name (if any) by which this occurrence was identified”

  • species_reso described as “The resolution of the species name, i.e. sensu lato or n. sp”

Related to this group, we have the variable taxon_no:

  • taxon_no described as “The unique identifier of the identified taxonomic name. If this is empty, then the name was never entered into the taxonomic hierarchy stored in this database and we have no further information about the classification of this occurrence.”

…which identifies which taxon ID is assigned to the original taxon identification. However, many taxa apparently get entered without being given unique taxon IDs, and are instead assigned tentatively to a higher taxon. For example, there are many occurrences listed with distinct species_names that are listed under a taxon_no for a genus-level taxon.

A different set of variables refers to the formal taxon assignment currently given within the PBDB’s present state of taxonomic standardization, which attempts to control for synonymies:

  • matched_name described as “The senior synonym and/or currently accepted spelling of the closest matching name in the database to the identified taxonomic name, if any is known, and if this name is different from the value of taxon_name.”

  • matched_rank described as “The taxonomic rank of the matched name, if different from the value of taxon_rank”

  • matched_no described as “The unique identifier of the closest matching name in the database to the identified taxonomic name, if any is known.”

  • genus described as “The name of the genus in which this occurrence is classified”

  • genus_no described as “The identifier of the genus in which this occurrence is classified”

Again, because many lower-taxa get informally assigned to the same higher-taxon ID on a tentative tentatively, taxon_no (and thus matched_no and matched_name in addition) may be the same for occurences listed as multiple distinct species going by taxon_rank and species_name.

Finally, the occurrence itself might be reidentified or superceded with a new taxonomic assignment, which can be found in:

  • reid_no described as “If this occurrence was reidentified, a positive integer that uniquely identifies the reidentification”

  • superceded described as “The value of this field will be true if this occurrence was later identified under a different taxon”

These aren’t exhaustive lists, and if we wanted to discuss pulling out unique supraspecific taxa, we would need to consider additional variables such as family_name reported by the show=phylo parameter.

It turns out the Dicellograptus example I keep using provides an example of one type of disagreement we might see in groups where there has been little standardization and where many species have been entered ‘informally’ as tentative assignments to a higher taxon. So let’s focus on the above variables in that dataset:

#taxonomic variables of interest
taxonVar<-c("reid_no","superceded",      #PBDB occurrence reidentification info
            #original taxon info for occurrence
            "taxon_name","taxon_rank","genus_name","genus_reso","species_name","species_reso", 
            #standardized formal taxon info
            "taxon_no","matched_name","matched_rank","matched_no","genus","genus_no") 

# due to the column dropping issue, we can't use paleobioDB package:
#dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident","phylo"), vocab="pbdb")

# so we'll use the API instead
dicelloData<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident,phylo&limit=all")
head(dicelloData[,taxonVar])
##   reid_no superceded         taxon_name taxon_rank     genus_name
## 1      NA         NA Dicellograptus sp.      genus Dicellograptus
## 2      NA         NA Dicellograptus sp.      genus Dicellograptus
## 3      NA         NA Dicellograptus sp.      genus Dicellograptus
## 4      NA         NA Dicellograptus sp.      genus Dicellograptus
## 5      NA         NA Dicellograptus sp.      genus Dicellograptus
## 6      NA         NA Dicellograptus sp.      genus Dicellograptus
##   genus_reso species_name species_reso taxon_no   matched_name
## 1                     sp.                 33650 Dicellograptus
## 2                     sp.                 33650 Dicellograptus
## 3                     sp.                 33650 Dicellograptus
## 4                     sp.                 33650 Dicellograptus
## 5                     sp.                 33650 Dicellograptus
## 6                     sp.                 33650 Dicellograptus
##   matched_rank matched_no          genus genus_no
## 1        genus      33650 Dicellograptus    33650
## 2        genus      33650 Dicellograptus    33650
## 3        genus      33650 Dicellograptus    33650
## 4        genus      33650 Dicellograptus    33650
## 5        genus      33650 Dicellograptus    33650
## 6        genus      33650 Dicellograptus    33650

Just from the first few lines, we can also see a number of Dicellograptus occurrences that were never resolved beyond the genus level and thus are just assigned to the Dicellograptus genus taxon ID (taxon_no = 33650), and the matched_name and other ‘matched’ information is the same, because Dicellograptus has never been synonymized (which is good, because its a valid genus, so it shouldn’t be synonymized).

We can see these listed as “Species lacking formal opinion data” here.

But how many unique taxon_name and species_name values are assigned to taxon_no = 33650?

unique(dicelloData[dicelloData$taxon_no==33650,"taxon_name"])
##  [1] Dicellograptus sp.                 Dicellograptus ? sp.              
##  [3] Dicellograptus alector n. sp.      Dicellograptus gurleyi            
##  [5] Dicellograptus mensurans           Dicellograptus sextans cf.        
##  [7] Dicellograptus spp.                Dicellograptus sextans            
##  [9] Dicellograptus vagus cf.           Dicellograptus intortus           
## [11] Dicellograptus divaricatus cf.     Dicellograptus caduceus cf.       
## [13] Dicellograptus forchammeri cf.     Dicellograptus intortus cf.       
## [15] Dicellograptus moffatensis cf.     Dicellograptus intermedius n. sp. 
## [17] Dicellograptus divaricatus         Dicellograptus russonioides n. sp.
## [19] Dicellograptus johnstrupi          Dicellograptus morrisi            
## [21] Dicellograptus gurleyi cf.         Dicellograptus elegans cf.        
## [23] Dicellograptus morrisi cf.         Dicellograptus smithi cf.         
## [25] Dicellograptus sp. ?               Dicellograptus alector            
## [27] Dicellograptus alector cf.         Dicellograptus ? angulatus cf.    
## [29] Dicellograptus angulatus cf.       Dicellograptus flexuosus          
## [31] Dicellograptus minor               Dicellograptus tumidus            
## [33] Dicellograptus turgidus            Dicellograptus mirabilis          
## 40 Levels: Dicellograptus ? angulatus cf. ... Dicellograptus vagus cf.
unique(dicelloData[dicelloData$taxon_no==33650,"species_name"])
##  [1] sp.          alector      gurleyi      mensurans    sextans     
##  [6] spp.         vagus        intortus     divaricatus  caduceus    
## [11] forchammeri  moffatensis  intermedius  russonioides johnstrupi  
## [16] morrisi      elegans      smithi       angulatus    flexuosus   
## [21] minor        tumidus      turgidus     mirabilis   
## 27 Levels: alector anceps angulatus caduceus complanatus ... vagus

Thankfully, all of these are given a taxon_rank of species… right? We can check this by removing any occurrence not given a taxon_rank of species and looking at the unique taxon_name values.

unique(dicelloData[dicelloData$taxon_no==33650 & dicelloData$taxon_rank!="species","taxon_name"])
## [1] Dicellograptus sp.   Dicellograptus ? sp. Dicellograptus spp. 
## [4] Dicellograptus sp. ?
## 40 Levels: Dicellograptus ? angulatus cf. ... Dicellograptus vagus cf.

Okay, in Dicellograptus, at least, all the informal species assigned formally to a genus-level taxon are listed correctly as species.

We can examine another sort of the disagreement among these fields by looking at datasets where there has been a lot of taxonomic standarization in the PBDB (i.e. not graptolites). Matt Clapham showed me a particularly interesting read-out for a query involving Acosarina minuta, a Permian brachiopod (…perhaps a minute brachiopod?), filtering for the same variables as above:

# so we'll use the API, like above...

# Note that the following won't work except from web browser address bar or other utility 
  # which automatically (and silently) replaces " " with "%20"
acoData1<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Acosarina minuta&show=ident,phylo")
colnames(acoData1)
##  [1] "occurrence_no"  "record_type"    "reid_no"        "superceded"    
##  [5] "collection_no"  "taxon_name"     "taxon_rank"     "taxon_no"      
##  [9] "matched_name"   "matched_rank"   "matched_no"     "early_interval"
## [13] "late_interval"  "early_age"      "late_age"       "reference_no"
# the query terminated at the " " in the taxon name and never got to other parameters, like 'show'
# (This tripped me up for an hour. Don't judge me!)

# Okay, now the right way
acoData<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Acosarina%20minuta&show=ident,phylo")
head(acoData[,match(taxonVar,colnames(acoData))])
##   reid_no superceded             taxon_name taxon_rank  genus_name
## 1      NA         NA     Orthotichia indica    species Orthotichia
## 2   20839         NA       Acosarina minuta    species   Acosarina
## 3      NA         NA       Acosarina minuta    species   Acosarina
## 4      NA         NA Acosarina dorsisulcata    species   Acosarina
## 5      NA         NA Acosarina dorsisulcata    species   Acosarina
## 6      NA         NA Acosarina dorsisulcata    species   Acosarina
##   genus_reso species_name species_reso taxon_no     matched_name
## 1                  indica                177488 Acosarina minuta
## 2                  minuta                125301 Acosarina minuta
## 3                  minuta                125301 Acosarina minuta
## 4            dorsisulcata                126711 Acosarina minuta
## 5            dorsisulcata                126711 Acosarina minuta
## 6            dorsisulcata                126711 Acosarina minuta
##   matched_rank matched_no     genus genus_no
## 1      species     125301 Acosarina    26652
## 2      species     125301 Acosarina    26652
## 3      species     125301 Acosarina    26652
## 4      species     125301 Acosarina    26652
## 5      species     125301 Acosarina    26652
## 6      species     125301 Acosarina    26652

All of the occurrences listed here are formally identified in PBDB’s standardized taxonomy as Acosarina minuta. That’s pretty easy to verify by looking at unique values of matched_name and matched_no:

unique(acoData$matched_name)
## [1] Acosarina minuta
## Levels: Acosarina minuta
unique(acoData$matched_no)
## [1] 125301

However, we can see even just from the head() that there are multiple values of taxon_name, taxon_no, genus_name and species_name formally reassigned to the ‘matched’ taxon Acosarina minuta. Here’s a full list:

unique(acoData[,c("taxon_name","taxon_no","genus_name","species_name")])
##                           taxon_name taxon_no   genus_name species_name
## 1                 Orthotichia indica   177488  Orthotichia       indica
## 2                   Acosarina minuta   125301    Acosarina       minuta
## 4             Acosarina dorsisulcata   126711    Acosarina dorsisulcata
## 15     Acosarina dorsisulcata n. sp.   126711    Acosarina dorsisulcata
## 49                     Orthis indica   125296       Orthis       indica
## 58                  Acosarina indica   175512    Acosarina       indica
## 99                 Kotlaia capillosa   123689      Kotlaia    capillosa
## 100 Kotlaia n. gen. capillosa n. sp.   123689      Kotlaia    capillosa
## 103               Orthotichia minuta   199434  Orthotichia       minuta
## 110             Acosarina minuta cf.   125301    Acosarina       minuta
## 111             Sunacosarina campana   190895 Sunacosarina      campana
## 134        Acosarina kanmerai n. sp.   206333    Acosarina     kanmerai
## 142   Orthis (Schizophoria) indica ?   125296       Orthis       indica
## 143     Orthis (Schizophoria) indica   125296       Orthis       indica
## 217                Dalmanella indica   255744   Dalmanella       indica
## 234               Acosarina ? minuta   125301    Acosarina       minuta

So, how do we navigate these variables to identify unique species in our datasets?

Potentially Misleading Ways To Identify Species in PBDB Occurrence Data

In the last post, I mentioned that a number of my friends, as it turned out, used very different criteria for distinguishing taxa from PBDB occurrence data, criteria which also varied among the formal ‘apps’ of the Paleobiology Database.

Personally, I’d intuitively chosen (two blog posts ago) to use species_name alone, filtered on taxon_rank = "species". It turns out several of my friends also use species_name. Another friend uses a combination of taxon_name and taxon_no. Strategies based on taxon_name or species_name are (at best) not making use of the potential for information of taxonomic synonymies in the PBDB and (at worst) in groups that have undergone a considerable amount of qualified taxonomic standardization, using these variables may be very misleading relative to the current state of knowledge.

It should also be clear that taxon_no alone could be a misleading way to distinguish unique species in occurrence data, even if filtered to species-ranked occurrences. Referring to the last post, I showed that paleobioDB package uses taxon_no to identify unique species in occurrences, filtered to taxon_rank of species. The function match() is then used to find genus_name and species_name which are then combined into a single species label. We can see the code here. As we can see above, even filtering taxon_no to species-level occurrences via taxon_rank would result in mistakingly grouping a number of informally entered species that are formally assigned to their genus as a single species-level taxon with an exaggerated number of occurrences assigned to it. Furthermore, the use of match() returns the first and only the first match in a list, so the assigned species-label is an arbitrary function of whatever species comes first in a table of occurrences. This results in the mistaken reporting of Dicellograptus alector from Dicellograptus occurrence data below:

require(paleobioDB)
dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"))
pbdb_temp_range(dicelloData, rank="species", do.plot=FALSE)
##                              max   min
## Dicellograptus alector     471.8 443.4
## Dicellograptus anceps      455.8 443.7
## Dicellograptus ornatus     453.0 443.4
## Dicellograptus complanatus 453.0 443.7

I’ve opened an issue ticket for this issue on the paleobioDB repo here.

It turns out this isn’t what PBDB Navigator does, as I had incorrectly guessed last post. PBDB Navigator is doing something much more complex, counting formal taxa that are listed as daughter taxa. The reason why Dicellograptus gravis appears despite having no occurrences in the PBDB is because that taxon was formally entered without any occurrences. This is discussed more here, where I had thought it was a bug and reported it as such, but it was more of a… misinterpretation. Similarly, PBDB Classic / Fossilworks appears to use the taxonomic standarization in the database to return ranges for all formally-defined species appropriately (under the Download function), but returns any informally defined species in addition.

A Framework for Identifying Unique Species

So, how should we go about pulling species-level taxa from our occurrence data? If we want all ‘species’ information possible, then we must do two seperate processes:

  1. Pull out all formally-defined species, with their matched_name species name. This should be as simple as using matched_rank to filter on species-level formal taxa and pulling out the unique matched_no values.

  2. Pull out all informal species that are assigned formally to supraspecific taxa. This should be possible on filtering only for taxa that do not have a matched_rank = species assignment but have a taxon_rank = species value, and then pulling on unique values of species_name.

Obviously, the second is optional if we want to only keep formal species; depends on our data.

There are some complexities to the above for actually generating the correct labels for these species-level taxa. First, there is no corresponding species_name-like variable for matched_taxon as there is with genus_name and genus. The most reasonable method seems to be splitting the species name from the genus name in matched_name and combining it with genus, but presumably this should have the same effect as just using matched_name whole.

Let’s look at a function that tries to follow this ‘informal versus formal’ dichotomy. Here’s taxonClean from Matthew Clapham’s github repo PBDB-R-scripts, function script located here:

taxonClean <- function(occurrences, tax_level="genus", formal_id="no") {
  
  #FILE PREPARATION AND DATA CLEANING
  #file preparation and data cleaning for species level analysis
  if(tax_level=="species"){
    #removes rows where species qualified by cf. or aff., question mark, ex gr. or quotation mark
    resolved_occs <- subset(occurrences, occurrences$species_reso=="" | occurrences$species_reso=="n. sp.")
    
    if (formal_id=="yes") {
      #deletes occurrences not resolved to species level using formally-classified species
      cleaned_occs <- subset(resolved_occs, resolved_occs$matched_rank=="species")
      cleaned_occs$final_taxon <- cleaned_occs$matched_name
      
    } else {
      #deletes occurrences not resolved to species level using classified and unclassified species
      resolved_occs <- subset(resolved_occs, resolved_occs$taxon_rank=="species")
      
      #strips all extraneous qualifiers from taxon name
      resolved_occs$taxon_name <- gsub("\" ", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub("cf. ", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub("aff. ", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub("\\? ", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub("n. gen. ", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub(" n. sp.", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub(" sensu lato", "", resolved_occs$taxon_name)
      resolved_occs$taxon_name <- gsub(" informal", "", resolved_occs$taxon_name)
      
      #separates classified and unclassified occurrences
      classified_occs <- subset(resolved_occs, resolved_occs$matched_rank=="species")
      unclassified_occs <- subset(resolved_occs, resolved_occs$matched_rank!="species")
      
      classified_occs$final_taxon <- classified_occs$matched_name
      unclassified_occs$final_taxon <- unclassified_occs$taxon_name
      
      #combines classified and unclassified occurrences
      cleaned_occs <- rbind(classified_occs, unclassified_occs)
    }
    
    #or file preparation and data cleaning for family level analysis
  } else if (tax_level=="family") {
    
      #deletes occurrences not resolved to at least family level
      resolved_occs <- subset(occurrences, occurrences$matched_rank <= 9)    
      
      #deletes occurrences where genus or family are qualified with question mark, quotations, cf. or aff.
      if(length(grep("\\? ", resolved_occs$taxon_name)) > 0){
        resolved_occs <- resolved_occs[-grep("\\? ", resolved_occs$taxon_name),]
        }
      
      if(length(grep("\" ", resolved_occs$taxon_name)) > 0){
        resolved_occs <- resolved_occs[-grep("\" ", resolved_occs$taxon_name),]
        }
      
      if(length(grep("cf. ", resolved_occs$taxon_name)) > 0){
        resolved_occs <- resolved_occs[-grep("cf. ", resolved_occs$taxon_name),]
        }
      
      if(length(grep("aff. ", resolved_occs$taxon_name)) > 0){
        resolved_occs <- resolved_occs[-grep("aff. ", resolved_occs$taxon_name),]
        }
      
      cleaned_occs <- resolved_occs
      
      cleaned_occs$final_taxon <- cleaned_occs$family
      
    #or file preparation and data cleaning for genus level analysis
  } else {
    
      #deletes occurrences not resolved to at least genus level using classified and unclassified species
      resolved_occs <- subset(occurrences, occurrences$matched_rank <= 5)
      
      #deletes occurrences where genus is qualified with question mark, quotations, cf. or aff.
      cleaned_occs <- subset(resolved_occs, resolved_occs$genus_reso=="" | resolved_occs$genus_reso=="n. gen.")
      
      #extracts genus name from matched_name string
      cleaned_occs$final_taxon <- gsub(" .*", "", cleaned_occs$matched_name)
    
  }
  
  cleaned_occs

}

(I made some small code alterations because I am just personally terrified by the idea of not embracing the lines of control structure (i.e. if, for, etc.) in curly braces! There was several actual errors: an erroneous resolved_occs before resolved_occs was called, also an attempt to call matched_name = 3 rather than matched_name = "species"…)

Note that taxonClean depends on vocab = "pbdb", which differs from the default for data functions in package paleobioDB. This function will return a table of occurrences, pared down only to the occurrences of taxa at the appropriate level and with their valid taxon labels in a new variable labeled final_taxon.

Let’s see what it does with our Dicellograptus data, using the API.

dicelloData<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident,phylo&limit=all")

# with informal species
cleanOccs<-taxonClean(dicelloData, tax_level="species", formal_id="no")
unique(cleanOccs$final_taxon)
##  [1] Dicellograptus anceps       Dicellograptus complanatus 
##  [3] Dicellograptus ornatus      Dicellograptus alector     
##  [5] Dicellograptus gurleyi      Dicellograptus mensurans   
##  [7] Dicellograptus sextans      Dicellograptus intortus    
##  [9] Dicellograptus intermedius  Dicellograptus divaricatus 
## [11] Dicellograptus russonioides Dicellograptus johnstrupi  
## [13] Dicellograptus morrisi      Dicellograptus flexuosus   
## [15] Dicellograptus minor        Dicellograptus tumidus     
## [17] Dicellograptus turgidus     Dicellograptus mirabilis   
## 19 Levels: Dicellograptus ... Dicellograptus mirabilis
# only formal species
cleanOccs<-taxonClean(dicelloData, tax_level="species", formal_id="yes")
unique(cleanOccs$final_taxon)
## [1] Dicellograptus anceps      Dicellograptus complanatus
## [3] Dicellograptus ornatus    
## 4 Levels: Dicellograptus ... Dicellograptus ornatus

For me though, I’d probably approach this issue in a slight different way than taxonClean… I’ll show what I cook up next time, when I (hopefully) finally show some stuff I did weeks ago integrating paleobioDB data downloads with paleotree functions.