Wednesday, February 11, 2015

What is the right way to identify unique species in Paleobiology Database downloads?

(Continuing with the RMarkdown-to-Blogger experiment! I apologize for any mysterious indentations that I can’t seem to make go away! Sorry!)

So, let’s recap:

Last post, we had some fun pulling data out of the Paleobiology Database using R package paleobioDB (Varela et al., in press). We even made a nice plot of age uncertainty in occurrences sorted by species in the delightful graptolite genus Dicellograptus. Let’s see that pretty plot again, although we’ll hide some of the R code this time!

library(paleobioDB)
dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"))
dicelloData<-dicelloData[dicelloData$rnk==3,]   #keep only occurrences of taxa resolved to species level

plotOccPBDB(dicelloData,"Dicellograptus Species")

Now, we had a little mystery when we decided to look at species ranges from this same data using function pbdb_temp_range in package paleobioDB.

pbdb_temp_range(dicelloData, rank="species", do.plot=FALSE)
##                              max   min
## Dicellograptus alector     471.8 443.4
## Dicellograptus anceps      455.8 443.7
## Dicellograptus ornatus     453.0 443.4
## Dicellograptus complanatus 453.0 443.7

The occurrences I plotted above show something like more than twenty species. pbdb_temp_range shows only four species. What is going on?

Well, for the occurrence plot above, I used the species_name column (the column $ids, see here for its description in the paleobioDB API documentation) to distinguish different species, which we can see contains more than 20 unique identifiers:

as.character(unique(dicelloData$ids))
##  [1] "anceps"       "alector"      "gurleyi"      "mensurans"   
##  [5] "sextans"      "complanatus"  "vagus"        "intortus"    
##  [9] "divaricatus"  "caduceus"     "forchammeri"  "moffatensis" 
## [13] "intermedius"  "russonioides" "johnstrupi"   "morrisi"     
## [17] "elegans"      "smithi"       "angulatus"    "ornatus"     
## [21] "flexuosus"    "minor"        "tumidus"      "turgidus"    
## [25] "mirabilis"

However, while pbdb_temp_range uses the various ‘name’ entries for identifying independent taxa at other taxonomic levels, pbdb_temp_range uses taxon_no (column $tid) to identify unique species-level taxa. We can see this in the code in paleobioDB’s GitHub repo.

This is suggestive that taxon_no is the Paleobiology Database’s way of handling taxonomic synonymy among species. A quick survey of a few friends who actually use PBDB data on a regular basis (unlike me, I’m a PBDB-newb) reveals some… uh, lack of clarity on this point. Well, a lack of clarity often suggests lack of clear documentation. Indeed, the paleobioDB API documentation describes taxon_no simply as the “unique identifier of the identified taxonomic name” and is remarkably silent on suggesting whether it might indicate unique species-level identities. Looking through the rest of the API documentation, its never said explicitly, but one might infer that this is true, as long as taxonomic rank (column $rnk) has been been limited to species-level taxa only. I.e., all species-level taxa with unique taxon-IDs are valid species-level taxa.

I wonder though: aren’t there cases of synonymized supraspecific taxa? These must be missed if uniqueness at those levels in paleobioDB package is determined based on the various ‘names’ qualifiers. This issue seems like quite the tough nut. It seems properly sorting data out of an occurrances download from the PBDB may require directly referencing the taxonomic database of the PBDB as well.

Anyway, we can see that $tid matches the species returned by pbdb_temp_range, which finds its names for the unique tid values using match(), which just means the species name reported by pbdb_temp_range is whatever species name is listed first with that tid value in a given data table.

unique(dicelloData$tid)
## [1] 306364  33650 306226 306367
as.character(sapply(unique(dicelloData$tid),function(x) dicelloData$ids[match(x,dicelloData$tid)]))
## [1] "anceps"      "alector"     "complanatus" "ornatus"

Well, that solves that mystery, but is tid actually the right way to identify unique valid species in a dataset? I wonder, as this isn’t written down anywhere…

I decided to pursue this further and went to Fossilworks, a mirror database of the Paleobiology Database which I mentioned in the previous post. I used the download form at Fossilworks to pull the species ranges for Dicellograptus. It returns ranges for 27 separate species; here’s a summary showing just the first four columns and small selection of ranges:

genus species base of range (Ma) top of range (Ma)
Dicellograptus alector 456.1 443.7
Dicellograptus anceps 449.5 445.6
Dicellograptus angulatus 456.1 449.5
Dicellograptus caduceus 460.9 456.1
Dicellograptus complanatus 449.5 445.6
Dicellograptus divaricatus 468.1 456.1
Dicellograptus vagus 460.9 456.1

I also get 27 species with the download form using PBDB classic at paleobioDB (also discussed in the last post). So, these applications are clearly using species_name and not taxon_no to distinguish what species are in the occurrences data.

Okay, I thought, this is, uh… inconsistent.

Recently, the Paleobiology Database revealed a new interface, called PBDB Navigator, which uses a map and various chronological and taxonomic filters to sort through the PBDB collection data. How many species does Navigator report for Dicellograptus?

Taxon name (number of occurrences in the database)
    Dicellograptus complanatus  (1) 
    Dicellograptus ornatus  (1) 
    Dicellograptus anceps  (1) 
    Dicellograptus gravis  (1) 

Huh. Okay, so Navigator uses taxon_no to identify individual unique species, unlike the Classic PBDB which is on the same website but uses species_name.

This is making my head hurt. What’s the answer? Maybe someone could pipe up in the comments.

Anyway, until next time!