Tuesday, March 17, 2015

Converting PBDB Occurence data to paleotree's timeList format

Hello all! Let’s finally get into what I wanted to talk about a month ago: converting Paleobiology Database (PBDB) data from paleobioDB (or the API) into a format for paleotree functions!

But first…

Some Recent Updates to R Package paleobioDB Since My Last Post

# install newest development version of paleobioDB
library(devtools)
install_github("ropensci/paleobioDB")

Hot on the heels of my last post and the related tickets I opened on the paleobioDB github repo, Sara Varela has swiftly updated pbdb_temp_range in paleobioDB so that only taxa formally recognized (and thus, synoymized) by the PBDB, as described here. This function now only identifies and return the taxa formally recognized by the PBDB.

library(paleobioDB)
## Loading required package: raster
## Loading required package: sp
## Loading required package: maps
dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"))
pbdb_temp_range(dicelloData, rank="species", do.plot=FALSE)
##                              max   min
## Dicellograptus anceps      455.8 443.7
## Dicellograptus ornatus     453.0 443.4
## Dicellograptus complanatus 453.0 443.7

This now returns the three Dicellograptus species ‘formally’ identified (i.e. given species-level taxon IDs and have occurrences listed for them) within the PBDB. This misses informal taxa (see last post), but that’s really the problem of anyone who works on a group with poorly-kept taxonomy in the PBDB.

(cough graptolites cough)

Let’s look at a non-graptolite example then.

acoData<-pbdb_occurrences(limit="all", base_name="Acosarina minuta", show=c("phylo","ident"), vocab="pbdb")
pbdb_temp_range(acoData, rank="species", do.plot=FALSE)
##                    max   min
## Acosarina minuta 303.7 251.3
unique(acoData$matched_no)
## [1] 125301

Only a single species, just as it should be because all of the listed occurrences are synonymized to a single species within the PBDB’s taxonomic database.

pbdb_temp_range(acoData, rank="genus", do.plot=FALSE)
##             max   min
## Acosarina 303.7 251.3

And only a single genus, so the results at different levels of the taxonomic hierarchy are consistent!

Getting PBDB Data into paleotree: Differences between API and paleobioDB downloads

Preferably, we’d like to implement utilities that work on both data collected using paleobioDB or data collected using the PBDB API. So, let’s also download some data using the API:

dicelloData2<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident,phylo&limit=all")

Unfortunately, here’s where we run into a snag. We can see these datasets differ in their column names and, in some cases, contents.

head(dicelloData)
##      oid typ cid                tna rnk   tid            mna mra   mid
## 1:1 3393 occ 328 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:2 4589 occ 378 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:3 4634 occ 379 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:4 4688 occ 380 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:5 8957 occ 403 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
## 1:6 9127 occ 424 Dicellograptus sp.   5 33650 Dicellograptus   5 33650
##                   oei   eag   lag rid            gnl   gnn          odl
## 1:1       Black River 460.9 457.5  13 Dicellograptus 33650 Graptoloidea
## 1:2             Harju 452.0 443.7  13 Dicellograptus 33650 Graptoloidea
## 1:3             Harju 452.0 443.7  13 Dicellograptus 33650 Graptoloidea
## 1:4           Caradoc 460.9 449.5  13 Dicellograptus 33650 Graptoloidea
## 1:5           Ashgill 449.5 443.7  13 Dicellograptus 33650 Graptoloidea
## 1:6 Middle Ordovician 470.0 458.4  13 Dicellograptus 33650 Graptoloidea
##       odn           cll   cln          phl   phn            idt ids  oli
## 1:1 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## 1:2 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## 1:3 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## 1:4 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## 1:5 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## 1:6 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
##      rst  rss
## 1:1 <NA> <NA>
## 1:2 <NA> <NA>
## 1:3 <NA> <NA>
## 1:4 <NA> <NA>
## 1:5 <NA> <NA>
## 1:6 <NA> <NA>
head(dicelloData2)
##   occurrence_no record_type reid_no superceded collection_no
## 1          3393  occurrence      NA         NA           328
## 2          4589  occurrence      NA         NA           378
## 3          4634  occurrence      NA         NA           379
## 4          4688  occurrence      NA         NA           380
## 5          8957  occurrence      NA         NA           403
## 6          9127  occurrence      NA         NA           424
##           taxon_name taxon_rank taxon_no   matched_name matched_rank
## 1 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 2 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 3 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 4 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 5 Dicellograptus sp.      genus    33650 Dicellograptus        genus
## 6 Dicellograptus sp.      genus    33650 Dicellograptus        genus
##   matched_no    early_interval     late_interval early_age late_age
## 1      33650       Black River       Black River     460.9    457.5
## 2      33650             Harju             Harju     452.0    443.7
## 3      33650             Harju             Harju     452.0    443.7
## 4      33650           Caradoc           Caradoc     460.9    449.5
## 5      33650           Ashgill           Ashgill     449.5    443.7
## 6      33650 Middle Ordovician Middle Ordovician     470.0    458.4
##   reference_no     genus_name genus_reso subgenus_name subgenus_reso
## 1           13 Dicellograptus                       NA            NA
## 2           13 Dicellograptus                       NA            NA
## 3           13 Dicellograptus                       NA            NA
## 4           13 Dicellograptus                       NA            NA
## 5           13 Dicellograptus                       NA            NA
## 6           13 Dicellograptus                       NA            NA
##   species_name species_reso          genus genus_no family family_no
## 1          sp.              Dicellograptus    33650     NA        NA
## 2          sp.              Dicellograptus    33650     NA        NA
## 3          sp.              Dicellograptus    33650     NA        NA
## 4          sp.              Dicellograptus    33650     NA        NA
## 5          sp.              Dicellograptus    33650     NA        NA
## 6          sp.              Dicellograptus    33650     NA        NA
##          order order_no         class class_no       phylum phylum_no
## 1 Graptoloidea    33606 Graptolithina    33534 Hemichordata     33518
## 2 Graptoloidea    33606 Graptolithina    33534 Hemichordata     33518
## 3 Graptoloidea    33606 Graptolithina    33534 Hemichordata     33518
## 4 Graptoloidea    33606 Graptolithina    33534 Hemichordata     33518
## 5 Graptoloidea    33606 Graptolithina    33534 Hemichordata     33518
## 6 Graptoloidea    33606 Graptolithina    33534 Hemichordata     33518

Why is this? Because of the two different ‘vocabularies’, explained in the last blog post, we can end up with PBDB data downloads with different column heads and contents. By default, paleobioDB returns data with the ‘com’ (compact) vocab, while PBDB API by default uses the more extended ‘pbdb’ vocab. While ‘com’ saves on data transfer for large tables, and the API documentation is clear about the changes in variable names, the changes in contents is less clear. For example, taxon_rank numbers the ranks in the ‘com’ vocab, but there doesn’t appear to be any documentation on what these numbers mean. All one can do is download the same data under both vocabs to figure it out; e.g. a rank of ‘3’ appears to be ‘species’.

As I noted before, this makes a mess of things because it means those of us writing code to manipulate PBDB data either (a) write two seperate versions of their code to anticipate either ‘pbdb’ or ‘com’ datasets, which is what the authors of paleobioDB have done or (b) pick one and require users to only use one vocab. Personally, I find the first option to be dangerous from a programming view, because writing the same code twice can lead to toxic inconsistencies, such as when you have to go back to fix bugs at a later date and forget to fix them in the second version of the code. The second option is the easiest from a programming point of view, but likely to cause issues since the default for paleobioDB is ‘com’ vocab and the default for the PBDB API is ‘pbdb’ vocab, and any utilities written are likely to get use from both. Speaking from maintaining paleotree for the last three years, something like a warning statement about ‘you need to use the right vocab!!’ is exactly the sort of statement that just confuses users and fills your inbox. A third option might be to include lines that transform the needed variables of one vocab into the other, but without more documentation, that’s not entirely possible.

For the moment, I’m going to go with the second option and stick to ‘pbdb’ vocab since it is the clearest and easiest to read and manipulate. This means I’ll need to adjust the arguments for my paleobioDB Dicellograptus download to use ‘pbdb’ vocab.

dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"), vocab="pbdb")
head(dicelloData)
##     occurrence_no record_type collection_no         taxon_name taxon_rank
## 1:1          3393  occurrence           328 Dicellograptus sp.      genus
## 1:2          4589  occurrence           378 Dicellograptus sp.      genus
## 1:3          4634  occurrence           379 Dicellograptus sp.      genus
## 1:4          4688  occurrence           380 Dicellograptus sp.      genus
## 1:5          8957  occurrence           403 Dicellograptus sp.      genus
## 1:6          9127  occurrence           424 Dicellograptus sp.      genus
##     taxon_no   matched_name matched_rank matched_no    early_interval
## 1:1    33650 Dicellograptus        genus      33650       Black River
## 1:2    33650 Dicellograptus        genus      33650             Harju
## 1:3    33650 Dicellograptus        genus      33650             Harju
## 1:4    33650 Dicellograptus        genus      33650           Caradoc
## 1:5    33650 Dicellograptus        genus      33650           Ashgill
## 1:6    33650 Dicellograptus        genus      33650 Middle Ordovician
##     early_age late_age reference_no          genus genus_no        order
## 1:1     460.9    457.5           13 Dicellograptus    33650 Graptoloidea
## 1:2     452.0    443.7           13 Dicellograptus    33650 Graptoloidea
## 1:3     452.0    443.7           13 Dicellograptus    33650 Graptoloidea
## 1:4     460.9    449.5           13 Dicellograptus    33650 Graptoloidea
## 1:5     449.5    443.7           13 Dicellograptus    33650 Graptoloidea
## 1:6     470.0    458.4           13 Dicellograptus    33650 Graptoloidea
##     order_no         class class_no       phylum phylum_no     genus_name
## 1:1    33606 Graptolithina    33534 Hemichordata     33518 Dicellograptus
## 1:2    33606 Graptolithina    33534 Hemichordata     33518 Dicellograptus
## 1:3    33606 Graptolithina    33534 Hemichordata     33518 Dicellograptus
## 1:4    33606 Graptolithina    33534 Hemichordata     33518 Dicellograptus
## 1:5    33606 Graptolithina    33534 Hemichordata     33518 Dicellograptus
## 1:6    33606 Graptolithina    33534 Hemichordata     33518 Dicellograptus
##     species_name late_interval genus_reso species_reso
## 1:1          sp.          <NA>       <NA>         <NA>
## 1:2          sp.          <NA>       <NA>         <NA>
## 1:3          sp.          <NA>       <NA>         <NA>
## 1:4          sp.          <NA>       <NA>         <NA>
## 1:5          sp.          <NA>       <NA>         <NA>
## 1:6          sp.          <NA>       <NA>         <NA>

There’s also another issue of trying to write functions useful for both PBDB API and paleobioDB downloads. In addtion to paleobioDB (dropping empty variables)[https://github.com/ropensci/paleobioDB/issues/18], empty variable entries are assigned NA while the read.csv procedure used above for downloading PBDB API data replaces empty variables with "" (i.e. an empty string). Empty variables, however, get assigned NA by read.csv, which is very inconsistent.

Here’s some examples from the API downloaded data:

#an empty variable
as.character(dicelloData2$subgenus_reso[1:20])
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#a variable with both empty and filled entries
as.character(dicelloData2$genus_reso[1:20])
##  [1] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "" 
## [18] ""  "?" ""

And, for comparison, the paleobioDB data (note the empty variable has been dropped):

#an empty variable
as.character(dicelloData$subgenus_reso[1:20])
## character(0)
#a variable with both empty and filled entries
as.character(dicelloData$genus_reso[1:20])
##  [1] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 
## [18] NA  "?" NA

One way to fix this is via the argument na.strings in read.csv.

dicelloData3<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident,phylo&limit=all",
  na.strings=c("NA",""))

#a variable with both empty and filled entries
as.character(dicelloData3$genus_reso[1:20])
##  [1] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 
## [18] NA  "?" NA

A better solution, however, may be to simply put a check in any function cleaning the data to convert empty string values ("") to NAs, like so.

#making a copy of the original API download
dicelloData3<-dicelloData2     

#replace empty values with NA
dicelloData3[dicelloData3==""]<-NA

#a variable with both empty and filled entries
as.character(dicelloData3$genus_reso[1:20])
##  [1] NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 
## [18] NA  "?" NA

Looks good.

Sorting Out Unique Taxa

Alright, now that we picked a consistent vocabulary, we need to write a function that sorts occurrence data into the unique taxa of a user-selected taxonomic rank. I think the preferable way to do this is to make a list where each element is a ‘unique’ taxon and all the occurrences attributed to it are under that element as a table. Additionally, we’ll have to allow the function to pull either just the ‘formal’ identified taxa or ‘all’ the identified taxa. This is necessary if we want species that are listed under their genus and never got assigned a species-level taxonomic ID in the PBDB.

Its important we consider the order of data manipulations we need to perform in this process. Formal identified taxa need to be pulled first, and then informal taxa pulled from whatever occurrences are left over. Furthermore, its possible (but hopefully rare) that the same taxon might be listed both formally and informally in a set of PBDB occurrences, and it would be preferable that the ‘informal’ occurrences for a taxon be concatenated to the ‘formal’ occurrences for that same taxon, rather than be treated as separate taxon.

Now, the functions I used as references when writing this function were paleobioDB’s pbdb_temp_range function (code here), and Matthew Clapham’s taxonClean function (code here). I took a somewhat different approach though. Both of these write seperate code for the different taxonomic ranks, which is vaguely unsatisfying from a programming perspective, as it could lead to inconsistencies where you fail to update all parts. Thankfully, except for species, one can get the formal taxonomic name for a given taxonomic rank by looking for the variable with the same name as the taxonomic rank. Additionally, we can look for informal occurrences of a taxon by then checking for occurrences with taxon_rank identical to the required rank (although it seems this is rarely important).

And, as noted above, we need a check to convert empty "" values to NAs.

taxonSortPBDBocc<-function(data,rank, onlyFormal=TRUE, cleanUncertain=TRUE,
                           cleanResoValues=c(NA, '"', "", "n. sp.", "n. gen.")){
  #this function inspired by Matt Clapham's taxonClean and paleobioDB's pbdb_temp_range
    #onlyFormal=FALSE;rank="species"
    #onlyFormal=FALSE;rank="genus"
    #pull occurrences out of a data table and sort by unique taxa into a list structure
    #First, some warning checks to see if occurrences downloaded correctly: 
    # pbdb vocab and showing phylo and ident data
        #the following is taken with minor modification from code in paleobioDB package
  if (any("tna"==colnames(data))){
    stop("need to add 'vocab='pbdbd'' to pbdb_occurrences query\n  *or*\n 'vocab=pbdb' to your PBDB API query")
    }
    if (!any("genus_name"==colnames(data)) | !any("genus"==colnames(data))){    
        stop("need to add 'show=c('phylo', 'ident')' to pbdb_occurrences query\n *or*\n  'show=ident,phylo' to PBDB API query")
        }
  #additional checks for rank
  if(length(rank)!=1){stop("length of rank must be 1")}
  if(!any(sapply(c("species","genus","family","order","class","phylum"),function(x) x==rank))){
    stop("rank must be one of 'species', 'genus', 'family', 'order', 'class' or 'phylum'")}
  #for inconsistencies between rank and onlyFormal - NOT TRUE, IGNORE THIS CHECK
  #if(!onlyFormal & (rank!="species" | rank!="genus")){
  #  stop("Informal taxon does not exist for above genus level, please use 'onlyFormal=TRUE'")}
  #need to replace any empty string values with NAs (due perhaps to use of read.csv with the API)
  data[data==""]<-NA
  #now,  pull taxa using the relevant variable from 'phylo' for formal ID
    #this matches what paleobioDB now does
  #sort occurrences by unique taxa in each level and then append valid names
  if(rank=="species"){
    if(cleanUncertain){ #remove uncertain species taxonomy
      data<-data[data[,"species_reso"] %in% cleanResoValues,,drop=FALSE]
      }
    #first formal taxa
    #get species names for taxa formally recognized as species level
    whichFormal<-(data[,"matched_rank"]==rank)
    taxonVar<-as.character(data[,"matched_name"])
    taxaNames<-as.character(unique(taxonVar[whichFormal]))  #get unique taxa
    #drop empty entries (occurrences listed at a higher taxonomic level, formally)
    taxaNames<-taxaNames[!is.na(taxaNames)]
    #take rows these taxa are in and drop them into elements of a list
        sortedOcc<-lapply(taxaNames,function(x) data[which(x==taxonVar),,drop=FALSE])
    names(sortedOcc)<-taxaNames
    if(!onlyFormal){
      #now use taxon_rank to identify useful informal occurrence
      taxonVar2<-data[,c("genus_name","species_name")]
      taxonVar2<-apply(taxonVar2,1,paste,collapse=" ")
      #which of these occs are useful as informal occs
      stillUseful<-which(!whichFormal & (data[,"taxon_rank"]==rank))
      taxaNames2<-as.character(unique(taxonVar2[stillUseful]))
        #drop any weird empties
      taxaNames2<-taxaNames2[taxaNames2!=" " & !is.na(taxaNames2)]  
      sortedOcc2<-lapply(taxaNames2,function(x) data[which(x==taxonVar2),,drop=FALSE])
      names(sortedOcc2)<-taxaNames2
      # merge sortedocc2 with sortedOcc
      share1<-sapply(taxaNames,function(x) any(x==taxaNames2))
      share2<-sapply(taxaNames2,function(x) any(x==taxaNames))
      if(sum(!share1)>0){sortedOccU<-sortedOcc[!share1]}else{sortedOccU<-list()}
      if(sum(!share2)>0){sortedOcc2U<-sortedOcc2[!share2]}else{sortedOcc2U<-list()}
      #and for shared names
      if(sum(share1)>0){
        shared<-lapply(taxaNames[share1],function(x) 
          cbind(sharedOcc[[taxaNames==x]],sharedOcc2[[taxaNames2==x]]))
        names(shared)<-names(taxaNames[share1])
      }else{shared<-list()}
      sortedOcc<-c(sortedOccU,sortedOcc2U,shared)
      }
  }else{   #if not at the species rank
    if(cleanUncertain & rank=="genus"){  #if rank=genus and removing uncertain taxonomy
      data<-data[data[,"genus_reso"] %in% cleanResoValues,,drop=FALSE]
      }
    taxonVar<-data[,rank] #then our taxonomic variable of interest is just so!
    taxaNames<-as.character(unique(taxonVar))  #get unique taxa
    taxaNames<-taxaNames[!is.na(taxaNames)]
    #take rows these taxa are in and drop them into elements of a list
      sortedOcc<-lapply(taxaNames,function(x) data[which(x==taxonVar),,drop=FALSE])
     names(sortedOcc)<-taxaNames
      if(!onlyFormal){
          #now use taxon_rank to identify useful informal occurrence
          taxonVar2<-data[,"taxon_name"]
          #which of these occs are useful as informal occs
      stillUseful<-which(is.na(taxonVar) & (data[,"taxon_rank"]==rank))
      taxaNames2<-as.character(unique(taxonVar2[stillUseful]))
        taxaNames2<-taxaNames2[!is.na(taxaNames2)]
        sortedOcc2<-lapply(taxaNames2,function(x) data[which(x==taxonVar2),,drop=FALSE])
        names(sortedOcc2)<-taxaNames2
      # merge sortedocc2 with sortedOcc
      share1<-sapply(taxaNames,function(x) any(x==taxaNames2))
      share2<-sapply(taxaNames2,function(x) any(x==taxaNames))
      if(sum(!share1)>0){sortedOccU<-sortedOcc[!share1]}else{sortedOccU<-list()}
      if(sum(!share2)>0){sortedOcc2U<-sortedOcc2[!share2]}else{sortedOcc2U<-list()}
      #and for shared names
      if(sum(share1)>0){
        shared<-lapply(taxaNames[share1],function(x)
          cbind(sharedOcc[[taxaNames==x]],sharedOcc2[[taxaNames2==x]]))
        names(shared)<-names(taxaNames[share1])
      }else{shared<-list()}
      sortedOcc<-c(sortedOccU,sortedOcc2U,shared)
      }
    }
  # sort occurrences by taxon name
  sortedOcc<-sortedOcc[order(names(sortedOcc))]
  return(sortedOcc)
  }

(I apologize for the terribly uneven indenting above. A strange copy/paste error involving tabs in Rstudio, Notepad++ and R GUI, and the more I tab, the worse it gets. I promise a clean-looking version in the paleotree github repo soon.)

Now, in the following, I have dropped or revised some of the checks from taxonClean and pbdb_temp_ranges.

  • pbdb_temp_ranges originally (before my issue tickets) checked to see if the same taxon was listed under multiple taxon_no ID numbers, and returned an error if this was true. This resulted in some interesting cases where the error arose due to subtle differences in taxonomic name, like Pseudoclimacograptus (Metaclimacograptus) hughesi and Pseudoclimacograptus hughesi being listed (for whatever reason) under different taxon_no ID numbers. I’m blanket presuming that if a taxon’s listed taxon name (e.g. formal names matched_name, genus, etc. as well as informal like genus_name + species_name and taxon_name.) are the same, its the same taxon. This assumption is tempered by the order of operations I discussed above: formal taxa are identified first, and informal taxa are identified from the occurrences that are ‘leftover’, and informal occurrences assigned to a taxon with a formal ID are concatenated to the ‘formal’ occurrences. Otherwise, we could mistakingly link the same occurrences to multiple taxa.
    • For what its worth, with respect to my above example, it looks like all taxa with taxon_name of Pseudoclimacograptus (Metaclimacograptus) hughesi or Pseudoclimacograptus hughesi have identical genus_name and species_name variables… but not identical matched_name variables, as only some of these occurrences have been assigned to formal taxon Diplograptus_hughesi, while others are assigned to genus-level formal taxa. But that’s a whole other can of worms: the inconsistent taxonomy of graptolites in the PBDB.
  • taxonClean removes taxa with question marks, cf. or other taxonomic-uncertainty flim-flam from the names it uses to identify unique taxa. Now there are two seperate concerns here:
    1. Occurrences with taxonomic uncertainy in our data. This really only pertains when rank is "species" or "genus", and can be easily taken care of by removing occurrences where species_reso or genus_reso does not equal a ‘safe’ value, a functionality controlled by the argument cleanUncertain, which is TRUE by default. The ‘safe’ values for species_reso or genus_reso are listed in the cleanResoValues argument, and by default includes common notifiers like ‘n. sp.’ which indicate something other than taxonomic uncertainty.
    2. The ‘cleaning’ of names so they don’t have unwanted taxonomic cruff on them, like that one sock in your laundry that attracts all the distasteful lint. This is particularly problematic as I use names, and not taxon ID numbers, to match occurrences as having the same tazon. However, my use-case is somewhat different from Clapham’s, as cleanTaxon uses taxon_name, while I use matched_name for formal species and the appropriate taxonomic variable for higher-taxa (i.e. genus for genera) and, for informal taxa, combining genus_name and species_name (for species), with taxon_name only being called for referencing informal supraspecific taxa. My working assumption is that taxonomic name rubbish has been kept out of matched_name and the various ‘formal’ supraspecific taxon variables, as well as genus_name and species_name for the taxonomic names given in the reference for that occurrence. That could be a bad assumption, but so far I haven’t found any. I know this rubbish exists in taxon_name, but (a) its hard to identify every use-case of such name-trash, and plus the case where an occurrence isn’t given a formal higher-taxon is very rare (in the data I’ve looked at).

Now, let’s test this function with data from paleobioDB. If we look at the Dicellograptus data, we should find only a single genus.

x<-taxonSortPBDBocc(dicelloData, rank="genus", onlyFormal=TRUE)
names(x)
## [1] "Dicellograptus"

Yep! And with the dataset downloaded using the API, we should get the same:

x<-taxonSortPBDBocc(dicelloData2, rank="genus", onlyFormal=TRUE)
names(x)
## [1] "Dicellograptus"

Yes. Now, let’s try to just retrieve ‘formal’ species from the paleobioDB dataset, and look at what’s in the first element:

x<-taxonSortPBDBocc(dicelloData, rank="species", onlyFormal=TRUE)
names(x)
## [1] "Dicellograptus anceps"      "Dicellograptus complanatus"
## [3] "Dicellograptus ornatus"
head(x[[1]])
##       occurrence_no record_type collection_no            taxon_name
## 1:17          52248  occurrence          3875 Dicellograptus anceps
## 1:214       1143907  occurrence        145603 Dicellograptus anceps
## 1:227       1143994  occurrence        145612 Dicellograptus anceps
##       taxon_rank taxon_no          matched_name matched_rank matched_no
## 1:17     species   306364 Dicellograptus anceps      species     306364
## 1:214    species   306364 Dicellograptus anceps      species     306364
## 1:227    species   306364 Dicellograptus anceps      species     306364
##       early_interval early_age late_age reference_no          genus
## 1:17       Rawtheyan     455.8    445.6          227 Dicellograptus
## 1:214     Ashgillian     449.5    443.7        47085 Dicellograptus
## 1:227     Ashgillian     449.5    443.7        47087 Dicellograptus
##       genus_no        order order_no         class class_no       phylum
## 1:17     33650 Graptoloidea    33606 Graptolithina    33534 Hemichordata
## 1:214    33650 Graptoloidea    33606 Graptolithina    33534 Hemichordata
## 1:227    33650 Graptoloidea    33606 Graptolithina    33534 Hemichordata
##       phylum_no     genus_name species_name late_interval genus_reso
## 1:17      33518 Dicellograptus       anceps  Kralodvorian       <NA>
## 1:214     33518 Dicellograptus       anceps          <NA>       <NA>
## 1:227     33518 Dicellograptus       anceps          <NA>       <NA>
##       species_reso
## 1:17          <NA>
## 1:214         <NA>
## 1:227         <NA>

Okay, 3 formal species and occurrence data for a single species (Dicellograptus anceps), just like we’d expect. And now both formal and informal species:

x<-taxonSortPBDBocc(dicelloData, rank="species", onlyFormal=FALSE)
names(x)
##  [1] "Dicellograptus alector"      "Dicellograptus anceps"      
##  [3] "Dicellograptus complanatus"  "Dicellograptus divaricatus" 
##  [5] "Dicellograptus flexuosus"    "Dicellograptus gurleyi"     
##  [7] "Dicellograptus intermedius"  "Dicellograptus intortus"    
##  [9] "Dicellograptus johnstrupi"   "Dicellograptus mensurans"   
## [11] "Dicellograptus minor"        "Dicellograptus mirabilis"   
## [13] "Dicellograptus morrisi"      "Dicellograptus ornatus"     
## [15] "Dicellograptus russonioides" "Dicellograptus sextans"     
## [17] "Dicellograptus tumidus"      "Dicellograptus turgidus"

A big increase in the number of species recovered! Now, we might be curious if this number further increases if we retain occurrences with uncertain taxonomy:

x<-taxonSortPBDBocc(dicelloData, rank="species", onlyFormal=FALSE, cleanUncertain=FALSE)
names(x)
##  [1] "Dicellograptus alector"      "Dicellograptus anceps"      
##  [3] "Dicellograptus angulatus"    "Dicellograptus caduceus"    
##  [5] "Dicellograptus complanatus"  "Dicellograptus divaricatus" 
##  [7] "Dicellograptus elegans"      "Dicellograptus flexuosus"   
##  [9] "Dicellograptus forchammeri"  "Dicellograptus gurleyi"     
## [11] "Dicellograptus intermedius"  "Dicellograptus intortus"    
## [13] "Dicellograptus johnstrupi"   "Dicellograptus mensurans"   
## [15] "Dicellograptus minor"        "Dicellograptus mirabilis"   
## [17] "Dicellograptus moffatensis"  "Dicellograptus morrisi"     
## [19] "Dicellograptus ornatus"      "Dicellograptus pumilis"     
## [21] "Dicellograptus russonioides" "Dicellograptus sextans"     
## [23] "Dicellograptus smithi"       "Dicellograptus tumidus"     
## [25] "Dicellograptus turgidus"     "Dicellograptus vagus"

Oooh, even more species of questionable taxonomic validity!

That’s enough Dicellograptus for the moment. Now, let’s try our example brachiopod data from before. We should only recover one species, even if we allow it to pull informal taxa too:

x<-taxonSortPBDBocc(acoData, rank="species", onlyFormal=FALSE)
names(x)
## [1] "Acosarina minuta"

Yep, looks good!

Converting Occurrences Data to a timeList Data Object

Alright, now we can finally do what I meant to show you in January: turn these PBDB occurrences into a paleotree timelist data format. As you might recall from some previous posts, timeList data objects are lists composed of two matrices, the first matrix giving the start and end times of intervals, and the second matrix giving the intervals in which taxa are first and last observed.

Now, I imagine there might be more reasons to need to convert occurrence data to a timeList object than just those involving the PBDB (such as output from simulation function sampleRanges), so I’ll allow this function to look either for (a) variables named early_age and late_age as under ‘pbdb’ vocab, or, (b) if each element of the list is a two-column matrix, use each pair of values as the earliest and latest time bounds for the listed occurrences.

occData2timeList<-function(occList){
    # the following is all original, though inspired by paleobioDB code
    #need checks
    #pull first list entry
    exOcc<-occList[[1]]
    #test if all occList entries have same number of columns
    if(!all(sapply(occList,function(x) ncol(x)==ncol(exOcc)))){
        stop("Not all occList entries have same number of columns?")}
    #will assume all data given in this manner either has columns named "" and ""
        #or has only two columns
    if(!any(colnames(exOcc)=="early_age") | !any(colnames(exOcc)=="late_age")){
        if(!ncol(exOcc)!=2){stop("")}else{
            ageSelector<-1:2}
        }else{ageSelector<-c("early_age","late_age")}
    #get intervals in which taxa appear
    taxaInt<-lapply(occList,function(x) x[,ageSelector])
    #get earliest and latest intervals the taxa appear in
    taxaMin<-lapply(taxaInt,function(x) x[x[,1]==max(x[,1]),,drop=FALSE])
    taxaMax<-lapply(taxaInt,function(x) x[x[,2]==min(x[,2]),,drop=FALSE])
    #get smallest intervals
    taxaMin<-sapply(taxaMin,function(x) if(nrow(x)>1){
        x[which((-apply(x,1,diff))==min(-apply(x,1,diff)))[1],]
        }else{x})
    taxaMax<-sapply(taxaMax,function(x) if(nrow(x)>1){
        x[which((-apply(x,1,diff))==min(-apply(x,1,diff)))[1],]
        }else{x})
    #transpose and remove weird list attribute
    taxaMin<-t(taxaMin)
    taxaMax<-t(taxaMax)
    taxaMin<-cbind(unlist(taxaMin[,1]),unlist(taxaMin[,2]))
    taxaMax<-cbind(unlist(taxaMax[,1]),unlist(taxaMax[,2]))
    #make interval list
    intTimes<-unique(rbind(taxaMin,taxaMax))
    intTimes<-intTimes[order(-intTimes[,1],-intTimes[,2]),]
    #now assign taxa first and last intervals
    firstInt<-apply(taxaMin,1,function(x)
        which(apply(intTimes,1,function(y) identical(y,x))))
    lastInt<-apply(taxaMax,1,function(x)
        which(apply(intTimes,1,function(y) identical(y,x))))
    taxonTimes<-cbind(firstInt,lastInt)
    rownames(taxonTimes)<-names(occList)
    #package it together
    dimnames(intTimes)<-list(NULL,c("startTime","endTime"))
    res<-list(intTimes=intTimes,taxonTimes=taxonTimes)
    return(res)
    }

Let’s try it out!

sortDicelloOcc<-taxonSortPBDBocc(dicelloData,rank="species",onlyFormal=FALSE)
dicelloTimeList<-occData2timeList(occList=sortDicelloOcc)
dicelloTimeList
## $intTimes
##       startTime endTime
##  [1,]     471.8   457.5
##  [2,]     467.3   458.4
##  [3,]     460.9   456.1
##  [4,]     460.9   449.5
##  [5,]     458.4   443.4
##  [6,]     455.8   445.6
##  [7,]     453.0   445.2
##  [8,]     449.5   445.6
##  [9,]     449.5   443.7
## [10,]     445.2   443.4
## 
## $taxonTimes
##                             firstInt lastInt
## Dicellograptus alector             4       9
## Dicellograptus anceps              6       9
## Dicellograptus complanatus         7       9
## Dicellograptus divaricatus         3       3
## Dicellograptus flexuosus           4       4
## Dicellograptus gurleyi             4       4
## Dicellograptus intermedius         2       3
## Dicellograptus intortus            3       3
## Dicellograptus johnstrupi          4       4
## Dicellograptus mensurans           4       4
## Dicellograptus minor               7       9
## Dicellograptus mirabilis           9       9
## Dicellograptus morrisi             4       8
## Dicellograptus ornatus             7      10
## Dicellograptus russonioides        3       3
## Dicellograptus sextans             1       5
## Dicellograptus tumidus             9       9
## Dicellograptus turgidus            9       9

Looks like a timeList to me!

Okay, now we’re ready to try using the data with paleotree! Now, the main use will probably be for time-scaling a tree using functions like bin_timePaleoPhy, but since we lack a tree at this very moment to time-scale, let’s just use one of paleotree’s diversity curve for now. I realize that’s a bit of a letdown, maybe, since paleobioDB already has diversity curve functions (although they work different from paleotree’s).

library(paleotree)
taxicDivDisc(dicelloTimeList)

Other things we could do is to get at sampling rates from these estimates, which I’ll be covering in a future blog post (perhaps the next post?? Who knows!).

A Full Worked Example

My intention is to add taxonSortPBDBocc and occData2timeList to paleotree very soon, but for now users will have to source them into R manually. Assuming you do source these two functions, or that you are reading this in the future and have a new version of paleotree with these two functions, here’s a complete worked example with all graptoloids, using paleobioDB to download the data:

library(paleotree)
library(paleobioDB)

taxon<-"Graptoloidea"

graptOcc<-pbdb_occurrences(limit="all", base_name=taxon,
    show=c("phylo","ident"), vocab="pbdb")

#for species (formal ID only)
sortGraptOcc<-taxonSortPBDBocc(graptOcc,rank="species",onlyFormal=TRUE)
graptTimeList<-occData2timeList(occList=sortGraptOcc)
taxicDivDisc(graptTimeList)

#for species (all)
sortGraptOcc<-taxonSortPBDBocc(graptOcc,rank="species",onlyFormal=FALSE)
graptTimeList<-occData2timeList(occList=sortGraptOcc)
taxicDivDisc(graptTimeList)

#for genera (formal ID only)
sortGraptOcc<-taxonSortPBDBocc(graptOcc,rank="genus",onlyFormal=TRUE)
graptTimeList<-occData2timeList(occList=sortGraptOcc)
taxicDivDisc(graptTimeList)

#for family (formal ID only)
sortGraptOcc<-taxonSortPBDBocc(graptOcc,rank="family",onlyFormal=TRUE)
graptTimeList<-occData2timeList(occList=sortGraptOcc)
taxicDivDisc(graptTimeList)

# number of occurrences listed for each family
sapply(sortGraptOcc,nrow)
##   Abrograptidae  Diplograptidae Phyllograptidae    Retiolitidae 
##               4             625              95             100 
##   Sinograptidae 
##               5

Unsurprisingly, we can see that there are more genera given formal IDs in the PBDB than there are formal species. Of course, the number of (unsynonymized) informal species swamp out both, with an enormous number of informal species in the early Silurian, creating a noticeable and somewhat odd peak. Some of the dicordence between the species and genus counts is almost certainly due to the general trend of simplification in graptolite colony structure, which leads to there being many species of Silurian monograptids placed in the polyphyletic wastebin taxon Monograptus, which means Silurian generic richness of graptolites is almost certainly not a very good metaphor for changes in species-level diversity. Simultaneously, though, it would take some investigation to figure out how the PBDB has over 200 contemporaneous early Silurian graptoloids. Very curious.

You might note from the table of number of occurrences per family that searching for graptoloids has skipped the paraphyletic family Anisograptidae. The ‘anisograptids’ were the paraphyletic ancestors for the traditional graptoloids (i.e. Section Graptoloidea Lapworth) and included in the Subdivision Graptoloida by Maletz et al. (2009), as this group was also planktonic and had a nematophorous sicula like all other graptoloids (well, except for the ones who lose their sicula…). It isn’t traditionally included in the graptoloids as the ‘anisograptids’ had not lost the bithecae of their dendroid graptolite ancestors. Anyway, all this means is that the above figures do not depict all the planktonic graptolites (i.e. the graptoloids as defined by Maletz et al., 2009).

Anyway, there we go: how to get PBDB API and paleobioDB data downloads to talk to paleotree.

PS: Hopefully, most of this will not become irrelevant with (version 1.2 of the PBDB API)[https://twitter.com/meclapham/status/577896469207748608]!