Hello all! Let’s finally get into what I wanted to talk about a month ago: converting Paleobiology Database (PBDB) data from paleobioDB
(or the API) into a format for paleotree
functions!
But first…
Some Recent Updates to R Package paleobioDB Since My Last Post
# install newest development version of paleobioDB
library(devtools)
install_github("ropensci/paleobioDB")
Hot on the heels of my last post and the related tickets I opened on the paleobioDB
github repo, Sara Varela has swiftly updated pbdb_temp_range
in paleobioDB
so that only taxa formally recognized (and thus, synoymized) by the PBDB, as described here. This function now only identifies and return the taxa formally recognized by the PBDB.
library(paleobioDB)
## Loading required package: raster
## Loading required package: sp
## Loading required package: maps
dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"))
pbdb_temp_range(dicelloData, rank="species", do.plot=FALSE)
## max min
## Dicellograptus anceps 455.8 443.7
## Dicellograptus ornatus 453.0 443.4
## Dicellograptus complanatus 453.0 443.7
This now returns the three Dicellograptus species ‘formally’ identified (i.e. given species-level taxon IDs and have occurrences listed for them) within the PBDB. This misses informal taxa (see last post), but that’s really the problem of anyone who works on a group with poorly-kept taxonomy in the PBDB.
(cough graptolites cough)
Let’s look at a non-graptolite example then.
acoData<-pbdb_occurrences(limit="all", base_name="Acosarina minuta", show=c("phylo","ident"), vocab="pbdb")
pbdb_temp_range(acoData, rank="species", do.plot=FALSE)
## max min
## Acosarina minuta 303.7 251.3
unique(acoData$matched_no)
## [1] 125301
Only a single species, just as it should be because all of the listed occurrences are synonymized to a single species within the PBDB’s taxonomic database.
pbdb_temp_range(acoData, rank="genus", do.plot=FALSE)
## max min
## Acosarina 303.7 251.3
And only a single genus, so the results at different levels of the taxonomic hierarchy are consistent!
Getting PBDB Data into paleotree: Differences between API and paleobioDB downloads
Preferably, we’d like to implement utilities that work on both data collected using paleobioDB
or data collected using the PBDB API. So, let’s also download some data using the API:
dicelloData2<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident,phylo&limit=all")
Unfortunately, here’s where we run into a snag. We can see these datasets differ in their column names and, in some cases, contents.
head(dicelloData)
## oid typ cid tna rnk tid mna mra mid
## 1:1 3393 occ 328 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## 1:2 4589 occ 378 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## 1:3 4634 occ 379 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## 1:4 4688 occ 380 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## 1:5 8957 occ 403 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## 1:6 9127 occ 424 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## oei eag lag rid gnl gnn odl
## 1:1 Black River 460.9 457.5 13 Dicellograptus 33650 Graptoloidea
## 1:2 Harju 452.0 443.7 13 Dicellograptus 33650 Graptoloidea
## 1:3 Harju 452.0 443.7 13 Dicellograptus 33650 Graptoloidea
## 1:4 Caradoc 460.9 449.5 13 Dicellograptus 33650 Graptoloidea
## 1:5 Ashgill 449.5 443.7 13 Dicellograptus 33650 Graptoloidea
## 1:6 Middle Ordovician 470.0 458.4 13 Dicellograptus 33650 Graptoloidea
## odn cll cln phl phn idt ids oli
## 1:1 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## 1:2 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## 1:3 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## 1:4 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## 1:5 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## 1:6 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus sp. <NA>
## rst rss
## 1:1 <NA> <NA>
## 1:2 <NA> <NA>
## 1:3 <NA> <NA>
## 1:4 <NA> <NA>
## 1:5 <NA> <NA>
## 1:6 <NA> <NA>
head(dicelloData2)
## occurrence_no record_type reid_no superceded collection_no
## 1 3393 occurrence NA NA 328
## 2 4589 occurrence NA NA 378
## 3 4634 occurrence NA NA 379
## 4 4688 occurrence NA NA 380
## 5 8957 occurrence NA NA 403
## 6 9127 occurrence NA NA 424
## taxon_name taxon_rank taxon_no matched_name matched_rank
## 1 Dicellograptus sp. genus 33650 Dicellograptus genus
## 2 Dicellograptus sp. genus 33650 Dicellograptus genus
## 3 Dicellograptus sp. genus 33650 Dicellograptus genus
## 4 Dicellograptus sp. genus 33650 Dicellograptus genus
## 5 Dicellograptus sp. genus 33650 Dicellograptus genus
## 6 Dicellograptus sp. genus 33650 Dicellograptus genus
## matched_no early_interval late_interval early_age late_age
## 1 33650 Black River Black River 460.9 457.5
## 2 33650 Harju Harju 452.0 443.7
## 3 33650 Harju Harju 452.0 443.7
## 4 33650 Caradoc Caradoc 460.9 449.5
## 5 33650 Ashgill Ashgill 449.5 443.7
## 6 33650 Middle Ordovician Middle Ordovician 470.0 458.4
## reference_no genus_name genus_reso subgenus_name subgenus_reso
## 1 13 Dicellograptus NA NA
## 2 13 Dicellograptus NA NA
## 3 13 Dicellograptus NA NA
## 4 13 Dicellograptus NA NA
## 5 13 Dicellograptus NA NA
## 6 13 Dicellograptus NA NA
## species_name species_reso genus genus_no family family_no
## 1 sp. Dicellograptus 33650 NA NA
## 2 sp. Dicellograptus 33650 NA NA
## 3 sp. Dicellograptus 33650 NA NA
## 4 sp. Dicellograptus 33650 NA NA
## 5 sp. Dicellograptus 33650 NA NA
## 6 sp. Dicellograptus 33650 NA NA
## order order_no class class_no phylum phylum_no
## 1 Graptoloidea 33606 Graptolithina 33534 Hemichordata 33518
## 2 Graptoloidea 33606 Graptolithina 33534 Hemichordata 33518
## 3 Graptoloidea 33606 Graptolithina 33534 Hemichordata 33518
## 4 Graptoloidea 33606 Graptolithina 33534 Hemichordata 33518
## 5 Graptoloidea 33606 Graptolithina 33534 Hemichordata 33518
## 6 Graptoloidea 33606 Graptolithina 33534 Hemichordata 33518
Why is this? Because of the two different ‘vocabularies’, explained in the last blog post, we can end up with PBDB data downloads with different column heads and contents. By default, paleobioDB
returns data with the ‘com’ (compact) vocab, while PBDB API by default uses the more extended ‘pbdb’ vocab. While ‘com’ saves on data transfer for large tables, and the API documentation is clear about the changes in variable names, the changes in contents is less clear. For example, taxon_rank
numbers the ranks in the ‘com’ vocab, but there doesn’t appear to be any documentation on what these numbers mean. All one can do is download the same data under both vocabs to figure it out; e.g. a rank of ‘3’ appears to be ‘species’.
As I noted before, this makes a mess of things because it means those of us writing code to manipulate PBDB data either (a) write two seperate versions of their code to anticipate either ‘pbdb’ or ‘com’ datasets, which is what the authors of paleobioDB
have done or (b) pick one and require users to only use one vocab. Personally, I find the first option to be dangerous from a programming view, because writing the same code twice can lead to toxic inconsistencies, such as when you have to go back to fix bugs at a later date and forget to fix them in the second version of the code. The second option is the easiest from a programming point of view, but likely to cause issues since the default for paleobioDB
is ‘com’ vocab and the default for the PBDB API is ‘pbdb’ vocab, and any utilities written are likely to get use from both. Speaking from maintaining paleotree
for the last three years, something like a warning statement about ‘you need to use the right vocab!!’ is exactly the sort of statement that just confuses users and fills your inbox. A third option might be to include lines that transform the needed variables of one vocab into the other, but without more documentation, that’s not entirely possible.
For the moment, I’m going to go with the second option and stick to ‘pbdb’ vocab since it is the clearest and easiest to read and manipulate. This means I’ll need to adjust the arguments for my paleobioDB
Dicellograptus download to use ‘pbdb’ vocab.
dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"), vocab="pbdb")
head(dicelloData)
## occurrence_no record_type collection_no taxon_name taxon_rank
## 1:1 3393 occurrence 328 Dicellograptus sp. genus
## 1:2 4589 occurrence 378 Dicellograptus sp. genus
## 1:3 4634 occurrence 379 Dicellograptus sp. genus
## 1:4 4688 occurrence 380 Dicellograptus sp. genus
## 1:5 8957 occurrence 403 Dicellograptus sp. genus
## 1:6 9127 occurrence 424 Dicellograptus sp. genus
## taxon_no matched_name matched_rank matched_no early_interval
## 1:1 33650 Dicellograptus genus 33650 Black River
## 1:2 33650 Dicellograptus genus 33650 Harju
## 1:3 33650 Dicellograptus genus 33650 Harju
## 1:4 33650 Dicellograptus genus 33650 Caradoc
## 1:5 33650 Dicellograptus genus 33650 Ashgill
## 1:6 33650 Dicellograptus genus 33650 Middle Ordovician
## early_age late_age reference_no genus genus_no order
## 1:1 460.9 457.5 13 Dicellograptus 33650 Graptoloidea
## 1:2 452.0 443.7 13 Dicellograptus 33650 Graptoloidea
## 1:3 452.0 443.7 13 Dicellograptus 33650 Graptoloidea
## 1:4 460.9 449.5 13 Dicellograptus 33650 Graptoloidea
## 1:5 449.5 443.7 13 Dicellograptus 33650 Graptoloidea
## 1:6 470.0 458.4 13 Dicellograptus 33650 Graptoloidea
## order_no class class_no phylum phylum_no genus_name
## 1:1 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus
## 1:2 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus
## 1:3 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus
## 1:4 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus
## 1:5 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus
## 1:6 33606 Graptolithina 33534 Hemichordata 33518 Dicellograptus
## species_name late_interval genus_reso species_reso
## 1:1 sp. <NA> <NA> <NA>
## 1:2 sp. <NA> <NA> <NA>
## 1:3 sp. <NA> <NA> <NA>
## 1:4 sp. <NA> <NA> <NA>
## 1:5 sp. <NA> <NA> <NA>
## 1:6 sp. <NA> <NA> <NA>
There’s also another issue of trying to write functions useful for both PBDB API and paleobioDB
downloads. In addtion to paleobioDB
(dropping empty variables)[https://github.com/ropensci/paleobioDB/issues/18], empty variable entries are assigned NA
while the read.csv
procedure used above for downloading PBDB API data replaces empty variables with ""
(i.e. an empty string). Empty variables, however, get assigned NA
by read.csv
, which is very inconsistent.
Here’s some examples from the API downloaded data:
#an empty variable
as.character(dicelloData2$subgenus_reso[1:20])
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#a variable with both empty and filled entries
as.character(dicelloData2$genus_reso[1:20])
## [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
## [18] "" "?" ""
And, for comparison, the paleobioDB
data (note the empty variable has been dropped):
#an empty variable
as.character(dicelloData$subgenus_reso[1:20])
## character(0)
#a variable with both empty and filled entries
as.character(dicelloData$genus_reso[1:20])
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [18] NA "?" NA
One way to fix this is via the argument na.strings
in read.csv
.
dicelloData3<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident,phylo&limit=all",
na.strings=c("NA",""))
#a variable with both empty and filled entries
as.character(dicelloData3$genus_reso[1:20])
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [18] NA "?" NA
A better solution, however, may be to simply put a check in any function cleaning the data to convert empty string values (""
) to NA
s, like so.
#making a copy of the original API download
dicelloData3<-dicelloData2
#replace empty values with NA
dicelloData3[dicelloData3==""]<-NA
#a variable with both empty and filled entries
as.character(dicelloData3$genus_reso[1:20])
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [18] NA "?" NA
Looks good.
Sorting Out Unique Taxa
Alright, now that we picked a consistent vocabulary, we need to write a function that sorts occurrence data into the unique taxa of a user-selected taxonomic rank. I think the preferable way to do this is to make a list where each element is a ‘unique’ taxon and all the occurrences attributed to it are under that element as a table. Additionally, we’ll have to allow the function to pull either just the ‘formal’ identified taxa or ‘all’ the identified taxa. This is necessary if we want species that are listed under their genus and never got assigned a species-level taxonomic ID in the PBDB.
Its important we consider the order of data manipulations we need to perform in this process. Formal identified taxa need to be pulled first, and then informal taxa pulled from whatever occurrences are left over. Furthermore, its possible (but hopefully rare) that the same taxon might be listed both formally and informally in a set of PBDB occurrences, and it would be preferable that the ‘informal’ occurrences for a taxon be concatenated to the ‘formal’ occurrences for that same taxon, rather than be treated as separate taxon.
Now, the functions I used as references when writing this function were paleobioDB
’s pbdb_temp_range
function (code here), and Matthew Clapham’s taxonClean
function (code here). I took a somewhat different approach though. Both of these write seperate code for the different taxonomic ranks, which is vaguely unsatisfying from a programming perspective, as it could lead to inconsistencies where you fail to update all parts. Thankfully, except for species, one can get the formal taxonomic name for a given taxonomic rank by looking for the variable with the same name as the taxonomic rank. Additionally, we can look for informal occurrences of a taxon by then checking for occurrences with taxon_rank
identical to the required rank (although it seems this is rarely important).
And, as noted above, we need a check to convert empty ""
values to NA
s.
taxonSortPBDBocc<-function(data,rank, onlyFormal=TRUE, cleanUncertain=TRUE,
cleanResoValues=c(NA, '"', "", "n. sp.", "n. gen.")){
#this function inspired by Matt Clapham's taxonClean and paleobioDB's pbdb_temp_range
#onlyFormal=FALSE;rank="species"
#onlyFormal=FALSE;rank="genus"
#pull occurrences out of a data table and sort by unique taxa into a list structure
#First, some warning checks to see if occurrences downloaded correctly:
# pbdb vocab and showing phylo and ident data
#the following is taken with minor modification from code in paleobioDB package
if (any("tna"==colnames(data))){
stop("need to add 'vocab='pbdbd'' to pbdb_occurrences query\n *or*\n 'vocab=pbdb' to your PBDB API query")
}
if (!any("genus_name"==colnames(data)) | !any("genus"==colnames(data))){
stop("need to add 'show=c('phylo', 'ident')' to pbdb_occurrences query\n *or*\n 'show=ident,phylo' to PBDB API query")
}
#additional checks for rank
if(length(rank)!=1){stop("length of rank must be 1")}
if(!any(sapply(c("species","genus","family","order","class","phylum"),function(x) x==rank))){
stop("rank must be one of 'species', 'genus', 'family', 'order', 'class' or 'phylum'")}
#for inconsistencies between rank and onlyFormal - NOT TRUE, IGNORE THIS CHECK
#if(!onlyFormal & (rank!="species" | rank!="genus")){
# stop("Informal taxon does not exist for above genus level, please use 'onlyFormal=TRUE'")}
#need to replace any empty string values with NAs (due perhaps to use of read.csv with the API)
data[data==""]<-NA
#now, pull taxa using the relevant variable from 'phylo' for formal ID
#this matches what paleobioDB now does
#sort occurrences by unique taxa in each level and then append valid names
if(rank=="species"){
if(cleanUncertain){ #remove uncertain species taxonomy
data<-data[data[,"species_reso"] %in% cleanResoValues,,drop=FALSE]
}
#first formal taxa
#get species names for taxa formally recognized as species level
whichFormal<-(data[,"matched_rank"]==rank)
taxonVar<-as.character(data[,"matched_name"])
taxaNames<-as.character(unique(taxonVar[whichFormal])) #get unique taxa
#drop empty entries (occurrences listed at a higher taxonomic level, formally)
taxaNames<-taxaNames[!is.na(taxaNames)]
#take rows these taxa are in and drop them into elements of a list
sortedOcc<-lapply(taxaNames,function(x) data[which(x==taxonVar),,drop=FALSE])
names(sortedOcc)<-taxaNames
if(!onlyFormal){
#now use taxon_rank to identify useful informal occurrence
taxonVar2<-data[,c("genus_name","species_name")]
taxonVar2<-apply(taxonVar2,1,paste,collapse=" ")
#which of these occs are useful as informal occs
stillUseful<-which(!whichFormal & (data[,"taxon_rank"]==rank))
taxaNames2<-as.character(unique(taxonVar2[stillUseful]))
#drop any weird empties
taxaNames2<-taxaNames2[taxaNames2!=" " & !is.na(taxaNames2)]
sortedOcc2<-lapply(taxaNames2,function(x) data[which(x==taxonVar2),,drop=FALSE])
names(sortedOcc2)<-taxaNames2
# merge sortedocc2 with sortedOcc
share1<-sapply(taxaNames,function(x) any(x==taxaNames2))
share2<-sapply(taxaNames2,function(x) any(x==taxaNames))
if(sum(!share1)>0){sortedOccU<-sortedOcc[!share1]}else{sortedOccU<-list()}
if(sum(!share2)>0){sortedOcc2U<-sortedOcc2[!share2]}else{sortedOcc2U<-list()}
#and for shared names
if(sum(share1)>0){
shared<-lapply(taxaNames[share1],function(x)
cbind(sharedOcc[[taxaNames==x]],sharedOcc2[[taxaNames2==x]]))
names(shared)<-names(taxaNames[share1])
}else{shared<-list()}
sortedOcc<-c(sortedOccU,sortedOcc2U,shared)
}
}else{ #if not at the species rank
if(cleanUncertain & rank=="genus"){ #if rank=genus and removing uncertain taxonomy
data<-data[data[,"genus_reso"] %in% cleanResoValues,,drop=FALSE]
}
taxonVar<-data[,rank] #then our taxonomic variable of interest is just so!
taxaNames<-as.character(unique(taxonVar)) #get unique taxa
taxaNames<-taxaNames[!is.na(taxaNames)]
#take rows these taxa are in and drop them into elements of a list
sortedOcc<-lapply(taxaNames,function(x) data[which(x==taxonVar),,drop=FALSE])
names(sortedOcc)<-taxaNames
if(!onlyFormal){
#now use taxon_rank to identify useful informal occurrence
taxonVar2<-data[,"taxon_name"]
#which of these occs are useful as informal occs
stillUseful<-which(is.na(taxonVar) & (data[,"taxon_rank"]==rank))
taxaNames2<-as.character(unique(taxonVar2[stillUseful]))
taxaNames2<-taxaNames2[!is.na(taxaNames2)]
sortedOcc2<-lapply(taxaNames2,function(x) data[which(x==taxonVar2),,drop=FALSE])
names(sortedOcc2)<-taxaNames2
# merge sortedocc2 with sortedOcc
share1<-sapply(taxaNames,function(x) any(x==taxaNames2))
share2<-sapply(taxaNames2,function(x) any(x==taxaNames))
if(sum(!share1)>0){sortedOccU<-sortedOcc[!share1]}else{sortedOccU<-list()}
if(sum(!share2)>0){sortedOcc2U<-sortedOcc2[!share2]}else{sortedOcc2U<-list()}
#and for shared names
if(sum(share1)>0){
shared<-lapply(taxaNames[share1],function(x)
cbind(sharedOcc[[taxaNames==x]],sharedOcc2[[taxaNames2==x]]))
names(shared)<-names(taxaNames[share1])
}else{shared<-list()}
sortedOcc<-c(sortedOccU,sortedOcc2U,shared)
}
}
# sort occurrences by taxon name
sortedOcc<-sortedOcc[order(names(sortedOcc))]
return(sortedOcc)
}
(I apologize for the terribly uneven indenting above. A strange copy/paste error involving tabs in Rstudio, Notepad++ and R GUI, and the more I tab, the worse it gets. I promise a clean-looking version in the paleotree
github repo soon.)
Now, in the following, I have dropped or revised some of the checks from taxonClean
and pbdb_temp_ranges
.
pbdb_temp_ranges
originally (before my issue tickets) checked to see if the same taxon was listed under multipletaxon_no
ID numbers, and returned an error if this was true. This resulted in some interesting cases where the error arose due to subtle differences in taxonomic name, like Pseudoclimacograptus (Metaclimacograptus) hughesi and Pseudoclimacograptus hughesi being listed (for whatever reason) under differenttaxon_no
ID numbers. I’m blanket presuming that if a taxon’s listed taxon name (e.g. formal namesmatched_name
,genus
, etc. as well as informal likegenus_name + species_name
andtaxon_name
.) are the same, its the same taxon. This assumption is tempered by the order of operations I discussed above: formal taxa are identified first, and informal taxa are identified from the occurrences that are ‘leftover’, and informal occurrences assigned to a taxon with a formal ID are concatenated to the ‘formal’ occurrences. Otherwise, we could mistakingly link the same occurrences to multiple taxa.- For what its worth, with respect to my above example, it looks like all taxa with
taxon_name
of Pseudoclimacograptus (Metaclimacograptus) hughesi or Pseudoclimacograptus hughesi have identicalgenus_name
andspecies_name
variables… but not identicalmatched_name
variables, as only some of these occurrences have been assigned to formal taxon Diplograptus_hughesi, while others are assigned to genus-level formal taxa. But that’s a whole other can of worms: the inconsistent taxonomy of graptolites in the PBDB.
- For what its worth, with respect to my above example, it looks like all taxa with
taxonClean
removes taxa with question marks, cf. or other taxonomic-uncertainty flim-flam from the names it uses to identify unique taxa. Now there are two seperate concerns here:- Occurrences with taxonomic uncertainy in our data. This really only pertains when
rank
is"species"
or"genus"
, and can be easily taken care of by removing occurrences wherespecies_reso
orgenus_reso
does not equal a ‘safe’ value, a functionality controlled by the argumentcleanUncertain
, which isTRUE
by default. The ‘safe’ values forspecies_reso
orgenus_reso
are listed in thecleanResoValues
argument, and by default includes common notifiers like ‘n. sp.’ which indicate something other than taxonomic uncertainty. - The ‘cleaning’ of names so they don’t have unwanted taxonomic cruff on them, like that one sock in your laundry that attracts all the distasteful lint. This is particularly problematic as I use names, and not taxon ID numbers, to match occurrences as having the same tazon. However, my use-case is somewhat different from Clapham’s, as
cleanTaxon
usestaxon_name
, while I usematched_name
for formal species and the appropriate taxonomic variable for higher-taxa (i.e.genus
for genera) and, for informal taxa, combininggenus_name
andspecies_name
(for species), withtaxon_name
only being called for referencing informal supraspecific taxa. My working assumption is that taxonomic name rubbish has been kept out ofmatched_name
and the various ‘formal’ supraspecific taxon variables, as well asgenus_name
andspecies_name
for the taxonomic names given in the reference for that occurrence. That could be a bad assumption, but so far I haven’t found any. I know this rubbish exists intaxon_name
, but (a) its hard to identify every use-case of such name-trash, and plus the case where an occurrence isn’t given a formal higher-taxon is very rare (in the data I’ve looked at).
- Occurrences with taxonomic uncertainy in our data. This really only pertains when
Now, let’s test this function with data from paleobioDB
. If we look at the Dicellograptus data, we should find only a single genus.
x<-taxonSortPBDBocc(dicelloData, rank="genus", onlyFormal=TRUE)
names(x)
## [1] "Dicellograptus"
Yep! And with the dataset downloaded using the API, we should get the same:
x<-taxonSortPBDBocc(dicelloData2, rank="genus", onlyFormal=TRUE)
names(x)
## [1] "Dicellograptus"
Yes. Now, let’s try to just retrieve ‘formal’ species from the paleobioDB
dataset, and look at what’s in the first element:
x<-taxonSortPBDBocc(dicelloData, rank="species", onlyFormal=TRUE)
names(x)
## [1] "Dicellograptus anceps" "Dicellograptus complanatus"
## [3] "Dicellograptus ornatus"
head(x[[1]])
## occurrence_no record_type collection_no taxon_name
## 1:17 52248 occurrence 3875 Dicellograptus anceps
## 1:214 1143907 occurrence 145603 Dicellograptus anceps
## 1:227 1143994 occurrence 145612 Dicellograptus anceps
## taxon_rank taxon_no matched_name matched_rank matched_no
## 1:17 species 306364 Dicellograptus anceps species 306364
## 1:214 species 306364 Dicellograptus anceps species 306364
## 1:227 species 306364 Dicellograptus anceps species 306364
## early_interval early_age late_age reference_no genus
## 1:17 Rawtheyan 455.8 445.6 227 Dicellograptus
## 1:214 Ashgillian 449.5 443.7 47085 Dicellograptus
## 1:227 Ashgillian 449.5 443.7 47087 Dicellograptus
## genus_no order order_no class class_no phylum
## 1:17 33650 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:214 33650 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## 1:227 33650 Graptoloidea 33606 Graptolithina 33534 Hemichordata
## phylum_no genus_name species_name late_interval genus_reso
## 1:17 33518 Dicellograptus anceps Kralodvorian <NA>
## 1:214 33518 Dicellograptus anceps <NA> <NA>
## 1:227 33518 Dicellograptus anceps <NA> <NA>
## species_reso
## 1:17 <NA>
## 1:214 <NA>
## 1:227 <NA>
Okay, 3 formal species and occurrence data for a single species (Dicellograptus anceps), just like we’d expect. And now both formal and informal species:
x<-taxonSortPBDBocc(dicelloData, rank="species", onlyFormal=FALSE)
names(x)
## [1] "Dicellograptus alector" "Dicellograptus anceps"
## [3] "Dicellograptus complanatus" "Dicellograptus divaricatus"
## [5] "Dicellograptus flexuosus" "Dicellograptus gurleyi"
## [7] "Dicellograptus intermedius" "Dicellograptus intortus"
## [9] "Dicellograptus johnstrupi" "Dicellograptus mensurans"
## [11] "Dicellograptus minor" "Dicellograptus mirabilis"
## [13] "Dicellograptus morrisi" "Dicellograptus ornatus"
## [15] "Dicellograptus russonioides" "Dicellograptus sextans"
## [17] "Dicellograptus tumidus" "Dicellograptus turgidus"
A big increase in the number of species recovered! Now, we might be curious if this number further increases if we retain occurrences with uncertain taxonomy:
x<-taxonSortPBDBocc(dicelloData, rank="species", onlyFormal=FALSE, cleanUncertain=FALSE)
names(x)
## [1] "Dicellograptus alector" "Dicellograptus anceps"
## [3] "Dicellograptus angulatus" "Dicellograptus caduceus"
## [5] "Dicellograptus complanatus" "Dicellograptus divaricatus"
## [7] "Dicellograptus elegans" "Dicellograptus flexuosus"
## [9] "Dicellograptus forchammeri" "Dicellograptus gurleyi"
## [11] "Dicellograptus intermedius" "Dicellograptus intortus"
## [13] "Dicellograptus johnstrupi" "Dicellograptus mensurans"
## [15] "Dicellograptus minor" "Dicellograptus mirabilis"
## [17] "Dicellograptus moffatensis" "Dicellograptus morrisi"
## [19] "Dicellograptus ornatus" "Dicellograptus pumilis"
## [21] "Dicellograptus russonioides" "Dicellograptus sextans"
## [23] "Dicellograptus smithi" "Dicellograptus tumidus"
## [25] "Dicellograptus turgidus" "Dicellograptus vagus"
Oooh, even more species of questionable taxonomic validity!
That’s enough Dicellograptus for the moment. Now, let’s try our example brachiopod data from before. We should only recover one species, even if we allow it to pull informal taxa too:
x<-taxonSortPBDBocc(acoData, rank="species", onlyFormal=FALSE)
names(x)
## [1] "Acosarina minuta"
Yep, looks good!
Converting Occurrences Data to a timeList Data Object
Alright, now we can finally do what I meant to show you in January: turn these PBDB occurrences into a paleotree
timelist
data format. As you might recall from some previous posts, timeList
data objects are lists composed of two matrices, the first matrix giving the start and end times of intervals, and the second matrix giving the intervals in which taxa are first and last observed.
Now, I imagine there might be more reasons to need to convert occurrence data to a timeList
object than just those involving the PBDB (such as output from simulation function sampleRanges
), so I’ll allow this function to look either for (a) variables named early_age
and late_age
as under ‘pbdb’ vocab, or, (b) if each element of the list is a two-column matrix, use each pair of values as the earliest and latest time bounds for the listed occurrences.
occData2timeList<-function(occList){
# the following is all original, though inspired by paleobioDB code
#need checks
#pull first list entry
exOcc<-occList[[1]]
#test if all occList entries have same number of columns
if(!all(sapply(occList,function(x) ncol(x)==ncol(exOcc)))){
stop("Not all occList entries have same number of columns?")}
#will assume all data given in this manner either has columns named "" and ""
#or has only two columns
if(!any(colnames(exOcc)=="early_age") | !any(colnames(exOcc)=="late_age")){
if(!ncol(exOcc)!=2){stop("")}else{
ageSelector<-1:2}
}else{ageSelector<-c("early_age","late_age")}
#get intervals in which taxa appear
taxaInt<-lapply(occList,function(x) x[,ageSelector])
#get earliest and latest intervals the taxa appear in
taxaMin<-lapply(taxaInt,function(x) x[x[,1]==max(x[,1]),,drop=FALSE])
taxaMax<-lapply(taxaInt,function(x) x[x[,2]==min(x[,2]),,drop=FALSE])
#get smallest intervals
taxaMin<-sapply(taxaMin,function(x) if(nrow(x)>1){
x[which((-apply(x,1,diff))==min(-apply(x,1,diff)))[1],]
}else{x})
taxaMax<-sapply(taxaMax,function(x) if(nrow(x)>1){
x[which((-apply(x,1,diff))==min(-apply(x,1,diff)))[1],]
}else{x})
#transpose and remove weird list attribute
taxaMin<-t(taxaMin)
taxaMax<-t(taxaMax)
taxaMin<-cbind(unlist(taxaMin[,1]),unlist(taxaMin[,2]))
taxaMax<-cbind(unlist(taxaMax[,1]),unlist(taxaMax[,2]))
#make interval list
intTimes<-unique(rbind(taxaMin,taxaMax))
intTimes<-intTimes[order(-intTimes[,1],-intTimes[,2]),]
#now assign taxa first and last intervals
firstInt<-apply(taxaMin,1,function(x)
which(apply(intTimes,1,function(y) identical(y,x))))
lastInt<-apply(taxaMax,1,function(x)
which(apply(intTimes,1,function(y) identical(y,x))))
taxonTimes<-cbind(firstInt,lastInt)
rownames(taxonTimes)<-names(occList)
#package it together
dimnames(intTimes)<-list(NULL,c("startTime","endTime"))
res<-list(intTimes=intTimes,taxonTimes=taxonTimes)
return(res)
}
Let’s try it out!
sortDicelloOcc<-taxonSortPBDBocc(dicelloData,rank="species",onlyFormal=FALSE)
dicelloTimeList<-occData2timeList(occList=sortDicelloOcc)
dicelloTimeList
## $intTimes
## startTime endTime
## [1,] 471.8 457.5
## [2,] 467.3 458.4
## [3,] 460.9 456.1
## [4,] 460.9 449.5
## [5,] 458.4 443.4
## [6,] 455.8 445.6
## [7,] 453.0 445.2
## [8,] 449.5 445.6
## [9,] 449.5 443.7
## [10,] 445.2 443.4
##
## $taxonTimes
## firstInt lastInt
## Dicellograptus alector 4 9
## Dicellograptus anceps 6 9
## Dicellograptus complanatus 7 9
## Dicellograptus divaricatus 3 3
## Dicellograptus flexuosus 4 4
## Dicellograptus gurleyi 4 4
## Dicellograptus intermedius 2 3
## Dicellograptus intortus 3 3
## Dicellograptus johnstrupi 4 4
## Dicellograptus mensurans 4 4
## Dicellograptus minor 7 9
## Dicellograptus mirabilis 9 9
## Dicellograptus morrisi 4 8
## Dicellograptus ornatus 7 10
## Dicellograptus russonioides 3 3
## Dicellograptus sextans 1 5
## Dicellograptus tumidus 9 9
## Dicellograptus turgidus 9 9
Looks like a timeList to me!
Okay, now we’re ready to try using the data with paleotree
! Now, the main use will probably be for time-scaling a tree using functions like bin_timePaleoPhy
, but since we lack a tree at this very moment to time-scale, let’s just use one of paleotree
’s diversity curve for now. I realize that’s a bit of a letdown, maybe, since paleobioDB
already has diversity curve functions (although they work different from paleotree
’s).
library(paleotree)
taxicDivDisc(dicelloTimeList)
Other things we could do is to get at sampling rates from these estimates, which I’ll be covering in a future blog post (perhaps the next post?? Who knows!).
A Full Worked Example
My intention is to add taxonSortPBDBocc
and occData2timeList
to paleotree
very soon, but for now users will have to source them into R manually. Assuming you do source these two functions, or that you are reading this in the future and have a new version of paleotree
with these two functions, here’s a complete worked example with all graptoloids, using paleobioDB
to download the data:
library(paleotree)
library(paleobioDB)
taxon<-"Graptoloidea"
graptOcc<-pbdb_occurrences(limit="all", base_name=taxon,
show=c("phylo","ident"), vocab="pbdb")
#for species (formal ID only)
sortGraptOcc<-taxonSortPBDBocc(graptOcc,rank="species",onlyFormal=TRUE)
graptTimeList<-occData2timeList(occList=sortGraptOcc)
taxicDivDisc(graptTimeList)
#for species (all)
sortGraptOcc<-taxonSortPBDBocc(graptOcc,rank="species",onlyFormal=FALSE)
graptTimeList<-occData2timeList(occList=sortGraptOcc)
taxicDivDisc(graptTimeList)
#for genera (formal ID only)
sortGraptOcc<-taxonSortPBDBocc(graptOcc,rank="genus",onlyFormal=TRUE)
graptTimeList<-occData2timeList(occList=sortGraptOcc)
taxicDivDisc(graptTimeList)
#for family (formal ID only)
sortGraptOcc<-taxonSortPBDBocc(graptOcc,rank="family",onlyFormal=TRUE)
graptTimeList<-occData2timeList(occList=sortGraptOcc)
taxicDivDisc(graptTimeList)
# number of occurrences listed for each family
sapply(sortGraptOcc,nrow)
## Abrograptidae Diplograptidae Phyllograptidae Retiolitidae
## 4 625 95 100
## Sinograptidae
## 5
Unsurprisingly, we can see that there are more genera given formal IDs in the PBDB than there are formal species. Of course, the number of (unsynonymized) informal species swamp out both, with an enormous number of informal species in the early Silurian, creating a noticeable and somewhat odd peak. Some of the dicordence between the species and genus counts is almost certainly due to the general trend of simplification in graptolite colony structure, which leads to there being many species of Silurian monograptids placed in the polyphyletic wastebin taxon Monograptus, which means Silurian generic richness of graptolites is almost certainly not a very good metaphor for changes in species-level diversity. Simultaneously, though, it would take some investigation to figure out how the PBDB has over 200 contemporaneous early Silurian graptoloids. Very curious.
You might note from the table of number of occurrences per family that searching for graptoloids has skipped the paraphyletic family Anisograptidae. The ‘anisograptids’ were the paraphyletic ancestors for the traditional graptoloids (i.e. Section Graptoloidea Lapworth) and included in the Subdivision Graptoloida by Maletz et al. (2009), as this group was also planktonic and had a nematophorous sicula like all other graptoloids (well, except for the ones who lose their sicula…). It isn’t traditionally included in the graptoloids as the ‘anisograptids’ had not lost the bithecae of their dendroid graptolite ancestors. Anyway, all this means is that the above figures do not depict all the planktonic graptolites (i.e. the graptoloids as defined by Maletz et al., 2009).
Anyway, there we go: how to get PBDB API and paleobioDB
data downloads to talk to paleotree
.
PS: Hopefully, most of this will not become irrelevant with (version 1.2 of the PBDB API)[https://twitter.com/meclapham/status/577896469207748608]!