Introduction
Hello all!
I’ve had an interesting week or so having conversations with several individuals about best practices for identifying ‘species’ level taxa in fossil occurrence data from the Paleobiology Database following my post two weeks ago where I found that several utilities associated with the PBDB identified ‘species’ in different ways. Let’s try to hit this issue really hard today.
Paleobiology Database Vocabulary
First, in my last post, I was a little confusing since I referred to the same variable by its full name and a shortening that seemed to be used only by the paleobioDB
package (Varela et al., in press); it turns out this is simply the difference between variable headings returned by the Paleobiology Database’s API depending on special parameter vocab
, which can either be pbdb
(full variable names) or com
(for the ‘compact’, shortened terminology). You can easily change this in paleobioDB
data-pulling functions using argument vocab
.
library(paleobioDB)
## Loading required package: raster
## Loading required package: sp
## Loading required package: maps
head(pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="com"))
## oid typ cid tna rnk tid mna mra mid
## 1:1 3393 occ 328 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## 1:2 4589 occ 378 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## 1:3 4634 occ 379 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## 1:4 4688 occ 380 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## 1:5 8957 occ 403 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## 1:6 9127 occ 424 Dicellograptus sp. 5 33650 Dicellograptus 5 33650
## oei eag lag rid idt ids oli rst rss
## 1:1 Black River 460.9 457.5 13 Dicellograptus sp. <NA> <NA> <NA>
## 1:2 Harju 452.0 443.7 13 Dicellograptus sp. <NA> <NA> <NA>
## 1:3 Harju 452.0 443.7 13 Dicellograptus sp. <NA> <NA> <NA>
## 1:4 Caradoc 460.9 449.5 13 Dicellograptus sp. <NA> <NA> <NA>
## 1:5 Ashgill 449.5 443.7 13 Dicellograptus sp. <NA> <NA> <NA>
## 1:6 Middle Ordovician 470.0 458.4 13 Dicellograptus sp. <NA> <NA> <NA>
head(pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="pbdb"))
## occurrence_no record_type collection_no taxon_name taxon_rank
## 1:1 3393 occurrence 328 Dicellograptus sp. genus
## 1:2 4589 occurrence 378 Dicellograptus sp. genus
## 1:3 4634 occurrence 379 Dicellograptus sp. genus
## 1:4 4688 occurrence 380 Dicellograptus sp. genus
## 1:5 8957 occurrence 403 Dicellograptus sp. genus
## 1:6 9127 occurrence 424 Dicellograptus sp. genus
## taxon_no matched_name matched_rank matched_no early_interval
## 1:1 33650 Dicellograptus genus 33650 Black River
## 1:2 33650 Dicellograptus genus 33650 Harju
## 1:3 33650 Dicellograptus genus 33650 Harju
## 1:4 33650 Dicellograptus genus 33650 Caradoc
## 1:5 33650 Dicellograptus genus 33650 Ashgill
## 1:6 33650 Dicellograptus genus 33650 Middle Ordovician
## early_age late_age reference_no genus_name species_name
## 1:1 460.9 457.5 13 Dicellograptus sp.
## 1:2 452.0 443.7 13 Dicellograptus sp.
## 1:3 452.0 443.7 13 Dicellograptus sp.
## 1:4 460.9 449.5 13 Dicellograptus sp.
## 1:5 449.5 443.7 13 Dicellograptus sp.
## 1:6 470.0 458.4 13 Dicellograptus sp.
## late_interval genus_reso species_reso
## 1:1 <NA> <NA> <NA>
## 1:2 <NA> <NA> <NA>
## 1:3 <NA> <NA> <NA>
## 1:4 <NA> <NA> <NA>
## 1:5 <NA> <NA> <NA>
## 1:6 <NA> <NA> <NA>
Note that some variables change their formatting depending on whether pbdb
or com
vocabulary are used; for example, taxon_rank
is full nouns (‘genus’, ‘species’, etc) with pbdb
, but is a numerical coding with com
.
For this post (and future posts), I’ll try to use the full pbdb
terminology; you can always check the API page for PBDB occurrence data if you want the corresponding com
label though.
A brief aside… paleobioDB package versus the raw API in R
Shanan Peters has usefully pointed out that we shouldn’t mistake paleobioDB
as the only way to access the PBDB API. After all, the API will presumably continue to evolve and thus the paleobioDB
package may not be up-to-date with the API’s various arguments. Instead of using paleobioDB
we could instead access the API directly, for example:
rawAPI<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident&limit=all")
head(rawAPI)
## occurrence_no record_type reid_no superceded collection_no
## 1 3393 occurrence NA NA 328
## 2 4589 occurrence NA NA 378
## 3 4634 occurrence NA NA 379
## 4 4688 occurrence NA NA 380
## 5 8957 occurrence NA NA 403
## 6 9127 occurrence NA NA 424
## taxon_name taxon_rank taxon_no matched_name matched_rank
## 1 Dicellograptus sp. genus 33650 Dicellograptus genus
## 2 Dicellograptus sp. genus 33650 Dicellograptus genus
## 3 Dicellograptus sp. genus 33650 Dicellograptus genus
## 4 Dicellograptus sp. genus 33650 Dicellograptus genus
## 5 Dicellograptus sp. genus 33650 Dicellograptus genus
## 6 Dicellograptus sp. genus 33650 Dicellograptus genus
## matched_no early_interval late_interval early_age late_age
## 1 33650 Black River Black River 460.9 457.5
## 2 33650 Harju Harju 452.0 443.7
## 3 33650 Harju Harju 452.0 443.7
## 4 33650 Caradoc Caradoc 460.9 449.5
## 5 33650 Ashgill Ashgill 449.5 443.7
## 6 33650 Middle Ordovician Middle Ordovician 470.0 458.4
## reference_no genus_name genus_reso subgenus_name subgenus_reso
## 1 13 Dicellograptus NA NA
## 2 13 Dicellograptus NA NA
## 3 13 Dicellograptus NA NA
## 4 13 Dicellograptus NA NA
## 5 13 Dicellograptus NA NA
## 6 13 Dicellograptus NA NA
## species_name species_reso
## 1 sp.
## 2 sp.
## 3 sp.
## 4 sp.
## 5 sp.
## 6 sp.
Note, however, some small differences between these otherwise very similar requests. In particular, paleobioDB
seems to have removed several variables related to data cleaning, like whether an occurrence has been superceded by another occurrence or new taxonomic assignment. I’ve traced the code used in the function through paleobioDB
’s github repo and I’m not sure why these don’t show up.
PBDBpack<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="pbdb")
# same number of columns in both?
ncol(rawAPI)==ncol(PBDBpack)
## [1] FALSE
# Nope! Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])
## [1] "reid_no" "superceded" "subgenus_name" "subgenus_reso"
Perhaps there is a cleaning step where paleobioDB
(or Paleobiology Database, when it responds to JSON requests) is removing empty columns (i.e. contains only NA
s).
# Are the missing columns just full of NAs?
sapply(missingCol,function(x) all(is.na(rawAPI[,x])))
## reid_no superceded subgenus_name subgenus_reso
## TRUE TRUE TRUE TRUE
There’s no mention to this phenomenon that I can find in paleobioDB’s manual (or any evidence of column dropping in the code of the package’s repo) or the documentation for the Paleobiology Database’s API, however. It’s a little odd because it means one could query, for example, for abundance data (show=c("abund")
) and not get back the columns relating to abundance because they are empty. This could create issues for code that expects variables to always be present in a given data download.
For example, here is an extremely contrived example for the bizarre reticulate graptoloid genus Reteograptus (see Finney, 1980, for some morphological details), which is probably a lazarus taxon of the abrograptids (according to Finney).
rawAPI<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Reteograptus&show=abund&limit=all")
PBDBpack<-pbdb_occurrences(limit="all", base_name="Reteograptus", show=c("abund"), vocab="pbdb")
# Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])
## [1] "reid_no" "superceded" "abund_value" "abund_unit"
#let's look at the abundance values from the rawAPI; are they all NAs?
apply(rawAPI[,c("abund_value","abund_unit")],2,function(x) all(is.na(x)))
## abund_value abund_unit
## TRUE TRUE
There does not appear to be an argument to alter this variable-dropping behavior. I’ve opened an issue ticket here at the paleobioDB
repo.
The ultimate result of the is that we can’t always depend on the same variables to (a) be present, (b) have the same column headings or (c) have the same variable codings for tables of PBDB occurrence data, because it depends greatly on how the data was downloaded and which options were used, going beyond the choices involving show =
. The authors of paleobioDB
package seem to have dealt with this by writing two separate versions of functions for analyzing occurrence data within each function (example), which is inefficient but perhaps necessary.
Distinguishing Species in Paleobiology Database Occurrence Data
It turns out there is no single ‘right’ answer to how to identify unique species in a set of PBDB occurrence data and assign occurrences to these distinct species. To understand the different ways we might perform this task, we should cover the relevant variables and what the information contained within is. The following variables can be obtained by setting the show
to both ident
and phylo
, whether within the API or paleobioDB
.
Some of the following information comes from a very informative discussion I have had with Matthew Clapham. However, personallylly, I’m still working through understanding all these variables, and intentionally trying to do this with as little fact-checking by email as possible. This is partly to demonstrate the limitations of the current documentation and also to avoid drowning certain individuals in emails. Better I save up all my questions for the end, right? I will be honest and say I still don’t entirely understand reid_no
and superceded
. So don’t blame my sources, blame me for when I get things wrong, because that’s part of this little mental experiment.
An important distinction is that one set of variables refers to the taxonomic identification for an occurrence reported within the original reference:
taxon_name
described as “The taxonomic name by which this occurrence is identified.”taxon_rank
described as “The taxonomic rank of the name, if this can be determined.”genus_name
described as “The taxonomic name (less species) by which this occurrence was identified. This is often a genus, but may be a higher taxon.”genus_reso
described as “The resolution of this taxonomic name, i.e. sensu lato or aff.”species_name
described as “The species name (if any) by which this occurrence was identified”species_reso
described as “The resolution of the species name, i.e. sensu lato or n. sp”
Related to this group, we have the variable taxon_no
:
taxon_no
described as “The unique identifier of the identified taxonomic name. If this is empty, then the name was never entered into the taxonomic hierarchy stored in this database and we have no further information about the classification of this occurrence.”
…which identifies which taxon ID is assigned to the original taxon identification. However, many taxa apparently get entered without being given unique taxon IDs, and are instead assigned tentatively to a higher taxon. For example, there are many occurrences listed with distinct species_names
that are listed under a taxon_no
for a genus-level taxon.
A different set of variables refers to the formal taxon assignment currently given within the PBDB’s present state of taxonomic standardization, which attempts to control for synonymies:
matched_name
described as “The senior synonym and/or currently accepted spelling of the closest matching name in the database to the identified taxonomic name, if any is known, and if this name is different from the value of taxon_name.”matched_rank
described as “The taxonomic rank of the matched name, if different from the value of taxon_rank”matched_no
described as “The unique identifier of the closest matching name in the database to the identified taxonomic name, if any is known.”genus
described as “The name of the genus in which this occurrence is classified”genus_no
described as “The identifier of the genus in which this occurrence is classified”
Again, because many lower-taxa get informally assigned to the same higher-taxon ID on a tentative tentatively, taxon_no
(and thus matched_no
and matched_name
in addition) may be the same for occurences listed as multiple distinct species going by taxon_rank
and species_name
.
Finally, the occurrence itself might be reidentified or superceded with a new taxonomic assignment, which can be found in:
reid_no
described as “If this occurrence was reidentified, a positive integer that uniquely identifies the reidentification”superceded
described as “The value of this field will be true if this occurrence was later identified under a different taxon”
These aren’t exhaustive lists, and if we wanted to discuss pulling out unique supraspecific taxa, we would need to consider additional variables such as family_name
reported by the show=phylo
parameter.
It turns out the Dicellograptus example I keep using provides an example of one type of disagreement we might see in groups where there has been little standardization and where many species have been entered ‘informally’ as tentative assignments to a higher taxon. So let’s focus on the above variables in that dataset:
#taxonomic variables of interest
taxonVar<-c("reid_no","superceded", #PBDB occurrence reidentification info
#original taxon info for occurrence
"taxon_name","taxon_rank","genus_name","genus_reso","species_name","species_reso",
#standardized formal taxon info
"taxon_no","matched_name","matched_rank","matched_no","genus","genus_no")
# due to the column dropping issue, we can't use paleobioDB package:
#dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident","phylo"), vocab="pbdb")
# so we'll use the API instead
dicelloData<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident,phylo&limit=all")
head(dicelloData[,taxonVar])
## reid_no superceded taxon_name taxon_rank genus_name
## 1 NA NA Dicellograptus sp. genus Dicellograptus
## 2 NA NA Dicellograptus sp. genus Dicellograptus
## 3 NA NA Dicellograptus sp. genus Dicellograptus
## 4 NA NA Dicellograptus sp. genus Dicellograptus
## 5 NA NA Dicellograptus sp. genus Dicellograptus
## 6 NA NA Dicellograptus sp. genus Dicellograptus
## genus_reso species_name species_reso taxon_no matched_name
## 1 sp. 33650 Dicellograptus
## 2 sp. 33650 Dicellograptus
## 3 sp. 33650 Dicellograptus
## 4 sp. 33650 Dicellograptus
## 5 sp. 33650 Dicellograptus
## 6 sp. 33650 Dicellograptus
## matched_rank matched_no genus genus_no
## 1 genus 33650 Dicellograptus 33650
## 2 genus 33650 Dicellograptus 33650
## 3 genus 33650 Dicellograptus 33650
## 4 genus 33650 Dicellograptus 33650
## 5 genus 33650 Dicellograptus 33650
## 6 genus 33650 Dicellograptus 33650
Just from the first few lines, we can also see a number of Dicellograptus occurrences that were never resolved beyond the genus level and thus are just assigned to the Dicellograptus genus taxon ID (taxon_no = 33650
), and the matched_name
and other ‘matched’ information is the same, because Dicellograptus has never been synonymized (which is good, because its a valid genus, so it shouldn’t be synonymized).
We can see these listed as “Species lacking formal opinion data” here.
But how many unique taxon_name
and species_name
values are assigned to taxon_no = 33650
?
unique(dicelloData[dicelloData$taxon_no==33650,"taxon_name"])
## [1] Dicellograptus sp. Dicellograptus ? sp.
## [3] Dicellograptus alector n. sp. Dicellograptus gurleyi
## [5] Dicellograptus mensurans Dicellograptus sextans cf.
## [7] Dicellograptus spp. Dicellograptus sextans
## [9] Dicellograptus vagus cf. Dicellograptus intortus
## [11] Dicellograptus divaricatus cf. Dicellograptus caduceus cf.
## [13] Dicellograptus forchammeri cf. Dicellograptus intortus cf.
## [15] Dicellograptus moffatensis cf. Dicellograptus intermedius n. sp.
## [17] Dicellograptus divaricatus Dicellograptus russonioides n. sp.
## [19] Dicellograptus johnstrupi Dicellograptus morrisi
## [21] Dicellograptus gurleyi cf. Dicellograptus elegans cf.
## [23] Dicellograptus morrisi cf. Dicellograptus smithi cf.
## [25] Dicellograptus sp. ? Dicellograptus alector
## [27] Dicellograptus alector cf. Dicellograptus ? angulatus cf.
## [29] Dicellograptus angulatus cf. Dicellograptus flexuosus
## [31] Dicellograptus minor Dicellograptus tumidus
## [33] Dicellograptus turgidus Dicellograptus mirabilis
## 40 Levels: Dicellograptus ? angulatus cf. ... Dicellograptus vagus cf.
unique(dicelloData[dicelloData$taxon_no==33650,"species_name"])
## [1] sp. alector gurleyi mensurans sextans
## [6] spp. vagus intortus divaricatus caduceus
## [11] forchammeri moffatensis intermedius russonioides johnstrupi
## [16] morrisi elegans smithi angulatus flexuosus
## [21] minor tumidus turgidus mirabilis
## 27 Levels: alector anceps angulatus caduceus complanatus ... vagus
Thankfully, all of these are given a taxon_rank
of species… right? We can check this by removing any occurrence not given a taxon_rank
of species and looking at the unique taxon_name
values.
unique(dicelloData[dicelloData$taxon_no==33650 & dicelloData$taxon_rank!="species","taxon_name"])
## [1] Dicellograptus sp. Dicellograptus ? sp. Dicellograptus spp.
## [4] Dicellograptus sp. ?
## 40 Levels: Dicellograptus ? angulatus cf. ... Dicellograptus vagus cf.
Okay, in Dicellograptus, at least, all the informal species assigned formally to a genus-level taxon are listed correctly as species.
We can examine another sort of the disagreement among these fields by looking at datasets where there has been a lot of taxonomic standarization in the PBDB (i.e. not graptolites). Matt Clapham showed me a particularly interesting read-out for a query involving Acosarina minuta, a Permian brachiopod (…perhaps a minute brachiopod?), filtering for the same variables as above:
# so we'll use the API, like above...
# Note that the following won't work except from web browser address bar or other utility
# which automatically (and silently) replaces " " with "%20"
acoData1<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Acosarina minuta&show=ident,phylo")
colnames(acoData1)
## [1] "occurrence_no" "record_type" "reid_no" "superceded"
## [5] "collection_no" "taxon_name" "taxon_rank" "taxon_no"
## [9] "matched_name" "matched_rank" "matched_no" "early_interval"
## [13] "late_interval" "early_age" "late_age" "reference_no"
# the query terminated at the " " in the taxon name and never got to other parameters, like 'show'
# (This tripped me up for an hour. Don't judge me!)
# Okay, now the right way
acoData<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Acosarina%20minuta&show=ident,phylo")
head(acoData[,match(taxonVar,colnames(acoData))])
## reid_no superceded taxon_name taxon_rank genus_name
## 1 NA NA Orthotichia indica species Orthotichia
## 2 20839 NA Acosarina minuta species Acosarina
## 3 NA NA Acosarina minuta species Acosarina
## 4 NA NA Acosarina dorsisulcata species Acosarina
## 5 NA NA Acosarina dorsisulcata species Acosarina
## 6 NA NA Acosarina dorsisulcata species Acosarina
## genus_reso species_name species_reso taxon_no matched_name
## 1 indica 177488 Acosarina minuta
## 2 minuta 125301 Acosarina minuta
## 3 minuta 125301 Acosarina minuta
## 4 dorsisulcata 126711 Acosarina minuta
## 5 dorsisulcata 126711 Acosarina minuta
## 6 dorsisulcata 126711 Acosarina minuta
## matched_rank matched_no genus genus_no
## 1 species 125301 Acosarina 26652
## 2 species 125301 Acosarina 26652
## 3 species 125301 Acosarina 26652
## 4 species 125301 Acosarina 26652
## 5 species 125301 Acosarina 26652
## 6 species 125301 Acosarina 26652
All of the occurrences listed here are formally identified in PBDB’s standardized taxonomy as Acosarina minuta. That’s pretty easy to verify by looking at unique values of matched_name
and matched_no
:
unique(acoData$matched_name)
## [1] Acosarina minuta
## Levels: Acosarina minuta
unique(acoData$matched_no)
## [1] 125301
However, we can see even just from the head()
that there are multiple values of taxon_name
, taxon_no
, genus_name
and species_name
formally reassigned to the ‘matched’ taxon Acosarina minuta. Here’s a full list:
unique(acoData[,c("taxon_name","taxon_no","genus_name","species_name")])
## taxon_name taxon_no genus_name species_name
## 1 Orthotichia indica 177488 Orthotichia indica
## 2 Acosarina minuta 125301 Acosarina minuta
## 4 Acosarina dorsisulcata 126711 Acosarina dorsisulcata
## 15 Acosarina dorsisulcata n. sp. 126711 Acosarina dorsisulcata
## 49 Orthis indica 125296 Orthis indica
## 58 Acosarina indica 175512 Acosarina indica
## 99 Kotlaia capillosa 123689 Kotlaia capillosa
## 100 Kotlaia n. gen. capillosa n. sp. 123689 Kotlaia capillosa
## 103 Orthotichia minuta 199434 Orthotichia minuta
## 110 Acosarina minuta cf. 125301 Acosarina minuta
## 111 Sunacosarina campana 190895 Sunacosarina campana
## 134 Acosarina kanmerai n. sp. 206333 Acosarina kanmerai
## 142 Orthis (Schizophoria) indica ? 125296 Orthis indica
## 143 Orthis (Schizophoria) indica 125296 Orthis indica
## 217 Dalmanella indica 255744 Dalmanella indica
## 234 Acosarina ? minuta 125301 Acosarina minuta
So, how do we navigate these variables to identify unique species in our datasets?
Potentially Misleading Ways To Identify Species in PBDB Occurrence Data
In the last post, I mentioned that a number of my friends, as it turned out, used very different criteria for distinguishing taxa from PBDB occurrence data, criteria which also varied among the formal ‘apps’ of the Paleobiology Database.
Personally, I’d intuitively chosen (two blog posts ago) to use species_name
alone, filtered on taxon_rank = "species"
. It turns out several of my friends also use species_name
. Another friend uses a combination of taxon_name
and taxon_no
. Strategies based on taxon_name
or species_name
are (at best) not making use of the potential for information of taxonomic synonymies in the PBDB and (at worst) in groups that have undergone a considerable amount of qualified taxonomic standardization, using these variables may be very misleading relative to the current state of knowledge.
It should also be clear that taxon_no
alone could be a misleading way to distinguish unique species in occurrence data, even if filtered to species-ranked occurrences. Referring to the last post, I showed that paleobioDB
package uses taxon_no
to identify unique species in occurrences, filtered to taxon_rank
of species. The function match()
is then used to find genus_name
and species_name
which are then combined into a single species label. We can see the code here. As we can see above, even filtering taxon_no
to species-level occurrences via taxon_rank
would result in mistakingly grouping a number of informally entered species that are formally assigned to their genus as a single species-level taxon with an exaggerated number of occurrences assigned to it. Furthermore, the use of match()
returns the first and only the first match in a list, so the assigned species-label is an arbitrary function of whatever species comes first in a table of occurrences. This results in the mistaken reporting of Dicellograptus alector from Dicellograptus occurrence data below:
require(paleobioDB)
dicelloData<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("phylo","ident"))
pbdb_temp_range(dicelloData, rank="species", do.plot=FALSE)
## max min
## Dicellograptus alector 471.8 443.4
## Dicellograptus anceps 455.8 443.7
## Dicellograptus ornatus 453.0 443.4
## Dicellograptus complanatus 453.0 443.7
I’ve opened an issue ticket for this issue on the paleobioDB
repo here.
It turns out this isn’t what PBDB Navigator does, as I had incorrectly guessed last post. PBDB Navigator is doing something much more complex, counting formal taxa that are listed as daughter taxa. The reason why Dicellograptus gravis appears despite having no occurrences in the PBDB is because that taxon was formally entered without any occurrences. This is discussed more here, where I had thought it was a bug and reported it as such, but it was more of a… misinterpretation. Similarly, PBDB Classic / Fossilworks appears to use the taxonomic standarization in the database to return ranges for all formally-defined species appropriately (under the Download function), but returns any informally defined species in addition.
A Framework for Identifying Unique Species
So, how should we go about pulling species-level taxa from our occurrence data? If we want all ‘species’ information possible, then we must do two seperate processes:
Pull out all formally-defined species, with their
matched_name
species name. This should be as simple as usingmatched_rank
to filter on species-level formal taxa and pulling out the uniquematched_no
values.Pull out all informal species that are assigned formally to supraspecific taxa. This should be possible on filtering only for taxa that do not have a
matched_rank = species
assignment but have ataxon_rank = species
value, and then pulling on unique values ofspecies_name
.
Obviously, the second is optional if we want to only keep formal species; depends on our data.
There are some complexities to the above for actually generating the correct labels for these species-level taxa. First, there is no corresponding species_name
-like variable for matched_taxon
as there is with genus_name
and genus
. The most reasonable method seems to be splitting the species name from the genus name in matched_name
and combining it with genus
, but presumably this should have the same effect as just using matched_name
whole.
Let’s look at a function that tries to follow this ‘informal versus formal’ dichotomy. Here’s taxonClean
from Matthew Clapham’s github repo PBDB-R-scripts, function script located here:
taxonClean <- function(occurrences, tax_level="genus", formal_id="no") {
#FILE PREPARATION AND DATA CLEANING
#file preparation and data cleaning for species level analysis
if(tax_level=="species"){
#removes rows where species qualified by cf. or aff., question mark, ex gr. or quotation mark
resolved_occs <- subset(occurrences, occurrences$species_reso=="" | occurrences$species_reso=="n. sp.")
if (formal_id=="yes") {
#deletes occurrences not resolved to species level using formally-classified species
cleaned_occs <- subset(resolved_occs, resolved_occs$matched_rank=="species")
cleaned_occs$final_taxon <- cleaned_occs$matched_name
} else {
#deletes occurrences not resolved to species level using classified and unclassified species
resolved_occs <- subset(resolved_occs, resolved_occs$taxon_rank=="species")
#strips all extraneous qualifiers from taxon name
resolved_occs$taxon_name <- gsub("\" ", "", resolved_occs$taxon_name)
resolved_occs$taxon_name <- gsub("cf. ", "", resolved_occs$taxon_name)
resolved_occs$taxon_name <- gsub("aff. ", "", resolved_occs$taxon_name)
resolved_occs$taxon_name <- gsub("\\? ", "", resolved_occs$taxon_name)
resolved_occs$taxon_name <- gsub("n. gen. ", "", resolved_occs$taxon_name)
resolved_occs$taxon_name <- gsub(" n. sp.", "", resolved_occs$taxon_name)
resolved_occs$taxon_name <- gsub(" sensu lato", "", resolved_occs$taxon_name)
resolved_occs$taxon_name <- gsub(" informal", "", resolved_occs$taxon_name)
#separates classified and unclassified occurrences
classified_occs <- subset(resolved_occs, resolved_occs$matched_rank=="species")
unclassified_occs <- subset(resolved_occs, resolved_occs$matched_rank!="species")
classified_occs$final_taxon <- classified_occs$matched_name
unclassified_occs$final_taxon <- unclassified_occs$taxon_name
#combines classified and unclassified occurrences
cleaned_occs <- rbind(classified_occs, unclassified_occs)
}
#or file preparation and data cleaning for family level analysis
} else if (tax_level=="family") {
#deletes occurrences not resolved to at least family level
resolved_occs <- subset(occurrences, occurrences$matched_rank <= 9)
#deletes occurrences where genus or family are qualified with question mark, quotations, cf. or aff.
if(length(grep("\\? ", resolved_occs$taxon_name)) > 0){
resolved_occs <- resolved_occs[-grep("\\? ", resolved_occs$taxon_name),]
}
if(length(grep("\" ", resolved_occs$taxon_name)) > 0){
resolved_occs <- resolved_occs[-grep("\" ", resolved_occs$taxon_name),]
}
if(length(grep("cf. ", resolved_occs$taxon_name)) > 0){
resolved_occs <- resolved_occs[-grep("cf. ", resolved_occs$taxon_name),]
}
if(length(grep("aff. ", resolved_occs$taxon_name)) > 0){
resolved_occs <- resolved_occs[-grep("aff. ", resolved_occs$taxon_name),]
}
cleaned_occs <- resolved_occs
cleaned_occs$final_taxon <- cleaned_occs$family
#or file preparation and data cleaning for genus level analysis
} else {
#deletes occurrences not resolved to at least genus level using classified and unclassified species
resolved_occs <- subset(occurrences, occurrences$matched_rank <= 5)
#deletes occurrences where genus is qualified with question mark, quotations, cf. or aff.
cleaned_occs <- subset(resolved_occs, resolved_occs$genus_reso=="" | resolved_occs$genus_reso=="n. gen.")
#extracts genus name from matched_name string
cleaned_occs$final_taxon <- gsub(" .*", "", cleaned_occs$matched_name)
}
cleaned_occs
}
(I made some small code alterations because I am just personally terrified by the idea of not embracing the lines of control structure (i.e. if
, for
, etc.) in curly braces! There was several actual errors: an erroneous resolved_occs
before resolved_occs
was called, also an attempt to call matched_name = 3
rather than matched_name = "species"
…)
Note that taxonClean
depends on vocab = "pbdb"
, which differs from the default for data functions in package paleobioDB
. This function will return a table of occurrences, pared down only to the occurrences of taxa at the appropriate level and with their valid taxon labels in a new variable labeled final_taxon
.
Let’s see what it does with our Dicellograptus data, using the API.
dicelloData<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident,phylo&limit=all")
# with informal species
cleanOccs<-taxonClean(dicelloData, tax_level="species", formal_id="no")
unique(cleanOccs$final_taxon)
## [1] Dicellograptus anceps Dicellograptus complanatus
## [3] Dicellograptus ornatus Dicellograptus alector
## [5] Dicellograptus gurleyi Dicellograptus mensurans
## [7] Dicellograptus sextans Dicellograptus intortus
## [9] Dicellograptus intermedius Dicellograptus divaricatus
## [11] Dicellograptus russonioides Dicellograptus johnstrupi
## [13] Dicellograptus morrisi Dicellograptus flexuosus
## [15] Dicellograptus minor Dicellograptus tumidus
## [17] Dicellograptus turgidus Dicellograptus mirabilis
## 19 Levels: Dicellograptus ... Dicellograptus mirabilis
# only formal species
cleanOccs<-taxonClean(dicelloData, tax_level="species", formal_id="yes")
unique(cleanOccs$final_taxon)
## [1] Dicellograptus anceps Dicellograptus complanatus
## [3] Dicellograptus ornatus
## 4 Levels: Dicellograptus ... Dicellograptus ornatus
For me though, I’d probably approach this issue in a slight different way than taxonClean
… I’ll show what I cook up next time, when I (hopefully) finally show some stuff I did weeks ago integrating paleobioDB
data downloads with paleotree
functions.