Friday, April 3, 2015

What I Accomplished at the Paleobiology Database Hackathon, March 2015

Hello all! Recently I was able to attend the Paleobiology Database (PBDB) Hackathon that was held at UC Santa Cruz, and I wanted to talk to you about the experience and present some interesting products from my work at the hackathon.

Thoughts on the Hackathon Structure

Now, I’ve never attended another hackathon, but my preconceived notion is they tend to be entirely devoted to developing new software in a working-group setting. I think this particular meeting was ultimately more of a mixture between a hackathon and a PBDB API workshop. Much of the focus in shared discussions was on becoming familiar with the API, particularly the in-development version 1.2 of the API, and reporting issues as we encountered them while programming. The last was critical: if we hadn’t been trying to all program various items based on the API, we would have not encountered many of the issues we encountered; we were basically acting as Quality Assurance for the API software and the database, which is invaluable for future users to use the API effectively.

Now, I think that format worked quite well, but I think future PBDB hackathons (if there are plans for such) will probably hew closer to the typical hackathon model, as hopefully participants of future PBDB hacking events will encounter fewer issues and better documentation. There was also a learning curve issue: we probably needed more time so to first become proficient at the API as a group and then be able to work together on focused group projects. Overall, though, I think the main goal was to promote community excitement about the PBDB API, and I think that was accomplished in spades.

For me, the ‘workshop on the API’ aspect of the get-together was invaluable. I’ve been trying pretty hard for the last two months to understand the API’s output (as my previous blog posts will attest to) but one can only get so far by bothering Matt Clapham and others with emails.

So What Did I Do?

Well, readers, I wrote a bunch of functions in R! I have even added the majority of them to paleotree on github, along with documentation, which means you can go play with them now! Just go install the in-development version of paleotree directly from github with package devtools

#get in-development paleotree version 2.4

Which we can check and see is named version 2.4:

## [1] '2.4'

So, now let’s load paleotree!

#load paleotree
## Loading required package: ape

You can use ?paleotree to go peruse the help files for the more than seventy functions available in paleotree.

Functions to download PBDB data

Now, what you won’t find in paleotree is the functions I wrote at the hackathon for downloading PBDB data. Why? Well: first, there’s already an R package for that, paleobioDB (Varela et al., 2015). I have little interest in doing what others have already made their focus. Second, maintaining such functions to ensure functionality forever as I essentialy would like to ensure for all paleotree functions is difficult, as issues and corrections will be needed every time a new PBDB API version appeared. Note how the functions below call version 1.1 of the API.

The functions below were written to automate some aspects of the API for the ‘occurrences’ and ‘taxa’ download functionalities, respectively. In particular, the longer of the two, easygetPBDBocc, was written to strip out warning messages returned by the API, which can cause problems with simple uses of read.csv to read in PBDA data downloads using the API. This could be particularly useful if a user wants to repeatedly query the PBDB, say for a series of taxon names from a list, particularly if it is unknown whether all the taxon on that list have been formally entered into the PBDB and have occurrence data listed for them.

  #cleans PBDB occurrence downloads of warnings
    command<-paste(unlist(strsplit(command,split=" ")),collapse="%20")
        warn<-sapply(warn, function(x) 
  #let's get some taxonomic data

Note well, that easyGetPBDBtaxa will only return the senior names of taxa, so that we don’t have to remove junior synonyms from the resulting dataset.

Now, we can use these functions to download some example data for graptoloids from the Paleobiology Database API, version 1.1:



And let’s look at these datasets very briefly…

##   occurrence_no record_type reid_no superceded collection_no
## 1          2319  occurrence      NA         NA           270
## 2          2432  occurrence      NA         NA           279
## 3          2461  occurrence      NA         NA           281
## 4          2604  occurrence      NA         NA           288
## 5          2761  occurrence      NA         NA           297
## 6          2762  occurrence      NA         NA           297
##            taxon_name taxon_rank taxon_no  matched_name matched_rank
## 1    Hallograptus sp.      genus    33673  Hallograptus        genus
## 2   Schizograptus sp.      genus    33761 Schizograptus        genus
## 3 Didymograptus ? sp.      genus    33655 Didymograptus        genus
## 4   Didymograptus sp.      genus    33655 Didymograptus        genus
## 5    Diplograptus sp.      genus    33660  Diplograptus        genus
## 6   Didymograptus sp.      genus    33655 Didymograptus        genus
##   taxon_no orig_no record_type associated_records      rank
## 1    33606   33606       taxon                 NA     order
## 2   166989  166989       taxon                 NA    family
## 3   166991  166991       taxon                 NA subfamily
## 4   150197  150197       taxon                 NA    family
## 5   166988  166988       taxon                 NA subfamily
## 6    33650   33650       taxon                 NA     genus
##        taxon_name common_name     status parent_no senior_no
## 1    Graptoloidea          NA belongs to     33534     33606
## 2    Retiolitidae          NA belongs to     33606    166989
## 3 Plectograptinae          NA belongs to    166989    166991
## 4  Diplograptidae          NA belongs to     33606    150197
## 5    Retiolitinae          NA belongs to    166989    166988
## 6  Dicellograptus          NA belongs to     33606     33650

Now what to we do with these big tables of data? Let’s look at what we can do with the occurrence data first.

Sorting Unique Taxa From Occurrence Datasets with taxonSortPBDBocc

Having the occurrence data in a big table isn’t going to do us much good without some sorting of these occurrence into those assigned to separate, unique taxa. We can break these tables down into lists, where each element is a table of occurrences assigned to taxa at various taxonomic levels using the new paleotree function taxonsortPBDBocc, which debuted in the last blog post. This function received a lot of attention from me while at the hackathon and can now handle data from almost any vocabularly or API version.

As discussed in that previous post, there are several ways to pull taxa: from different taxonomic levels, but also deciding whether to pull the ‘informal’ taxa that have never been officially entered and yet are listed in the original information from the identification of the occurrence. We can also decide whether we want to keep occurrence that had some sort of indicator of taxonomic uncertainty in their identified taxon name.

One neat thing we can do is use taxonSortPBDBocc to count the number of taxa available for different taxonomic levels and levels of data ‘cleanliness’.

First, we can count just the formal genera:

occGenus<-taxonSortPBDBocc(graptOccPBDB, rank="genus")
## [1] 133

And then just formal species:

occSpeciesFormal<-taxonSortPBDBocc(graptOccPBDB, rank="species")
## [1] 20

And, yep, there are fewer ‘formal’ graptoloid species in the PBDB then there are ‘formal’ genera. This must mean a majority of genera have no species formally assigned to them.

Now let’s also count the informal species, along with the formal species:

occSpeciesInformal<-taxonSortPBDBocc(graptOccPBDB, rank="species",
## [1] 642

And our numbers increase to something that might be realistic (to my eye), now that we have those ‘informal’ species.

Now let’s have the informal and formal species altogether, but let’s not throwout any occurrences with suspicious/uncertain taxon identifiers. This is really everything and the kitchen sink, as they say.

occSpeciesEverything<-taxonSortPBDBocc(graptOccPBDB, rank="species",
        onlyFormal=FALSE, cleanUncertain=FALSE)
## [1] 734

And we get even more species recovered.

Now, we can visualize the age uncertainty of the occurrences assigned to our species using a plotting function I wrote a few weeks ago and posted to this blog. This function is now in paleotree as plotOccData, and it takes taxon-sorted lists of occurrence data as its input, just as given by taxonSortPBDBocc. Let’s plot the formal species data for now:


Each of the horizontal lines is the age uncertainty of a single occurrence, and occurrences are visually sorted and color-coded by the taxa (in this case, species) that they below to. We can get something a little more complex if we try genera:


But as there are many more taxa, there is a lot more going on in this figure.

Get Taxon Occurrence Data into a ‘timeList’ object with occData2timeList

Once we have a taxon-sorted list of occurrence tables, we can obtain a timeList object useable by many other paleotree functions via the function occData2timeList. This function also initially debuted in the last blog post. It is much-much improved now, in particular it returns (by default) the smallest bounds possible for the first and last appearance of each taxon, which are values that maximize the use of information content from all the occurrence data.

Let’s apply it to the dataset that contains cleaned occurrences, for both formal and informal species.

# use occData2timeList

Let’s look at what we have. Every timeList object is composed of two matrices each with two columns: (1) the age bounds on the intervals, and (2) the respective first and last intervals of each taxon, given as the interval’s rownumber in the first matrix.

##      startTime endTime
## [1,]     488.3   471.8
## [2,]     485.4   477.7
## [3,]     485.4   473.9
## [4,]     478.6   470.0
## [5,]     478.6   468.9
## [6,]     478.6   468.1
##                           firstInt lastInt
## Abiesgraptus tenuiramosus       90      90
## Akidograptus acuminatus         59      62
## Akidograptus ascensus           59      59
## Amplexograptus arctus           19      19
## Amplexograptus bohemicus        58      58
## Amplexograptus confertus        18      34

Our main purpose for getting a timeList object in paleotree is probably to time-scale a tree, but with one not being handy at the moment, let’s just do something a little more boring and compare the diversity curves for (formal and informal) species and for genera:




We can see that the two are very different, as I noted last time. The early Silurian spike in species diversity looks artificial to my eye, but there may also be something going on whether too many taxa have been lumped into the wastebin taxon Monograptus.

If only there was a way of visualizing the taxonomic structures in the PBDB…

Creating a ‘Taxon-Tree’ from the Paleobiology Database’s Taxonomic Data

Most of us are probably familiar with the superficial similarities between taxonomy, as a nested hierarchy, and phylogenies, which also are hierarchies with a nesting structure (except not everything needs to be equally nested from the topology’s point of view…). A commonly invoked metaphore is of taxonomic groups as to imagine that nested taxonomic groups are alike nested monophyletic clades, with no additional resolution. In other words, we could describe the taxonomy of a group as a phylogeny-like arrangedment of mostly-unresolved clusters nested within each other. It’ll look like a phylogeny but it is really just the taxonomic information portrayed in a new way.

However, in some ways, such ‘taxon-trees’ are already widely used in some fields as the analytical basis for various phylogenetic comparative methods. For example, much of Phylomatic’s lower-taxonomic structure is derived from a taxon-tree like approach. Recently, Soul and Friedman examined the use of taxon-trees versus real cladograms in the fossil record and found (excitingly) that the use of outdated non-cladistic-based taxon-trees performed just as well in many ways as actual cladograms for a number of groups in the fossil record.

I don’t know if the PBDB’s taxonomy will ever be good enough that we could use a ‘taxon-tree’ for some group as the basis for comparative analyses, but for now we can use a taxon-tree approach to visualize what taxonomy is in the PBDB and visually search for weird errors.

The function I’ve written for this is makePBDBtaxontree which takes a taxonomic download for some group from the PBDB. This function is not 100% optimal, however, as the taxon-tree produced only captures the original Linnaen ranks. When version 1.2 of the API is released, I’ll be able to query the name of a taxon’s direct, most senior parent’s name and construct taxon-trees that way, which will likely add additional branching levels to the produced taxon-trees.

Let’s look at some example taxon-trees from various taxonomic groups:



#asaphid trilobites

#need to drop repeated taxon first: Hylaeosaurus

One thing you’ll notice in these trees is not all the edges seem to stretch to the same height from the root: that’s deliberate. If we are looking at genera in an order and there are genera on edges of length=1, those genera have been loosely assigned to that order with no class or family information. A longer edge would indicate the genus is simply part of a monotypic higher-taxon which has been lost as ape hates ‘singles’ (i.e. nodes with only a single descendant).

Now, those are pretty neat. However, we can go a step further and use our occurrence data to help us generate a time-scaled taxon-tree, which should be even more useful for seeing weird outliers in the taxonomy or occurrence data.

We can time-scale using function bin_timePaleoPhy and we’ll select arguments like nonstoch.bin=TRUE and type="mbl" to make a pretty time-scaled tree when we plot it.

#can use the time data from occurrences, generated above

#let's time-scale this tree with paleotree
  #in such a way to maximize prettiness
## Warning: Following taxa dropped from tree: Trichograptus, Anthograptus, Yutagraptus, Joamgsjamotes...
## Warning: Following taxa dropped from timeList: Nicholsonograptus

This drops alot of taxa; why? Presumably some are due to mispelled taxon names, but it must be more than that. There must be a large number of senior graptoloid genera which exist in the taxonomic database but have no corresponding occurrences.

Now let’s load Mark Bell and Graeme Lloyd’s library strap and take geoscalePhylo for a spin.

## Loading required package: geoscale
geoscalePhylo(timeTree, ages=timeTree$ranges.used)

Cool! For those who haven’t used geoscalePhylo before, the thicker black bars on the edges are each taxon’s stratigraphic range, in this case the maximal range of each taxon.

We can see some genera with suspiciously long ranges relative to closely related taxa with much shorter ranges, and some groups that seems to be out of place (e.g. many of these genera should be in the diplograptid group, there is no monograptids…).

A Worked Example: Rhynchonellida

Alright, now let’s take the above, where we mostly follow me playing around with a graptolite dataset, and let’s go back over these function with an entirely different group, the rynchonellid brachiopods. I have spent considerable time poking and proding the character data of this group as part of my current post-doctoral position with Sandy Carlson at UC Davis. So, what does the Rhynchonellida look like in the PBDB?

#need to drop repeated taxon first: Rhynchonelloidea


Well, we can see a number of families with genera nested within them, and a number of genera that appear to be in monotypic families. Let’s get the occurrence data in the right format and time-scale the taxon-tree:


## Warning: Following taxa dropped from tree: Nayunnella, Hispanirhynchia, Sphenarina, Xiaobangdaia, Colophragma...
## Warning: Following taxa dropped from timeList: Aethirhynchia, Agarhyncha, Akopovorhynchia, Allorhynchoides, Almerarhynchia...

Even more taxa dropped than with the graptoloids… again, this is probably the effect of having taxa listed in the taxonomic part of the PBDB and not in the occurrence database. I’d say ‘or vice versa’, but we should only be placing formal senior genera on the taxon-tree, so they can’t be in the occurrence data but not the taxonomy data.

Let’s plot the time-scaled tree…

geoscalePhylo(rynchTimeTree, ages=rynchTimeTree$ranges.used)

Well, the number of overlapping node labels make this a little messy, but hey, looks cool!

Anyway, you can find everything above (except the two functions for pulling PBDB data) in the version of paleotree now up on Github, soon coming to CRAN!

Well, until next time…