Monday, October 26, 2015

What is a College Degree? or Why an Engineering Student Takes French and Why a Paleontology Student Takes Calculus



Recently, I read a conversation online where an individual asked why a typical ‘paleontology’ degree (i.e. geology or, sometimes, biology) requires calculus. People gave the typical answers, but they rarely give the answer that I think makes sense. This answer does not even appear to readily exist in people’s minds. I see similar questions get asked all the time: forget calculus, why does a paleontologist have to take hard-rock petrology? Why does an engineer have to take French? Why does an English major have to take biology? These aren’t likely to be directly ‘useful’, at least not professionally, to these students. So why take them?

I’d like to talk about that today. I don’t usually talk about my philosophical view of teaching, even privately. One of my mentors in academia has actually complimented me on how practical my teaching statements are written, detailing precisely what activities and tasks would be involved in particular courses. But I don’t believe we can do anything well without a strong theoretical foundation for structuring how we should move forward, so here it is. Here’s my theory.

First, I’m going to preface all of this by saying this is entirely just my viewpoint on the subject. I’m an academic, but I am poorly read on philosophy of education, or the larger end-goals of education, or the historical purpose of secondary education. That is probably pretty typical for most scientists. Also, my perception of college and its apparent goals is strongly colored by my observation of American universities. So, I might be a little off my rocker here, and I apologize if anyone with actual background in those areas reads this and shakes their head sadly. That’s fine, and if so, let me know what you know that shows maybe there is some other intention designed into our college system. I’ll also more or less use ‘university’ and ‘college’ interchangeably, which might annoy some who see a very clear distinction.

What I’ll say here largely comes from my own thinking, discussion with others and two sources, both of which I was first exposed to as a graduate student in an education seminar at the University of Chicago:

Booker, H. G. 1963. University Education and Applied Science. Science 141(3580):486-576.

Richter, F. 1991. Geology and the university. Geotimes 36(9):5.


They aren’t very long reads; if you can’t get these and want a copy, please contact me. Another thought-provoking read that I encountered later, which expresses somewhat opposing arguments, is:

Crow, M. M. 2012. Citizen science U: The best way to teach today's hyperconnected students is to get rid of the departments of geology and biology. Scientific American 307(4):48-49.

So, let’s cover the ‘typical’ answers to the question of why a paleontologist needs calculus. Well, the most immediately proximal causation, and least helpful answer, is that that’s how modern universities work: you have ‘common’ or ‘general’ education requirements you need to fulfill for any major or specialty. Some are for any student enrolled at all, while some are for certain degrees, like how geology Bachelor degrees often require calculus in addition to ‘general’ requirements. Ultimately, students are effectively required to take classes from nearly every broad area of education in the college. You don’t have to be extremely shrewd or know much about how financial matters are solved ‘behind the scenes’ at a university to know that this is very financially convenient for certain departments, especially those with small numbers of students actually majoring in that field. It is also very convenient for graduate students in those departments, as it means there are teaching assistantships available for graduate students associated with those classes, and thus they can have financial support while they get their degree.

But there’s better answers than this. If nothing else, we can reject this simple answer because if a student was only fulfilling general course requirements for the university’s financial gain, then (a) the courses required would be completely random, when generally a very similar set of course requirements exists across all universities, and (b) whether you actually took these courses, or your performance in them, wouldn’t be of concern to most graduate degrees admissions committees, when generally, they are actually quite important (if sometimes of less consideration than other qualities as an indicator of future success).

The most common rationale I see given to this line of questioning is to attempt to find some specific reason why that specific course is somehow related to the actual field. And, sure, there are plenty of application of calculus in paleontology, mainly related to biomechanics issues. There’s also plenty of engineering literature in French. But these are corner cases: the vast majority of paleontologists never do any calculus. There’s lots of math in paleontology today, and I would recommend anyone in the natural sciences (paleontology, geology, biology) to be familiar with univariate and multivariate statistics at a bare minimum.  Now, I work with math and quantitative analysis much more often than most paleontologists, and even I handed off the one derivation I’ve encountered in my work to a statistician colleague, rather than do it myself. In general, you will be hard pressed to show how anyone found specific uses for every single general requirement class they took in college. And, to be honest, I doubt that many non-engineering students retain the knowledge and capability to derive and integrate for more than a year or two after passing Calculus, anyway. At this point, I only vaguely remember how to integrate. So, if the goal is that students should gain and retain specific skills for future use, the system doesn't seem optimal for that.

One answer I see given infrequently given is that a student takes such classes to 'broaden' themselves; that may be infrequently given because it sounds like a line from a university propaganda pamphlet. But no one seems to know what that means. I’d argue that the best reason is precisely this ‘broadening’, but we need to be able to explain what that is. How does ‘broadening’ aid in our professional development?

The answer is that secondary education has *nothing* to do with your professional development: getting a job and getting a college degree are at cross purposes. Booker (1963) and Richter (1990) argue, collectively, that secondary education is ultimately about developing yourself as an individual. You take courses in a subject area to expose yourself to the particular mode of thinking applied by that subject area. Every subject area in a liberal arts college is ultimately a different approach to addressing some set of questions. Why does a student take a basic English class? To learn how comparative analysis of literature and writing can be used to address issues. Why does a student take French? To be exposed to how the vocabulary of another language works, and the basic concept that the language you use can make it easier (or harder) to communicate and express certain thoughts.

Why does a student take a basic geology class? To learn how geologists approach scientific questions about the earth’s history and environment, and that of other planets. Think about it: one of the most common topics in an introductory geology class is training students to understand and appreciate the magnitude of the geological timescale. Dealing with time and spatial scales that are much larger than those we interact with daily is part of the mental toolbox of a geologist. Taking an introductory geology course exposes you to this toolbox, and allows you to add those tools to how *you* go forward and approach problems. While I imagine that many of those who have taken an introductory geology class do not recall the names of the geologic periods or their exact order, hopefully what they do retain is a lasting impression of the immensity of time in earth history.

And that’s really what I think college is: building a mental toolbox that helps you see how to approach problems. They could be problems you encounter in your work, your personal life, your hobbies, whatever! And the key to getting that toolbox is being exposed to a diverse array of fields of study and gaining those important conceptual insights and perspectives unique to those disciplines. So, why does a paleontologist need calculus? So that they comprehend concepts like the relationship among successive derivatives of a given function, the relationship of derivatives to the concept of the area under a curve defined by a function, etc.

And, I think, that many students who get through a calculus sequence gain that conceptual understanding for the long-term. In my opinion, college classes generally succeed at the unstated goal of broadening the perspective of students. However, I don’t think this is entirely intentional on the part of the educators. First, I should state the caveat that many college educators give considerably more time and effort to the art of education than public perception gives them due for, and I think many give a great deal of consideration to understanding what the ultimate goal of education is. I think the stereotype of the professor who treats teaching as a burden to be avoided at all costs mainly results from the fact the majority of time spent ‘teaching’ isn’t contact hours (i.e. time spent lecturing or managing a lab session) but rather the many hours spent outside of the classroom preparing lectures, assignments, exams and grading. However, teaching is hard work, and I think it is not uncommon to lose sight of what the greater goal is of course work. Being actively cognizant of such a goal, and the need for the course to transmit a new way of thinking to the students, and actively working toward that is a difficult mental task to juggle, and so I think many do not actively think about this when making course materials, because often, just making course materials at all is a high enough bar. Thus, my perception (flavored mainly by my state school undergraduate education) is that the end result of this is that many college educators do not think of their role in terms of exposing students to different lines of investigation, or any similar goal. Thankfully, the nature of the college system seems to counteract the need for active recognition of the end result, because that end-result is hard-wired in the system, a result of how majors are designed and how a diversity of courses are required from outside of the discipline. Thus, I think the system generally achieves the goal of expanding student perspectives even without directed intent.

That said, even having a discussion about the ultimate objectives of higher education could have a great benefit. Regardless of whether the system works currently (and I think it does), it could work even better. We should design our courses to embrace the presentation of new perspectives and the process of investigation in that field. I think there are many ways of doing this, but I think there are also many ways that do not work toward this goal. In particular, courses that depend on rote memorization and a small number of examinations (particularly multiple choice) incentivize a superficial understanding of the material, focused on memorization of details rather than an understanding of the larger-scale patterns and processes. This results in students cramming before exams as their study strategy, followed by regurgitating that information at every written answer question in hopes that some phrase in that mess is close enough to score some more valuable grade points. Little, if anything, is retained long-term in such classes in my experience, which is ironic as many courses with this design are intended to play a critical role in various degree programs, providing background information for later classes. Furthermore, this course structure may also incentivize cheating, with grades often determined more by exams than any other form of assessment, and made easier when exams are based on repeating a litany of facts rather than critical assessment of ideas.

Now, you might say ‘but students have to know an exhaustive amount of knowledge from introductory course for later courses they are prerequisites of’, and on those grounds, you perhaps will argue that it is necessary to test all of those areas, to make sure they know all of it. Well, personally, I don’t see how testing for it one class at all guarantees, or even necessarily correlates with knowing it for a later class. My guess is that most students forget most of the factual details, but do retain the cognitive, investigative tools that I’m arguing is the actual purposes of college. For example, I’d wager that the majority of students who take introductory geology probably doesn’t retain information like what time interval the breakup of Rodinia occurred during, although they might recall that Rodinia is the name of some paleo-continent; rather, they’ll retain the ability to read a time-scale, the ability to use stratigraphic relationships to interpret the order of deposition of rock units, general knowledge of how continental drift works, the general position of the antecedents of modern continents over the Phanerozoic, etc. If, for example, a class two years later gives them a reading assignment that assumes they know the what, when and where of Rodinia, a student well-prepared by previous courses may not know off the top of their head, but can go look it up very easily and understand what they find. As college educators, perhaps we shouldn’t rely on students coming into a class knowing any particular background details, but rather that they have the thinking skills and conceptual basis to comprehend new material, and the ability to go back and fill in holes in their knowledge about old material.

If we accept this theory that university curriculum is intended to diversify a student's ability to handle novel problems, than that idea disagrees strongly with the belief that college should encompass any significant element of professional training.  College isn’t about preparing you for job skills: in general, you will learn day-to-day job skills on the job. College is about teaching you how to think, and my sense is that this is also the opinion of many employers: they want employees who can think and thus learn new skills quickly, rather than students who have the wrong skillset and have to be retrained. Do not interpret this as my admonishing courses that teach any sort of practical ability. Learning how to do certain procedures, such as applying statistical techniques or writing computer code, brings a student face-to-face with the investigative approach used by that field. Courses with such practical material are important components of many programs and absolutely essential elements for graduate. What I am opposed to are courses that attempt to teach based on the demands of the mercurial job market, particularly if done at the expense of demonstrating to the students the broader aspects of the course's subject matter. Choosing to teach specific skills for the sole purpose of making a student into an ideal job candidate is making a risky gamble that the job market will remain the same for any length of time, or that that targeted industry will still exist in a decade (and has that ever been a good bet?). It seems like a better bet and more efficient to leave job training for when graduates become employed, and instead teach them how to think, which gives them the flexibility to deal with the unpredictable things that will happen over their lifetime.

If there is any particular demographic that is the source for the belief that college should be job training, it is current college students, many of whom appear to think that they have been conned into taking classes that are a waste of time; distracting from obtaining that desired employment. A university degree provides no special ingredient essential to professional development, but rather makes the statement that the bearer has had a series of courses in which they encountered a diverse set of approaches to handling questions of knowledge, and thus might show more creativity and comprehension than a job candidate who does not hold such a degree. But you can certainly have many of the positive attributes of a college graduate without attending college, and there are many experiences that trump college in providing raw perspective. Of course, the flip-side of this is that college degrees shouldn’t be a prerequisite for getting a job. In some sense, our modern society expects applicants for many positions to hold a college degree, thus reflecting the cultural saturation of degrees, and further reinforcing the notion that a college education is job training, and thus committing a great disservice to our colleges and universities by perpetuating that erroneous belief.

In Pirsig’s Zen and the Art of Motorcycle Maintenance (a book with both considerable strengths and weaknesses), an extended discussion of the meaning of higher education ends with the suggestion that college shouldn’t be seen as a requirement thrust upon every citizen, but an experience selected as the result of a careful, thoughtful decision to devote additional years to learning and personal betterment. I think this describes an ideal world, one I would personally prefer to live in, rather than our current society which treats college as an inevitability, considers the choice to actively not go to college (if able) as a mild insanity, and deems the lack of a college degree as a serious disadvantage on one’s opportunities. If you are honest about what a college educations is, you should recognize many individuals already have the perspective and thinking skills that college would grant them without ever attending, and unless they want to go even further with their understanding of a subject matter, it isn’t actually useful for them to attend college. Furthermore, some individuals may never want what college offers, regardless of whether they already have it or not. I think in an ideal world, these sets of individuals would get to opt out of our societal expectations, because we shouldn’t be able to simply force higher education on them.


However…we don’t get to live in that ideal world. The reality is that many jobs arbitrarily require a college education, and increasingly require more advanced graduate degrees as the marketplace saturates with bachelor’s degrees. This puts an enormous amount of pressure for the core mission of colleges and universities to change, to fit the perceived notion of what ‘college is for’: mainly, to match the perception of students and their families who mistake college for glorified job training. Some have suggested that the very concept of college is on the verge of changing dramatically. It is hard in this modern digital age to argue that any traditional institution is invulnerable or unchangeable, as we have already seen many areas of society where technological ‘disruption’ has, well, disrupted an entire industry. However, university and higher education are centuries old institutions: and it is equally hubris-incurring to point out any particular long-lived institution and forecast that it must change to survive, or to predict that institution as being ripe for extinction. Regardless, it is unclear how our college system will adapt to changing pressures. My hope is that we can communicate the value of the current system, where students are required to sample a diverse array of academic fields, and where courses place theory and discussion, rather than move toward conveying technical skills for professional development (after all, we already have tech schools).

So let’s go back to where I started this discussion.

We’ve now built this argument that the structure of college education is to increase one’s exposure to the diversity of thought, not professional development. But let’s step back a bit. Does that actually argue for requiring that an undergraduate student interested in paleontology take calculus? No, not quite. The ultimate goal cannot support the claim that each and every typical course requirement is a necessary element of the degree. What I think is important is that every liberal arts degree reflect a diverse set of experiences, rather than a hyper-focused, specialized formula that provides no broader perspective. What needs to be attained is the diversity. Some programs of study recognize this, and are lenient and allow exceptions or course substitutions, although not for every student. Going beyond that, many students have ‘creative study’ majors, where student build their own ‘major’, their own program of study from available courses. Instead of being hyper-focused and allowing students to escape prerequisites they find aren’t useful, though, generally students in such programs tend to take diverse courses, as it was their desire for an interdisciplinary focus that drove them to creative studies to begin with. I think there’s a lot to be celebrated in that. Now, that said, many undergraduates interested in paleontology do end up taking calculus, and I think in the long-term, that course has benefits that are difficult to measure or enumerate, even if they later lose the ability to derive and integrate.

So what do you think?

Friday, April 3, 2015

What I Accomplished at the Paleobiology Database Hackathon, March 2015

Hello all! Recently I was able to attend the Paleobiology Database (PBDB) Hackathon that was held at UC Santa Cruz, and I wanted to talk to you about the experience and present some interesting products from my work at the hackathon.

Thoughts on the Hackathon Structure

Now, I’ve never attended another hackathon, but my preconceived notion is they tend to be entirely devoted to developing new software in a working-group setting. I think this particular meeting was ultimately more of a mixture between a hackathon and a PBDB API workshop. Much of the focus in shared discussions was on becoming familiar with the API, particularly the in-development version 1.2 of the API, and reporting issues as we encountered them while programming. The last was critical: if we hadn’t been trying to all program various items based on the API, we would have not encountered many of the issues we encountered; we were basically acting as Quality Assurance for the API software and the database, which is invaluable for future users to use the API effectively.

Now, I think that format worked quite well, but I think future PBDB hackathons (if there are plans for such) will probably hew closer to the typical hackathon model, as hopefully participants of future PBDB hacking events will encounter fewer issues and better documentation. There was also a learning curve issue: we probably needed more time so to first become proficient at the API as a group and then be able to work together on focused group projects. Overall, though, I think the main goal was to promote community excitement about the PBDB API, and I think that was accomplished in spades.

For me, the ‘workshop on the API’ aspect of the get-together was invaluable. I’ve been trying pretty hard for the last two months to understand the API’s output (as my previous blog posts will attest to) but one can only get so far by bothering Matt Clapham and others with emails.

So What Did I Do?

Well, readers, I wrote a bunch of functions in R! I have even added the majority of them to paleotree on github, along with documentation, which means you can go play with them now! Just go install the in-development version of paleotree directly from github with package devtools

#get in-development paleotree version 2.4
library(devtools)
install_github("dwbapst/paleotree")

Which we can check and see is named version 2.4:

packageVersion("paleotree")
## [1] '2.4'

So, now let’s load paleotree!

#load paleotree
library(paleotree)
## Loading required package: ape

You can use ?paleotree to go peruse the help files for the more than seventy functions available in paleotree.

Functions to download PBDB data

Now, what you won’t find in paleotree is the functions I wrote at the hackathon for downloading PBDB data. Why? Well: first, there’s already an R package for that, paleobioDB (Varela et al., 2015). I have little interest in doing what others have already made their focus. Second, maintaining such functions to ensure functionality forever as I essentialy would like to ensure for all paleotree functions is difficult, as issues and corrections will be needed every time a new PBDB API version appeared. Note how the functions below call version 1.1 of the API.

The functions below were written to automate some aspects of the API for the ‘occurrences’ and ‘taxa’ download functionalities, respectively. In particular, the longer of the two, easygetPBDBocc, was written to strip out warning messages returned by the API, which can cause problems with simple uses of read.csv to read in PBDA data downloads using the API. This could be particularly useful if a user wants to repeatedly query the PBDB, say for a series of taxon names from a list, particularly if it is unknown whether all the taxon on that list have been formally entered into the PBDB and have occurrence data listed for them.

easyGetPBDBocc<-function(taxa,show=c("ident","phylo")){
  #cleans PBDB occurrence downloads of warnings
  taxa<-paste(taxa,collapse=",")
    taxa<-paste(unlist(strsplit(taxa,"_")),collapse="%20")
    show<-paste(show,collapse=",")
    command<-paste0("http://paleobiodb.org/data1.1/occs/list.txt?base_name=",
        taxa,"&show=",show,"&limit=all",
        collapse="")
    command<-paste(unlist(strsplit(command,split=" ")),collapse="%20")
    downData<-readLines(command)
    if(length(grep("Warning",downData))!=0){
        start<-grep("Records",downData)
        warn<-downData[1:(start-1)]
        warn<-sapply(warn, function(x) 
            paste0(unlist(strsplit(unlist(strsplit(x,'"')),",")),collapse=""))
        warn<-paste0(warn,collapse="\n")
        names(warn)<-NULL
        mat<-downData[-(1:start)]
        mat<-read.csv(textConnection(mat))
        message(warn)
    }else{
        mat<-downData
        mat<-read.csv(textConnection(mat))
        }
    return(mat)
    }
  
easyGetPBDBtaxa<-function(taxon){
  #let's get some taxonomic data
  taxaData<-read.csv(paste0("http://paleobiodb.org/",
        "data1.1/taxa/list.txt?base_name=",taxon,
        "&rel=all_children&show=phylo,img&status=senior"))
    return(taxaData)
    }

Note well, that easyGetPBDBtaxa will only return the senior names of taxa, so that we don’t have to remove junior synonyms from the resulting dataset.

Now, we can use these functions to download some example data for graptoloids from the Paleobiology Database API, version 1.1:

graptOccPBDB<-easyGetPBDBocc("Graptoloidea")

graptTaxaPBDB<-easyGetPBDBtaxa("Graptoloidea")

And let’s look at these datasets very briefly…

head(graptOccPBDB)[,1:10]
##   occurrence_no record_type reid_no superceded collection_no
## 1          2319  occurrence      NA         NA           270
## 2          2432  occurrence      NA         NA           279
## 3          2461  occurrence      NA         NA           281
## 4          2604  occurrence      NA         NA           288
## 5          2761  occurrence      NA         NA           297
## 6          2762  occurrence      NA         NA           297
##            taxon_name taxon_rank taxon_no  matched_name matched_rank
## 1    Hallograptus sp.      genus    33673  Hallograptus        genus
## 2   Schizograptus sp.      genus    33761 Schizograptus        genus
## 3 Didymograptus ? sp.      genus    33655 Didymograptus        genus
## 4   Didymograptus sp.      genus    33655 Didymograptus        genus
## 5    Diplograptus sp.      genus    33660  Diplograptus        genus
## 6   Didymograptus sp.      genus    33655 Didymograptus        genus
head(graptTaxaPBDB)[,1:10]
##   taxon_no orig_no record_type associated_records      rank
## 1    33606   33606       taxon                 NA     order
## 2   166989  166989       taxon                 NA    family
## 3   166991  166991       taxon                 NA subfamily
## 4   150197  150197       taxon                 NA    family
## 5   166988  166988       taxon                 NA subfamily
## 6    33650   33650       taxon                 NA     genus
##        taxon_name common_name     status parent_no senior_no
## 1    Graptoloidea          NA belongs to     33534     33606
## 2    Retiolitidae          NA belongs to     33606    166989
## 3 Plectograptinae          NA belongs to    166989    166991
## 4  Diplograptidae          NA belongs to     33606    150197
## 5    Retiolitinae          NA belongs to    166989    166988
## 6  Dicellograptus          NA belongs to     33606     33650

Now what to we do with these big tables of data? Let’s look at what we can do with the occurrence data first.

Sorting Unique Taxa From Occurrence Datasets with taxonSortPBDBocc

Having the occurrence data in a big table isn’t going to do us much good without some sorting of these occurrence into those assigned to separate, unique taxa. We can break these tables down into lists, where each element is a table of occurrences assigned to taxa at various taxonomic levels using the new paleotree function taxonsortPBDBocc, which debuted in the last blog post. This function received a lot of attention from me while at the hackathon and can now handle data from almost any vocabularly or API version.

As discussed in that previous post, there are several ways to pull taxa: from different taxonomic levels, but also deciding whether to pull the ‘informal’ taxa that have never been officially entered and yet are listed in the original information from the identification of the occurrence. We can also decide whether we want to keep occurrence that had some sort of indicator of taxonomic uncertainty in their identified taxon name.

One neat thing we can do is use taxonSortPBDBocc to count the number of taxa available for different taxonomic levels and levels of data ‘cleanliness’.

First, we can count just the formal genera:

occGenus<-taxonSortPBDBocc(graptOccPBDB, rank="genus")
length(occGenus)
## [1] 133

And then just formal species:

occSpeciesFormal<-taxonSortPBDBocc(graptOccPBDB, rank="species")
length(occSpeciesFormal)
## [1] 20

And, yep, there are fewer ‘formal’ graptoloid species in the PBDB then there are ‘formal’ genera. This must mean a majority of genera have no species formally assigned to them.

Now let’s also count the informal species, along with the formal species:

occSpeciesInformal<-taxonSortPBDBocc(graptOccPBDB, rank="species",
   onlyFormal=FALSE)
length(occSpeciesInformal)
## [1] 642

And our numbers increase to something that might be realistic (to my eye), now that we have those ‘informal’ species.

Now let’s have the informal and formal species altogether, but let’s not throwout any occurrences with suspicious/uncertain taxon identifiers. This is really everything and the kitchen sink, as they say.

occSpeciesEverything<-taxonSortPBDBocc(graptOccPBDB, rank="species",
        onlyFormal=FALSE, cleanUncertain=FALSE)
length(occSpeciesEverything)
## [1] 734

And we get even more species recovered.

Now, we can visualize the age uncertainty of the occurrences assigned to our species using a plotting function I wrote a few weeks ago and posted to this blog. This function is now in paleotree as plotOccData, and it takes taxon-sorted lists of occurrence data as its input, just as given by taxonSortPBDBocc. Let’s plot the formal species data for now:

plotOccData(occSpeciesFormal)

Each of the horizontal lines is the age uncertainty of a single occurrence, and occurrences are visually sorted and color-coded by the taxa (in this case, species) that they below to. We can get something a little more complex if we try genera:

plotOccData(occGenus)

But as there are many more taxa, there is a lot more going on in this figure.

Get Taxon Occurrence Data into a ‘timeList’ object with occData2timeList

Once we have a taxon-sorted list of occurrence tables, we can obtain a timeList object useable by many other paleotree functions via the function occData2timeList. This function also initially debuted in the last blog post. It is much-much improved now, in particular it returns (by default) the smallest bounds possible for the first and last appearance of each taxon, which are values that maximize the use of information content from all the occurrence data.

Let’s apply it to the dataset that contains cleaned occurrences, for both formal and informal species.

# use occData2timeList
graptTimeSpecies<-occData2timeList(occList=occSpeciesInformal)

Let’s look at what we have. Every timeList object is composed of two matrices each with two columns: (1) the age bounds on the intervals, and (2) the respective first and last intervals of each taxon, given as the interval’s rownumber in the first matrix.

head(graptTimeSpecies[[1]])
##      startTime endTime
## [1,]     488.3   471.8
## [2,]     485.4   477.7
## [3,]     485.4   473.9
## [4,]     478.6   470.0
## [5,]     478.6   468.9
## [6,]     478.6   468.1
head(graptTimeSpecies[[2]])
##                           firstInt lastInt
## Abiesgraptus tenuiramosus       90      90
## Akidograptus acuminatus         59      62
## Akidograptus ascensus           59      59
## Amplexograptus arctus           19      19
## Amplexograptus bohemicus        58      58
## Amplexograptus confertus        18      34

Our main purpose for getting a timeList object in paleotree is probably to time-scale a tree, but with one not being handy at the moment, let’s just do something a little more boring and compare the diversity curves for (formal and informal) species and for genera:

graptTimeGenus<-occData2timeList(occList=occGenus)

taxicDivDisc(graptTimeSpecies)

taxicDivDisc(graptTimeGenus)

We can see that the two are very different, as I noted last time. The early Silurian spike in species diversity looks artificial to my eye, but there may also be something going on whether too many taxa have been lumped into the wastebin taxon Monograptus.

If only there was a way of visualizing the taxonomic structures in the PBDB…

Creating a ‘Taxon-Tree’ from the Paleobiology Database’s Taxonomic Data

Most of us are probably familiar with the superficial similarities between taxonomy, as a nested hierarchy, and phylogenies, which also are hierarchies with a nesting structure (except not everything needs to be equally nested from the topology’s point of view…). A commonly invoked metaphore is of taxonomic groups as to imagine that nested taxonomic groups are alike nested monophyletic clades, with no additional resolution. In other words, we could describe the taxonomy of a group as a phylogeny-like arrangedment of mostly-unresolved clusters nested within each other. It’ll look like a phylogeny but it is really just the taxonomic information portrayed in a new way.

However, in some ways, such ‘taxon-trees’ are already widely used in some fields as the analytical basis for various phylogenetic comparative methods. For example, much of Phylomatic’s lower-taxonomic structure is derived from a taxon-tree like approach. Recently, Soul and Friedman examined the use of taxon-trees versus real cladograms in the fossil record and found (excitingly) that the use of outdated non-cladistic-based taxon-trees performed just as well in many ways as actual cladograms for a number of groups in the fossil record.

I don’t know if the PBDB’s taxonomy will ever be good enough that we could use a ‘taxon-tree’ for some group as the basis for comparative analyses, but for now we can use a taxon-tree approach to visualize what taxonomy is in the PBDB and visually search for weird errors.

The function I’ve written for this is makePBDBtaxontree which takes a taxonomic download for some group from the PBDB. This function is not 100% optimal, however, as the taxon-tree produced only captures the original Linnaen ranks. When version 1.2 of the API is released, I’ll be able to query the name of a taxon’s direct, most senior parent’s name and construct taxon-trees that way, which will likely add additional branching levels to the produced taxon-trees.

Let’s look at some example taxon-trees from various taxonomic groups:

#graptoloids
graptTree<-makePBDBtaxontree(graptTaxaPBDB,"genus")
plot(graptTree,show.tip.label=FALSE,no.margin=TRUE,edge.width=0.35)
nodelabels(graptTree$node.label,adj=c(0,1/2))

#conodonts
conoData<-easyGetPBDBtaxa("Conodonta")
conoTree<-makePBDBtaxontree(conoData,"genus")
plot(conoTree,show.tip.label=FALSE,no.margin=TRUE,edge.width=0.35)
nodelabels(conoTree$node.label,adj=c(0,1/2))

#asaphid trilobites
asaData<-easyGetPBDBtaxa("Asaphida")
asaTree<-makePBDBtaxontree(asaData,"genus")
plot(asaTree,show.tip.label=FALSE,no.margin=TRUE,edge.width=0.35)
nodelabels(asaTree$node.label,adj=c(0,1/2))

#Ornithischia
ornithData<-easyGetPBDBtaxa("Ornithischia")
#need to drop repeated taxon first: Hylaeosaurus
ornithData<-ornithData[-(which(ornithData[,"taxon_name"]=="Hylaeosaurus")[1]),]
ornithTree<-makePBDBtaxontree(ornithData,"genus")
plot(ornithTree,show.tip.label=FALSE,no.margin=TRUE,edge.width=0.35)
nodelabels(ornithTree$node.label,adj=c(0,1/2))

One thing you’ll notice in these trees is not all the edges seem to stretch to the same height from the root: that’s deliberate. If we are looking at genera in an order and there are genera on edges of length=1, those genera have been loosely assigned to that order with no class or family information. A longer edge would indicate the genus is simply part of a monotypic higher-taxon which has been lost as ape hates ‘singles’ (i.e. nodes with only a single descendant).

Now, those are pretty neat. However, we can go a step further and use our occurrence data to help us generate a time-scaled taxon-tree, which should be even more useful for seeing weird outliers in the taxonomy or occurrence data.

We can time-scale using function bin_timePaleoPhy and we’ll select arguments like nonstoch.bin=TRUE and type="mbl" to make a pretty time-scaled tree when we plot it.

#can use the time data from occurrences, generated above

#let's time-scale this tree with paleotree
  #in such a way to maximize prettiness
timeTree<-bin_timePaleoPhy(graptTree,timeList=graptTimeGenus,
  nonstoch.bin=TRUE,type="mbl",vartime=3)
## Warning: Following taxa dropped from tree: Trichograptus, Anthograptus, Yutagraptus, Joamgsjamotes...
## Warning: Following taxa dropped from timeList: Nicholsonograptus

This drops alot of taxa; why? Presumably some are due to mispelled taxon names, but it must be more than that. There must be a large number of senior graptoloid genera which exist in the taxonomic database but have no corresponding occurrences.

Now let’s load Mark Bell and Graeme Lloyd’s library strap and take geoscalePhylo for a spin.

library(strap)
## Loading required package: geoscale
geoscalePhylo(timeTree, ages=timeTree$ranges.used)
nodelabels(timeTree$node.label,cex=0.7,adj=c(0.3,0))

Cool! For those who haven’t used geoscalePhylo before, the thicker black bars on the edges are each taxon’s stratigraphic range, in this case the maximal range of each taxon.

We can see some genera with suspiciously long ranges relative to closely related taxa with much shorter ranges, and some groups that seems to be out of place (e.g. many of these genera should be in the diplograptid group, there is no monograptids…).

A Worked Example: Rhynchonellida

Alright, now let’s take the above, where we mostly follow me playing around with a graptolite dataset, and let’s go back over these function with an entirely different group, the rynchonellid brachiopods. I have spent considerable time poking and proding the character data of this group as part of my current post-doctoral position with Sandy Carlson at UC Davis. So, what does the Rhynchonellida look like in the PBDB?

rynchData<-easyGetPBDBtaxa("Rhynchonellida")
#need to drop repeated taxon first: Rhynchonelloidea
rynchData<-rynchData[-(which(rynchData[,"taxon_name"]=="Rhynchonelloidea")[1]),]
rynchTree<-makePBDBtaxontree(rynchData,"genus")

plot(rynchTree,show.tip.label=FALSE,no.margin=TRUE,edge.width=0.35)
nodelabels(rynchTree$node.label,adj=c(0,1/2))

Well, we can see a number of families with genera nested within them, and a number of genera that appear to be in monotypic families. Let’s get the occurrence data in the right format and time-scale the taxon-tree:

rynchOcc<-easyGetPBDBocc("Rhynchonellida")
rynchSortOcc<-taxonSortPBDBocc(rynchOcc,"genus")
rynchTimeList<-occData2timeList(rynchSortOcc)

rynchTimeTree<-bin_timePaleoPhy(rynchTree,timeList=rynchTimeList,
    nonstoch.bin=TRUE,type="mbl",vartime=3)
## Warning: Following taxa dropped from tree: Nayunnella, Hispanirhynchia, Sphenarina, Xiaobangdaia, Colophragma...
## Warning: Following taxa dropped from timeList: Aethirhynchia, Agarhyncha, Akopovorhynchia, Allorhynchoides, Almerarhynchia...

Even more taxa dropped than with the graptoloids… again, this is probably the effect of having taxa listed in the taxonomic part of the PBDB and not in the occurrence database. I’d say ‘or vice versa’, but we should only be placing formal senior genera on the taxon-tree, so they can’t be in the occurrence data but not the taxonomy data.

Let’s plot the time-scaled tree…

geoscalePhylo(rynchTimeTree, ages=rynchTimeTree$ranges.used)
nodelabels(rynchTimeTree$node.label,cex=0.5,adj=c(0.3,0))

Well, the number of overlapping node labels make this a little messy, but hey, looks cool!

Anyway, you can find everything above (except the two functions for pulling PBDB data) in the version of paleotree now up on Github, soon coming to CRAN!

Well, until next time…