Friday, December 13, 2013

Stratigraphic Uncertainty, timePaleoPhy versus bin_timePaleoPhy, and My Own Paleontological Myopia

Hi all!

So, its been a rough week! I've gotten more emails about issues or questions about paleotree this week than I have in the past four months! Which is awesome, cause I love talking to people about using paleotree, but there's something that's been worrying me in some of these emails. These emails have given me a grave concern that a number of users don't understand the distinction between two key time-scaling functions in paleotree: timePaleoPhy and bin_timePaleoPhy. The problem is, not understanding this difference is extremely dangerous, as they could generate time-scaled trees that don't make sense without realizing it.

But I can't really blame them, because I think its my fault as the package author to explain this distinction clearly enough... the problem is I didn't understand something critical and fundamental three years ago: Our Fossil Records are Different.

See, the fossil record for your group is probably really different from the fossil record I'm working on, and that means not only how we do things is different, but the sort of assumptions we come to the table with and what we think X means is different. And that sounds kind of obvious, until you meet someone whose fossil record is radically different from yours and they don't even have a concept for this one thing you thought everybody knew about.

The problem is that people are using timePaleoPhy who probably shouldn't be using timePaleoPhy, when they should be using bin_timePaleoPhy instead. It isn't helped by the argument 'rand.obs' in timePaleoPhy, which some interpreted as representing how to bring stratigraphic uncertainty into an analysis. After explaining to several users over email why I was concerned about their use of paleotree, I felt like a specific blog post on the topic was needed.

But to understand where this misconception of some users comes from, and why I think it stems from my own misconceptions in the past that maybe fossil records were more similar than I know now, I've also got to explain my reasons for writing both functions the way I did.

So, I wrote timePaleoPhy first, when I had some stratigraphic data for graptolite ranges, with stratigraphic meters scaled to an absolute time-scale. Since graptolite species often persist over time, I had a matrix of first and last appearance dates (FADs and LADs, respectively), like so...

Normalograptus_normalis 457.9 438.3
Climacograptus_typicalis  463.2 442.8

(This is totally fake data; I don't even know if typicalis is still a Climacograptus, but maybe its the type?...I can't remember at this very moment.)

The dates here represent actual ranges; i.e. N. normalis is first found in rocks dated 457.9 Ma and is found fairly continuously until you get to rocks dated 438.3 Ma. Of course, the rocks aren't actually 'dated' so precisely, more like we're assuming continuous sedimentation rates and so we're inferring those dates using the absolute dates we do have.

This means when we time-scale the tree, unless we're allowing for ancestors (which we don't normally do unless we know something about apoomorphic and plesiomorphic taxa) the clade containing N. normalis has to be at least as old as 457.9 Ma.

And that's what timePaleoPhy does, if you hand it data like above, you get back a tree where the nodes are as old as the dates in the first column (the taxon names should be row labels), because it thinks those are first appearance dates, known very precisely.

A different question is where the tips go (or 'when', really). As I've talked about before here on this blog and in published papers, this is a difficult question. For most things, such as looking at a tree or estimating lineage diversity through time, you'd want the whole taxon range 'added' to the tree, so you should use the LAD as the tip-date. However, comparative methods for studying trait evolution assume that the date of tips represents the date of the population from which trait values were taken. Of course, maybe your trait doesn't vary much (or at all) over the duration of the taxon, but that's a story for another time. In the papers I've written, I often call these tip-dates 'times of observation'... cause I think some reviewer didn't like 'tip-date', or maybe it was a committee member. Well, reviewers seem to dislike 'times of observation' too.

(I can't freakin' win when it comes to vocabulary.)

At the time, I was most interested in comparative methods, so I assumed this would be true of everyone else. In general, it seemed like people were using the FAD primarily in the literature (or at least, I thought so at the time) as the tip-date for comparative methods, so I made the FAD the default tip-dates that timePaleoPhy spits out. To get timePaleoPhy to use the LAD as the tip-date, you have to set add.term=TRUE, which adds the range to the terminal branch the taxon sits on.

But using the FAD or the LAD wasn't really a satisfying solution: the tip-date in a comparative analysis supposed to be the time of the sampled population (hence why I call them 'times of observation'). For my long-ago purpose, I had I had size measurements from a bunch of figured specimens for graptolites but with very little data on 'when' those specimens came from. They were from 'the Late Ordovician' or 'Early Silurian' and I didn't have enough information to figure out a more specific tip-date. So that's pretty much useless information for figuring out the 'when' for these specimens, and if I want to do some PCM, well the tips are supposed to represent the date of the population where I measured the trait. I just know the Normalograptus normalis specimen I measured has a 4.2 mm long doodad. Using the FAD or the LAD wasn't really an honest approach given the uncertainty involved.

So... rand.obs is how I dealt with this uncertainty, still using the FAD for N. normalis to adjust the clade age, but allowing the tips to fall at some date randomly selected between the FAD and the LAD. Maybe I had measured doodad length from a normalis specimen from 441.3 Ma,  so, the tip would be at that date, but the clade constraining that taxon would still be constrained by the 457.9 Ma first appearance date. Note that rand.obs only works if we also set add.term=TRUE, cause the way I decided to do it (and no, I'm not changing it) was that these random observation dates were a temporary replacement of the LADs. Thankfully, timePaleoPhy returns an error if you ever try using rand.obs=TRUE while add.term=FALSE, or else people might get very confused.

Later, I needed to make use of a much less precise dataset, where taxon first and last occurrences were graptolite zones. So like, the different zones had minimum and maximum dates associated with their starting and ending, like this...

zone_A 466.2 452.4
zone_B 452.4 448.5
zone_C 448.5 441.1
zone_D 441.1 436.2
zone_E 436.2 431.0

(This is totally fake cause graptolite zones are usually shorter than that.)

To keep track of everything, I needed some way of recording the dates for the intervals and which intervals taxa first and last showed up in. Normally you see this thing displaying in papers as a range chart, where species A will have Xs in intervals 1, 2, 3, and then not 4, meaning it last appeared in interval 4. This seemed like a fairly non-straightforward way of handling the information, particularly as I was only really interested in the first and last interval each taxon appeared in.

So, what I did was I listed the first and last interval that each taxon was known from as the row of the interval matrix. Like so:

N._normalis 1 4
C._typicalis 1 3

So, if we look at this in reference to the prior matrix, this tells us that N. normalis first appears in zone A and last appears in zone D, which means it might have first shown up anywhere between 466.2 and 452.4 Ma and last appeared somewhere between 441.1 and 436.2 Ma. And so now, if we have both this 'taxon times' matrix and the 'interval times' matrix above, we have all the information on the dates associated with when these taxa were first and last occurring.

In paleotree, I decided to call a list that contained both matrices a 'timeList' object because as you have no doubt deduced, I have terrible taste when it comes to coming up with new vocabulary. A lot of functions in paleotree use these timeList objects. (It's not a real object class, cause I didn't understand S3 classes yet when I wrote paleotree.) Anyway, I went with a list of two matrices rather than two seperate matrices because the taxonTimes matrix doesn't make sense without reference to the intervalTimes matrix and so I felt I needed to keep them sorted together. To get a timeList, you can just make the two matrices by hand in your favorite spreadsheet program, read them into R and then combine as a list. The last step is as simple as typing time<-list(intervalTimes,taxonTimes) into the terminal. Note that the interval names and taxon names are rownames, so the matrices should both have only two columns each.

Now, to time-scale a tree with this sort of data, we need to somehow get precise first and last appearance dates and then make a tree from there. So I made bin_timePaleoPhy to draw dates from these intervals, under a uniform probability distribution, and then use the randomly drawn dates to make a time-scaled tree. So, maybe normalis was first seen  at 454.2 or 461.8 Ma, and last seen at 440.7 Ma, or 438.8 Ma. Or lots of other combinations! We can resample those dates and get all sorts of new FADs. and LADs and make lots of time-scaled trees. Unlike timePaleoPhy, bin_timePaleoPhy actually does something about stratigraphic uncertainty, at least as far as having data in discrete intervals goes.

But this doesn't fix the unknown times of observation issue for trait data... so bin_timePaleoPhy also has a rand.obs argument just like , so that we can let our observed doodad length for N. normalis come from 452.6 Ma or 446.3 Ma or whatever, depending on the set of FADs and LADs we plucked from the intervals in a given run. Again, just like with timePaleoPhy, the date of the nodes are still going to depend on the FADs, rand.obs is only going to impact the tip-dates, those enigmatic 'times of observation'.

So, let's be clear:

1) timePaleoPhy does *not* deal in stratigraphic uncertainty of appearance dates: it believes the numbers you are handing it are very precise first and last appearance dates, which exist for some fossil records but probably not most. It does deal with uncertainty in the times of observation of specimen measurement, though, which is what rand.obs is for.

2) bin_timePaleoPhy *does* deal with stratigraphic uncertainty, but it requires slightly more complicated input: two matrices, as a list.

And I think, somehow, I said something that is confusing people about these two facts. I'm repeatedly getting emails from people that think the matrix for timePaleoPhy is a matrix of the minimum and maximum ages for a point occurrence of a fossil. In other words, their fossil taxa only appear at a single point in time which is imprecisely known, and so their dates are the bounds on that single date, rather than two very precisely known dates that represent the first and last appearances.

This practice is very worrying, because if they put those dates in and then set rand.obs=TRUE and add.term=TRUE, they'll think their pulling their taxon occurrence dates randomly from their min and max dates... which is true for the tip-dates. But the nodes are being pushed back in time to those minimum dates, which really means that clades are 'as old as they could possibly be' or something wacky like that. And I bet people have done this without realizing that's what is happening! Its absolute nonsense and this nonsense has to stop.

I think the real issues is that when I wrote these functions, I was excessively myopic. I looked at things like a graptolite worker and I thought everyone had data like graptolite data, with taxa that typically first and last appeared at different times. I actually didn't realize that other people had data where it was more common to see things like 'species A is known from one skeleton, found somewhere in the early Cretaceous'. Some vert paleo friends of mine have kindly helped me understand that this isn't true, at least, for lots of vert datasets: they actually have taxa known from single collections that may be as poorly unconstrained as 'anywhere within a 40 million year interval', which, well, that blows my mind every single time I read this sentence.

So, given the apparent and continual misuse of timePaleoPhy, I have actually considered pulling the function entirely from the CRAN version of paleotree, or hiding the function, making it unavailable to regular users. This isn't ideal: some people, such as those who work with detailed biostratigraphic records such as for index fossils and microfossils, actually have precisely dated first and last appearances. I don't know if anyone like that uses paleotree yet, but I want to keep the door open for them! Plus, bin_timePaleoPhy is really just a dumb wrapper for timePaleoPhy. If someone really wants to write their own wrapper for timePaleoPhy, that's fine! I'm happy for them.

But honestly, the majority of people won't have data that's infinitely precise. They should be using bin_timePaleoPhy. (Or you know, bin_cal3TimePaleoPhy, but maybe I shouldn't get ahead of myself...)

So, the real question is how to convince people to want to use bin_timePaleoPhy. Well, although I just explained the function from a grapt-point-of-view, I didn't code it for that only that type of data. The timeList objects takes by bin_timePaleoPhy and other paleotree functions can actually be much less formal than my example (well, a few specific functions have very specific needs, but that isn't true of bin_timePaleoPhy).

For one thing, the intervals in the interval matrix can be completely messy. For example, you could have an interval matrix that looks like:
Lower Cumberbatchian 134.5 129.4
Cumberbatchian-Martinian 134.5 109.6
Cumberbatchian-Bloomean 134.5 99.8
Middle Upper Martinian 107.5 105.8
Martinian-Bloomean  115.4 99.8

What I've discovered in a recent collaborative project is if you get data from the PBDB, you'll probably have to have an interval matrix like this. It's okay. But if you have multiple collections for a taxa, try to get the most precise first and last intervals.

For example, say your taxa only ever appear in single interval, well than the first and last intervals are identical, so you'd just structure the second part of the input list so the columns matched, like so:

Rareosaurus 1 1
Uniqueodon 2 2

This is a little more tricky if your taxa also only ever appear in a single collection. By default, bin_timePaleoPhy assumes taxa first and last appear at different dates within an interval. For example, Rareosaurus will not be assigned matching FADs and LADs. However, if you set the argument point.occur=TRUE, then bin_timePaleoPhy will treat all your taxa as if their first and last occurrences are identical: i.e. all your taxa will be treated as point occurrences.

Of course, I have made it so that bin_timePaleoPhy has a little more functionality, to deal with those special cases that often crop up with uncertainty in appearance times; these can be usually handled with what I called the 'sites' matrix. For example, perhaps the first specimens of two taxa appear in the same fossil assemblage. That single fossil assemblage may be very poor resolved to a wide interval, but we know the first appearance date for both taxa should be the same date. If you make a site matrix as I describe in the bin_timePaleoPhy help file, and set the site number the same for the first appearance of both taxa, then bin_timePaleoPhy will always use the same randomly-drawn date for those taxa. You could do this for lots of taxa, or do this to account for a site where some taxa have their last appearance at and other taxa first appear at, or all sorts of neat modifications. I have to be honest and point out that this sites matrix idea was first suggested by Jon Mitchell over lunch about three years ago.

In reality, the point.occur argument I described is just a simplified way of modifying the site matrix so all first and last intervals are coded as having the same 'site' for each taxon. If you want to have only some taxa in a dataset constrained to be point occurrances, you can just specify a custom-made site matrix where only those taxa have their first and last interval constrained to be the same.

To be truly honest though, I don't think its a lack of advertising about the options offered by bin_timePaleoPhy that stops people from using it... really, its the timeList input format. A large number of people have suggested that this is a very obscure way of recording this data and more than a few have complained that its just not amendable to their use. Peter Smits told me that the timeList was specifically very 'ugly' to have to use two matrices. Overall, I have a feeling that I just did a shoddy job of describing timeList objects in the help files to begin with.

Well, I don't know what to do about it guys! Coming up with the two-matrix fix was the best idea I had after thinking about it for a while, and while I admit its inelegant and while I admit that the timeList structure is pretty dissimilar from the typical data structure you might see in a paleontological dataset (as opposed to a taxon by interval range chart), I just don't see a better solution, even though I can that its actively driving people to misuse the functions in paleotree in ways that I think are unpredictable and a little dangerous.

But I know that I don't know everything... so if anyone has a better idea of how to make the bin_timePaleoPhy input work better, I'd love to hear it and implement your idea in paleotree! Do you see a better, simpler way of conveying the same information? Or, is there a data format that you use, that you'd like to have a function for converting into a timeList? Let me know in the comments below.

For those of you interested, I give a real example of a very nicely behaved timeList object in my time-scaling method tutorial from this summer, which you can find here:

On a more philosophical note, I think those of us who write methods for paleontological data have to keep in mind that Fossil Records are Different. Had I been less nearsighted and graptolite-oriented in my design and help file writing in the beginning, maybe this user preference for the wrong function could have been avoided, and so for that, I really only can blame myself.

Finally, if you're reading this, and you think you might be one of those people who sent me an email this week and contributed to my concerns, I just want to thank you for emailing me and forcing me to confront the fact that there was something about my software that users were finding consistently confusing. I still love reading and responding to every paleotree email I get.