Monday, June 17, 2013

Sampling Rates and Dealing with Fossil Records Less Than Conducive to Their Estimation

So, in my last post, I went at length into a lot of the nuances we need to consider when we estimate sampling rates and sampling probabilities from the frequency distribution of ranges in the fossil record. This gave me the impetus to do something I'd been meaning to do for a while: actually track down the various estimates of sampling probability or sampling rate, turn them all into per Lmy sampling rates and compare them. Here they are, ordered from greatest to least:

Sampling Rate (per Lmy)    Taxon    Reference
17.4577776    Oklahoma Trilobites Species (Single Locality)    Foote and Raup, 1996
1.660731207    Cenozoic Macroperforate Planktic Forams    Ezard et al., 2011
1.046292901    Neogene Iberian Mammal Genera    Alba et al., 2001
0.868900035    Phanerozoic Brachiopoda Genera    Foote and Sepkoski,1999
0.797580429    Neogene Iberian Mammal Species    Alba et al., 2001
0.483501825    Phanerozoic Brachiopoda Genera    Foote and Sepkoski,1999
0.437808292    Phanerozoic Cephalopoda Genera    Foote and Sepkoski,1999
0.434450018    Phanerozoic Conodonta Genera    Foote and Sepkoski,1999
0.410974389    N.A. Cenozoic Mammal Species    Foote and Raup, 1996
0.408044166    European Jurassic Bivalve Species    Foote and Raup, 1996
0.385502461    Phanerozoic Trilobita Genera    Foote and Sepkoski,1999
0.357475065    Phanerozoic Graptolithina Genera    Foote and Sepkoski,1999
0.334331480    Phanerozoic Cephalopoda Genera    Foote and Sepkoski,1999
0.244922481    Phanerozoic Bryozoa Genera    Foote and Sepkoski,1999
0.214987601    Phanerozoic Bryozoa Genera    Foote and Sepkoski,1999
0.201575023    Phanerozoic Conodonta Genera    Foote and Sepkoski,1999
0.196147211    Phanerozoic Echinodea Genera    Foote and Sepkoski,1999
0.192764386    Phanerozoic Trilobita Genera    Foote and Sepkoski,1999
0.182563024    Phanerozoic Graptolithina Genera    Foote and Sepkoski,1999
0.159239636    Phanerozoic Echinodea Genera    Foote and Sepkoski,1999
0.153449104    Phanerozoic Gastropoda Genera    Foote and Sepkoski,1999
0.145183217    Phanerozoic Ostracoda Genera    Foote and Sepkoski,1999
0.133448941    Phanerozoic Bivalvia Genera    Foote and Sepkoski,1999
0.127046142    Phanerozoic Ostracoda Genera    Foote and Sepkoski,1999
0.122426282    Phanerozoic Anthozoa Genera    Foote and Sepkoski,1999
0.116261536    Phanerozoic Porifera Genera    Foote and Sepkoski,1999
0.115432413    Phanerozoic Blastozoa Genera    Foote and Sepkoski,1999
0.112799434    Phanerozoic Bivalvia Genera    Foote and Sepkoski,1999
0.099553348    Phanerozoic Blastozoa Genera    Foote and Sepkoski,1999
0.099553348    Phanerozoic Gastropoda Genera    Foote and Sepkoski,1999
0.099021026    Early Paleozoic Crinoid Genera    Foote and Raup, 1996
0.093263457    Phanerozoic Anthozoa Genera    Foote and Sepkoski,1999
0.086915600    Phanerozoic Porifera Genera    Foote and Sepkoski,1999
0.084205114    Phanerozoic Crinoidea Genera    Foote and Sepkoski,1999
0.078324167    Phanerozoic Crinoidea Genera    Foote and Sepkoski,1999
0.075548263    Phanerozoic Malacostraca Genera    Foote and Sepkoski,1999
0.067297159    Phanerozoic Osteichtyes Genera    Foote and Sepkoski,1999
0.054279636    Phanerozoic Asterozoa Genera    Foote and Sepkoski,1999
0.052305831    Phanerozoic Asterozoa Genera    Foote and Sepkoski,1999
0.039758685    Phanerozoic Malacostraca Genera    Foote and Sepkoski,1999
0.029548896    Phanerozoic Osteichtyes Genera    Foote and Sepkoski,1999
0.023242431    Phanerozoic Chondrichtyes Genera    Foote and Sepkoski,1999
0.011674604    Phanerozoic Chondrichtyes Genera    Foote and Sepkoski,1999
0.009677980    Phanerozoic Polychaeta Genera    Foote and Sepkoski,1999
0.009326054    Phanerozoic Polychaeta Genera    Foote and Sepkoski,1999
0.008051675    Cenozoic Chiroptera Genera    Eiting and Gunnell, 2009

Most of these were published as sampling probabilities, so I just found the mean interval length and got the sampling rater per Lmy using my paleotree function 'sProb2sRate'. The Foote and Sepkoski values weren't published as numbers, just shown in a plot, so I used a ruler rather than send an email to my old advisor. It's just a blog post, not a research paper.

Note that every group in Foote and Sepkoski had two sampling probabilities plotted, for two different systems of how to break the Phanerozoic into smaller intervals. Interestingly, this reveals considerable variation in the estimated sampling rate, suggesting that the estimates from freqRat or from the maximum likelihood variant can be sensitive to how we break up intervals (as I suggested in the last blog post).

The rate estimated for the Oklahoma trilobite data is wildly high, but note this is a densely sampled single locality in the Ordovician. Beware the disconnect between global estimates and local estimates... and beware the fact that they don't cleanly scale between each other. If all of your taxa are from Australia, if you do a global analysis, it will be just an analysis of Australia that you're only pretending is global. Maybe that is the 'global' rate, but it depends on your assumptions about whether there are taxa that you didn't sample in non-Australia regions of the world. There's also taxonomic issues here too: many of these analyses were done at the supraspecific level (note that doesn't mean persistent taxa are only at the genus level and above; paleontologists just have a tendency to analyze genera, not species).

Now let's look at the other end of the scale. We've got some fish groups, then polychaetes, then bats. Makes sense right? If you had to imagine a really bad fossil record, squishy worms and fragile little bats sounds about right. But hold on. Let's ask ourselves, what's missing from this list. Well, there aren't any terrestrial groups on this other than mammals. I used to think this was just because no one had done it yet.

Then I started talking to various vertebrate paleontologists and researchers working with datasets from non-Cenozoic vertebrate paleontology datasets and they told me that in other vertebrates groups, like dinosaurs, every species is found in what essentially amounts to a single stratigraphic interval, at least at the global scale. They might be found throughout a unit or formation that represents a limited window of time, like the Morrison or whatever, but you don't find the same morphotaxon persisting across intervals like you do in invertebrates. The people I talk to often call it 'point occurrence' data so I'll use that term for the rest of this post, even though that isn't exactly ideal: 'point-occurrence' really isn't completely accurate if some of those taxa are found at multiple horizons in a unit or formation, its less descriptive than we might desire and 'point occurrence' is sometimes used to mean other things. (There's other terms too like 'singleton' which could apply here, but their usage is even more muddled.)

Disclaimer: I know nothing about vertebrates or their fossil record. As evidence, I present that I squeaked by with a C- in Primate Anthropology, my one attempt to learn anything about vertebrates. I felt it was important to start off this post with that, because, great golly, I don't want anyone reading this and thinking I know anything about the vertebrate fossil record. Heck, I haven't even seen any datasets firsthand. All my information is hearsay from other scientists, most of whom contacted me in trying to time-scale their trees. If someone can show me a range chart that has a bunch of sauropod species persisting through multiple geologic intervals, then great, I'd love to see it.

So, anyway, I found this description of vertebrate paleo datasets to be really strange. Why are they so different from the ones we can get sampling rates from, like those above? One explanation I have been given is that this occurs because groups like dinosaurs are just more poorly sampled than shelly invertebrate groups. If  I try to simulate a really poorly sampled clade but still have a number of sampled taxa, under a simple model where extinction and sampling are homogenous Poisson processes, I can't make all the taxa point occurrences. There's a real good reason for this: as the sampling rate decreases, the number of original true taxa must increase, such that the sampled taxa is a tiny portion of a vast unsampled diversity. As the number of total taxa increases, you get more and more extremely long-lived taxa, a tiny minority, but they become so increasingly long-live that the probability of NOT sampling any of them one of them twice (and thus getting a persistent taxon) is incredibly low. I cannot avoid simulated datasets with persistent taxa, not under this simple model of extinction and sampling. This is true even if I break up the time-scale of these simulations into discrete intervals. I still get persistent taxa with first and last appearances sampled in multiple intervals.

So, to me, that means it isn't just 'poorer sampling'. There needs to be something else going on. We're missing some explanatory factor that could cause morphotaxa to either be very 'short-lived' in the fossil record and/or would make it very difficult to sample any morphotaxon that might survive into another stratigraphic interval. Of the former, it is possible to imagine a fossil record of point occurrences resulting from extremely high anagenesis rates, such that species really don't have any opportunity to be persistent. However, that means there's something about dinosaurs and other such groups that is fundamentally different from groups like North American mammals and fossil invertebrates. It doesn't need to represent a real evolutionary difference: it might be that differences in the available material gives more characters and thus allows for much much finer species delimination (or, possibly, that species delimination is being overly aggressive). The later, where some factor makes it difficult to sample any persist morphotaxon more than once, could occur if there is some sort of complex regional alternation in the deposition and preservation of rock packages that could preserve vertebrates; i.e. North America get the Morrison, then nothing for a long time, something like that (I have no idea if the preceding statement is true). But then why the shift going from Mesozoic dinosaurs to Cenozoic mammals?

Remember, to use the freqRat or the variants of it, you need to have some persistent taxa so that means there are some persistent bat taxa, found in multiple stages (as that's what Eitling and Gunnell used) but not dinosaurs. Now, an important distinction here is maybe it's that Eitling and Gunnell used genera. So, okay, forget bats. But why do Alroy's Cenozoic mammal species have persistent taxa? (as analyzed in Foote and Raup) Whatever the reason for this point occurrence pattern in some groups, it will probably be obscure for a while. I'd bet it is probably the result of multiple factors.

But, to get back to the practical point, what's a researcher with a dataset unanimously dominated by point occurrence taxa supposed to do if they want to use cal3, which absolutely requires an estimate of the sampling rate? (We'll ignore also needing branching and extinction rate estimates for the moment.)

It's really quite a conundrum. These are my thoughts.

First is that the freqRat and the ML variant aren't the only ways to get sampling rate. Foote (2004) also introduced a survivorship curve method, but that also won't work if you don't have persistent taxa. Solow and Smith (1997) presented a method that obtained sampling rate estimates from the waiting time distribution between sampling events in single taxa; that information definitely isn't available here, at least not at the global, phylogenetic scale we're trying to get sampling rates for.. There's also capture-mark-recapture methods, which also estimate sampling probabilities, but I don't know much about them or whether they could be applied to data that effectively has no recaptures. An idea that's been banging around in my skull is that Friedman and Brazeau (2012) introduced an idea that might bear water: get sampling estimates from the ghost branch distribution. But I think that issue is a lot more complicated than Friedman and Brazeau suggested, and there's some theoretical roadblocks relating to differentiation patterns (budding versus bifurcation, anagenesis, etc) that I haven't thought around yet.

So, that exhausts the methods (that I know of) for actually estimating the sampling rate.

One utter kludge would be to just pull them out of thin air. Ask yourself what fossil records on the chart above are probably of similar quality and use those values. Even better, choose a range of likely values and use those. The problem with this is that maybe your fossil record is even worse than the ones described here. Also, these values are for a limited selection of data: most of them are global and their relationship to regional datasets is thus obscure. Many of them are at the generic level, particularly the more poorly sampled records, and it is difficult to envision any way of converting a generic estimate of sampling to a species-level estimate (although maybe paraclade models might offer such a route; see Patzkowsky, 1995; Foote, 2012). Overall, I cannot recommend that pulling values out of thin air is the right way to use cal3. It's a quick and dirty way of getting something done, just like using a 'taxon tree' as a replacement for a cladogram, but its ultimately difficult to robustly defend the assumptions you'd be making in such an analysis. Class projects and pilot work can have quick and dirty techniques, but for real research we want to be able to trust the conclusions of, we would want something that involves less guesswork.

And there's an addition problem to such cross-eyed guesstimations: remember my simulation woes above? We can't get point occurrence data under our typical models of sampling and extinction events, where they occur as under a Poisson process. We know the assumptions of our models are not valid in those cases. And that means even ignoring the 'thin air' part of pulling sampling rates out of thin air, the models that cal3 assumes probably don't apply to this type of data. And it's not just cal3 that assumes those models; that's the typical assumptions of many paleobiological methods that involve putting any sort of expectation on a sampling process. Of course, just because model assumptions are violated doesn't mean a method will return the wrong answer. Methods can be more robust than that. However, without knowing what leads to point occurrence type data it is hard to make predictions, since we don't know exactly what about our model assumptions are incorrect. My presumption would be that if this data type occurs because sampling is much more complicated than a global Poisson process, then cal3 will give fairly wrong answers. Again, though, that's probably also true of many other methods too. If, instead, it's because these are groups where morphotaxa are extremely finely delimited, such that you can't get persistent taxa, then maybe the cal3 method is okay. This violation of model assumptions just means morphotaxa are 'terminating' via anagenetic pseudoextinction much more often than lineages terminating due to extinction. In other words, there's a disconnect between the morphological differentiation of lineages and their birth-death dynamics. cal3's assumption have more to do with the probability of not sampling a clade at all or not, something which should be independent of the morphological differentiation pattern within that clade.

Regardless, and some people have pushed me on this point, I do not claim that the cal3 method is applicable to all datasets. I fully admit that it is not! I did this to time-scale graptolite phylogenies and while I want the method to be as useful and general for as many people as possible, the only reasonable starting point for making an explicitly model-based time-scaling method was to start with the simplest models of sampling and diversification possible. Simple models are the only logical foundation for moving forward. And while I might be able to realistically defend the assumptions of those simple models in graptolites or other 'well-sampled' groups, as a scientific community, we have a lot more work ahead of us to tackle every dataset. Even though sampling processes are as important as origination and extinction in shaping the fossil record, if not more so, I think we still spend much more effort in understanding and modeling the dynamics of diversification in paleobiology than sampling processes.

But this difficulty in applying cal3 to every dataset poses a new question: are the other time-scaling methods that we have available sufficient for the sort of phylogeny-based analyses of evolution ('phylogenetic comparative methods') that we want to do in the fossil record?

Well, we can talk more about that this Sunday in Snowbird, if you'll be there. ;)