Friday, May 27, 2016

A New (Or Maybe Not?) Metric for Measuring How Much Two Topologies Contradict Each Other

A New (Or Maybe Not?) Metric for Measuring How Much Two Topologies Contradict Each Other

Recently, I’ve been working on finishing up a study where my collaborators and I did more than 15 different phylogenetic analyses: using different character data, different sets of taxa, different methods (parsimony versus Bayesian), different settings (different priors in Bayesian analyses), etc. Many of them produced somewhat less-than-well-resolved results. This makes comparing and contrasting all these topologies a real headache. Nevermind the additional need to also know how the topologies differed from previous analyses in this group!

What I needed was a metric that captured how much two trees disagreed in the relationships inferred (for taxa shared between the two analyses). For example, let’s simulate some trees with ape.

# let's simulate two trees
library(ape)
set.seed(1)
treeA<-rtree(30,br=NULL)
treeB<-rtree(30,br=NULL)

We can visually compare these trees with cophylo in package phytools.

library(phytools)
plot(cophylo(treeA,treeB))
## Rotating nodes to optimize matching...
## Done.

A mess, as expected. But how different are they? Obviously, my first thought was the symmmetric Robinson-Foulds distance, or some other similar topological distance, as is available in package phangorn in function treedist.

library(phangorn)
treedist(treeA,treeB)
## symmetric.difference      path.difference 
##             54.00000             69.77822

In this case, the symmetric.distance is the symmetric RF distance. However, this doesn’t actually capture what I want to capture: how much do two topologies conflict? RF will say two trees differ even if one tree is just the other tree with less resolution (i.e., more polytomies). What about a metric that just captures the disagreements, the incongruences between two topologies?

To take an extreme example, we’d want the difference between a star tree (a topology with no splits) and a well-resolved topology of the same taxa to be zero. For example…

# simulate a star tree
treeC<-stree(30)

# plot the tanglegram between A and C
plot(cophylo(treeA,treeC))
## Rotating nodes to optimize matching...
## Done.

(This is an utter mess because phytools’s cophylo doesn’t rotate lineages within polytomies; oh well.)

But as we can test, this isn’t the case with Robinson-Foulds distance.

treedist(treeA,treeC)
## Warning in treedist(treeA, treeC): Trees are not binary!
## symmetric.difference      path.difference 
##               27.000              146.806

Notice that, as implied by that warning, that topological distances aren’t even stable when non-binary trees are assessed. Regardless, the distance between the two trees is essentially relative to the number of nodes missing from the star tree. This is problematic: I want to know if somewhere in the 17+ by 17+ table of pair-wise topological distances we’ll end up building whether two tree actively disagree. If one tree agrees with a tree but just has lower resolution, well, I don’t care. There’s no actual contradicting relationships being inferred.

The solution is to invent my own metric for topological comparison, calculated as the total number of conflicting splits across the two topologies, after dropping any tip taxa not shared between the two trees. By dividing this number by the \(2 * (Number\ of\ shared\ tips - 2)\), the maximum number of splits that could conflict between two given topologies of the same size (remember, we dropped all the unshared taxa), we can scale this metric to between 0 (no conflicting relationships) and 1 (two entirely conflicting topologies), similar to the rescaling in Colless’s consensus fork index.

Algorithmically, we can identify those conflicting splits by counting the number of splits (via ape’s prop.part) on one tree that disagree with at least one split on the other tree: for example, split (AB)CD would be contradicted by split (AC)BD. To put it another way, all we need to test for is whether the taxa segregated by that split were found to be more closely related to some other taxa, not so segregated by the considered split. Here’s the code for calculating this metric, essentially a function for testing whether two splits contradict each other, another function which uses the first to count the number of contradictions across the splits from two trees, and a third which takes these counts of contradictions and scales them:

testContradiction<-function(namesA,namesB){
    matchA<-namesA %in% namesB
    matchB<-namesB %in% namesA
    if(any(matchB)){
        res<-!(all(matchA) | all(matchB))
    }else{
        res<-FALSE
        }
    return(res)
}
nContradiction<-function(partA,partB){
  partContra<-sapply(partA,function(x) 
      any(sapply(partB,function(y) 
        testContradiction(x,y))))  
  res<-sum(partContra)
  return(res)
  }

treeContradiction<-function(tree1,tree2,rescale=TRUE){
  # checks
  if(!inherits(tree1, "phylo")){
        stop("tree1 is not of class phylo")
    }
  if(!inherits(tree2, "phylo")){
        stop("tree2 is not of class phylo")
    }
  #
  tree1<-drop.tip(tree1,setdiff(tree1$tip.label,tree2$tip.label))
  tree2<-drop.tip(tree2,setdiff(tree2$tip.label,tree1$tip.label))
  #
  # more checks
  if(Ntip(tree1)!=Ntip(tree2)){
      stop("Trees do not contain same number of tips after pruning to tips with identical labels (?!)")}
  if(Ntip(tree1)<2 | Ntip(tree2)<2){
      stop("Trees contain less than one tip after pruning")}
  #
  # now measure number of contraditioncs
  nUnshared<-sum(1==attr(prop.part(tree1,tree2),"number"))
  part1<-lapply(prop.part(tree1),function(x) tree1$tip.label[x])
  part2<-lapply(prop.part(tree2),function(x) tree2$tip.label[x])
  nContra1<-nContradiction(part1,part2)
  nContra2<-nContradiction(part2,part1)
  res<-nContra1+nContra2
  #
  # rescale to 0-1 scale?
  if(rescale){
    #number of possible nodes that could contradict on one unrooted tree
    nPossNodes<-Ntip(tree1)-2   # per tree   
    res<-res/(2*nPossNodes)     # per two trees
    }
  return(res)
  }

Now, this ‘contradiction distance’ metric has some unusual but fully-intended mathematical properties; for example, this metric is non-Euclidean, as a number of very different well-resolved trees would be 0 distance from a star tree of the same taxa (but would have non-zero distances between each other). However, as we are using this metric solely as an indicator of the extent of disagreement, such properties are of little concern. If we were using this metric as the basis for an ordination, god-help-us. Instead, this allows us to quantify the difference of whether a poorly resolved tree only slightly or greatly contradicts a well-resolved tree.

We can see this when we try out the contradiction distances between our three trees from before; remember, C is a star tree.

treeContradiction(treeA,treeB)  # should be non-zero distance
## [1] 1
treeContradiction(treeA,treeC)  # should be zero distance
## [1] 0
treeContradiction(treeB,treeC)  # should be zero distance
## [1] 0

And between two identical trees?

treeContradiction(treeA,treeA)  # should be zero distance
## [1] 0

Another perhaps less ideal property is that a two taxa on opposite ends of the tree, moving from side of the topology to the other while holding the rest of the structure constant, would produce two trees with the maximum contradiction distance possible (i.e., = 1).

treeAA<-read.tree(text="(A,(B,(C,(D,(E,F)))));")
treeBB<-read.tree(text="(E,(B,(C,(D,(A,F)))));")
plot(cophylo(treeAA,treeBB))
## Rotating nodes to optimize matching...
## Done.

treeContradiction(treeAA,treeBB)
## [1] 1

But that’s not just a property of this metric, but RF distance too:

treedist(treeAA,treeBB)
## symmetric.difference      path.difference 
##             6.000000             6.324555

Basically, RF thinks the two are pretty distant too.

Playing around with this metric has been a lot of fun, and helped me identify all sorts of interesting patterns in my data, like Bayesian analyses being fairly consistent with each other, but small variations in maximum-parsimony producing wildly different topologies. However, its usefulness worries me a great deal.

Now… here’s my problem. I just invented such a metric, but my guess is someone else has already invented it. Phylogenetics is, as any graduate student will tell you, a vast field with a decades-deep history. It is extremely hard to never everything, in every nook and cranny of the field (yes, comprehending phylogenetics is like buttering an English muffin, I guess). People often invent things in papers using terms you could never guess to search for, in papers with ridiculous titles that have nothing to do with the grand thing they just invented. Maybe every field is a little like that; certainly, paleontology is. But that makes someone like me, who worries a lot about priority, very worried when he stumbles on something very useful and very simple. It means that someone probably though of it before. The problem, I can’t find anything like it in the literature.

So help me out here! Have you seen anything like this before? Whose wheel am I reinventing? Or is it really, somehow, a new wheel?