Just published online in Nature - https://go.nature.com/2F9uV5q
Renewing Felsenstein’s Phylogenetic Bootstrap in the Era of Big Data
F. Lemoine, J.-B. Domelevo Entfellner, E. Wilkinson, D. Correia, M. Dávila Felipe, T. De Oliveira, O. Gascuel*
* Correspondence: email@example.com
The phylogenetic bootstrap was proposed by Joseph Felsenstein more than 30 years ago. This method, based on resampling and replications, is used extensively to assess the robustness of phylogenetic inferences. Its usefulness, simplicity and interpretability made it extremely popular in evolutionary studies, to the point that it is generally required for publication of phylogenies. Felsenstein’s article has been cited more than 35,000 times and is ranked in the top 100 of the most cited scientific papers of all time. In 2017, it was cited more than 2,000 times.
However, it is commonly acknowledged that Felsenstein’s bootstrap is not appropriate for large datasets containing hundreds or thousands of taxa, which are now common thanks to high-throughput sequencing technologies. While such datasets generally contain a lot of phylogenetic information, the Felsenstein’s bootstrap proportions (FBP) tend to be low, especially when the tree is inferred from a single gene, or only a few genes. The reason for such degradation is explained by the core methodology of Felsenstein’s bootstrap. A bootstrap branch must match exactly a branch in the original tree estimate, to be accounted for in the bootstrap support of that branch. A difference of just one taxon is sufficient for the bootstrap branch to be counted absent, while it is nearly identical to the original branch. The standard approach is to remove “rogue” (phylogenetically unstable) taxa and relaunch the analysis, but this is statistically questionable and computationally expensive. Moreover, with large trees inferred branches are likely to have errors and a large fraction of taxa may be unstable, even in the absence of model misspecification of any sort, and without long branches.
In this article, we propose a new version of phylogenetic bootstrap, in which the presence of original branches in bootstrap trees is measured using a gradual “transfer” distance, as opposed to the original version using a binary presence/absence index. This distance is normalized in the [0, 1] range and averaged over all bootstrap trees. We so obtain the “transfer bootstrap expectation” (TBE), which replaces the branch presence frequency of FBP (i.e. the expectation of a 0/1 function), by the expectation of a nearly continuous function. By construction, TBE supports are necessarily higher than FBP’s and the difference is substantial for deep branches. When combined with consistent tree estimation, TBE rarely supports poor branches. Our results with mammal, HIV and simulated data sets, clearly demonstrate its usefulness, especially with deep branches and large trees, where branches known to be essentially correct are supported by TBE but not by FBP. Importantly, TBE supports are easily interpreted as fractions of unstable taxa, and the ability of TBE to identify the most unstable taxa (e.g. recombinant HIV sequences) makes it possible to study them further, understand why they are phylogenetically unstable, and revise the branch supports. TBE computation and other phylogenetic tools are available from http://booster.c3bi.pasteur.fr.