Every scientist that ever approached the task of phylogeny reconstruction knows that the very first step is to determine which evolutionary model best describes the relationships among their investigated genes or organisms. There is a wide range of models and each may be a potential candidate to explain the data. These vary from the Jukes and Cantor (JC) model which portrays simplistic patterns of evolution to the most parameter-rich nucleotide model, GTR+I+G, which assumes a complex combination of independent patterns. This step of selecting the most suitable model, known as Model Selection, is considered inevitable in order to produce statistically reliable results. Over the years, several methods have been devised to attempt to tackle this problem including the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), the hierarchical Likelihood Ratio Test (hLRT), and more, however, facing the researchers with a new problem: how do we choose the method for choosing the best model?
The choice of Model Selection criteria in 300 studies that were done during 2017-2018
Numerous studies have been devoted to elucidating this puzzling problem, analyzing the performance of the several methods on simulated datasets, and trying to reach the one most accurate method. However, such studies resulted in conflicting conclusions. It could be, we hypothesized, that different methods comply with subranges of the data characteristics. Therefore, if in one study datasets were simulated according to certain sample sizes or evolutionary processes, the results may differ from another study that used different parameters for simulations. To clarify this ambiguity, we performed a thorough examination that extends over a wide range of realistic data characteristics. We collected thousands of empirical datasets from several databases, and simulated synthetic ones based on the features extracted from them. Then, the best fitted models were selected according to several Model Selection criteria for each dataset, and were used to reconstruct the trees. Eventually, we measured the distances between the reconstructed tree and the true tree (used for simulations) by means of topological distances and branch-length distances. We observed that the average topological distances over all datasets were highly similar across the different methods. Thinking that the simulations might not be challenging enough, we increased the complexity of the simulation schemes, but similar results emerged.
In terms of topology inference, the performance of different methods is highly similar. Why? Is it because all methods tend to choose the same models? The answer to that is No; for example, AIC and BIC chose different models in 62% of the cases but still reached average distances that differed only in the 4th digit after the decimal point. So, is it possible that all models lead to similar topologies? To our surprise, our analysis confirmed this hypothesis. We reconstructed all phylogenies with the most complex model, GTR+I+G, regardless of the Model Selection methods, and the inferences were even better than those of the models selected by the Model Selection methods. Then, we examined the most simplistic model, JC. Although it was inferior to the other reconstruction procedures (i.e., the models selected by the methods or consistently using GTR+I+G), the differences were surprisingly small. That is, the malperformance of JC was negligible compared to other models, suggesting that, indeed, any model could serve just as well as the best fitted one.
Percentage of correct topologies for each of the reconstruction strategies over 7,200 simulated datasets
We began our study in anticipation of uncovering the best Model Selection criterion, but actually, we found that this step could be skipped. Researchers invest resources in lengthy statistical computations for the benefit of selecting the best fitted model, thinking that the validity of their results relies on this basic step when they could have just used the most parameters-rich one. It must be noted that the Model Selection step might still be inevitable for some applications. For example, while our results pointed at the futility of Model Selection for topologies and ancestral sequence reconstruction, it had some benefit over a predefined model when branch lengths were examined. We hypothesize that when the parameters estimates are essential for the estimation of the inferred outcome, like when measuring the branches of the phylogeny, then choosing a suitable model is beneficial. However, when the models serve as nuisance parameters and have an indirect impact, Model Selection may not be necessary and may be avoided by employing the most complex model.