Dinosaurs’ bones, the great pyramids, diamonds and oils hidden deep in the earth are all remains of an earlier life. However, we do not need to spend days sieving through dirt and sand, travel to faraway lands, or take a journey to the centre of the earth to find ancient relics. Every cell in our body has an abundance of molecular fossils hidden in our own DNA, the pseudogenes. As with most relics, pseudogenes have been for long considered “junk”, however with the development of new genome sequencing technologies as well as the advent of big data science, their key role and importance in understanding the evolution of life has become apparent, gaining their rightful place of recognition from the wide scientific community.
The first pseudogene was discovered in 1977 in the African clawed frog by Jacq, Miller and Brownlee and described as a non-functional DNA sequence that was an almost perfect copy of a protein coding gene1. Fast forward forty-or-so years and pseudogenes were found in every branch of the tree of life. They are formed when a gene acquires a loss of function mutation and remains fixed in the population, or when a gene is duplicated and one of the copies is disabled and rendered nonfunctional, or even when a gene’s mRNA product is reverse-transcribed and inserted back into the genome. Thus, for a long time they have been regarded as genomic fossils, remnants of disabled genes or gene copies, scattered throughout the genome.
Through their formation, degradation and reviving potential, pseudogenes are part of a complex biological circle (see Figure 1). In most cases, the acquisition of a disabling mutation is synonymous to a death sentence. As soon as they are formed these new pseudogenes are prone to decay and are slowly removed from the genome. However, some do retain a number of features commonly associated with functional protein coding genes, allowing their transcripts and RNA products to contribute to genome activity by entering the regulatory circuit. And then, through the same evolutionary chance, some pseudogenes acquire new mutations that revert the initial disablement and even provide new functions to the emerging proto-genes, giving once an extinct gene a new chance to life.
For over two decades our group has chosen pseudogenes as ideal models to study and understand various aspects of genome evolution. Our earlier work focused on developing computational tools, such as PseudoPipe2, for automating the pseudogene identification and characterisation process, and on annotating and comparing the pseudogenes complements across a myriad of organisms (e.g. yeast, worm, fly, chimp, gorilla, human to name just a few). Thus we were excited when the development of strain-specific genome assemblies for a large number of mice ranging from the Gairdner's shrewmouse (Mus Pahari) and the Algerian mouse (Mus spretus) to the common european house mouse (Mus musculus domesticus) and classical laboratory inbred mice, enabled us to annotate and analyse pseudogenes in such an important collection of organisms. The mouse is indeed one of the most widely researched model organisms and has been frequently used to study human diseases due to its experimental tractability and similarities in its genetic makeup to humans.
Tackling this project was no mean feat, as pseudogene annotation requires high quality genome assemblies and protein coding gene annotations as inputs. Thus, in order to obtain reliable results we started by mapping manually and computationally curated pseudogene annotations from the mouse reference genome onto each of the available strains. This allowed us to obtain a pan-genome map of pseudogenes and to evaluate their patterns of formation and degradation across the various strains. We learned that despite being separated by 80 million years of evolution, the pseudogene repertoire in mice is similar to that in humans in terms of size, biotype distribution, and family composition (e.g. GAPDH and ribosomal proteins are the largest pseudogene families). However, some notable differences also arose. For example, by analysing the mobile element content we found that the pseudogene pool in mice is continuously refreshed through multiple successive retrotransposition bursts. By contrast the human pseudogene complement is defined by a single such event that took place 40 million years ago at the dawn of primate lineage and resulted in the creation of the majority of pseudogenes found today in the human genome.
This result made us realise that while commonly, we tend to observe genomes in a static snapshot, we can take advantage of the pan-genome map of pseudogenes in such closely related organisms, to capture the evolutionary dynamics in real time. To this end, we looked at the conservation of pseudogene loci. We found that following the Gairdner's shrewmouse and Ryukyu mouse speciations, the Mus taxa suffered numerous genome remodelling processes that are indicated by a very low level of genomic loci retention for pseudogenes in these species compared to that observed for the classical laboratory strains. Even more we were able to correlate the speciation times with the number of conserved pseudogene loci, giving us a quantifiable metric of genome evolution.
Finally we attempted to find out whether the pseudogenes are in any way related to the large phenotypic diversity observed in mice. Across all mouse genomes analysed, transcriptional and functional analysis showed an enrichment in housekeeping functions associated with conserved pseudogenes. However, we also identified strain-specific functional annotations. For example, the New Zealand Obese mouse (as the name suggests an obesity prone strain) was characterised by an enrichment in pseudogenes associated with defensin, a potential obesity biomarker3.
In summary all our analyses pointed out the pseudogenes as ideal markers of genome remodelling processes.
The full study can be freely accessed here.
1. Jacq, C., Miller, J. R., & Brownlee, G. G. A pseudogene structure in 5S DNA of Xenopus laevis. Cell 12, 109-20 (1977).
2. Zhang, Z., Carriero, N., Zheng, D., Karro, J., Harrison, P. M., & Gerstein, M. PseudoPipe: an automated pseudogene identification pipeline. Bioinformatics 22, 1437-9 (2006).
3. Prats-Puig, A., Gispert-Saüch, M., Carreras-Badosa, G., Osiniri, I., Soriano-Rodríguez, P., Planella-Colomer, M., de Zegher, F., Ibánez, L., Bassols, J., & López-Bermejo, A. α-Defensins and bacterial/permeability-increasing protein as new markers of childhood obesity. Pediatric Obesity 2, e10-e13 (2016).