Leaping from frogs to plants - in quest of repeats

Ajinkya, a frog enthusiast joined IISER Bhopal's doctoral degree program hoping to uncover evolutionary patterns within frog genomes. The repeat-rich genomes of frogs became an impediment. He turned adversity into opportunity by studying the effect of repeats on plant population genomics.
Published in Ecology & Evolution
Leaping from frogs to plants - in quest of repeats
Like

The yearlong project at the end of the Master of Science program is the first time most students get a chance to dig into a topic and use their lifelong learning to solve a problem. During my master's dissertation, I worked with amphibians and got fascinated by their exquisite biology. While reading literature for my project, I came across several evolutionary studies which used genetics to answer questions that wouldn't have been possible without genetic tools. The results of my master's project resolved a century-old problem of cryptic species using bioacoustics and genetic data1. My fascination with frogs continued when I joined IISER Bhopal's doctoral degree program in evolutionary genomics. The beautiful campus of IISER Bhopal is home to myriad species. While working through the coursework, I observed frogs and other animal taxa in and around the campus. I hoped to figure out the study system for my Ph.D. work and find the most feasible options for fieldwork. I decided to work on frogs considering many species of frogs were abundant in and around the campus (Fig. 1). However, before I could start sampling, I had to look at the genome size estimates and plan the amount of data required for assembling their genomes. During the literature review, I realized that the frogs I wanted to study have large genomes. Amphibian genomes have been highly challenging to sequence and assemble due to heavily repeat-rich genomes, leading to under-representation in genomic datasets. Generating a good quality reference genome in these species would require multiple datasets from different sequencing technologies and is fairly expensive. Advances in sequencing technologies and genome assembly strategies make it easier to resolve repeat regions, but it is still a long haul in this conquest of repeats. Hence, I decided to work on the impediment itself and its effects on population genomic analyses, selecting another set of repeat-rich genomes to work on - plants.

Fig. 1: Herpetofauna of IISER Bhopal (Photos: Ajinkya).

Top to bottom row; left to right: Minervarya syhadrensis, Minervarya agricola (juvenile), Sphaerotheca sp., Sphaerotheca sp., Euphlyctis cyanophlyctis, Uperodon globulosus, Hoplobatrachus tigerinus, Polypedates maculatus, Chamaeleo zeylanicus, Minervarya agricola, Euphlyctis cyanophlyctis, Microhyla ornata.
 

It was the festival of light (Diwali) at Bhopal, but I was a bit worried rather than joyful thinking about which plant would be most suitable for my Ph.D. Diwali is a time for lighting lamps and celebrating the triumph of good over evil. A friend from Assam told me how they use the seeds of the plant Nagakesar (Mesua ferrea) as lamps. Intrigued by this, I searched the literature on this plant and found out it had many beneficial properties. Mesua ferrea is a typical forest tree species with high generation time and a long senescence period. The timber of this plant is unusually dense and sinks in water, due to which it is known as Ironwood in the Indian sub-continent. Not just the traditional Indian system of medicine (Ayurveda), even many pharmacological studies ascribe this tree with medicinal properties useful in multiple ailments. All these properties in a single forest plant warranted our attention and encouraged us to sequence this plant. We were able to sequence and assemble a high-quality draft genome for this plant2. Initial exploration of the demographic history showed that this plant had undergone extensive bottleneck events since the Mid-Pleistocene glaciations. Further investigation and fine-tuning of the various parameters of PSMC (Pairwise Sequential Markovian Coalescent) revealed uniquely unusual behavior in the trajectories. With an increase in -t (Maximum time to most recent common ancestor, i.e., TMRCA) parameter of PSMC, trajectories could infer estimates of Ne for older time points (Fig. 2).

Fig. 2: Demographic inference of Mesua ferrea by PSMC and effect of different values of maximum TMRCA.

PSMC inferred trajectories with the same -p parameter (3*2+1*10+15*2+14+4) but for several values of maximum TMRCA parameter. Colour used for -t of 35 (cyan), 45 (blue), 55 (green) and 65 (brown). For -t 500 (red), -p was used “4+25*2+4+6”, but did not have a sufficient number of recombination events in some of the last atomic intervals. The demographic scenario shows a steep decline in Ne, after MPT (Mid-Pleistocene transition) i.e. ∼700 KYA, which again went through a second bottleneck during LGM (Last glacial maximum) of the Last glacial period i.e. around ∼30KYA. (Code available: https://github.com/Ajinkya-IISERB/CoalRep/blob/main/blog/code/plot1.r)

Optimization of parameters used for software programs is generally described in user manuals and tends to be missing in scientific literature. However, results tend to get influenced by parameter settings and need careful attention. Hence, to better understand how the increase of the value of the -t parameter allows inferences at older timepoints, we extracted the genomic regions contributing to the atomic intervals across a continuous stretch of the genome in each PSMC run. The distribution of various atomic intervals along scaffold 1281 changed with an increase in the -t parameter and demonstrated the importance of appropriate parameter settings while using PSMC (Fig. 3).

Fig. 3: Distribution of atomic intervals across scaffold1281 for different maximum TMRCA values for Mesua ferrea PSMC.

For each run of PSMC with different -t values decode based genomic regions along this scaffold and corresponding atomic intervals are shown. The atomic intervals which spanned scaffold1281 are shown here with their respective colors. The Callability of bases in these regions is shown to highlight the quality of variants identified; heterozygosity is shown to demarcate hypervariable regions. It can be seen that the same genomic coordinates are being distributed to more recent atomic intervals from older AI’s, which hints at redistribution of positions of atomic intervals with changes in the maximum TMRCA parameter values. (Code available: https://github.com/Ajinkya-IISERB/CoalRep/blob/main/blog/code/plot2.r)

We hypothesized that other trees from a similar habitat should have experienced similar demographic events and may be related to large-scale changes in past climate. To test this hypothesis that events such as the Mid-Pleistocene glaciations had a major impact on all the tropical forest tree species, we compiled genomic datasets of 14 other forest tree species and compared their demographic histories. During this study, we came across Populus trichocarpa, which had higher repeat content than most other plants. In such species with high repeat content, just masking repeats leads to demographic inferences from a small fraction of the genome. However, using less than 70% of the genome has previously been shown to result in incorrect results. Hence, we decided to evaluate how the inclusion or exclusion of repeats changes the inferred demographic trajectory (Fig. 4). Populus trichocarpa has high-quality genome assembly, high repeat content, and abundant genomic resources, making it ideal for addressing this question. Using repeat masked and unmasked genomic data resulted in different demographic trajectories. We included a human high-quality dataset to further delve into the effect of individual repeat abundances and method-specific changes. After all these analyses we concluded that the extent of the effect of specific repeats will be different for each species and will impact their demographic history inferences differently at different timepoints. 

Fig. 4: Schematic of workflow.

Hence, we recommend performing an extensive de-novo repeat annotation and evaluate the effect of repeat regions on demographic analyses. We have developed a handy pipeline for performing all these steps, and it is available at https://github.com/Ajinkya-IISERB/CoalRep.

Check out the full paper to know how Repetitive genomic regions affect the inference of demographic history.

References:

1. Phuge, S. et al. Importance of genetic data in resolving cryptic species: A century old problem of understanding the distribution of Minervarya syhadrensis Annandale 1919, (Anura: Dicroglossidae). Zootaxa 4869, 451–492 (2020).

2. Patil, A. B. et al. The genome sequence of Mesua ferrea and comparative demographic histories of forest trees. Gene 769, 145214 (2021).

3. Patil, A.B., Vijay, N. Repetitive genomic regions and the inference of demographic history. Heredity (2021).

Please sign in or register for FREE

If you are a registered user on Research Communities by Springer Nature, please sign in