Hot topics close

The evolutionary drivers and correlates of viral host jumps

The evolutionary drivers and correlates of viral host jumps
Analysis of publicly available viral genomes shows that humans may give more viruses to animals than they give to us, and reveals evolutionary mechanisms underpinning viral host jumps.

An incomplete picture of global vertebrate viral diversity

Global genomic surveillance of viruses from different hosts is key to preparing for emerging and re-emerging infectious diseases in humans and animals13,16. To identify the scope of viral genomic data collected thus far, we downloaded the metadata of all viral sequences hosted on NCBI Virus (n = 11,645,803; accessed 22 July 2023; Supplementary Data 1). Most (68%) of these sequences were associated with SARS-CoV-2, reflecting the intense sequencing efforts during the COVID-19 pandemic. In addition, of these sequences, 93.6%, 3.3%, 1.5%, 1.1% and 0.6% were of viruses with single-stranded (ss)RNA, double-stranded (ds)DNA, dsRNA, ssDNA and unspecified genome compositions, respectively. The dominance of ssRNA viruses is not entirely explained by the high number of SARS-CoV-2 genomes, as ssRNA viruses still represent 80% of all viral genomes if SARS-CoV-2 is discounted.

Vertebrate-associated viral sequences represent 93% of this dataset, of which 93% were human associated. The next four most-sequenced viruses are associated with domestic animals (Sus, Gallus, Bos and Anas) and, after excluding SARS-CoV-2, represent 15% of vertebrate viral sequences, while viruses isolated from the remaining vertebrate genera occupy a mere 9% (Fig. 1a and Extended Data Fig. 1a), highlighting the human-centric nature of viral genomic surveillance. Further, only a limited number of non-human vertebrate families have at least ten associated viral genome sequences deposited (Fig. 1b), reinforcing the fact that a substantial proportion of viral diversity in vertebrates remains uncharacterized. Viral sequences obtained from non-human vertebrates thus far also display a strong geographic bias, with most samples collected from the United States of America and China, whereas countries in Africa, Central Asia, South America and Eastern Europe are highly underrepresented (Fig. 1c). This geographical bias varies among the four most-sequenced non-human host genera Sus, Gallus, Anas and Bos (Extended Data Fig. 1b). Finally, the user-submitted host metadata associated with viral sequences, which is key to understanding global trends in the evolution and spread of viruses in wildlife, remains poor, with 45% and 37% of non-human viral sequences having no associated host information provided at the genus level, or sample collection year, respectively. The proportion of missing metadata also varies extensively between viral families and between countries (Extended Data Fig. 2). Overall, these results highlight the massive gaps in the genomic surveillance of viruses in wildlife globally and the need for more conscientious reporting of sample metadata.

Fig. 1: Current state of the global genomic surveillance of vertebrate viruses.
figure 1

a, Proportion of non-SARS-CoV-2, vertebrate-associated viral sequences deposited in public sequence databases (n = 2,874,732), stratified by host. Viral sequences associated with humans and the next four most-sampled vertebrate hosts are shown. Sequences with no host metadata resolved at the genus level are denoted as ‘missing’. b, Proportion of host families represented by at least 10 associated viral sequences for the five major vertebrate host groups. c, Global heat map of sequencing effort, generated from all viral sequences deposited in public sequence databases that are not associated with human hosts (n = 1,599,672). d, Number of vertebrate viral species on NCBI Virus used for the genomic analyses in this study, stratified by viral family. The 32 vertebrate-associated viral families considered in this study are shown and the remaining 21 families that were not considered are denoted as ‘others’.

Full size image

Humans give more viruses to animals than they do to us

To investigate the relative frequency of anthroponotic and zoonotic host jumps, we retrieved 58,657 quality-controlled viral genomes spanning 32 viral families, associated with 62 vertebrate host orders and representing 24% of all vertebrate viral species on NCBI Virus (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/) (Fig. 1d). We found that the user-submitted species identifiers of these viral genomes are poorly ascribed, with only 37% of species names consistent with those in the ICTV viral taxonomy20. In addition, the genetic diversity represented by different viral species is highly variable since they are conventionally defined on the basis of the genetic, phenotypic and ecological attributes of viruses18. Thus, we implemented a species-agnostic approach based on network theory to define ‘viral cliques’ that represent discrete taxonomic units with similar degrees of genetic diversity, similar to the concept of operational taxonomic units21 (Fig. 2a and Methods). A similar approach was previously shown to effectively partition the genomic diversity of plasmids in a biologically relevant manner22. Using this approach, we identified 5,128 viral cliques across the 32 viral families that were highly concordant with ICTV-defined species (median adjusted Rand index, ARI = 83%; adjusted mutual information, AMI = 75%) and of which 95% were monophyletic (Fig. 2a). Some clique assignments aggregated multiple viral species identifiers, while others disaggregated species into multiple cliques (Fig. 2b; clique assignments for Coronaviridae illustrated in Extended Data Fig. 3). Despite the human-centric nature of genomic surveillance, viral cliques involving only animals represent 62% of all cliques, highlighting the extensive diversity of animal viruses in the global viral-sharing network (Extended Data Fig. 4a).

Fig. 2: Taxonomy-agnostic approach for identifying equivalent units of viral diversity.
figure 2

a, Workflow for taxonomy-agnostic clique assignments. Briefly, the alignment-free Mash53 distances between complete viral genomes in each viral family are computed and dense networks where nodes and edges representing viral genomes and the pairwise Mash distances, respectively, are constructed. From these networks, edges representing Mash distances >0.15 are removed to produce sparse networks, on which the community-detection algorithm, Infomap54, is applied to identify viral cliques. Concordance with the ICTV taxonomy was assessed using ARI and AMI. b, Sparse networks of representative viral cliques identified within the Coronaviridae (ssRNA), Picobirnaviridae (dsRNA), Genomoviridae (ssDNA) and Adenoviridae (dsDNA). Some viral clique assignments aggregated multiple viral species, while others disaggregated species into multiple cliques. Nodes, node shapes and edges represent individual genomes, their associated host and their pairwise Mash distances, respectively. The list of viral families considered in our analysis are shown on the bottom-left corner of each panel. Silhouettes were sourced from Flaticon.com and Adobe Stock Images (https://stock.adobe.com) with a standard licence.

Full size image

We then identified putative host jumps within these viral cliques by producing curated whole-genome alignments to which we applied maximum-likelihood phylogenetic reconstruction. For segmented viruses, we instead used single-gene alignments as the high frequency of reassortment23 precludes robust phylogenetic reconstruction using whole genomes. Phylogenetic trees were rooted with suitable outgroups identified using metrics of alignment-free distances (see Methods). We subsequently reconstructed the host states of all ancestral nodes in each tree, allowing us to determine the most probable direction of a host jump for each viral sequence (approach illustrated in Fig. 3a). To minimize the uncertainty in the ancestral reconstructions, we considered only host jumps where the likelihood of the ancestral host state was twofold higher than alternative host states (Fig. 3a and Supplementary Methods). Varying the stringency of this likelihood threshold yielded highly consistent results (Extended Data Fig. 5a), indicating that the inferred host jumps are robust to our choice of threshold. In total, we identified 12,676 viral lineages comprising 2,904 putative vertebrate host jumps across 174 of these viral cliques.

Fig. 3: Humans give more viruses to animals than they give to us.
figure 3

a, Illustration of ancestral host reconstruction approach used to infer the directionality of putative host jumps. Putative host jumps are identified if the ancestral host state has a twofold higher likelihood than alternative host states. The mutational distance (substitutions per site) represents the sum of the branch lengths between the tip sequence and the ancestral node for which the first host state transition occurred in a tip-to-root traverse. b, Number of distinct putative host jumps involving humans across all viral families considered (n = 32). Black dots represent the observed point estimates for each type of host jump. The violin plots show the bootstrap distributions of these estimates, where the host jumps within each viral clique were resampled with replacement for 1,000 iterations. Black lines show the 95% confidence intervals associated with these bootstrap distributions. Silhouettes were sourced from Flaticon.com and Adobe Stock Images (https://stock.adobe.com) with a standard licence. A two-tailed paired t-test was performed to test for a difference in the zoonotic and anthroponotic bootstrap distributions.

Full size image

Among the putative host jumps inferred to involve human hosts (599/2,904; 21%), we found a much higher frequency of anthroponotic compared with zoonotic host jumps (64% vs 36%, respectively; Fig. 3b). This finding was statistically significant as assessed via a bootstrap paired t-test (t = 227, d.f. = 999, P P = 0.035; see Methods). In addition, this result was robust to our choice of likelihood thresholds used during ancestral reconstruction (Extended Data Fig. 5b), the tree depth at which the host jump was identified (Extended Data Fig. 5c), and to sampling bias (Supplementary Notes and Fig. 1). The highest number of anthroponotic jumps was contributed by the cliques representing SARS-CoV-2 (132/383; 34%), MERS-CoV (39/383; 10%) and influenza A (37/383; 10%). This is concordant with the repeated independent anthroponotic spillovers into farmed, captive and wild animals described for SARS-CoV-2 (refs. 13,24,25,26,27) and influenza A28,29. Meanwhile, there has only been circumstantial evidence for human-to-camel transmission of MERS-CoV30,31,32. Noting the disproportionate number of anthroponotic jumps contributed by these viral cliques, we reperformed the analysis without them and found a significantly higher frequency of anthroponotic than zoonotic jumps (53.5% vs 46.5%; bootstrap paired t-test, t = 40, d.f. = 999, P 5d), indicating that this finding is generalizable across most viruses. Overall, our results highlight the high but largely underappreciated frequency of anthroponotic jumps among vertebrate viruses.

Host jumps of multihost viruses require fewer adaptations

Before jumping to a new host, a virus in its natural reservoir may fortuitously acquire pre-adaptive mutations that facilitate its transition to a new host. This may be followed by the further acquisition of adaptive mutations as the virus adapts to its new host environment16.

For each host jump inferred, we estimated the extent of both pre-jump and post-jump adaptations through the sum of branch lengths from the observed tip to the ancestral node where the host transition occurred (Fig. 3a). However, in practice, the degree of adaptation inferred may vary on the basis of different factors, including sampling intensity and the time interval between when the host jump occurred and when the virus was isolated from its new host. As such, for each viral clique, we considered only the minimum mutational distance associated with a host jump.

We first examined whether the minimum mutational distance associated with a host jump for each viral clique was higher than the minimum for a random selection of viral lineages not involved in host jumps (Fig. 3a and Methods). Indeed, the minimum mutational distance for a putative host jump within each clique was significantly higher than that for non-host jumps (Fig. 4a; two-tailed Mann–Whitney U-test, U = 6,767, P host jump = 1.31; two-tailed Z-test for slope = 0, Z = 6.58, d.f. = 289, P

Fig. 4: The strength of adaptative signals associated with host jumps decreases with broader viral host ranges.
figure 4

a,b, Distributions (Gaussian kernel densities and boxplots) of (a) minimum mutational distance and (b) minimum dN/dS for inferred host jump events and non-host jump controls on the logarithmic scale. Differences in distributions were assessed using two-sided Mann–Whitney U-tests. c,d, Scatterplots of the (c) minimum mutational distance and (d) minimum dN/dS for host jump and non-host jumps. Lines represent univariate linear regression smooths fitted on the data. We corrected for the effects of sequencing effort and viral family membership using Poisson regression models. The parameter estimates in these Poisson models and their statistical significance, as assessed using two-tailed Z-tests, after performing these corrections are annotated. For all panels, each data point represents the minimum distance or minimum dN/dS across all host jump or randomly selected non-host jump lineages in a single clique. Boxplot elements are defined as follows: centre line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range.

Full size image

We then considered the commonly used measure of directional selection acting on genomes, the ratio of non-synonymous mutations per non-synonymous site (dN) to the number of synonymous mutations per synonymous site (dS). Comparing the minimum dN/dS for host jumps within each clique, we observed that minimum dN/dS was also significantly higher for host jumps compared with non-host jumps (Fig. 4b; ORhost jump = 2.39; Z = 4.84, d.f. = 263, P F(1,528) = 2.23, P = 0.136) or dN/dS estimates (F(1,338) = 1.66, P = 0.198) between zoonotic and anthroponotic jumps, or between forward and reverse cross-species jumps (mutational distance: F(1,1588) = 0.538, P = 0.463; dN/dS: F(1,1168) = 0.0311, P = 0.860), indicating that there are no direction-specific biases in these measures of adaptation. Overall, these results are consistent with the hypothesized heightened selection following a change in host environment and additionally provide confidence in our ancestral-state reconstruction method for assigning host jump status.

However, the extent of adaptive change required for a viral host jump may vary. For instance, some zoonotic viruses may require minimal adaptation to infect new hosts while in other cases, more substantial genetic changes might be necessary for the virus to overcome barriers that prevent efficient infection or transmission in the new host. We therefore tested the hypothesis that the strength of selection associated with a host jump decreases for viruses that tend to have broader host ranges. To do so, we compared the minimum mutational distance between ancestral and observed host states to the number of host genera sampled for each viral clique. We found that the observed host range for each viral clique is positively associated with greater sequencing intensity (that is, the number of viral genomes in each clique; Pearson’s r = 0.486; two-tailed t-test for r = 0, t = 34.9, d.f. = 3,932, P 2,3,8. After correcting for both sequencing effort and viral family membership, we found that the mutational distance for host jumps tends to decrease with broader host ranges (Poisson regression, slope = −0.113; two-tailed Z-test for slope = 0, Z = −9.40, d.f. = 129, P Z = 7.16, d.f. = 127, P 4c). Similarly, the minimum dN/dS for a host jump decreases more substantially for viral cliques with broader host ranges (slope = −0.427; Z = −9.18, d.f. = 116, P Z = 3.08, d.f. = 116, P 4d). These trends in mutational distance and dN/dS were consistent when the same analysis was performed for ssDNA, dsDNA, +ssRNA and −ssRNA viruses separately (Extended Data Fig. 6). These results indicate that, on average, ‘generalist’ multihost viruses experience lower degrees of adaptation when jumping into new vertebrate hosts.

Host jump adaptations are gene and family specific

We next examined whether genes with different established functions displayed distinctive patterns of adaptive evolution linked to host jump events. Since gene function remains poorly characterized in the large and complex genomes of dsDNA viruses, we focused on the shorter ssRNA and ssDNA viral families. We selected for analysis the four non-segmented viral families with the greatest number of host jump lineages in our dataset: Coronaviridae (+ssRNA; n = 2,537), Rhabdoviridae (−ssRNA; n = 1,097), Paramyxoviridae (−ssRNA; n = 787) and Circoviridae (ssDNA; n = 695). For these viral families, we extracted all annotated protein-coding regions from their genomes and categorized them as either being associated with cell entry (termed ‘entry’), viral replication (‘replication-associated’) or virion formation (‘structural’), and classifying the remaining genes as ‘auxiliary’ genes.

For the Coronaviridae, Paramyxoviridae and Rhabdoviridae, the entry genes encode surface glycoproteins that could also be considered structural but were not categorized as such given their important role in mediating cell entry. The capsid gene of circoviruses, however, encodes the sole structural protein that is also the key mediator of cell entry and was therefore categorized as structural. To estimate putative signatures of adaptation in relation to lineages that have experienced host jumps for the different gene categories, we modelled the change in log10(dN/dS) in host jumps versus non-host jumps using a linear model, while correcting for the effects of clique membership (see Methods). Contrary to our expectation that entry genes would generally be under the strongest adaptive pressures during a host jump, we found that the strength of adaptation signals for each gene category varied by family. Indeed, the strongest signals were observed for structural proteins in coronaviruses (effect = 0.375, two-tailed t-test for difference in parameter estimates, t = 4.35, d.f. = 10,121, P t = 2.15, d.f. = 4,225, P = 0.02) (Fig. 5). Meanwhile, no significant adaptive signals were observed in the entry genes of all families (minimum P = 0.3), except for the capsid gene in circoviruses (effect = 0.325, t = 2.68, d.f. = 1,367, P = 0.004) (Fig. 5). These findings suggest that selective pressures acting on viral genomes in relation to host jumps are likely to differ by gene function and viral family.

Fig. 5: Signals of adaptation are gene and family specific.
figure 5

The strength of adaptation signals in genes associated with host jump and non-host jump lineages were estimated using linear models for Coronaviridae (n = 10,129), Paramyxoviridae (n = 4,233), Rhabdoviridae (n = 3,321), and Circoviridae (n = 1,373). We modelled the effects of gene type and host jump status on log(dN/dS) while correcting for viral clique membership and, for each gene type, inferred the strength of adaptive signal (denoted ‘effect’) as the difference in parameter estimates for host jumps versus non-host jumps. Points and lines represent the parameter estimates and their standard errors, respectively. Differences in parameter estimates were tested against zero using a one-tailed t-test. Subpanels for each gene type were ordered from left to right with increasing effect estimates.

Full size image

Given the lack of adaptive signals in the entry proteins, we further hypothesized that within each gene, adaptative changes are likely to be localized to regions of functional importance and/or that are under relatively stronger selective pressures exerted by host immunity. To test this, we focused on the spike gene (entry) of viral cliques within the Coronaviridae since the key region involved in viral entry is well characterized (that is, the receptor-binding domain (RBD))33. We found that dN/dS estimates consistent with adaptive evolution were indeed localized to the RBDs, but also to the N-terminal domains (NTD), of SARS-CoV-2 (genus Betacoronavirus), avian infectious bronchitis virus (IBV; Gammacoronavirus) and MERS (genus Alphacoronavirus) (Extended Data Fig. 7). This is consistent with the strong immune pressures exerted on these regions of the spike protein34,35 and the central role of the RBD in host-cell recognition and entry36,37,38. Overall, our results indicate that the extent of adaptation associated with a host jump likely varies by gene function, gene region and viral family.

Similar news
News Archive
  • Volvo Cars
    Volvo Cars
    Winter testing: Pushing a Volvo EX30 to the limit in Swedish Lapland
    21 Jul 2024
    28
  • Avaí FC
    Avaí FC
    Avai FC SC vs Santos FC live score 07-11-19 football
    7 Nov 2019
    2
  • Epl table
    Epl table
    Premier League Winners and Losers After Wednesday's Updated Week 21 EPL Table
    2 Jan 2020
    3
  • Software as a service
    Software as a service
    USD 72.2 Billion Network-as-a-service Market Expected to Reach by 2031 | Top Players such as - Cloudflare, NTT ...
    28 Oct 2023
    6
  • Mark McGhee
    Mark McGhee
    Gordon Strachan to help pick next Dundee manager as Mark McGhee heavily linked in search for new boss
    17 Feb 2022
    2