High G+C Content of Herpes Simplex Virus DNA: Proposed Role in Protection Against Retrotransposon Insertion
Jay C Brown*
Identifiers and Pagination:Year: 2007
First Page: 33
Last Page: 42
Publisher ID: TOBIOCJ-1-33
Article History:Received Date: 30/10/2007
Revision Received Date: 15/11/2007
Acceptance Date: 20/11/2007
Electronic publication date: 4/12/2007
Collection year: 2007
open-access license: This is an open access article licensed under the terms of the Creative Commons Attribution Non-Commercial License (http: //creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the work is properly cited.
The herpes simplex virus dsDNA genome is distinguished by an unusually high G+C nucleotide content. HSV-1 and HSV-2, for instance, have GC contents of 68% and 70% respectively, while that of the host (human) genome is 41%. To determine how GC content varies with genome location, GC content was measured separately in coding and intergenic regions of HSV-1 DNA. The results showed that the 75 genes constitute a uniform population with a mean GC content of 66.9 ± 4.1%. In contrast, intergenic regions were found in two non-overlapping populations, one with a mean GC content (69.3 ± 4.6% n=32) similar to the coding regions and another where the GC content is lower (56.0 ± 4.9 n=30). Compared to other regions of the genome, intergenic regions with reduced GC content were found to be enriched in local GC minima, CACACA sequences and a primary target sequence (TTAAAA) for retrotransposition events. The results are interpreted to suggest that a high GC content is part of the way HSV-1 protects its genes from invasion by mobile genetic elements active during cell differentiation in the nervous system.
The basic features of the herpes simplex virus (HSV-1) genome are now well known. Each virion contains a single molecule of dsDNA 152 kb in length. A total of 75 genes for known proteins are encoded with 69 of these present in a single copy and three in two copies each [1-3]. The genome is divided into two segments, L and S, which encode 57 (UL) and 12 (US) single copy genes, respectively. The L and S segments are separated by repeated genes, which are also found at the genome ends (Fig. 1). As HSV-1 DNA is replicated, the L and S segments invert with respect to each other yielding four genome “isomers” that are found in equal proportions in wild-type HSV-1 populations [4-6].
Schematic drawing of the HSV-1 genome. The genome is 152,261 bp in length and is divided into two segments, L and S.Each segment is organized with single copy genes (UL and US) in the middle and repeated genes (RL and RS) at the ends.
Among the genes encoded by HSV-1 are 43 ancestral or core genes present in all α-, β- and γ-herpesviruses [3, 7, 8]. All 43 are located in UL and most are conserved genes involved in vital virus functions such as entry of the virus into a host cell, DNA replication, capsid assembly, packaging DNA into the capsid and exit of the capsid from the host cell nucleus. The non-core HSV-1 genes include all US genes and highly divergent genes found at the segment ends . Non-core genes encode proteins involved in lineage- or species-specific functions such as transcriptional transactivation, immune evasion and host cell recognition.
A high GC content is one of the most unusual features of the HSV-1 genome. HSV-1 and HSV-2 have GC contents of 68% and 70%, respectively, while the value for the host (human) genome is 41%. In contrast, 54.4 ± 11.5% is the average GC content for the 44 herpesviruses whose genome sequences have been determined (as of March 2007). Only eight herpesviruses, all α-herpesviruses, have GC contents of 68% or greater. In an effort to understand the reason for the high GC content, I have focused on HSV-1 and examined the extent to which GC content varies with position in the genome. The GC content was determined in the coding regions and in the regions between genes. The results have identified a group of intergenic regions whose GC content is reduced compared to the coding regions and to the genome as a whole.
MATERIALS AND METHODOLOGY
The HSV-1 DNA sequence (NC_001806) was extracted from public databases and manipulated with programs in GCG (version 11.1.3; Accelrys) run on a Dell PowerEdge 1950 server with dual 3 GHz Xeon cpu’s and 10GB of RAM). Composition was used to determine GC contents and FindPatterns+ was used to locate specific sequences. GC content in a sliding 120 bp window was determined with Artemis (Sanger Institute). Local GC minima were identified by visual inspection of the trace with all identified minima at least 10% GC lower than the adjacent background.
GC Content in Coding and Intergenic Regions
Coding and intergenic regions were extracted from the HSV-1 DNA sequence (NC_001806) using the editor function of GCG, and GC contents were determined with Composition. The 75 genes were found to have a mean GC content of 66.9 ± 4.1% with a range of 58.1% (UL1) to 82.9% (RL1; Table 1). Gene GC contents constituted a single population centered at the mean value as shown in Fig. 2 (black bars). The GC contents of gene subgroups such as essential and non-essential genes or core and non-core genes were examined and found not to differ significantly from each other or from the genome as a whole (Table 2).
GC Content of Coding and Intergenic Regions in the HSV-1 Genome
|Coding Region||Intergenic Region (rightward)||Coding Region||Intergenic Region (rightward)|
|Gene||Length (bp)||GC||Length (bp)||%GCa||Gene||Length (bp)||%GC||Length (bp)||%GC|
|RL2 (ex 2)||1604||77.2||5588||67.8||UL37||3373||69.3||448||68.3|
|UL15 (ex 1)||1029||64.3||127||60.6||UL51||735||68.4||38||76.5|
|UL15 (ex 2)||1179||61.5||283||56.5||UL54||1539||69.3||225||60.0|
a Intergenic regions with reduced GC content are highlighted. Others are in the genome-like group
Mean GC Content of Selected HSV-1 Gene Groupsa
|Gene group||Mean %GC ± SD||P valueb|
|Essential||67.3 ± 3.9 (n=34)||0.55|
|Non-essential||66.5 ± 4.3 (n=38)|
|Core||66.7 ± 2.8 (n=43)||0.39|
|Non-core||67.3 ± 5.4 (n=38)|
|Beta kinetic class||65.5 ± 2.8 (n=13)||0.87|
|Gamma kinetic class||66.5 ± 4.3 (n=38)|
|All UL||66.6 ± 3.1 (n=58)||0.91|
|All US||65.2 ± 1.4 (n=12)|
Compared to gene GC contents, intergenic regions were found to have a broader range of values, which extend from 46.3% (UL3-UL4) to 81.5% (UL16-UL17). Intergenic region GC contents were considered to form two non-overlapping populations as shown in Fig. 2 (gray bars). One coincided approximately with the distribution of gene GC contents while the other spanned a range of lower values. Further analysis was done with the two populations, which are identified as the genome-like GC content group (32 regions) and reduced GC group (30 regions), respectively (Table 1). Intergenic regions in the genome-like group had GC contents in the range of 62.1% (UL43)-81.5% (UL16); the range of the reduced GC group was 46.3% (UL3)-61.7% (US1).
Both genome-like and reduced GC intergenic regions are found throughout the HSV-1 genome (Table 1). In each case, however, there are areas of concentration. Reduced GC regions, for instance, occur preferentially at three sites, UL20-UL26, UL35-UL45 and US3-US7 that contain evolutionarily divergent genes (Table 1) . In contrast, genome-like intergenic regions are enriched in two areas, UL16-UL20 and UL27-UL34, where evolutionarily conserved genes are concentrated (Table 1).Genome-like intergenic regions are also enriched between genes found near the genome ends and near the inter-segment junction.
The lengths of intergenic sequences vary from zero (i.e. overlapping genes) to several hundred base pairs (excepting very long intergenic regions that occur near the segment ends). Genome-like and reduced GC intergenic regions differ only slightly in average length, the values being 258.7±195.1 and 230±138.8, respectively. Many reduced GC intergene lengths occur in a single population with a mean of approximately 150 bp, while genome-like lengths are more evenly distributed in the range of 0-600 bp (Fig. 3).
Local GC Minima, CA Repeats and TTAAAA Sites
Inspection of intergenic sequences revealed additional features that distinguish genome-like from reduced GC regions. For instance, measurement of GC content in a sliding window 120 bp in length showed that intergenic regions were often the sites of pronounced local minima in GC content. This feature is illustrated in Fig. 4, which shows the GC content in two regions of the genome. Local GC minima were found to occur, for instance, between UL42-UL43, UL43-UL44, and US5-US6, but not between US6-US7. The presence of local GC minima is indicated for intergenic regions with genome-like and reduced GC content in Tables 3 and 4, respectively. Of the 29 intergenic regions with reduced GC content, 19 were found to be sites of local GC minima while only 8 of 29 genome-like intergenic regions have the same feature.
Intergenic Regions with Genome-Like GC Contenta
|Intergenic Region||Local GC minb||CACACA||TTAAAA|
|RL2 (ex 2)-UL1||Yes||Yes (2)||No|
|UL14-UL15 (ex 1)||No||No||No|
|UL17-UL15 (ex 2)||No||No||No|
|Total per 10,000 bp||4.1||4.1||2.0|
a Intergenic regions with genome-like GC content are those with GC contents in the range of 62.1%-81.5%
b Local GC minima are those identified by visual inspection of the GC content trace as defined in a sliding 120 bp window as shown in Fig. ((4)).
Intergenic Regions with Reduced GC Contentaa
|Intergenic Region||Local GC Minb||CACACA||TTAAAA|
|UL15 (ex 1)-UL16||>No||Yes (2)||No|
|UL15 (ex 2)-UL18||Yes||Yes||No|
|Total per 10,000 bp||28.4||23.9||9.0|
a Intergenic regions with reduced GC content are those with GC contents in the range of 46.3%-61.7%.
b Local GC minima are those identified by visual inspection of the GC content trace as defined in a sliding 120 bp window as shown in Fig. ((4)).
The HSV-1 genome was examined for the frequency of two specific sequence motifs, CA repeats and TTAAAA sites. CA repeats are a common feature of the host (human) genome where they constitute 0.3% of the total sequence . CA repeats occur predominantly as microsatellite DNA, but other functions have also been defined (see below). In the HSV-1 genome the sequence CACACA was taken as an overall measure of the CA repeat frequency. It occurs 79 times, a value in excess of the statistical expectation for a random 68% GC genome (49 see Table 5). Compared to the statistical expectation, CACACA sequences are enriched in intergenic regions; 17 are expected on a statistical basis and 24 are found (Table 5). Further, among the 24 intergenic CACACA sequences, most (16) are found in reduced GC regions (Tables 4 and 5).
CA Repeat and TTAAAA Retrotransposition Insertion Sites in the HSV-1 Genomea
|CACACA (in genes)||39||55|
|CACACA (intergenic regions)||10||24|
|TTAAAA (in genes)||4||18|
|TTAAAA (intergenic regions)||1||10|
a Statistically expected sequence numbers were calculated based on both strands of the 152,261 bp HSV-1 genome with 68% GC content. A 16%; T 16%; G 34%; C 34%. Calculated values were rounded to the nearest whole number. The proportions of gene and intergene regions in the HSV-1 genome were taken as 79% and 21%, respectively [1, 2]
TTAAAA sequences were identified because these are prominent target sites for introduction of novel DNA by retrotransposition [11-14]. Such DNA inserts, including the long interspersed nuclear elements (LINE or L1 sequences) could affect the virus by inactivating it if incorporation was into a required gene. Retrotransposition could also affect the genome by introduction of a novel gene into the HSV-1 genome if incorporation follows reverse transcription of a cellular mRNA [15-18].
Analysis demonstrated that the HSV-1 genome contains a total of 28 TTAAAA sequences, a value greater than the number expected on a statistical basis (Table 5). Eighteen occur in coding regions and 10 in intergenic segments. Of those in intergenic regions, 6 are found in reduced GC and 4 in genome-like sequences (Tables 3 and 4. Of the four present in genome-like intergenic regions, all are in intergenic segments (at the ends of UL and in US) that contain evolutionarily divergent genes .
Intergenic Regions with Reduced GC Content
The high GC content observed for HSV-1 genes was fully expected in light of the high GC content of the genome as a whole and of the RL, RS, UL and US regions . It was remarkable to notice, however, that coding region GC contents occur in a single, relatively narrow distribution (Fig. 2). In view of the diversity of HSV-1 genes, it might have been expected that different populations would be observed corresponding, for instance, to essential compared to non-essential genes, UL compared to US genes, core compared to lineage specific genes or early (beta) compared to late (gamma) genes (see Table 2). The existence of a single population indicates that coding region GC content is affected by factors that apply to all genes in a more or less equal way.
The greater diversity of GC contents in intergenic regions contrasts with the narrow distribution in the genes (Fig. 2). While some intergenic GC contents are similar to the genes and to the genome as a whole, there are others with lower GC contents. These are distributed over a wide range (~46%- ~61%) as shown in Table 1 and Fig. 2. The presence of a reduced GC population of intergenic regions suggests that evolutionary forces constraining other sequences apply differently in this case, and understanding the difference might be revealing about HSV-1 evolution, replication and pathogenesis.
High GC content in genes and the effects of L1 retrotransposition
I suggest the high GC content of genes may be related to the way HSV-1 adapts to retrotransposition events in neurons. HSV-1 enters the host by infecting epithelial cells, but it promptly traffics to neurons in local (trigeminal) ganglia where it establishes a latent infection that most often lasts for the lifetime of the host. The virus can be reactivated from neurons causing recurrent infections that are the hallmark of HSV-1 disease .
Neurons arise during development from progenitor cells that undergo a large-scale proliferation accompanied by generation of extensive cell diversity [20, 21]. Functioning neurons are selected from this highly diverse population that may initially include 1010 or more cell types. Selection is based on the activity of individual neurons. Those that become incorporated into active networks survive while the others are lost. It is estimated that no more than 15%-40% of post-mitotic neurons survive this experience-based selection process [22-23].
Although the mechanisms used for generation of neuronal cell diversity are not thoroughly understood, it is considered that retrotransposition by L1 elements plays a significant role . The human genome contains approximately 105 L1 elements (non-LTR retrotransposons) that together account for ~15% of the genome. Active L1 elements are 5 kb-6 kb in length, but most L1s are inactive due to 5’ deletions, rearrangements or mutations in the open reading frames. Only 80-100 human L1s are considered to be retrotransposition competent, and of these ~10% are considered to be highly active [11, 16].
Two genes, an endonuclease and reverse transcriptase are encoded in L1 retrotransposons, and transposition involves reverse transcription of the L1 mRNA. Reverse transcription takes place at specific sites in the host cell genome where primers are created following single-strand cuts introduced by the L1-encoded endonuclease. L1 elements are introduced at these sites, which occur preferentially at TTAAAA and related AT-rich sequences [11-14]. As a rare event, the L1 reverse transcription machinery can act in trans on cellular mRNAs to introduce genes and processed pseudogenes at L1 retrotransposition sites [17, 18].
In the developing nervous system, L1 retrotransposition can, in principle, contribute to generation of cell diversity by any process in which insertion of exogenous DNA can affect gene expression. Introduction into an exon, for instance, could block protein expression while similar insertion events could affect promoter function, alternative splicing or other processes. Recent studies have demonstrated that L1 retrotransposition can promote other forms of genetic instability by way of the DNA repair machinery . L1 retrotransposition has been observed experimentally in mice, in HeLa cells and in neural progenitor cells in culture [11, 24]. In the latter case, retrotransposition is found to be enhanced following downregulation of Sox2, a presumptive inhibitor of L1 mobility .
It is suggested that a high GC content is a part of the way HSV-1 protects itself from harmful effects that might result from L1 insertion, particularly insertion into genes. A high GC content is expected to minimize the number of TTAAAA and other AT-rich insertion sites present in genes and therefore to protect the genes from L1 retrotransposition events. Such protection would be particularly beneficial in the case of latently infected cells where the number of HSV-1 genomes is reduced compared to lytic infections . The proposed high rate of L1 retrotransposition in neurons makes a high GC content particularly attractive for viruses such as HSV-1 that replicate and establish latent infections in neurons. It is consistent with this idea to note that of the 44 sequenced herpesviruses, all eight with GC contents of 68% or greater have a tropism for the nervous system. (The eight are: HSV-1, 68% HSV-2, 70% cercopithecine herpesvirus 1, 74% cercopithecine herpesvirus 2, 70% cercopithecine herpesvirus 16, 76% bovine herpesvirus 1 72% bovine herpesvirus 5, 74% pseudorabies virus, 73%.) In contrast, varicella-zoster virus (VZV) has a comparatively low GC content (46%) despite the fact that it becomes latent in neurons . The ability of VZV to produce such infections despite a low GC content may be due at least in part to the diversity of ganglia in which latent infections occur. Latent VZV infections are found in ganglia along the entire length of the central neural axis  while those of HSV-1 and HSV-2 are concentrated in a more restricted number of ganglia (e.g. the trigeminal in the case of HSV-1).
Retrotransposition and Reduced GC Intergenic Regions
The existence of a population of intergenic regions with reduced GC content may also be related to retrotransposition events in neurons. Among the ways herpesviruses are thought to adapt to environmental change, on an evolutionary time scale, is by incorporating entire genes from the host cell genome. Evidence for such incorporations has been obtained most recently in the case of certain gammaherpesviruses [3, 27-29]. Because the incorporated genes lack introns, it is presumed that their incorporation occurs by way of a cDNA intermediate and L1 retrotransposition. In the case of HSV-1, the existence of intergenic regions with reduced GC content would be preferred as sites for gene insertion due to the greater probability that they will contain AT-rich L1 insertion sequences. Such regions would arise by default if their creation involved simply resisting the evolutionary pressure favoring increased GC content. Evidence for preference of TTAAAA sequences in reduced GC compared to genome-like intergenic regions is described in this study (Tables 4 and 5).Six TTAAAA sequences are found in 6688 bp of reduced GC intergenic DNA while 4 are found in 19,651 bp of genome-like intergenic sequence.
In the future it should be possible to test the idea expressed here that a high GC content is involved in resistance of the HSV-1 genome to invasion by L1 retrotransposons. As noted above, it is now possible to observe L1 transposition in cultured cells with the L1 element derived either from the cell genome or from a transfected plasmid [11, 24, 30]. Similar tests with herpesvirus-infected cells should demonstrate that L1 transposition into the virus genome occurs at a higher rate in low GC compared to high GC viruses. In HSV-1-infected cells, transposition should occur preferentially into reduced GC intergene regions and should avoid genes.
Local GC Minima and CA Repeats
Determination of the HSV-1 GC content in a 120 bp window was carried out as a supplement to measurements of the gene and intergene values. It was expected that intergenic regions might have a reduced GC content because of the measurements described above and also because intergenic regions are often the sites of AT-rich polyadenylation signals (AATAAA). In spite of such expectations, it was nevertheless noteworthy that GC minima were as pronounced and as focused on intergenic regions as the results demonstrated (Fig. 4). Intergenic regions with pronounced GC minima were found throughout the HSV-1 genome with a preference for reduced GC intergenic sequences as shown in Tables 3 and 4. Further information about the function of intergenic GC minima might best be pursued by examining the genomes of other viruses for this feature, and appropriate studies are now underway in our laboratory.
CA dinucleotide repeats are an abundant feature of the human genome where they account for ~0.3% of the total sequence [10, 31]. Most occur in microsatellite sequences, but others are involved in functions such as regulation of gene expression [32, 33], control of alternative mRNA splicing [34, 35] and mRNA stability . Such functions depend on the ability of CA repeats to bind heterogeneous nuclear ribonucleoprotein L (hnRNP L [33, 36]). In the HSV-1 genome, one CACACA sequence has been identified within the thymidine kinase coding region (UL23) where it acts by way of hnRNP L binding to enhance mRNA polyadenylation and nuclear export of intronless mRNAs . Others of the 55 coding region CACACA sequences may act in the same or a similar way. CACACA sequences are also found in intergenic DNA. These are located throughout the HSV-1 genome with most (16 of 24) occurring in reduced GC intergenic regions (Tables 3 and 4). Analysis of deletions in such CACACA sequences suggests itself as a promising way to identify their function.
For contributions during the course of this investigation I gratefully acknowledge Michael Black, Fred Homa, Bill Newcomb Ann Campbell, Anna Maria Copeland, Lou Hammarskjöld and Jeffrey Brown. This work was supported by NIH award AI41466.