Feng-Hsu Lin & Donald R. Forsdyke
Prokaryotes that grow optimally in acid have
purine-poor codons in long open reading frames
Extremophiles (2007) 11, 9-18
Copyright held by Springer-Verlag. The original publication is available at http://www.springerlink.com
Keywords Acid environment . Base composition . Codons .
Extremophiles . Optimum growth . Purine-loading
ORF = open reading frame. CUTG = Codon Usage Tables from GenBank
The archaeon Picrophilus torridus grows optimally at pH 0.7 (Fütterer et al. 2004; Ciaramella et al. 2005). However, under acid conditions purines are removed from DNA, generating apurinic acid (Tamm et al. 1952; Bernstein and Bernstein 1991). How do organisms with low pH growth optima preserve their DNA? Mechanisms to ensure near neutral cytoplasm may aid preservation in some acidophiles. Yet P. torridus is probably like P. oshimae in having an internal pH of 4.6 and being unviable in media of pH above 4.0 (Jack et al. 1998). Since in duplex DNA a pyrimidine in one strand must be matched by a purine in the other strand, vulnerability to depurination cannot be decreased by eliminating purines from DNA. However, if one strand were particularly vulnerable (Beletskii and Bhagwat 1996), or if transcription products were vulnerable, a decreased content of purines in a segment of one strand might be advantageous.
In most organisms transcripts are purine-loaded (i.e. their purine content exceeds 50%; Forsdyke and Mortimer 2000). This trend is especially prevalent in organisms with high optimum growth temperatures (Lao and Forsdyke 2000; Lambros et al. 2003; Lobry and Chessel 2003; Paz et al. 2004; Basak et al. 2004). In acidophiles vulnerability to depurination might be less if there were decreased use of purine-rich codons. This would be especially important in long open reading frames (ORFs); mutation tending to be a one-hit process, long ORFs that contain many purines constitute larger potential targets for mutation than short ORFs. But P. torridus grows optimally at 60ºC. Can nucleic acid base composition be adjusted in P. torridus to adapt both to environmental pH and to temperature?
To detect possible adaptation of the base composition of a species to an environmental variable, it is usual to plot the base composition values of different species against values for the variable. For example, although a subject of some contention (Musto et al. 2005; Basak and Ghosh 2005), linear regression plots of species base composition against optimum growth temperature show a negative correlation with species GC% and a positive correlation with species AG%. We here apply this approach to examine adaptation to pH. We first demonstrate for P. torridus that ORF base compositions vary with ORF length. For a given base compositional parameter (AG%, GC%) there is a unique slope when that parameter is plotted against ORF length. We then economically summarize a large amount of data by expressing the corresponding slopes for various acidophiles as a function of their optimum growth pH. Our results confirm that increased purine-loading is an adaptation to temperature that affects all ORFs, whereas purine-unloading appears as an adaptation to acid growth conditions that preferentially affects long ORFs.
"Codon Usage Tables from GenBank" (CUTG) are derived from the available annotated protein-encoding sequences of a species and are automatically updated with each new GenBank release (Nakamura et al. 2000). It was assumed that any errors in ORF assignments would be of insufficient magnitude to affect our aggregate results. Base compositions at different codon positions were calculated from the CUTG tables (GenBank release143) using programs written in Perl (Mortimer and Forsdyke 2003).
Prokaryotes for which optimum growth pH values had been determined were
identified from the literature (Schleper et al. 1995; Wiegel and Kevbrin
2003; Macario et al. 2004). For many of these the genomic sequences had
been designated complete. Two species for which there is extensive, but
still incomplete sequence information (Geobacillus stearothermus
and Sulfolobus acidocaldarius), were also included (Table 1).
Approximate values for total genomic base compositions were calculated
after summing the base compositions of the available ORFs (Zavala et al.
Organisms identified as growing optimally at extremes of pH are currently scarce. The five with lowest pH values are also thermophiles, so it is necessary to distinguish adaptation to pH from adaptation to temperature. Since there are many organisms that grow optimally in neutral media, a small selection was made from this pH region, with a bias towards organisms that, like P. torridus, are thermophilic and are also archaea (Table 1). Values for optimum growth temperatures were found in a Prokaryote Growth Temperature Database (PGTdb; Huang et al. 2004), and in the records at the National Center for Biotechnological Information (Washington). Where a range of optimum values was given, the arithmetic center of the range was selected.
ORF lengths were calculated from CUTG tables. When base compositions of entire ORFs, or of individual codon positions in ORFs, are plotted against ORF lengths, the multiple data points are distributed as rightward-pointing arrowheads (Paz et al. 2004). Many features of the distributions are captured by first order linear regression analysis. Although smaller proteins dominate the statistics, and the few points near the tips of arrowheads that correspond to very long proteins usually depart a little from regression lines, higher order regressions offer little improvement. Further details of methods and terminology may be found in Rayment and Forsdyke (2005).
Most organisms have many small proteins and few very large ones. For an individual species, plots of the base compositions of each protein-encoding region (ORF) against the corresponding lengths (kilobases), show the multiple data points to be distributed as rightward-pointing arrowheads, with each point corresponding to an individual gene (Paz et al. 2004). Many features of the distributions can be captured by first order linear regression analysis (Rayment and Forsdyke 2005). The 1535 genes of P. torridus were studied in this way.
In P. torridus, with increasing gene length the overall AG% values of ORFs decrease (Fig. 1a), and some 4% of the variation between points can be accounted for on this basis (adjusted r2 = 0.038). All codon positions, but especially first and second positions, contribute to the decrease (Figs. 1c, e, g). In contrast, GC% values increase with increasing gene length (Fig. 1b), but less than 1% of the variation between points can be accounted for on this basis (r2 = 0.006). However, when analyzed in terms of codon positions, a significant role of second codon positions emerges (r2 = 0.048; Fig. 1f), which is partially countermanded by third codon positions (Fig. 1h). Thus, for both AG% and GC% the amino acid-determining codon positions (first and/or second) are involved in changes in average base compositions as ORF lengths increase.
The contributions of subsets of ORFs of increasing length to these results were examined by eliminating all ORFs above certain lengths (i.e. the arrowheads were progressively blunted). From first order linear regression analyses of these distributions, values for slopes and coefficients of determination (adjusted r2) were extracted. Figure 2a shows that for AG%, even when longer ORFs are omitted, slope values (AG%/kb) are still negative (filled symbols). Indeed, when all ORFs above 1.5 kb are omitted some 8% of the variation can be explained on this basis (r2 = 0.08; open symbols). Although all codon positions contribute, among the codon positions, negative slope values and r2 values tend to be consistently greatest at second positions (Fig. 2e). However, third codon positions make an increased contribution when all ORFs above 1 kb are omitted (r2 = 0.06; Fig. 2g).
Figure 2b show that for GC%, even when longer ORFs are omitted, slope values (GC%/kb) remain weakly positive, but become negative when ORFs above 0.75kb are omitted. The positive slopes are largely due to second codon positions. When ORFs above 1.5 and 2.5 kb are omitted (Fig. 2f), from 7% to 8% of the variation between points can be explained on this basis (r2 = 0.07 to 0.08). Figure 2h shows that negative slope values, when smaller ORFs alone are considered (i.e. when cut-off points are low), are largely due to changes at third codon positions. Here weakly negative slope values become progressively more negative as longer ORFs are omitted from the distributions. When all ORFs above 1 kb are omitted, some 2% of the variation between third codon positions can be explained on this basis (r2 = 0.02). Apart from some differences when cut-off points were low, similar results were obtained in a study of Thermoplasma acidophilum, which has an optimum growth pH of 2.0 (Fig. 3).
Thus, long proteins tend to have a greater proportion of purine-poor codons than relatively shorter proteins. This holds over a wide range of length scales and, indeed, becomes statistically more certain over the low and middle range of lengths. With respect to impacting ORF length, a need to shed purines, perhaps due to a low optimum growth pH, would seem to trump a need to load purines due to a high optimum growth temperature. Does this amplify a trend towards purine-loss that affects all ORFs? That is, irrespective of their lengths, are ORFs generally purine-depleted in organisms that have low growth pH optima? To examine this, and to determine the extent to which our observations on P. torridus are applicable to other acidophiles, we compared the overall base compositions of the ORFs of a variety of species that differ in their optimum growth conditions.
Measurements of base compositions at different codon positions were made for 25 prokaryotes (including P. torridus), that vary both in their optimum growth pHs and temperatures (Table 1). Whereas in Figures 1-3 each point corresponds to a subset of genes within a species, in Figure 4 each point refers to all the sequenced genes of a species. There is no significant effect of optimum growth pH on either overall AG% values (Fig. 4a), or the AG% values for individual codon positions (Figs. 4c, e, g). Albeit at an extreme, the points for P. torridus (indicated by arrows) follow the general trend (i.e. purine-loading, mostly at first codon positions).
The same values were plotted against optimum growth temperatures. Consistent with previous studies showing that AG% increases as optimum growth temperature increases (Lambros et al. 2003), the slope is positive (r2 = 0.15; P = 0.031; Fig. 4b). This is due largely to AG% values at first and third codon positions (Figs. 4d, h), particularly the latter (r2 = 0.37; P = 0.001). Again, P. torridus follows the general trend.
Thus, apart from participating in a barely significant decline in AG% at
second codon positions (r2 = 0.098; P = 0.070; Fig.
4e), from direct measurement of base composition it must be concluded
that P. torridus has resolved the putative conflict between pH
and temperature in favor of temperature. As with other thermophiles,
purines are loaded, not shed.
Values for GC% were plotted similarly. These decrease as optimum growth pH decreases (Fig. 4a). The decrease is attributable to bases at all codon positions, but is significant only at first codon positions (Fig. 4c; r2 = 0.197; P = 0.015). The points for P. torridus follow the general trend among organisms with low optimum growth pH. Thus, P. torridus has a lower overall GC% (i.e. it is relatively AT-rich) relative to organisms that have higher optimum growth pHs. The trend is weakly countermanded in P. torridus by a relative enrichment of GC% in ORFs encoding longer proteins (Figs. 1b, d, f; Figs 2b, d, f). The same group of organisms shows no significant trend in GC% with optimum growth temperature (Figs. 4b, d, f, h). Thus, differences that correlate with optimum growth pH are unlikely to be explained by differences in optimum growth temperature.
For P. torridus there is an apparent discord between the purine-unloading of long ORFs relative to short ORFs (Figs. 1a, c, e, g; Figs. 2a, c, e, g), and an overall pH-independent purine-loading of all ORFs (Figs. 4a, c). To investigate this, slope values from arrow-head plots of base composition versus ORF length were obtained for the 24 other organisms whose base compositions are shown in Table 1 and Figure 4. In every case all available ORFs were included, since a complete set of ORFs can give significant results (as in pointed arrow plots; Fig. 1), albeit not with the best r2 values (as with blunted arrow plots; Figs. 2, 3). We chose not to make arbitrary cuts (to generate blunted arrow plots) among genomes that differ greatly in ORF number (Table 1). The slope values, including those for P. torridus, were plotted against the optimum growth parameters (pH and temperature) for each organism.
The unloading of purines as ORF length increases, as observed with P. torridus, occurs in other organisms with a low optimum growth pH (negative slopes of plots at low pH; Fig. 5a). This is largely due to differences at first and second codon positions, particularly the latter (r2 = 0.32; P = 0.002; Fig. 5e). Thus, the weak tendency to unload purines in ORFs at second codon positions (Fig. 4e), is probably due to a preferential purine-loss from the codons of longer ORFs.
These data indicate that in acidophilic microorganisms there is an increasing substitution (or insertion) of amino acids corresponding to purine-poor codons as protein length increases. No correlations were noted when the same slope values were plotted against optimum growth temperatures (Figs. 5b, d, f, h). Thus, the correlations relate to differences in optimum growth pH. Whereas increased optimum environmental temperature elevates purines generally (Figs. 4b) but does not differentially affect purines in long ORFs (Figs. 5b), decreased optimum environmental pH does not decrease purines generally (Fig. 4a) but does differentially affect purines in long ORFs (Fig. 5a).
Figure 6 shows similar slope plots for GC%. The weak increase in GC% with ORF length observed in P. torridus (Figs. 1b, f) appears as part of a general trend, which actually becomes more evident as optimum growth pH increases. While remaining positive, slope values progressively decline as optimum growth pH decreases (Fig. 6a). This, however, is due mainly to differences at third codon positions (r2 = 0.16; P = 0.026; Fig. 6g), so does not involve amino acid differences. Proteins would have needed to exchange (or insert) no particular amino acid to account for this. But for P. torridus the second codon position is primarily involved (Figs. 1f, 2f), so for acidophiles a differential usage of amino acids is implied (Fig. 6e).
contrast to these results concerning optimum growth pH (Fig. 6a), with
increasing optimum growth temperature slope values for GC% versus ORF
length decrease (r2 = 0.266; P = 0.005; Fig. 6b) but, in
general, remain positive. This decrease primarily involves second codon
positions (Fig. 6f), indicating a difference in amino acid content.
Thus, relative to organisms growing optimally at 37ºC, differences in
GC% between ORFs encoding long and short proteins, although present, are
less marked in thermophiles. ORFs for long proteins are enriched in GC
relative to ORFs for short proteins, but this is a characteristic of
most of the 25 species irrespective of optimum growth temperatures.
There has been much interest in identifying differences between the proteins of mesophiles and the proteins of thermophiles that can account for the greater stability of the latter (Jaenicke and Bohm 1998; Kumar and Nussinov 2001; Fukuchi and Nishikawa 2001). Proteins of thermophiles are under pressure to reduce interdomain space, and their structures are more compact with fewer external loops corresponding to interdomain regions than in the orthologous structures of mesophiles (Thomson and Eisenberg 1999; Schafer et al. 2004). There are also differences in amino acid composition (Suhre and Claverie 2003). However, some amino acid differences appear not as a primary adaptation to thermophilia at the protein level, but secondary to adaptations at the nucleic acid level that result in non-synonymous (therefore, amino acid-changing) codon changes (Lao and Forsdyke 2000; Lambros et al. 2003; Lobry and Chessel 2003; Paz et al. 2004; Basak et al. 2004). While changes in amino acid composition that might aid stability at low pH were anticipated in acidophiles, the only change in P. torridus is a slight increase in isoleucine (Fütterer et al., 2004). Whether there is a general reduction in interdomain space in acidophiles is not currently known (Schafer et al. 2004), but there is a reduction in intergenic space and coding sequences account for 91.7% of the P. torridus genome (Fütterer et al. 2004).
The primary nucleic acid level adaptation in thermophiles is enhanced loading of ORFs with purines. This is a phenomenon of protein-encoding regions, not of intergenic regions or introns, suggesting that it might reflect an adaptation operative at the mRNA level. The adaptation might reflect an evolutionary pressure primarily affecting transcripts, rather than differentially affecting one of the strands of ORFs encoding those transcripts. It has been proposed that the loading of the loop regions of mRNA secondary structures with one type of Watson-Crick-pairing base (e.g. purines rather than pyrimidines), would militate against the RNA-RNA interactions that could lead to formation of lengths of double-strand RNA extending beyond two helical turns. Such structures could delay mRNA-dependent processes (i.e. protein synthesis), and trigger false alarms indicative of cell stress (Cristillo et al. 2001; Forsdyke 2001, 2006; Forsdyke et al. 2002).
According to this argument, wherever possible, over evolutionary time pyrimidine-to-purine mutations have been accepted. Multiple such mutations would be needed to effectively "purine-load" an ORF, and a single mutation would have little impact. Since multiple mutations would be necessary, then they would be expected to be predominantly synonymous (Fig. 4h; Lambros et al. 2003). Organisms with multiple non-synonymous mutations would more likely be counterselected due to impairment of protein functions, but the pressure to purine-load does invoke some non-synonymous mutations (Lao and Forsdyke 2000; Mortimer and Forsdyke 2003).
We envisage a selective pressure on nucleic acids in acidophiles relating to their increased susceptibility to depurination under acid conditions. Error-prone DNA repair processes at a site of depurination would then lead to base changes (Bernstein and Bernstein 1991). A single mutation in a first or second (i.e. mainly non-synonymous) codon position in an ORF could, depending on its nature and position, impact the function of an entire protein. Long ORFs that are likely to contain more purines than short ORFs would be particularly vulnerable. So, wherever feasible, adaptation to acid environments should involve non-synonymous purine-to-pyrimidine substitutions in long ORFs (Fig. 5). Since acidophiles other than Picrophilus are, despite their acid environment, held to maintain their internal pH close to neutral (Ciaramella et al. 2004), our observation that these organisms follow the same depurination trend in long ORFs as P. torridus (Fig. 5), suggests that their neutrality controls may be imperfect. In other words, while they may maintain near-neutral internal pH under laboratory conditions, environmental fluctuations (e.g. nutrient deprivation, hypothermia) may decrease this control, so allowing occasional transient excursions of internal pH to that of the environment. While decreasing susceptibility to future mutation, purine-to-pyrimidine substitutions could also introduce amino acids that are more consistent with protein function under acid conditions. In long ORFs the advantage with respect to thermophilia conferred by purine-loading, would be trumped by the advantage with respect to acidophilia conferred by purine-unloading. This predicts, for those seeking to detect amino acid differences between organisms growing optimally in different environments (Suhre and Claverie 2003), that it would be more informative to compare amino acid differences between long orthologs.
The above explanation for the negative effect of ORF length on purine-loading in acidophiles (Figs. 1-3, 5), seems more likely than another argument for length-dependent effects on ORF base composition – namely that long proteins are more likely to consist of multiple independent domains, so that the corresponding ORFs are likely to have a greater content of more-mutable interdomain sequence, which can accept new "placeholder" amino acids (Rayment and Forsdyke 2005). As discussed above, we expect extremophile proteins to have compact structures with less opportunities for interdomain mutations. Furthermore, adding a pyrimidine-rich codon would do nothing to address the problem – a need to decrease the number of purine-rich codons.
Another explanation for ORF length effects on base-composition is that mutations to stop-codons (classically UAA, UAG and UGA) are more likely to be lethal in long ORFs (Oliver and Marin 1996). These codons are all AG-rich and GC-poor. Accordingly, members of the set of codons able to undergo one-step mutation to a stop-codon would be likely to accept mutations that would generate AG-poor and GC-rich synonymous codons. For example, serine codons UCA and UCG would accept pyrimidines at third codon positions generating the codons UCU and UCC, which would not give rise to stop-codons by one-step mutation. The glycine codon GGA would be more likely to accept a pyrimidine at its third position, so generating GGU and GGC. Thus, long ORFs should be relatively AG-depleted (Fig. 1a) and GC-enriched (Fig. 1b). However, it is difficult to envisage why this should be specific to acidophiles.
Regarding amino acid content, we should note that, consistent with the AT-richness of acidophiles (Fig. 4a), the reported slight elevation in isoleucine in P. torridus would correspond to an increase in one or more GC-poor codons (AUU, AUC, AUA), Whether the isoleucine elevation serves a function at the protein level, or is actually secondary to adaptations at the nucleic acid level, remains to be determined. Regarding nucleic acid level adaptations, there is a reciprocal relationship between AG% and GC% (Lao and Forsdyke 2000; Saccone et al. 2001; Mortimer and Forsdyke 2003). So a primary affect on one, may secondarily affect the other. The underlying basis of this relationship can be determined from first principles. There are two ways to modulate GC% when the total number of bases is constant; either by changing the number of G’s, or by changing the number of C’s. As GC% increases, trading A for G would not affect the AG%. Likewise, trading T for C would not affect the AG%. However, if T were replaced by G, AG% would increase as GC% increases (non-reciprocal relationship). Conversely, if A were replaced with C, AG% would decrease as GC% increases (reciprocal relationship; Lambros et al. 2003). So A-for-C trading may partly explain some of the effects observed here. For example, such trading would decrease the vulnerability of a strand to cytosine deamination, which is favored at extremes of pH, and preferentially affects single strands of DNA (Beletskii and Bhagwat 1996).
We have not here considered direct effects of low pH on mRNAs. Since mRNAs are multiple and ephemeral, a random change in a base in one mRNA is unlikely to have a long-term consequence for the species. Intriguingly, Amosova et al. (2006) have noted that when duplex DNA adopts extruded stem-loop configurations, certain short sequences of guanine residues can catalyze self-depurination under physiological conditions. Thus, there are depurination-mediated "hot-spots" for mutation in DNA. Such self-depurination increases when the pH is lowered, but decreases at high temperature due to disruption of the required DNA stem-loop structure. Possible implications of the present study for the base composition of nucleic acids in the relatively alkaline environment of mitochondria are discussed elsewhere (Forsdyke 2006).
Go to: Thermophilic bacteria and Szybalski's rule
Go to: GC% in thermophiles not supportive of neutralist argument (Click Here)
Return to: Bioinformatics Index(Click Here)
Return to: Homepage(Click Here)
This page was established in May 2006 and was last edited on 22 Jan 2007 by Donald Forsdyke