What is a Gene?
Scherrer and Josts’ symposium.
The gene concept in 2008
128, 157-161.DOI: 10.1007/s12064-009-0071-2
In 1900 Henry Bernard of the Natural History section of the British Museum, who regarded Ernst Haeckel of Jena as his “friend and teacher,” devised a numerical taxonomic system to replace the classical Linnaean system. Describing the latter as “philosophically absurd and practically disastrous,” he provoked several leading biologists to contribute their thoughts on “the species concept” to a letter that was passed from correspondent to correspondent, finally ending up in an envelope labeled “Bernard’s Symposium.” The year was ominous. Mendelism was about to burst on the scene and, after a bitter battle with the biometricians, a new science – Genetics – was to emerge triumphant. Decades later Bernard’s envelope was discovered in the archives of the geneticist William Bateson (Cock 1977). Despite its imperfections, the Linnaean system of species nomenclature had survived because it worked. It remains unchallenged today.
Many of the same points can be made concerning the present “symposium.” Describing the discovery that eukaryotic genes are fragmented as “devastating for the original gene concept,” a molecular biologist and a mathematician have teamed up to amend, and provoke our thoughts on, the meaning of the word “gene” (Scherrer and Jost 2007). Again, the preferred solution is numerical. It is hoped that a refined definition will allow better application of “mathematical algorithms that can analyse gene storage and expression in terms of information processing.” Again the year is ominous. From the torrent of sequence information that began in the 1980s (Benson et al. 2009), a new science – Evolutionary Bioinformatics – is emerging (Forsdyke 2006). Whether it will emerge triumphant is yet to tell. But from its basic tenets this correspondent is led to believe that the Mendelian gene concept is safe. Despite its imperfections, it will be the concept preferred by most of those attempting to identify and tackle biological problems in the twenty first century.
Although the word had not then been coined, Gregor Mendel’s “gene” of 1865 was something segregating intact among offspring that determined a character. The latter was some morphological or physiological feature that we now refer to as being part of the “conventional phenotype” – the phenotype that is most obviously responsible for interactions of an organism with its environment. More than a century later, in his “genome hypothesis” Richard Grantham (1980) referred to an apparently more inward-looking, genome-based, phenotype, for which the term “genome phenotype” was suggested (Bernardi and Bernardi 1986). The genome phenotype concept allowed a better understanding of genomes (i.e. the multiple forms of information, including genic information, that pass through the generations in the form of nucleic acid) and appeared to explain the occurrence of “placeholder” bases and amino acids in nucleic acids and proteins – a subject relevant to our present task.
Rather than “characters,” the genome phenotype deals with “pressures” that relate, in ways that remain to be fully explored, to fundamental biological themes such as self/not-self discrimination, the preservation of genome integrity, and the abeyance of that integrity needed for speciation. Among the pressures are the genome-wide pressures exerted by pairs of bases (e.g. GC-pressure, which can be regarded as the “accent” of DNA) and the potential to extrude stem-loop structures from duplex DNA (fold pressure). Some pressures are local, being confined to specific regions. AG-pressure and RNY pressure apply to exons, which are the DNA sequences corresponding to what is left as messenger RNA (mRNA) after introns have been removed from a primary transcript.
Molecules of RNA and protein must assume higher ordered structures in order to perform various structural and/or catalytic roles. Although sometimes requiring the assistance of molecular chaperones (Cristofari and Darlix 2002), the information for such structures is mainly encoded in their primary sequences. However there can be conflicts. When incorporated into a stem-loop structure, purines tend to occupy the less stable loops. Thus, fold pressure (quantified as the stability of stem-loop structures) tends to be countermanded by AG-pressure. It is important to distinguish general, genome-wide, fold pressure, which appears primarily to relate to function at the DNA level in the nucleus, from local fold pressure, which appears primarily to relate to function at the RNA level in the cytoplasm. Thus, the DNA from which a ribosomal RNA (rRNA) is transcribed is under two, potentially conflicting, fold pressures – general and local. The potential to fold that satisfies the needs of a segment of DNA may not be the same as that which satisfies the needs of the RNA transcribed from that segment. There must be some compromise – perhaps post-transcriptional RNA editing by removal of segments and/or base modifications (Scherrer and Darnell 1962; Greenberg and Penman 1966; Bass 2002).
Similarly, a protein-encoding exon can be considered under a local “protein pressure,” which must contend for genome space with other pressures. Many features of a genome (e.g. introns) can be understood in terms of the way the “hand of nature” has resolved contending pressures over evolutionary time to arrive at a form that best satisfies the needs of members of the species (Forsdyke and Mortimer 2000; Forsdyke 2001, 2002).
Through classical recombination mapping, Mendel’s “gene” was localized, first to a linkage group (chromosome) and then to a distinctive chromosomal region (Cock and Forsdyke 2008). When a break happened to move that region from one chromosome to another, the gene moved with the region. Mendel’s “gene” was further localized through mutation. In general, mutations elsewhere in the genome did not disturb the function attributed to a gene. Mutations in the region of the gene often, but not always, disturbed the function attributed to it.
When DNA sequencing techniques emerged in the 1970s, the types, and fine-resolution locations, of mutations affecting a function, could be determined. This suggested a strict, but largely unworkable, gene definition, which nevertheless could serve as a frame-of-reference for more practical definitions. Pertinent to this are “placeholder” bases or amino acids, that may occur both in genes that have RNA transcripts but no protein product, and in protein-encoding genes (Xue and Forsdyke 2003; Rayment and Forsdyke 2005). For example, genome compactness being a virtue in viruses, it was surprising that a glycine-alanine repeat region in the EBNA1 protein of Epstein-Barr virus could be removed without interfering with the major functions assigned to the protein (Yates and Camiolo 1988; Wu et al. 2002). The paradox appeared to be resolved when it was reported that the repeat region could inhibit antigen processing, so permitting the virus to evade host immune defences (Levitskaya et al. 1995; Levitsky and Masucci 2002).
However, supporting evidence derived from an expression construct that was transferred to target cells. Here the gene was transcribed into mRNA, which was then translated into the protein. Was it the transcribed mRNA, or its translation product, that was responsible for the effects observed (Cristillo et al. 2001)? The two amino acids in the repeat could each be encoded by any of four codons. It has recently been shown that changing from one synonymous codon to another can prevent the inhibition of antigen processing, even though the encoded amino acids have not changed. While it is possible that the rate of translation, and hence protein folding, might have been affected by different codon usage, it is most likely that, at the protein level, the glycines and alanines in the repeat region are mere placeholders, perhaps with no functional role (Starck et al. 2008; Tellam et al. 2008).
Hence, from the perspective of protein function, the repeat-encoding region in viral DNA would not be part of the gene, because a mutation that affected the region would be presumed not to impact the function of the gene. If needed at the DNA level for some other purpose (Schaap 1971), the region encoding the repeat could have evolved as an intron and then have been spliced out during mRNA processing. That the region was not spliced out and remained intact in mRNA, despite not appearing to function at the protein level, suggests function at the level of mRNA itself. Thus, from the perspective of mRNA function, the repeat-encoding region in viral DNA is part of the gene, since a mutation would affect function at the RNA level. In this case RNA trumps protein when it comes to gene definition.
While the possibility that the glycine-alanine repeat has some protein-level function is not excluded (Daskalogianni et al. 2008), for present purposes we can regard the amino acids as mere placeholders – a secondary consequence of the sequence requirements of the corresponding mRNA. Also for our purposes, we consider nucleic acids to have four bases (disregarding modified bases), and proteins to have twenty amino acids. We do not consider epigenetic effects or modified amino acids.
For a gene with an RNA end-product (e.g. the intron-containing Xist gene; Pfeifer and Tilghman 1994), a base that could be substituted with any other base without affecting RNA function would have satisfied one criterion for exclusion of that position from the gene. But the base could still be a “placeholder” regarding the gene’s assigned function. This role would be tested by deleting or adding a base at that position. If function were still not impaired, then the base would not be deemed a placeholder and the position would not be considered part of the gene, although the position might serve some other genomic pressure. Even if the position were deemed a placeholder (and hence was part of the gene), it could still concomitantly serve another genomic role. This argument would apply to both exons and introns in a gene with an RNA end-product. A mutation in an intronic splice site might result in mis-splicing and so affect the function of the final RNA product. By this criterion some parts of introns would be considered to contribute to the gene.
the extent that mutations in them can affect a protein product
quantitatively, the same criteria would apply to the exons and introns
corresponding to the 5’ and 3’ untranslated segments of mRNAs.
However, we should note that, perhaps influenced by Scherrer and Jost,
many consider that such segments should not be considered as having
derived from exons (Griffiths and Stotz 2006;
Protein-encoding exons have many base positions where function of the protein would be affected by mutation. The first and second positions of amino acid-encoding triplets would most likely affect the nature of an amino acid, and hence protein function. Third codon positions are often redundant so, in simple form, the protein-encoding parts of a gene would consist of sets of two bases followed, in many cases, by an irrelevant third base. However, since deleting, or adding to, this placeholder base would upset the reading frame, the entire coding region can be considered part of the gene. Similarly, many intron mutations would affect the protein (e.g. omission of a protein-encoding exon through mis-splicing) so, to that extent, its introns would also be part of the gene.
it is normally not feasible to put the region assigned to a gene through
all these mutational tests, a practical compromise is define a gene at
one of three levels depending on the context of the particular
discourse. 1. The most fundamental definition of a gene is that region
of the genome that is transcribed. Its boundaries would be the first and
last bases of the transcript, prior to any 5’ capping or 3’
trimming. Since the latter may be hard to define, then the 3’ end of
the trimmed transcript prior to polyadenylation would be an acceptable
compromise. 2. Beyond this, there are usually related regions on either
side of the transcribed region, mutations of which would affect
production of the gene product. So in some contexts these neighboring
regions would be included. 3. Finally, if a distant region could be
shown to affect the quality or quantity of the gene product, with a high
degree of specificity for that gene, this could also be considered part
of the gene. Since molecular chaperones are not specific, usually having
a variety of “client” proteins, their genes would not be considered
as part of the genes of their clients. Many other processing factors
(e.g. RNA splicing proteins) are likewise non-specific. This three level
definition may not meet all of Scherrer and Josts’ concerns, but it is
close to that of many textbooks (e.g. Lewin 2006).
The notion of the genome phenotype has been long on the table. So also has the notion of competition between hierarchical levels of information. Acknowledging possible conflicts between different types (layers) of information, Gregory Bateson (1979) observed that in biological systems:
Scherrer and Jost (2007) acknowledge “the nuclear DNA not only carries several types of information, but is at the same time the mechanistic carrier of the information contained,” and they recognize “the existence of an independent mechanism which lays down signals for meiotic alignment which seems to be largely independent of all other genomic information.” However, for their purposes these considerations are given little weight, even though “in the perspective of evolutionary biology, different conceptual emphasis has lead to utilizations of the term ‘gene’ that are different from ours.” They appear most concerned with regulatory aspects of the transfer of information from nucleus to cytoplasm and do not appear to have explored the ideas of Richard Grantham and Gregory Bateson that, with the subsequent appearance of vast quantities of DNA sequence information in the 1980s, led to deeper analyses in the 1990s. A new science – for which the name “Evolutionary Bioinformatics” was suggested – emerged (Forsdyke 2006). Just as Bernard’s symposium now serves to illustrate the controversies of 1900, so the present symposium on a proposed “new information theoretic scheme” will perhaps best serve as an indicator to future historians of the disparate lines of thought prevailing in the first decade of the 21st century.
I thank Klaus Scherrer for suggesting that I be invited to contribute to this debate. Queen’s University hosts my webpages where some of the references may be found.
Bass BL (2002) RNA editing by adenosine deaminases that act on RNA. Annu Rev Biochem 71:817-846.
Bateson, G. (1979) Mind and Nature. A Necessary Unity. Dutton, New York, p. 21
DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2009) GenBank.
Nucleic Acids Res 37:D26-D31
Bernardi G, Bernardi G (1986) Compositional constraints and genome evolution. J Mol Evol 24:1-11
Catania F, Lynch M (2008) Where do introns come from? PLoS Biology 6:11, e283
Cock AG (1977) Bernard’s symposium. The species concept in 1900. Biol J Linn Soc 9:1-30
Cristillo AD, Mortimer JR, Barrette IH, Lillicrap TP, Forsdyke DR (2001) Double-stranded RNA as a not-self alarm signal: to evade, most viruses purine-load their RNAs, but some (HTLV-1, Epstein-Barr) pyrimidine-load. J Theor Biol 208:475-491
Cristofari G, Darlix J-L (2002) The ubiquitous nature of RNA chaperone proteins. Prog Nuc Acid Res Mol Biol 72:223–268
Daskalogianni C, Apcher S, Candeias MM, Naski N, Calvo F, Fahräeus R (2008) Gly-Ala repeats induce position- and substrate-specific regulation of 26S proteosome-dependent partial processing. J Biol Chem 283:30090-30100
Forsdyke DR (2002) Selective pressures that decrease synonymous mutations in Plasmodium falciparum. Trends Parasitol 18:411-418
Grantham R (1980) Workings of the genetic code. Trends Biochem Sci 5:327-331
H, Penman S (1966) Methylation and processing of ribosomal RNA in HeLa
cells. J Mol Biol 21:527-535
Lewin B (2006)
Genes IX. Jones and
Levitskaya J, Coram M, Levitsky V, Imreh S, Steigerwald-Mullen PM, Klein G, Kurilla MG, Masucci MG (1995) Inhibition of antigen processing by the internal repeat region of the Epstein-Barr virus nuclear antigen-1. Nature 375:685-688
Levitsky V, Masucci MG (2002) Manipulation of immune responses by Epstein-Barr virus. Virus Res 88:71-86
Pfeifer K, Tilghman SM (1994) Allele-specific gene expression in mammals: the curious case of imprinted RNAs. Genes Devel 8:1867-1874
Rayment JH and Forsdyke DR (2005) Amino acids as placeholders: Base composition pressures on protein length in malaria parasites and prokaryotes. Applied Bioinformatics 4:117-130
Schaap T (1971) Dual information in DNA and the evolution of the genetic code. J Theor Biol 32:293-298
Scherrer K, Darnell JE (1962) Sedimentation characteristics of rapidly labeled RNA from Hela cells. Biochem Biophys Res Comm 7:486-490
Scherrer K, Jost J (2007) Gene and genon concept: coding versus regulation. A conceptual and information-theoretic analysis of genetic storage and expression in the light of modern molecular biology. Theory Biosci. 126, 65-113
SR, Cardinaud S, Shastri N (2008) Immune surveillance obstructed by
viral mRNA. Proc. Natl Acad Sci
Stoltzfus A, Spencer DF, Zuker M, Logsdon JM, Doolittle WF (1994) Testing the exon theory of genes: the evidence from protein structure. Science 265:202-207
J, Smith C, Rist M, Webb N, Cooper L, Vuocolo T, Connolly G, Tscharke
DC, Devoy MP, Khanna R (2008) Regulation of protein translation through
mRNA structure influences MHC class 1 loading and T cell recognition.
Proc Natl Acad Sci
Wu H, Kapoor P, Frappier L (2002) Separation of the DNA replication, segregation, and transcriptional activation functions of Epstein-Barr nuclear antigen 1. J Virol 76:2480-2490
HY, Forsdyke DR. (2003) Low complexity segments in Plasmodium
falciparum are primarily nucleic acid level adaptations. Mol
Biochem Parasitol 128:21-32
Yates JL, Camiolo SM (1988) Dissection of DNA replication and enhancer activation functions of Epstein-Barr virus nuclear antigen 1. Cancer Cells 6:197-205
End Note August 2009
|This paper was part of a special issue of Theory in Bioscience for which several authors with different backgrounds had been invited to consider the problem of gene definition. Thus, the description "symposium" in the title.|
Bioinformatics Index (Click Here)
HomePage (Click Here)
This page was begun in December 2008 and last edited on 15 May 2010 by Donald Forsdyke