This article by Dr. Kenji Ikehara and Dr. Ryoko Oi has been published in Current Proteomics, Volume 14, 2017
Diverse organisms with multiple kinds of genes inhabit most of the Earth. As is already known, these organisms have a fundamental biological system comprising genes that make up the genetic code and proteins. Therefore, an important area of research would be the mechanism underlying the production of an entirely new (EntNew) gene or the first protein belonging to a new family.
Previously, the design of a base sequence encoding a protein with a required function was not possible. Therefore, every EntNew gene would need to be created by random concatenation of monomeric units or mononucleotides. However, it would also be impossible to create an EntNew gene through this random process; diversity in a base sequence encoding a small protein composed of 100 amino acids is (43)100, or approximately 10180, implying that a gene encoding a small protein cannot be directly created by the random polymerization of mononucleotides.
Nonetheless, various protein families originating from an EntNew gene exist in extant organisms on Earth. Therefore, it is certain that, during evolution, organisms have acquired various genes encoding protein-like precision molecular machines to adapt to various environments on Earth. This indicates the existence of a specific mechanism, using which various EntNew genes have been created through substantially random processes.
We have proposed the GC-NSF(a) hypothesis for the formation of an EntNew gene, which suggests that an EntNew gene is generated from a non-stop frame on the antisense strand of a GC-rich gene (GC-NSF(a)). NSF(a) is the non-stop frame codon sequence on the antisense strand in the reading frame corresponding to the gene on the sense strand.
The GC-NSF(a) hypothesis assumes that an immature and flexible protein with weak catalytic activity, which is produced by GC-NSF(a) expression, evolves gradually into a mature enzyme with higher catalytic activity and more rigid structure as necessary base replacements accumulate onto GC-NSF(a).
Thereafter, to obtain direct evidence for the hypothesis, every amino acid sequence (AAS) of the imaginary protein encoded by GC-NSF(a) of the Pseudomonas aeruginosa PAO1 genome (GC content = 66.6%) was homology-searched against all AASs of extant proteins encoded by the same genome. We used NCBI BLASTP for computational investigation.
The results suggested that the GC-NSF(a) AAS of tal encoding the C-terminus domain of transaldolase B has sufficient homology with the AAS of ftsZ encoding the C-terminus domain of cell division protein FtsZ. In addition, three other AASs were obtained with similar analysis of 57 GC-rich microbial genomes. Thus, we conclude that the EntNew gene encoding the EntNew protein was generated according to the GC-NSF(a) hypothesis.
The EntNew gene can be created from GC-NSF(a) at a high probability because 0th-order structures (pre-primary structures) or the specific amino acid composition (actually an amino acid sequence) of a protein is written in the non-stop frame on the antisense strand (NSF(a)) of GC-rich, but not AT-rich, genes. In other words, GC-NSF(a) can encode the AAS of an immature protein, which is different from any previously existing proteins. Furthermore, AAS encoded by GC-NSF(a) satisfies the six conditions for the formation of a water-soluble globular structure at a high probability. Furthermore, the structure of this protein is slightly more flexible than that of extant proteins, making it possible to easily adjust surface amino acids according to newly encountered substrates.
The spread of organisms currently present on Earth is a result of the emergence of the first EntNew gene on primitive Earth, followed by the emergence of other homologous genes within the same gene family and their corresponding proteins; together, these emergent proteins allowed these organisms to adapt to the various environmental conditions on this planet.