Biomedical scientists and biologists routinely consider how selection shapes the structure and function of proteins of interest.  Less commonly, I suspect, do we consider how selection for attributes other than protein structure and function can favor or disfavor nucleotide sequences that encode particular amino acid sequences.   A new study (Stergachis et al., 2013) published in the December 13 issue of Science presents strong evidence for one particular source of selection (unrelated to protein function) influencing coding regions, known as exons, of genes.  This form of selection arises from the  fact, as revealed by the authors, that many transcription factors (TF), proteins that bind to specific nucleotide  sequences and regulate the frequency and pace of gene transcription (i.e., gene expression), bind in exonic regions of genes.

The primary experimental method used by Stergachis et al. (primarily from the lab of John Stamatoyannopoulos) for identifying sites of TF binding in genomic DNA was protection of DNA sequences from enzymatic cleavage by deoxyribonuclease I (DNase I).  This approach to mapping TF occupancy was applied to genomic DNA from 81 different human cell types.  Roughly 14% of nucleotides were part of a TF-binding site in at least one cell type.  Importantly, 86.9% of genes contained at least oen exonic site of TF binding.  The authors suggest that due to methodological and experimental limitations (e.g., more cell types could have been studied), the estimates for the number of TF binding sites or the number of genes containing at least one such site in a coding region are more likely to be underestimates than overestimates.

The chief implication of their analysis, at least for our purposes is that since variation in the identity of even one nucleotide in a site of TF binding can, but does not necessarily, dramatically affect the affinity of the TF for that site.  In other words, if a TF binds in a coding region, the corresponding exon sequence is constrained by selection for (or potentially against) maintaining TF binding and function.  Thus, a given codon and the encoded amino acid may be present in a gene or the protein encoded by that gene because of the ability of the codon to interact with a TF as opposed to the ability of the encoded amino acid to influence gene product (i.e., protein) function.  This dual role in evolution prompted the authors to call these sites encoding amino acids in polypeptide chains and TF binding in genes “duons.”

Stamatoyannopoulos and colleagues provide abundant evidence supporting the hypothesis that genomic sites of TF binding within exons are more evolutionary constrained than genomic sites where TF do not bind.  Interestingly, on average, both synonymous and nonsynonymous mutations in these exonic TF-binding sites appear to be evolutionarily younger than synonymous and nonsynonymous mutations outside of TF-binding sites.  The authors suggest that these results support the claim that binding of TF to coding regions constrains the evolution of both the nucleotide sequences of codons and amino acid sequences of polypeptides.

Other novel insights are presented in this stimulating study.  For example, first exons of protein-encoding genes are the most likely to contain TF-binding sequences.  Also of interest, amino acids encoded by two or more codons, whoever codon is preferred genome-wide is also the most frequent within TF-binding sites.

These reults also have implications for gene-disease associations.  Even if a disease-associated single nucleotide variant is located in a coding region, the variation may influence disease risk by affecting either protein function or the extent of protein synthesis.

This article is densely packed with interesting and novel observations and will repay close reading with stimulating insights and provocative questions.  It also points to a larger perspective that the well-known existence of codon bias (for amino acids with two or more codons) in some protein-coding genes, can arise for a multitude of selective reasons.  As noted in a commentary (Weatheritt and Babu, 2013) in the same issue of Science, there are many gene-related processes and phenomena that are influenced by the precise nucleotide sequence of the gene or the messenger RNA transcribed from the gene that may influence the ultimate amount of protein translated that can function in the cell.  These include, for example chromatin organization, enhancer function, mRNA splicing, microRNA target sites, and translational efficiency.

Finally, Weatheritt and Babu note that these new results permit the framing of many new experimental questions that will likely stimulate additional studies.  For example, how are the trade-offs between optimizing protein function and protein production resolved by evolution?  Or, do the mechanisms by which TF binding to exonic regions regulate gene transcription differ from the mechanisms by which TF binding to non-exonic regions regulate gene transcription?  The answers to such questions are likely to inform our understanding of the genetic causes for disease.


Stergachis AB, Haugen E, Shafer A, Fu W, Vernot B, Reynolds A, Raubitschek A, Ziegler S, LeProust EM, Akey JM, Stamatoyannopoulos JA. Exonic transcription factor binding directs codon choice and affects protein evolution. Science. 2013 Dec 13;342(6164):1367-72. doi: 10.1126/science.1243490. PubMed PMID: 24337295.

Weatheritt RJ, Babu MM. Evolution. The hidden codes that shape protein evolution. Science. 2013 Dec 13;342(6164):1325-6. doi: 10.1126/science.1248425. PubMed PMID: 24337281.