In lay publications, it is commonplace for writers to refer to the deoxynucleotide sequence of an individual’s nuclear genome as that individual’s “code” and to the determination of that sequence as “deciphering the code.”   Molecular biologists mean by the “genetic code,” not a DNA sequence but the relationships between RNA (or DNA) nucleotide triplets and particular amino acids.  For those interested in clinical genetics, the real code-deciphering challenge is much more daunting than determining nucleotide sequences; it is the mapping of genotypes to medically-relevant phenotypes, i.e. predicting diseases from the totality of sequences in a genome.

The somewhat cryptic paradox at the heart of genome-based personalized medicine at the present state of our understanding is easily put: while determining the identities of the alleles at every locus (whole genome sequencing, WGS), or every protein-coding locus (whole exome sequencing, WES), by modern methods of high-throughput DNA sequencing, provides highly individual information, the estimates for the corresponding allele-specific disease risks are based on average results for particular samples of people previously studied.  Unfortunately, some of these study populations may not adequately represent the genetic variability present in any particular individual.

Thus, the risk estimates for alleles associated with many conditions, especially those substantially influenced by genes at multiple loci and environmental factors, are not necessarily truly personalized.  It is relatively straightforward to rationalize such complexity because of the significant extent of epistasis (functional interactions among genes at different loci) and the potential for functional interactions between genes and aspects of the internal and external environments.  A new study in Human Mutation (Cassa et al., 2013) demonstrates the potential for variation in phenotype in relation to alleles implicated, to one degree or another, in the pathogenesis of medical conditions.

These authors used data from the Human Gene Mutation Database (HGMD), the largest assemblage of putatively pathogenic human genetic variants, and the 1000 Genomes Project (TGP), the largest publicly available compilation of WGS data from individuals not manifesting known clinical conditions.  Specifically, they determined the prevalence of HGMD variants in the genomes of the TGP in relation to their predicted functional effects and pathogenicity classification.  The pathogenicity classification is based on the strength of the evidence associating a genomic alteration with clinical effects.  The results of Cassa et al. clearly reveal the scale of the challenge facing those who wish to transform genomic DNA sequence information into reliable risk estimates for multiple diseases or other medical conditions for individual subjects.

A total of over 6,900 HGMD variants of all pathogenicity classes were identified in at least one TGP genome (i.e., in one asymptomatic individual), and many variants were found repeatedly in different TGP genomes.  More than 3,700 variants exhibited minor allele frequencies (MAF) of >=0.01 and more than 2,800 variants exhibited MAF of >=0.05.  Over 60 variants classified as either “disease causing mutations” or “disease-associated nonsense mutations” had MAF of >=0.01.  The majority of variants classified as “disease causing” are predicted to be pathogenic.  It is key to interpreting these results to know that the classification of genetic variants is based on predictive algorithms that assess the likelihood that given a mutation alters gene product function or expression based on known structure-function correlations and evolutionary considerations.

In the Discussion section, the authors suggest that some rare HGMD variants with the potential to cause pathology and that were not detected in the TGP genomes might occur in a larger pool of asymptomatic subjects.  On the other hand, some asymptomatic individuals harboring any given HGMD variant might later develop the disease previously associated with that variant despite showing no signs at the time of the study.

The key conclusion of the authors is that in the context of interpreting WGS data, use of compilations of pathogenic variants, even if these variants are thoroughly documented, may not yield reliable risk estimates for the corresponding diseases in any single subject in the absence of evidence for disease.  Although not addressed by the authors in this study, these results do not argue against the value of WES or WGS in the effort to account for specific clinical manifestations.

Another broader conclusion is in order and utilizes the concept of the “incidentalome” (Kohane et al., JAMA, 2006).  The term refers to unanticipated findings encountered when using technologies such as those now employed in radiology or genomics that provide massive amounts of information about a patient that potentially have no direct bearing on the clinical question originally motivating the test.

What has become clear from the results of screening with imaging (Illes et al., Science 2006) or genomic (Berg et al., Genetic Medicine, 2011) techniques is that the interpretation of a finding should generally be adjusted depending on the broader clinical context.  From this perspective, a density in the lung disclosed by CT scan in the presence of symptoms and/or signs consistent with tuberculosis or lung carcinoma will have a different probability of being meaningful than if it is observed in a seemingly healthy individual who was imaged for an unrelated reason like a musculoskeletal complaint.  Similarly, a given genetic variant has different probabilities of being clinically meaningful in the context of the expected clinical manifestations versus the setting of no symptoms.

These considerations suggest that the marketing efforts of direct-to-consumer providers of genomic testing and interpretation are highly problematic.  Customers, many or possibly even most of whom are without obvious ongoing pathological processes are encouraged to submit samples for DNA extraction and analysis.  The promise offered by these for-profit enterprises is that the customer will learn his precise and personalized risk for numerous conditions.   As delineated above, in such a setting, the highest risk present is that the individual submitting the sample will receive misleading estimates for their disease risks.


Cassa CA, Tong MY, Jordan DM. Large numbers of genetic variants considered to be pathogenic are common in asymptomatic individuals. Hum Mutat. 2013 Jul 1. doi:10.1002/humu.22375. [Epub ahead of print] PubMed PMID: 23818451.

Kohane IS, Masys DR, Altman RB. The incidentalome: a threat to genomic medicine. JAMA. 2006 Jul 12;296(2):212-5. Erratum in: JAMA. 2006 Sep 27;296(12):1466. PubMed PMID: 16835427.

Illes J, Kirschen MP, Edwards E, Stanford LR, Bandettini P, Cho MK, Ford PJ, Glover GH, Kulynych J, Macklin R, Michael DB, Wolf SM; Working Group on Incidental Findings in Brain Imaging Research. Ethics. Incidental findings in brain imaging research. Science. 2006 Feb 10;311(5762):783-4. PubMed PMID: 16469905; PubMed Central PMCID: PMC1524853.

Berg JS, Khoury MJ, Evans JP. Deploying whole genome sequencing in clinical practice and public health: meeting the challenge one bin at a time. Genet Med. 2011 Jun;13(6):499-504. doi: 10.1097/GIM.0b013e318220aaba. PubMed PMID: 21558861.