Академический Документы
Профессиональный Документы
Культура Документы
ploscollections.org/translationalbioinformatics
'Translational Bioinformatics' is a collection of PLOS Computational Biology Education articles which reads as
a "book" to be used as a reference or tutorial for a graduate level introductory course on the science of
translational bioinformatics.
Translational bioinformatics is an emerging field that addresses the current challenges of integrating
increasingly voluminous amounts of molecular and clinical data. Its aim is to provide a better understanding of
the molecular basis of disease, which in turn will inform clinical practice and ultimately improve human health.
The concept of a translational bioinformatics introductory book was originally conceived in 2009 by Jake Chen
and Maricel Kann. Each chapter was crafted by leading experts who provide a solid introduction to the topics
covered, complete with training exercises and answers. The rapid evolution of this field is expected to lead to
updates and new chapters that will be incorporated into this collection.
Collection editors: Maricel Kann, Guest Editor, and Fran Lewitter, PLOS Computational Biology Education
Editor.
Table of Contents
Chapter 7: Pharmacogenomics
Konrad J. Karczewski, Roxana Daneshjou, Russ B. Altman
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002817
This article is part of the Transla- biology, genetics, and genomics. Some symptoms, drugs, patients, clinical labora-
tional Bioinformatics collection for believe that the tremendous progress in tory measurements, and clinical images.
PLOS Computational Biology. discovery over the last 50+ years since The emergence of clinical and health
elucidation of the double helix structure information technologies has begun to
How should we define translational has not translated (theres that word!) into make these clinical data available for
bioinformatics? I had to answer this much practical health benefit. While the research through biobanks, electronic
question unambiguously in March 2008 accuracy of this claim can be debated, medical records, FDA resources about
when I was asked to deliver a review of there can be no debate that our ability to drug labels and adverse events, and claims
recent progress in translational bioinfor- measure (1) DNA sequence (including data. Therefore, a major challenge for
matics at the American Medical Infor- entire genomes!), (2) RNA sequence and translational medicine is to connect the
matics Associations Summit on Transla- expression, (3) protein sequence, structure, molecular/cellular world with the clinical
tional Bioinformatics. The lecture expression and modification, and (4) small world. The published literature, available
required me to define papers in the field, molecule metabolite structure, presence, in PubMED (http://www.ncbi.nlm.nih.
and then highlight exciting progress that and quantity has advanced rapidly and gov/pubmed), does this, as does the
occurred over the previous ,12 months. I enables us to imagine fantastic new Unified Medical Language System
have repeated this for the last few years, technologies in pursuit of human health. (UMLS) that provides a lingua franca
There are many barriers to translating (http://www.nlm.nih.gov/research/umls/
and the most difficult part of the exercise is
our molecular understanding into technol- ). However, it falls to translational bioin-
limiting my review only to those papers
ogies that impact patients. These include formatics to engineer the tools that link
that are within the field.
understanding health market size and molecular/cellular entities and clinical
I have never worried much about
forces, the regulatory milieu, how to entities. Thus, I define translational
definitions within informatics fields; they
harden the technology for routine use, bioinformatics research as the develop-
tend to overlap, merge and evolve.
and how to navigate an increasingly ment and application of informatics meth-
Informatics seems clear: the study of
complex intellectual property landscape. ods that connect molecular entities to
how to represent, store, search, retrieve
But before those activities can begin, we clinical entities.
and analyze information. The adjectives in
must overcome an even more fundamental In this collection, Dr. Kann and col-
front of informatics vary but also tend to
barrier: connecting the stuff of molecular leagues have assembled a wonderful group
make sense: medical informatics concerns
biology to the clinical world. Molecular of authors to introduce the key threads of
medical information, bioinformatics con- and cellular biology studies genes, DNA, translational bioinformatics to those new
cerns basic biological information, clinical RNA messengers, microRNAs, proteins, to the field. The collection first provides
informatics focuses on the clinical delivery signaling molecules and their cascades, a conceptual overview of the key data and
part of medical informatics, biomedical metabolites, cellular communication pro- concepts in the field, and then introduces
informatics merges bioinformatics and cesses and cellular organization. These some of the key methods for informatics
medical informatics, imaging informatics data are freely available in valuable discovery and applications. Just by exam-
focuses onimages, and so on. So what resources such as Genbank (http://www. ining the table of contents on the collec-
does this adjective translational denote? ncbi.nlm.nih.gov/genbank/), Gene Ex- tion page (http://www.ploscollections.
Translational medical research has pression Omnibus (http://www.ncbi.nlm. org/translationalbioinformatics), it is clear
emerged as an important theme in the nih.gov/geo/), Protein Data Bank (http:// that many exciting and emerging health
last decade. Starting with top-down lead- www.wwpdb.org/), KEGG (http://www. topics are squarely within the scope of
ership from the National Institutes of genome.jp/kegg/), MetaCyc (http:// translational bioinformatics: cancer, phar-
Health and its former Director, Dr. Elias metacyc.org/), Reactome (http://www. macogenomics, medical genetics, small
Zerhouni, and moving through academic reactome.org), and many other resources. molecule drugs, and diseases of protein
medical centers, research institutes and The clinical world studies diseases, signs, malfunction. There is an unmistakable
industrial research and development ef-
forts, there has been interest in more
effectively moving the discoveries and Citation: Altman RB (2012) Introduction to Translational Bioinformatics Collection. PLoS Comput Biol 8(12):
e1002796. doi:10.1371/journal.pcbi.1002796
innovations in the laboratory to the
bedside, leading to improved diagnosis, Editors : Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Baltimore County, United States of America
prognosis, and treatment. Translational
research encompasses many activities in- Published December 27, 2012
cluding the creation of medical devices, Copyright: 2012 Russ B. Altman. This is an open-access article distributed under the terms of the Creative
molecular diagnostics, small molecule Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
therapeutics, biological therapeutics, vac-
Funding: The author received no specific funding for writing this article.
cines, and others. One of the main targets
of translation, however, is revolutionary Competing Interests: The author has declared that no competing interests exist.
explosion of knowledge in molecular * E-mail: russ.altman@stanford.edu
Figure 2. Ogden-Richards semiotic triad, illustrating the relationships between the three major semiotic-derived types of
meaning.
doi:10.1371/journal.pcbi.1002826.g002
Table 1. Overview of information and knowledge types incumbent to the translational sciences.
Individual and/or Population Phenotype This information type involves data elements N Demographics
and metadata that describe characteristics at the N Clinical exam findings
individual or population levels that relate to the N Qualitative characteristics
physiologic and behavioral manifestation of N Laboratory testing results
healthy and disease states.
Individual and/or Population Bio-markers This information type involves data elements and N Genomic, proteomic and metabolomic expression profiles
metadata that describe characteristics at the N Novel bio-molecular assays capable of measuring bio-
individual or population levels that relate to the molecular structure and function
bio-molecular manifestation of healthy and
disease states.
Domain Knowledge This knowledge type is comprised of community- N Literature databases
accepted, or otherwise verified and validated [17] N Public or private databases containing experimental
sources of biomedical knowledge relevant to a results or reference standards
domain of interest. Collectively, these types of N Ontologies
domain knowledge may be used to support N Terminologies
multiple operations, including: 1) hypothesis
development; 2) hypothesis testing; 3) comparative
analyses; or 4) augmentation of experimental data
sets with statistical or semantic annotations [15,17,125].
Biological Models and Technologies This knowledge type typically consists of: 1) N Algorithms
empirically validated system or sub-system level N Quantitative Models
models that serve to define the mechanisms by N Analytical Pipelines
which bio-molecular and phenotypic processes N Publications
and their markers/indicators interact as a network
[6,20,124,126]; and 2) novel technologies that
enable the analysis of integrative data sets in
light of such models. By their nature these tools
include algorithmic or embedded knowledge
sources [124,126].
Translational Biomedical Knowledge Translational biomedical knowledge represents a N Publications
sub-type of general biomedical knowledge that is N Guidelines
concerned with a systems-level synthesis (i.e., N Integrative Data Sets
incorporate quantitative, qualitative, and semantic N Conceptual Knowledge Collections
annotations) of pathophysiologic or biophysical
processes or functions of interest (e.g.,
pharmacokinetics, pharmacodynamics, bionutrition,
etc.), and the markers or other indicators that can
be used to instrument and evaluate such models.
doi:10.1371/journal.pcbi.1002826.t001
The role of Biomedical Informatics and data analytic pipelines; and 4) the pattern used to address such knowledge
KE in this framework is to address the four dissemination of knowledge collections integration requirements. This design pat-
major information management challeng- resulting from research activities. tern can be broadly divided into four major
es enumerated earlier relative to the ability phases that collectively define a cyclical and
to generate Translational Biomedical 5.1 Design Pattern for Translational iterative process (which we will refer to as a
Knowledge, namely: 1) the collection Science Knowledge Integration translational research cycle,). For each
and management of high throughput, Informed by the conceptual framework phase of the pattern, practitioners must
multi-dimensional data; 2) the generation introduced in the preceding section and consider both the required inputs and
and testing of hypotheses relative to such illustrated in Figure 3, we will now anticipated outputs, and their interrelation-
integrative data sets; 3) the provision summarize the design and execution ships between and across phases.
methods are applied to generate and test 7. Summary N Summaries of basic methods, tech-
integrative hypotheses in a high-through- niques, and design patterns that
put manner. Such techniques require the As was stated at the outset of this can be used to employ knowl-
development and use of novel KA and KR chapter, our goals were to review the edge products in order to integrate
methods and structures, as well as the basic theoretical frameworks that define and reason upon heterogeneous and
design and verification/validation of core knowledge types and reasoning multi-dimensional data sets; and
operations with particular emphasis on
knowledge-based systems targeting the
aforementioned intersection point. There the applicability of such conceptual N An introduction to the open re-
search questions and areas re-
are several exemplary instances of investi- models within the biomedical domain, lated to the ability to apply
gational tools and projects targeting this and to introduce a number of prototyp- knowledge collections and knowl-
space, including RiboWeb, BioCyc, and a ical data integration requirements and edge-anchored reasoning processes
number of initiatives concerned with the patterns relevant to the conduct of across multiple networks or knowledge
modeling and analysis of complex biolog- translational bioinformatics that can be collections.
ical systems [6,113,114,120,124]. In addi- addressed via the design and use of
tion, there are a number of large-scale knowledge-based systems. In doing so, Given that the translational bioinfor-
conceptual knowledge collections focusing we have provided: matics is defined by the presence of
on this particular area that can be complex, heterogeneous, multi-dimension-
explored as part of the repositories main- N Definitions of the basic knowledge al data sets, and in light of the growing
tained and curated by the National Center types and structures that can be volume of biomedical knowledge collec-
for Biomedical Ontologies (NCBO). How- applied to biomedical and translational tions, the ability to apply such knowledge
ever, broadly accepted methodological research; collections to biomedical data sets requires
approaches and knowledge collections N An overview of the knowledge en- an understanding of the sources of such
related to this area generally remain gineering cycle, and the products knowledge, and methods of applying them
developmental. generated during that cycles; to reasoning applications. Ultimately,
Further Reading
N Brachman RJ, McGuinness DL (1988) Knowledge representation, connectionism and conceptual retrieval. Proceedings of the
11th annual international ACM SIGIR conference on research and development in information retrieval. Grenoble, France: ACM
Press.
N Campbell KE, Oliver DE, Spackman KA, Shortliffe EH (1998) Representing thoughts, words, and things in the UMLS. J Am Med
Inform Assoc 5: 421431.
N Compton P, Jansen R (1990) A philosophical basis for knowledge acquisition Knowledge Acquisition 2: 241257.
N Gaines BR (1989) Social and cognitive processes in knowledge acquisition. Knowledge Acquisition 1: 3958.
N Kelly GA (1955) The psychology of personal constructs. New York: Norton. 2 v. (1218).
N Liou YI (1990) Knowledge acquisition: issues, techniques, and methodology. Orlando, Florida, United States: ACM Press. pp.
212236.
N McCormick R (1997) Conceptual and procedural knowledge. International Journal of Technology and Design Education 7:
141159.
N Newell A, Simon HA (1981) Computer science as empirical inquiry: symbols and search. In: Haugeland J, editor. Mind design.
Cambridge: MIT Press/Bradford Books. pp. 3566.
N Patel VL, Arocha JF, Kaufman DR (2001) A primer on aspects of cognition for medical informatics. J Am Med Inform Assoc 8:
324343.
N Preece A (2001) Evaluating verification and validation methods in knowledge engineering. Micro-Level Knowledge
Management: 123145.
N Zhang J (2002) Representations of health concepts: a cognitive perspective. J Biomed Inform 35: 1724.
N Data: factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation [127]
N Information: knowledge obtained from investigation, study, or instruction [127]
N Knowledge: the circumstance or condition of apprehending truth or fact through reasoning [127]
N Knowledge engineering: a branch of artificial intelligence that emphasizes the development and use of expert systems
[128]
N Knowledge acquisition: the act of acquiring knowledge
N Knowledge representation: the symbolic formalization of knowledge
N Conceptual knowledge: knowledge that consists of atomic units of information and meaningful relationships that serve to
interrelate those units.
N Strategic knowledge: knowledge used to infer procedural knowledge from conceptual knowledge.
N Procedural knowledge: knowledge that is concerned with a problem-oriented understanding of how to address a given
task or activity.
N Terminology: the technical or special terms used in a business, art, science, or special subject [128]
N Ontology: a rigorous and exhaustive organization of some knowledge domain that is usually hierarchical and contains all
the relevant entities and their relations [128]
N Multi-dimensional data: data spanning multiple levels or context of granularity or scope while maintaining one or more
common linkages that span such levels
N Motif: a reproducible pattern
N Mashup: a combination of multiple, heterogeneous data or knowledge sources in order to create an aggregate collection of
such elements or concepts.
N Intelligent agent: a software agent that employs a formal knowledge-base in order to replicate expert performance relative
to problem solving in a targeted domain.
N Clinical phenotype: the observable physical and biochemical characteristics of an individuals that serve to define clinical
status (e.g., health, disease)
N Biomarker: a bio-molecular trait that can be measure to assess risk, diagnosis, status, or progression of a pathophysiologic
or disease state.
References
1. Coopers PW (2008) Research rewired. 48 p. 12. Maojo V, Garca-Remesal M, Billhardt H, 20. Zerhouni EA (2005) Translational and clinical
2. Casey K, Elwell K, Friedman J, Gibbons D, Alonso-Calvo R, Perez-Rey D, et al. (2006) sciencetime for a new vision. N Engl J Med
Goggin M, et al. (2008) A broken pipeline? Flat Designing new methodologies for integrating 353: 16211623.
funding of the NIH puts a generation of science biomedical information in clinical trials. Meth- 21. Sim I (2008) Trial registration for public trust:
at risk. 24 p. ods Inf Med 45: 180185. making the case for medical devices. J Gen
3. Payne PR, Johnson SB, Starren JB, Tilson HH, 13. Ash JS, Anderson NR, Tarczy-Hornoch P Intern Med 23 Suppl 1: 6468.
Dowdy D (2005) Breaking the translational (2008) People and organizational issues in 22. Preece A (2001) Evaluating verification and
barriers: the value of integrating biomedical research systems implementation. J Am Med validation methods in knowledge engineering.
informatics and translational research. J Investig Inform Assoc 15: 283289. Micro-Level Knowledge Management: 123
Med 53: 192200. 14. Payne PR, Mendonca EA, Johnson SB, Starren 145.
4. Research NDsPoC (1997) NIH directors panel JB (2007) Conceptual knowledge acquisition in 23. Brachman RJ, McGuinness DL (1988) Knowl-
on clinical research report. Bethesda, MD: biomedicine: a methodological review. J Biomed edge representation, connectionism and concep-
National Institutes of Health. Inform 40: 582602. tual retrieval. Proceedings of the 11th annual
5. Sung NS, Crowley WF, Jr., Genel M, Salber P, 15. Richesson RL, Krischer J (2007) Data standards international ACM SIGIR conference on re-
Sandy L, et al. (2003) Central challenges facing in clinical research: gaps, overlaps, challenges search and development in information retriev-
the national clinical research enterprise. JAMA and future directions. J Am Med Inform Assoc al. Grenoble, France: ACM Press.
289: 12781287. 14: 687696. 24. Compton P, Jansen R (1990) A philosophical
16. Erickson J (2008) A decade and more of UML: basis for knowledge acquisition. Knowledge
6. Butte AJ (2008) Medicine. The ultimate model
Acquisition 2: 241257.
organism. Science 320: 325327. an overview of UML semantic and structural
25. Gaines BR (1989) Social and cognitive processes
7. Chung TK, Kukafka R, Johnson SB (2006) issues and UML field use. Journal of Database
in knowledge acquisition. Knowledge Acquisi-
Reengineering clinical research with informatics. Management 19: I-Vii.
tion 1: 3958.
J Investig Med 54: 327333. 17. van Bemmel JH, van Mulligen EM, Mons B,
26. Gaines BR, Shaw MLG (1993) Knowledge
8. Kaiser J (2008) U.S. budget 2009. NIH hopes van Wijk M, Kors JA, et al. (2006) Databases for
acquisition tools based on personal construct
for more mileage from roadmap. Science 319: knowledge discovery. Examples from biomedi- psychology.
716. cine and health care. Int J Med Inform 75: 257 27. Liou YI (1990) Knowledge acquisition: issues,
9. Kush RD, Helton E, Rockhold FW, Hardison 267. techniques, and methodology. Orlando, Florida,
CD (2008) Electronic health records, medical 18. Oster S, Langella S, Hastings S, Ervin D, , United States: ACM Press. pp. 212236.
research, and the Tower of Babel. N Engl J Med Madduri R, et al. (2008) caGrid 1.0: an 28. Yihwa Irene L (1990) Knowledge acquisition:
358: 17381740. enterprise grid infrastructure for biomedical issues, techniques, and methodology. Proceed-
10. Ruttenberg A, Clark T, Bug W, Samwald M, research. J Am Med Inform Assoc 15: 138 ings of the 1990 ACM SIGBDP conference on
Bodenreider O, et al. (2007) Advancing transla- 149. trends and directions in expert systems. Or-
tional research with the Semantic Web. BMC 19. Kukafka R, Johnson SB, Linfante A, Allegrante lando, Florida, , United States: ACM Press.
Bioinformatics 8 Suppl 3: S2. JP (2003) Grounding a new information 29. Glaser R (1984) Education and thinking: the role
11. Fridsma DB, Evans J, Hastak S, Mead CN technology implementation framework in be- of knowledge. American Psychologist 39: 93104.
(2008) The BRIDG project: a technical havioral science: a systematic analysis of the 30. Hiebert J (1986) Procedural and conceptual
report. J Am Med Inform Assoc 15: 130 literature on IT use. J Biomed Inform 36: 218 knowledge: the case of mathematics. London:
137. 227. Lawrence Erlbaum Associates.
Abstract: Modern experimental systems level understanding of diseases or lead to systematic biological heterogeneity.
strategies often generate genome- tissues. Computational heterogeneity (e.g. some
scale measurements of human tis- Human genome-scale experimental data datasets have discrete value measurements
sues or cell lines in various physi- include microarrays [1,2,3], genome-wide while others are continuous) comes from the
ological states. Investigators often association studies [4,5], and RNA interfer- diversity of experimental platforms used to
use these datasets individually to ence screens [6,7] among many other assay biological processes. Integrative ap-
help elucidate molecular mecha- experimental designs [8]. These experi- proaches that bring together diverse data
nisms of human diseases. Here we ments range from those targeted towards types and experiments must address the
discuss approaches that effectively tissue specificity [9] to those targeted challenge of effectively combining these data
weight and integrate hundreds of towards specific diseases such as cancer for inference.
heterogeneous datasets to gene- [10]. The NCBI Gene Expression Omnibus There are many strategies for combin-
gene networks that focus on a (GEO) [11], a database of microarrays ing these diverse and heterogeneous data.
specific process or disease. Diverse alone, contains over 700 human datasets These include ridge regression [19,20],
and systematic genome-scale mea- collected under diverse experimental con- Bayesian inference [21,22,23,24,25], ex-
surements provide such approach- ditions encompassing more than 8000 pectation maximization [26], and support
es both a great deal of power and a
individual arrays. The human PeptideAtlas vector machines [27]. This chapter focuses
number of challenges. We discuss
[12], a similar resource for proteomics on the strategy of Bayesian integration,
some such challenges as well as
methods to address them. We also experiments, currently contains almost 6.7 which is capable of both predicting the
raise important considerations for million MS/MS spectra representing al- probability of an interaction between gene
the assessment and evaluation of most 84,000 non-singleton peptides across pairs and providing information on the
such approaches. When carefully 220 samples. In addition to these high contribution of each experiment to that
applied, these integrative data-driv- throughput experiments, there are databas- prediction. Bayesian integration allows for
en methods can make novel high- es of biochemical pathways [13], gene datasets to be combined based on the
quality predictions that can trans- function [14], pharmacogenomics [15], strength of evidence from individual data-
form our understanding of the and protein-protein interactions [16,17,18]. sets, which can be either learned from the
molecular-basis of human disease. Integrating heterogeneous genome-scale data [28] or expert annotated [29]. Intui-
experiments and databases is a challenging tively the Bayesian strategy works by
task. Beyond the straightforward concern of evaluating the accuracy and coverage of
experimental noise in each individual data- each individual dataset and the relevance of
This article is part of the Transla- set, integrative approaches also face partic- each source of data to the disease or tissue of
tional Bioinformatics collection for ular challenges inherent to the process of interest and using this information to weight
PLOS Computational Biology. unifying heterogeneous data types. Specifi- each datasets impact on resulting predic-
cally we are concerned with biological and tions. Here we discuss Bayesian methods
computational sources of heterogeneity. that infer genome-scale functional relation-
1. Introduction Biological heterogeneity among experiments ship networks from high throughput exper-
emerges from the measurement of many imental data by building on exiting gold
Researchers are using genome-scale different processes or the unique probing of standards. We discuss how these methods
experimental methods (i.e. approaches biological systems. The source of biological work, how to develop high quality gold
that assay hundreds or thousands of genes material (e.g. whether experiments measure standards, and how to evaluate networks of
at a time) to probe the molecular mech- cells in culture or biopsied tissues) can also predicted functional relationships.
anisms of normal biological processes and
disease states across systems from cell
culture to human tissue samples. Data of
Citation: Greene CS, Troyanskaya OG (2012) Chapter 2: Data-Driven View of Disease Biology. PLoS Comput
this scale can provide a great deal of Biol 8(12): e1002816. doi:10.1371/journal.pcbi.1002816
information about the process or disease of
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
interest, the tissue of origin, and the Baltimore County, United States of America
metabolic state of the organism, among Published December 27, 2012
other factors. To understand biological
Copyright: 2012 Greene, Troyanskaya. This is an open-access article distributed under the terms of the
processes on a systems level one must Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
combine data from measurements across medium, provided the original author and source are credited.
different molecular levels (e.g. proteomic, Funding: This work was supported by the National Science Foundation (NSF) CAREER [award DBI-0546275];
metabolomic, and genomic measure- National Institutes of Health (NIH) [R01 GM071966, R01 HG005998 and T32 HG003284]; National Institute of
ments) while incorporating data from General Medical Sciences (NIGMS) Center of Excellence [P50 GM071508]. The funders had no role in the
preparation of the manuscript.
diverse experiments within each individual
level. An effective integrative analysis will Competing Interests: The authors have declared that no competing interests exist.
take advantage of these data to develop a * E-mail: ogt@genomics.princeton.edu
Figure 1. Potential distributions of experimental results obtained for datasets collected under three different conditions. The dotted
line indicates the distribution of negative examples and the solid line indicates the distribution of positive examples. In condition A the positive
examples more often occur to the right of the negative examples, in condition B both sets overlap, and in condition C the positive examples occur
more often to the left of the negative examples.
doi:10.1371/journal.pcbi.1002816.g001
Genes are discretized into values above or below the median. The numbers of positive and negative examples come from the gold standard. These values can be used
to predict the probability that a gene with unknown status is involved in the disease.
doi:10.1371/journal.pcbi.1002816.t001
gold standard. By building a gold standard from the biological process ontology that be calculated from diverse data using
of positive and negative relationships, it were appropriate for confirmation or refu- Bayesian inference. The process is similar
becomes possible to predict whether or not tation through laboratory experiments such to the integration process described for
a pair of genes interacts. as response to DNA damage stimulus and single-gene prediction, but there are differ-
As with all machine learning strategies, aldehyde metabolism. These terms can ences. For each dataset, appropriate scores
the gold standard determines what type of be downloaded and used to build a positive for each gene pair must be calculated.
relationship can be discovered. Here we functional relationship standard. Gene pairs Furthermore, these scores should not re-
will describe the process of building a gold where both pairs share one of these terms quire any manual intervention or adjust-
standard of functional relationships, but a can be considered to have a functional ment that would make an analysis of
different standard of only physical or only relationship. Gene pairs which do not share hundreds or thousands of datasets time
metabolic interactions could be used to an annotation are of unknown status. For consuming. For datasets that are naturally
develop a network with those types of Bayesian inference we must also have a made up of pair-wise scores such as yeast two-
connections. Here we define two genes as negative standard. One potential way to hybrid assays, this task is straightforward.
having a functional relationship if they develop a negative standard would be to For datasets made up of individual
work together to carry out a biological randomly select pairs of genes. This assumes gene measurements, such as microarray
process (e.g. a KEGG pathway) that can be that most pairs of genes do not interact. experiments, a useful measure must be
assayed by definitive experimental follow- It is possible to add additional high found.
up. This definition allows us to capture quality experimentally annotated relation- One measure that can provide pair-wise
diverse types of relationships, while discov- ships to these standards from other scores across arrays is correlation. Corre-
ering relationships suitable for biological databases. Databases like KEGG [13], lation quantifies the amount that two
follow-up. The Gene Ontologys biological Reactome [31], and HPRD [32] have genes vary together and can be a useful
process ontology provides annotations of previously been used to identify additional indicator of functional relationships. Com-
genes to process, but includes both very functional relationships [33]. The positive paring correlation across datasets in a
broad and very narrow processes. Two and negative relationships from the stan- regular manner is difficult however, be-
examples of broad terms would be bio- dard determine the type of relationship cause datasets may display more or less
logical regulation and response to stim- that will be predicted by the Bayesian correlation based on both true biology
ulus. Two examples of narrow terms integration. Here we use functional rela- (e.g. under some conditions more genes
would be positive regulation of cell growth tionships, but a gold standard built strictly vary together) or experimental error (e.g.
involved in cardiac muscle cell develop- from physical protein-protein interactions systematic biases due to hybridization
ment and cell-matrix adhesion involved will infer only physical interactions rela- conditions) and the variance of gene-wise
in tangential migration using cell-cell tionships between genes. correlations would vary based on these
interactions. The broad terms are not dataset dependent effects. Fishers z-trans-
specific enough to provide a meaningful 4. Building a Network of form provides a means to convert these
gold standard, while the narrow terms have Functionally Related Genes correlation coefficients (r) to z-scores by
too few annotations to provide sufficient calculating z as
examples of known relationships. Given a gold standard of gene-gene
To address this shortcoming, Myers et al. relationships, the probability that two genes 1 1zr
[30] used a panel of experts to select terms of unknown status have a relationship can z~ ln :
2 1{r
5. Evaluating Functional
Relationship Networks
After performing a Bayesian integra-
tion it is appropriate to assess the quality
Figure 3. The result of querying HEFalMp for the role of APOE across all biological
of the inference approach. One straight-
processes. Red links indicate that there is a high probability of a functional relationship between forward way to evaluate the network
the two genes. would be measure the concordance of
doi:10.1371/journal.pcbi.1002816.g003 the gold standard and predictions from
the network. This is easily done by
gene situation, we were interested in biological processes as shown in Figure 2. ordering gene pairs by their probabilities
PDi DEi , or the probability of gene i The result is shown in Figure 3. The red in the network from highest to lowest. For
causing disease given its evidence. Here we links indicate that there is a high probability each gene pair in the gold standard, the
are interested in the probability of a of a functional relationship between the two true positive rate (TPR) to that point can
functional relationship between genes i genes and green links indicate a low be calculated as
and j, P FRi,j , given some pair-wise probability. Black links indicate a probabil-
evidence (e.g. correlation), Ei,j . As in the ity of approximately 0.5.
Positive Pairs Thus Far
single gene situation, this can be calculated The probability of a functional relation- TPR~ :
with ship between any pair of genes is calculat- Total Positives in Standard
ed as described previously. As such, this
probability is dependent on evidence from The false positive rate (FPR) can be
P Ei,j DFRi,j P FRi,j each individual dataset. By clicking on a calculated with the same values for negative
P FRi,j DEi,j ~ :
P Ei,j link, the contributions for each dataset pairs. These values can then be plotted with
towards that gene pair are provided as FPR on the horizontal axis and TPR on the
shown in Figure 4 for APOE and PLTP. vertical access. This provides one type of
Like before, a contingency table is used. This figure indicates the value of including receiver-operator characteristic (ROC)
The difference in this situation is that the high quality databases such as BioGRID curve which can be used to assess the quality
table is based on pair-wise gene measures as input data. While the microarray of predictions from the network. The area
instead of measurements for individual datasets are informative, in this case the under this curve (AUC) summarizes to a
genes. This process, when used to calcu- three highest weighted datasets were non- single number the quality of predictions.
late pair-wise probabilities of functional microarray data sources. Unfortunately this approach to evalua-
relationships for all of the genes in the These functional relationships can then tion uses the same evaluation standard as
genome of interest, results in a functional be used to connect genes to diseases the gold standard used for learning and
relationship network for the organism of through guilt by association approaches. therefore it tests the ability of the inference
interest. Guilt by association approaches work by approach to match the gold standard, and
Huttenhower et al. [33] performed finding genes or diseases that are highly not its ability to make new predictions.
Bayesian integration and prediction using connected to query genes. How exactly One way to avoid this circularity is to hold
human gold standards and datasets. This this is done depends on the underlying a group of genes out of the gold standard
tool allows users to query the network and network, the size and type of the query during the integration process. Connec-
also displays what datasets contribute to the sets, whether or not the task must be done tions between these held out genes can
relationships predicted from the integrated in real time. An example approach would then be used after the networks are
approach. As an example we can query be to consider as positives only relation- generated to assess the quality of predic-
HEFalMp to find out how the APOE ships with a probability from the inference tions from the network (in this case the
protein relates to all genes across all stage of greater than 0.9. A Fishers exact concordance between the predictions and
the known relationship status of the held can result in too few known relationships cross-validation approach. With cross-val-
out genes are used). While the holdout for assessment of the network. This idation, the gene sets are divided up into
approach is effective for large gold stan- assessment problem can be alleviated at groups. Like the hold-out approach, all
dards, when gold standards are small this the cost of computation time by using a but one group is used to train the network
Figure 5. The diseases that are significantly connected to APOE through the guilt by association strategy used in HEFalMp.
Alzheimer disease and Macular degeneration are both annotated to the disease in OMIM as noted by the gold bars to the left of the disease (http://
hefalmp.princeton.edu/gene/diseases?context = 0&name = APOE). The other diseases are implicated by APOEs functional relationships to genes
annotated to that disease in OMIM.
doi:10.1371/journal.pcbi.1002816.g005
while the evaluation is performed on the selected for follow-up. These are com- designing biological experiments [36]. If
left out group. In contrast to the hold-out bined with randomly selected genes to these predictions lead to a higher success
approach, the process of training and create a gene list for evaluation. Literature rate in molecular biology experiments, an
evaluation is performed iteratively with evidence for genes on this list can be integrative analysis can dramatically lower
each group of genes being evaluated, but assessed, and a comparison can be per- the cost per discovery. Hibbs et al. [37]
like the hold-out approach, only the formed for genes selected from the net- used a data driven approach to direct
predictions generated on held out genes work and genes selected randomly. If the experimental biology and found that
are used for evaluation. proportion of literature based positives of computational predictions could be exper-
When standards are incomplete, exist- genes or pairs selected from the network is imentally validated at a substantially
ing literature can also be used for substantially higher than those selected higher rate than randomly selected genes.
evaluation. This can be incorporated in a randomly, this provides evidence that the Furthermore, those genes that were found
number of ways. One way is to use a blind network recapitulates true biology. by computational methods were more
literature evaluation. Pairs predicted with Fundamentally the goal of this data likely to exhibit a subtle phenotype than
high probability or genes highly connected driven functional genomics strategy is to the genes already known to be involved.
to members of the standard can be create a network of predictions useful for This study provides evidence that compu-
tational predictions combined with exper-
imental science can lower the cost of
experimental discoveries while finding
subtle phenotypes that high throughput
experimental designs may miss.
6. Summary
Data driven functional genomics strate-
gies combine methods from statistics and
computer science to integrate diverse
experimental data for the purpose of
making novel biological predictions. By
bringing diverse data together, these meth-
ods are capable of discovering patterns of
biological relevance not well characterized
in individual studies [38]. Furthermore,
because these methods rely on existing
data, they can be used to efficiently direct
definitive low throughput experimental
studies in a cost effective manner [37,39].
Integrative data driven approaches are
Figure 7. The functional relationship network discovered by a data driven integration often compared to publicly available
for the YFG gene in YFO. databases of knowledge or experiments
doi:10.1371/journal.pcbi.1002816.g007 or to the statistical analysis of results from
doi:10.1371/journal.pcbi.1002816.t002
doi:10.1371/journal.pcbi.1002816.t003
individual high throughput experiments, but use, these methods can generate predictions relationship if they are uncorrelated in
they are distinct from both of these. Data- capable of efficiently directing experimental this dataset? What if they are negative-
bases generated by literature curation are by biology [37,40]. ly correlated?
their nature not well suited to the discovery of 3. Using the contingency tables from
new knowledge and databases of experimen- 7. Exercises Tables 2 and 3 and the knowledge that
tal results require researchers to know a priori 20% of gene-pairs in the organism of
which datasets are relevant to the biological 1. All proteins connected to the protein interest have a functional relationship,
question of interest. Integrative data driven Your Favorite Gene (YFG) in the what is the probability that genes YFG
approaches combine high throughput exper- functional relationship network of Your and MFG have a functional relationship
iments and databases of diverse types and in Favorite Organism (YFO) are shown in if they are positively correlated in the
so doing can make predictions beyond those Figure 7. Three of them are known to experiment that Table 2 is derived from
discovered using single data sources. be associated with Your Favorite and physically interacting in the data-
The flexibility of the data driven approach Disease (YFD). These genes are base from which Table 3 is derived?
also gives rise to its greatest challenge. This YFDG1, YFDG2, and YFDG3. YFD
4. What is the major difference between
strategy relies upon gold standards that are a has six genes annotated to it among the
100 genes present in YFO. Using a databases and integrative data driven
representation of high quality current knowl-
Fishers exact test to evaluate guilt by approaches?
edge. When these standards are of high
quality and appropriate to the biological association, is YFG significantly associ- Answers to the Exercises can be found
question of interest, the resulting answers are ated with YFD (av0:05)? in Text S1.
likely to be useful. If the standards are of 2. Does the gene expression dataset
lower quality, the utility of the predictions described by the contingency table in Supporting Information
will be lessened. In many cases the gold Table 2 provide any information about
standard quality is the critical determinant of whether or not the genes YFG and Text S1 Answers to Exercises
success for these algorithms. With careful MFG are likely to have a functional (DOCX)
Further Reading
Glossary
N Functional Relationship: The type of interaction that two genes have if they
participate in the same biological process.
N Gold Standard: A set of genes or gene-pairs with a known status (positive or
negative) in the tissue, process, disease, or phenotype of interest.
N Hypergeometric/Fishers Exact Test: A test of independence appropriate for
categorical count data when the number of items in each cell is small.
Abstract: Big molecules such as biochemistry have focused on identifying bioinformatics called chemical bioinfor-
proteins and genes still continue to the chemicals that cause (toxins), cure matics a discipline that has evolved to
capture the imagination of most (drugs) or characterize (biomarkers) most help address the blended chemical and
biologists, biochemists and bioin- human diseases. Historically, this kind of molecular biological needs of toxicoge-
formaticians. Small molecules, on work has been reliant on the slow, careful nomics, pharmacogenomics, metabolo-
the other hand, are the molecules and sometime tedious approaches of mics and systems biology.
that most biologists, biochemists classical analytical chemistry and classical Chemical bioinformatics combines the
and bioinformaticians prefer to biochemistry. Nevertheless, it has led to sequence-centric tools of bioinformatics
ignore. However, it is becoming important discoveries and enormous ad- with the chemo-centric tools of chemin-
increasingly apparent that small vances in our understanding of the actions formatics. The term cheminformatics,
molecules such as amino acids, of chemicals on genes, proteins and cells. which is an abbreviated form of chemical
lipids and sugars play a far more With the recent emergence of high informatics, was first coined by Frank
important role in all aspects of throughput omics technologies, our Brown nearly 15 years ago [3]. Cheminfor-
disease etiology and disease treat- ability to detect, identify, and characterize matics (as it is known in North America) or
ment than we realized. This partic- small molecules along with their large chemoinformatics (as it is known in Europe
ular chapter focuses on an emerg- molecule targets has been radically and the rest of the world) is actually a close
ing field of bioinformatics called changed [1,2]. Now it is possible to cousin to bioinformatics. Just as bioinfor-
chemical bioinformatics a disci-
perform as many sequencing experiments, matics is a field of information technology
pline that has evolved to help
address the blended chemical and mass spectrometry (MS) experiments or concerned with using computers to analyze
molecular biological needs of tox- compound identifications in a single day as molecular biological data, cheminformatics
icogenomics, pharmacogenomics, used to be done in a single year. As a is a field of information technology that uses
metabolomics and systems biolo- result, traditional fields such as toxicology, computers to facilitate the collection, stor-
gy. In the following pages we will pharmacology and biochemistry have age, analysis and manipulation of large
cover several topics related to been transformed into totally new fields quantities of chemical data.
chemical bioinformatics. First, a called toxicogenomics, pharmacogenomics However, there are some distinct cul-
brief overview of some of the most and metabolomics. This transformation tural differences between bioinformatics
important or useful chemical bioin- has changed not only the fundamentals of and cheminformatics. For instance, che-
formatic resources will be given. these disciplines, but also the fundamentals minformatics software is mostly designed
Second, a more detailed overview of their data. Rather than trying to for use by chemists, while bioinformatics
will be given on those particular manage a few samples, a few sequences software is designed for use by molecular
resources that allow researchers to or a few compounds in a paper notebook biologists. Consequently there is often a
connect small molecules to diseas- or on an Excel spreadsheet, researchers terminology gap that makes it difficult for
es. This section will focus on are confronted with the task of handling biologists to use cheminformatic software
describing a number of recently hundreds of samples, thousands of com- and chemists to use bioinformatics soft-
developed databases or knowl- pounds, thousands of spectra and thou- ware. Likewise, most cheminformatic soft-
edgebases that explicitly relate
sands of genes or protein sequences. This ware is structure-based or picture-driven
small molecules either as the
has led to the development of novel while most bioinformatic software is se-
treatment, symptom or cause to
disease. Finally a short discussion computational tools and entirely new quence-based or text-driven. As a result,
will be provided on newly emerg- bioinformatic disciplines to facilitate the different search and query interfaces have
ing software tools that exploit handling of this data. This particular evolved that are quite specific to either
these databases as a means to chapter focuses on an emerging field of cheminformatic or bioinformatic software.
discover new biomarkers or even
new treatments for disease. Citation: Wishart DS (2012) Chapter 3: Small Molecules and Disease. PLoS Comput Biol 8(12): e1002805.
doi:10.1371/journal.pcbi.1002805
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Baltimore County, United States of America
This article is part of the Transla- Published December 27, 2012
tional Bioinformatics collection for Copyright: 2012 David S. Wishart. This is an open-access article distributed under the terms of the Creative
PLOS Computational Biology. Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: Funding to develop the databases described in this article was provided by Genome Canada, Alberta
Innovates, and the Canadian Institutes of Health Research. The funders had no role in the preparation of the
1. Introduction manuscript.
For most of the past 100 years, the fields Competing Interests: The author has declared that no competing interests exist.
of toxicology, pharmacology and clinical * E-mail: david.wishart@ualberta.ca
HumanCyc (Encylopedia of Human Metabolic Pathways) http://humancyc.org/ -MetaCyc adopted to human metabolism
-No disease or drug pathways
KEGG (Kyoto Encyclopedia of Genes and Genomes) http://www.genome.jp/kegg/ -Best known and among the most complete
metabolic pathway databases
-Covers many organisms
-A Few disease and drug pathways
The Medical Biochemistry Page http://themedicalbiochemistrypage.org/ -Simple metabolic pathway diagrams with
extensive explanations
-A few drug and disease pathways
MetaCyc (Encyclopedia of Metabolic Pathways) http://metacyc.org/ -Similar to KEGG in coverage, but different
emphasis
-Well referenced
-No disease or drug pathways
Reactome (A Curated Knowledgebase of Pathways) http://www.reactome.org/ -Pathway database with more advanced query
features
-Not as complete as KEGG or MetaCyc
Roche Applied Sciences Biochemical Pathways Chart http://www.expasy.org/cgi-bin/search-biochem-index -The old metabolism standard (on line)
-Describes most human metabolism
Small Molecule Pathway Database (SMPDB) http://www.smpdb.ca/ -Pathway database with disease, drug and
metabolic pathways for humans
-Extensive search, analysis and visualization
tools
Wikipathways http://www.wikipathways.org -Community annotated pathway database for
19 model organisms
-Contains 175 human pathways
-Few drug or disease pathways
doi:10.1371/journal.pcbi.1002805.t001
logical, mechanistic, medical and bio- a metabolomic database that associates chemicals in each pathway. In addition, the
chemical information on about 3100 metabolites to disease biomarkers or disease cellular locations (membrane, cytoplasm,
commonly encountered (i.e. household or diagnosis; 3) DrugBank is a drug database mitochondrion, nucleus, peroxisome, etc.)
environmental) toxins and poisons. that links drugs and drug targets to of all metabolites and the enzymes involved
Each of these databases addresses the symptoms, diseases and disease treatments in their processing are explicitly illustrated.
needs of certain communities such as and 4) T3DB is a toxic substance database Likewise the quaternary structures (if
animal physiologists (ATDB), toxicoge- that associates toxins and their biological known) and cofactors associated with each
nomics or toxicology specialists (CTD targets with symptoms, conditions, diseases of the pathway proteins are also shown. If
and T3DB), environmental or industrial and disease treatments. A more detailed some of the metabolic processes occur
regulators (ACToR) or medicinal chemists description of each of these databases is primarily in one organ or in the intestinal
interested in toxicity prediction (Super- provided below. microflora, this information is also illustrat-
Toxic). However, with the exception of ed. The inclusion of explicit chemical,
T3DB, most of these online toxin or toxic 3. SMPDB A Pathway cellular and physiological information is
compound databases are relatively lightly Database for Drugs and Disease one of the more unique and useful features
annotated, with fewer than a dozen data of SMPDB. SMPDB is also unique in its
fields per compound and essentially no As noted earlier, SMPDB is a pathway inclusion of significant numbers of meta-
physiological, disease or disease symptom database specifically designed to facilitate bolic disease pathways (.100) and drug
information. clinical omics studies, with a specific pathways (.200) not found in any other
Clearly not all of the chemical-bioinfor- emphasis on clinical biochemistry and pathway database. Likewise, unlike other
matic databases we have described in this clinical pharmacology. Currently SMPD pathway databases, SMPDB supports a
section are suitable for deriving information consists of more than 450 highly detailed, number of unique database querying and
about small molecules and disease. Likewise, hand-drawn pathways describing small viewing features. These include simplified
many of the databases mentioned above are molecule metabolism or small molecule database browsing, the generation of pro-
not exactly suitable for translational bioin- processes that are specific to humans. tein/metabolite lists for each pathway, text
formatic questions or for applications relat- These pathways can be placed into four querying, chemical structure querying and
ing to medicine, medical biochemistry or different categories: 1) metabolic pathways; sequence querying, as well as large-scale
clinical research. However, there is at least 2) small molecule disease pathways; 3) small pathway mapping via protein, gene or
one database in each of the four major molecule drug pathways and 4) small chemical compound lists.
chemical-bioinformatic database classes that molecule signaling pathways. An example The SMPDB interface is largely mod-
does generally meet these criteria. In of a typical SMPDB pathway (Phenylke- eled after the interface used for DrugBank
particular: 1) SMPDB is a pathway database tonuria) is shown in Figure 1. As seen in this [32], T3DB [36] and the HMDB [26],
that explicitly relates small molecules to figure, all SMPDB pathways explicitly with a navigation panel for Browsing,
disease and disease treatment; 2) HMDB is include the chemical structure of the major Searching and Downloading the database.
doi:10.1371/journal.pcbi.1002805.t002
Below the navigation panel is a simple text browsing tool that provides a tabular the SMPDB pathway button brings up a
query box that supports general text synopsis of SMPDBs content with thumb- full-screen image for the corresponding
queries of the entire textual content of nail images of the pathway diagrams, pathway. Once opened the pathway
the database. Mousing over the Browse textual descriptions of the pathways, as image may be expanded by clicking on the
button allows users to choose between two well as lists of the corresponding chemical Zoom button located at the top and
browsing options, SMP-BROWSE and components and enzyme/protein compo- bottom of the image. An image legend
SMP-TOC. SMP-TOC is a scrollable nents. This browse view allows users to link is also available beside the Zoom
hyperlinked table of contents that lists all scroll through the database, select different button.
pathways by name and category. SMP- pathway categories or re-sort its contents. At the top of each pathway image is a
BROWSE is a more comprehensive Clicking on a given thumbnail image or pathway synopsis contained in a yellow
doi:10.1371/journal.pcbi.1002805.t003
ACToR (Aggregated Computation Toxicology http://actor.epa.gov/actor/faces/ACToRHome.jsp -Contains aggregated data on 2,500,000 environmental
Resource) chemicals
-Searchable by chemical name and structure
-Data includes chemical structure, physico-chemical
values, in vitro assay data and in vivo toxicology data.
ATDB (Animal Toxin Database) http://protchem.hunnu.edu.cn/toxin/index.jsp -Database with .3800 peptide toxins
-Provides sequence data on peptide/protein toxins
from venomous insects and animals
CTD (Comparative Toxicogenomic Database) http://ctd.mdibl.org/ -Data on .5000 chemicals with literature-derived
information on chemical-gene interactions
SuperToxic http://bioinformatics.charite.de/supertoxic/ -Contains data on 60,000 toxic compounds and some
target data
-Provides chemical and toxicity information
-Can predict the toxicity of query compounds
T3DB (Toxin, Toxin-Target Database) http://www.t3db.org/ -Searchable database of 3100 common toxins and 1400
target proteins
-Provides extensive structural, physiological,
mechanistic, medical and biochemical information
doi:10.1371/journal.pcbi.1002805.t004
could be a drug target (via the SeqSearch isoform or paralogue, may be a drug visualization and mapping tools to explain
query); 7) ascertain whether a newly target or a disease indicator (through the or teach others about metabolic diseases,
identified human protein, such as an SeqSearch query); or 8) use the pathway basic metabolism or drug action.
compound that matches the T3DBs supports compound searches on the basis construct complex queries (find all toxins
database of known toxic compounds. of SMILES strings (under the SMILES that target acetylcholinesterase and are
Users can also select the type of search tab) and molecular weight ranges (under pesticides) or to build a series of highly
(exact or Tanimoto score) to be per- the Molecular Weight tab). customized tables. The output from these
formed. High scoring hits are presented The T3DBs data extraction utility queries is provided in HTML format with
in a tabular format with hyperlinks to the (Data Extractor) employs a simple rela- hyperlinks to all associated ToxCards.
corresponding ToxCards (which, in turn, tional database system that allows users to To summarize, T3DB allows users to
links to the targets). The ChemQuery tool select one or more data fields and to link toxic substances to a variety of disease
allows users to quickly determine whether search for ranges, occurrences or partial conditions, including acute toxicity, long-
their compound of interest is a known occurrences of words, strings, or numbers. term toxicity, birth defects, cancer, other
toxin or chemically related to a known The data extractor uses clickable web illnesses. It also provides links between
toxin and which target(s) it may act upon. forms so that users may intuitively con- toxic substances and their targets both
In addition to these structure similarity struct SQL-like queries. Using a few through descriptions of the mechanism of
searches, the ChemQuery utility also mouse clicks, it is relatively simple to action and through the identification of
Abstract: Proteins do not function create macromolecular structures of various Likewise, interaction maps obtained from
in isolation; it is their interactions complexities and heterogeneities. Proteins one species can be used, with some
with one another and also with interact in pairs to form dimers (e.g. reverse limitations, to predict interaction networks
other molecules (e.g. DNA, RNA) transcriptase), multi-protein complexes (e.g. in other species. Protein interaction net-
that mediate metabolic and signal- the proteasome for molecular degradation), works can also suggest functions for
ing pathways, cellular processes, or long chains (e.g. actin filaments in muscle previously uncharacterized proteins by
and organismal systems. Due to fibers). The subunits creating the various uncovering their role in pathways or
their central role in biological complexes can be identical or heteroge- protein complexes [4]. Due to their central
function, protein interactions also neous (e.g. homodimers vs. heterodimers) role in biological function, protein inter-
control the mechanisms leading to and the duration of the interaction can be actions also control the mechanisms lead-
healthy and diseased states in transient (e.g. proteins involved in signal ing to healthy and diseased states in
organisms. Diseases are often transduction) or permanent (e.g. some organisms. Diseases are often caused by
caused by mutations affecting the ribosomal proteins). However, protein in- mutations affecting the binding interface
binding interface or leading to teractions do not always have to be physical or leading to biochemically dysfunctional
biochemically dysfunctional alloste- [1]. The term protein interaction is also allosteric changes in proteins. Therefore,
ric changes in proteins. Therefore, used to describe metabolic or genetic protein interaction networks can elucidate
protein interaction networks can correlations, and even co-localizations. the molecular basis of disease, which in
elucidate the molecular basis of
Metabolic interactions describe proteins turn can inform methods for prevention,
disease, which in turn can inform
involved in the same pathway (e.g. the diagnosis, and treatment [5,6].
methods for prevention, diagnosis,
and treatment. In this chapter, we Krebs cycle proteins), while genetically The study of human disease experi-
will describe the computational identified associations identify co-expressed enced extensive advancements once the
approaches to predict and map or co-regulated proteins (e.g. enzymes biomedical characterization of proteins
networks of protein interactions regulating the glycolytic pathway). As the shifted to studies taking into account a
and briefly review the experimental name implies, protein interactions by co- proteins network at different functional
methods to detect protein interac- localization list proteins found in the same levels (i.e. in pair-wise interactions, in
tions. We will describe the applica- cellular compartment. complexes, in pathways, and in whole
tion of protein interaction networks Whether the association is physical or genomes). For instance, consider how our
as a translational approach to the functional, protein-protein interaction understanding of Huntingtons disease
study of human disease and eval- (PPI) data can be used in a larger scale (HD) has evolved from the early Mende-
uate the challenges faced by these to map networks of interactions [2,3]. In lian single-gene studies to the latest HD-
approaches. PPI network graphs, the nodes represent specific network-based analyses. HD is an
the proteins and the lines connecting them autosomal dominant neurodegenerative
represent the interactions between them disease with features recognized by Hun-
This article is part of the Transla- (Figure 1). Protein interaction networks tington in 1872 [7], and whose specific
tional Bioinformatics collection for are useful resources in the abstraction of patterns of inheritance were documented
PLOS Computational Biology. basic science knowledge and in the in 1908 [8]. After almost a century of
development of biomedical applications. genetics studies, the culprit gene in HD
By studying protein interaction networks was identified; in 1993, we learned that
1. Introduction
we can learn about the evolution of HD was caused by the repeat expansion of
Early biological experiments revealed individual proteins and about the different a CAG trinucleotide in the Huntingtin (Htt)
proteins as the main agents of biological systems in which they are involved. gene [9]. This expansion causes aggrega-
function. As such, proteins ultimately
determine the phenotype of all organ- Citation: Gonzalez MW, Kann MG (2012) Chapter 4: Protein Interactions and Disease. PLoS Comput Biol 8(12):
isms. Since the advent of molecular e1002819. doi:10.1371/journal.pcbi.1002819
biology we have learned that proteins Editor: Fran Lewitter, Whitehead Institute, United States of America
do not function in isolation; instead, it is Published December 27, 2012
their interactions with one another and
This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted,
also with other molecules (e.g. DNA, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under
RNA) that mediate metabolic and signal- the Creative Commons CC0 public domain dedication.
ing pathways, cellular processes, and Funding: This research was funded by the National Institutes of Health (NIH) 1K22CA143148 to MGK (PI); ACS-
organismal systems. IRG grant to MGK (PI), and R01LM009722 to MGK (co-investigator). MWGs research was supported in part by
The concept of protein interaction is the Intramural Research Program of the NIH, National Library of Medicine. The funders had no role in the
preparation of the manuscript.
generally used to describe the physical
contact between proteins and their interact- Competing Interests: The authors have declared that no competing interests exist.
ing partners. Proteins associate physically to * E-mail: mkann@umbc.edu
incorporate a variety of biological proteins is assumed to interact if they alignment (MSA) of each protein and its
considerations; they take advantage of show enrichment of the same correlated orthologs, iii) from the MSAs, building
the fact that interacting proteins coevolve mutations [42]. distance matrices, and iv) calculating the
to preserve their function (e.g. mirrortree, 3.2.2 Coevolution at the full- correlation coefficient between the
phylogenetic profiling [2835]), occur in the sequence level. Methods detecting distance matrices. The mirrortree
same organisms (e.g. [36,37]), conserve coevolution at the full-sequence level are correlation coefficient is used for
gene order (e.g. gene neighbors method based on the idea that changes in one measuring tree similarity, thereby,
[38,39]) or are fused in some organisms protein are compensated by correlated allowing the evaluation of whether the
(e.g. the Rosetta Stone method [40,41]). changes in its interacting partner to proteins in question coevolved [28
preserve the interaction [29,30,4245]. 35].
Therefore, as interacting proteins The mirrortree method has been suc-
3.2 Theoretical Predictions of PPIs coevolve, they tend to have phylogenetic cessfully implemented to confirm experi-
Based on Coevolution trees with topologies that are more similar mental interactions in E. coli [4], S.
Below, we will expand on two methods than expected by chance [46]. The cerevisiae [48], and H. sapiens [49]. But,
generating theoretical PPI predictions coevolution of interacting proteins was the degree of similarity between the
through coevolutionary signal detection first qualitatively observed for polypeptide phylogenetic trees is strongly affected by
either at the residue or at the full-sequence growth factors, neurotransmitters, and the sequence divergence driven by the
level. immune system proteins with their underlying speciation process [4,50].
3.2.1 Coevolution at the residue respective receptors [47]. Several Therefore, two proteins may have similar
level. Pairs of residues within the same methodologies have been developed to phylogenetic trees due only to common
protein can coevolve because of three- measure coevolution at the full-sequence speciation events, but they may not
dimensional proximity or shared function level, and among them, the mirrortree necessarily be interacting partners. By
[42]. The intramolecular correlations of method is one of the most intuitive and subtracting the signal from speciation
interacting protein partners can be used to accurate options. As shown in Figure 3, events Pazos et al. [4] and Sato et al.
predict intermolecular coevolution. mirrortree measures coevolution for a [50] showed improvements for the per-
Residue-based coevolution methods given pair of proteins by i) identifying the formance of the mirrortree method. One
measure the set of correlated pair orthologs of both proteins in common approach creates a speciation vector
mutations in each protein. A pair of species, ii) creating a multiple sequence from the distance matrices derived from
the ribosomal 16S sequences (for prokary- 4. Protein Networks and Therefore, mutations in a single gene may
otes and 18S for eukaryotes), while the Disease cause multiple syndromes or only cause
other uses the average distance of all disease in some of the biological processes
4.1 Studying the Genetic Basis of
proteins in a pair of organisms. Both the gene mediates. Establishing which geno-
methods subtract the speciation vector Disease types are responsible for the perturbed
from the original distance matrix con- The majority of our current knowledge phenotype of interest is not straightforward.
structed for the given protein pair. about the etiology of various diseases
Genes can influence one another in several
In principle, to characterize protein comes from approaches aiming to uncover
ways; genes can interact synergistically,
interactions at a systems level, all pro- their genetic basis. In the near future, the
(as in epistasis), or they can modify one
tein-protein and domain-domain interac- ability to generate individual genome data
another (e.g. the expression of one gene
tions in a given organism must be using next generation sequencing methods
might affect the expression of another).
catalogued. The mirrortree method is a promises to change the field of transla-
Cystic fibrosis and Becker muscular
suitable option to complement experimen- tional bioinformatics even more.
dystrophy, previously considered classical
tal detections because it is inexpensive and Since the inception of Mendelian ge-
examples of Mendelian patterns of inher-
fast. Moreover, mirrortree only requires netics in the 1900s, great effort has gone
itance, are now believed to be caused by
the proteins sequences as input and thus into cataloguing the genes associated with
a mutation of one gene which is modified
can be used to analyze proteins for which individual diseases. A gene can be isolated
by other genes [57,58]. Thus, even simple
no other information is available. Since based on its position in the chromosome
Mendelian diseases can lead to complex
mirrortree predictions are based on differ- by a process known as positional cloning
genotype-phenotype associations [59].
ent principles than any other computa- [52]. A few examples of human disease-
Environmental factors (e.g. diet, infection
tional or experimental techniques, they related genes identified by positional
by bacteria) are also major determinants of
can also uncover functional relationships cloning include the genes associated with
disease phenotype expression often acting
eluding other methods. Still, the imple- cystic fibrosis [53], HD [9], and breast
in combination with other genotype-phe-
mentation of the mirrortree approach is cancer susceptibility [54,55]. Even in
notype association confounders (i.e. plei-
under several limitations. One limitation simple Mendelian diseases, however, the
otropy and gene modifiers). In fact, most
of the mirrortree method is the minimum correlation between the mutations in the
common diseases such as cancer, meta-
number of orthologs it requires. Selecting patients genome and the symptoms is not
bolic, psychiatric and cardio-vascular dis-
orthologs in large families with many often clear [56]. Several reasons have been
orders (e.g. diabetes, schizophrenia and
paralogs is also a considerable challenge suggested for this apparent lack of corre-
hypertension) are believed to be caused by
for mirrortree [49]. In addition, coevolu- lation between genotype and phenotype,
several genes (multigenic) and are affected
tion does not necessarily take place including pleiotropy, influence of other
by several environmental factors [60].
uniformly across the sequence; different genes, and environmental factors.
sites may coevolve at different rates based Pleiotropy occurs when a single gene
on functional constraints. Thus, coevolu- produces multiple phenotypes. Pleiotropy 4.2 Studying the Molecular Basis of
tion signals vary when measured across the complicates disease elucidation because a Disease
entire sequence vs. at the domain level mutation on a pleiotropic gene may have Much can be learned from document-
[51]. an effect on some, all, or none of its traits. ing the genes associated with a particular
doi:10.1371/journal.pcbi.1002819.t001
work of proteins involved in the immune- new interactions only present in the protein), its inhibition may affect many
inflammatory response. This study gave diseased states. For example, Rossin et al. activities that are essential for the proper
new insights into how smoke causes used genome-wide association studies function of the cell and might thus be
disease: the exogenous toxicants in smoke (GWAS) to identify regions with variations unsuitable as a drug target. On the other
perturb several protein interactions in the that predispose immune-mediated diseases hand, less connected nodes (e.g. nodes
healthy cell state, thereby depressing the [81]. The GWAS studies provided a list of affecting a single disease pathway) could
immune system, while disrupting the proteins found to interact in a preferential constitute vulnerable points of the disease-
inflammation response. The study also manner. The resulting disease single- related network, which are better candi-
explained why smoking cessation has some nucleotide polymorphisms identified by dates for drug targets. The work by
immediate health benefits; eliminating GWAS studies such as that by Rossin et Yildirim and Goh [83] illustrates the
smoke exposure reverses the alterations al. can be eventually incorporated into advantages of evaluating drugs within the
at the transcriptomic level and restores the genotyping diagnostic tools. context of cellular and disease networks.
majority of normal protein interactions. Identifying disease subnetworks, and in turn This group created a drug-target network
Protein interaction studies play a major role in pathways that get activated in diseased states, can to map the relationships between the
the prediction of genotype-phenotype associations provide markers to create new prognostic protein targets of all drugs and all
while also identifying new disease genes. tools. For instance, using a protein-net- disease-gene products. The topological
The identification of disease-associated work-based approach, Chuang et al. [66] analysis of the human drug-target network
interacting proteins also identifies poten- identified a set of subnetwork markers that revealed that (i) most drugs target currently
tially interesting disease-associated gene accurately classify metastatic vs. non- known targets; (ii) only a small fraction of
candidates (i.e. the genes coding for the metastatic tumors in individual patients. disease genes encodes drug-target proteins;
interacting proteins are putative disease- Metastasis is the leading cause of death in (iii) current drugs do not target diseases
causing genes). One of the best ways to patients with breast cancer. However, a equally but only address some regions of
identify novel disease genes is to study the patients risk for metastasis cannot be the human disease network; and (iv) most
interaction partners of known disease- accurately predicted and it is currently drugs are palliativethey treat the symp-
associated proteins [77]. Gandhi et al. only estimated based on other risk factors. toms not the cause of the disease, which
[78] found that mutations on the genes When metastasis is deemed likely, breast largely reflects our lack of knowledge
of interacting proteins lead to similar cancer patients are prescribed aggressive regarding the molecular basis of diseases
disease phenotypes, presumably because chemotherapy, even when it might be such that for many pathologies we can only
of their functional relationship. Therefore, unnecessary. By integrating protein net- treat the symptoms but not cure them.
protein interactions can be used to prior- works with cancer expression profiles, the
itize gene candidates in studies investigat- authors identified relevant pathways that 5. SummaryTrends in the
ing the genetic basis of disease [79]. become activated during tumor progres- Translational Characterization
Others have used the properties of protein sion, which discriminate metastasis better of Human Disease
interaction networks to differentiate dis- than markers previously suggested by
ease from non-disease proteins. Based on studies using differential gene expression We are still quite far from understand-
this approach, Xu et al. [80] devised a alone. ing the etiology of most diseases. Further
classifier based on several topological Disease networks can inform drug design by advances on relevant experimental tech-
features of the human interactome to helping suggesting key nodes as potential nology (e.g. genetic linkage, protein inter-
predict genes related to disease. The drug targets. Drug target identification action prediction), along with integrative
classifier was trained on a set of non- constitutes a good example of the potential computational tools to organize, visualize,
disease and a set of disease genes (from of integrating structural data with high- and test hypotheses should provide a step
OMIM) and applied to a collection of over throughput data [82]. The structural forward in that direction. More than ten
5,000 human genes. As a result, 970 details on binding or allosteric sites can years after the completion of the human
disease genes were identified, a fraction be used to design molecules to affect genome project, it is clear that our
of which were experimentally validated. protein function. On the other hand, approach to human disease elucidation
New diagnostic tools can result from genotype- reconstruction of the different protein needs to change. The $3-billion human
phenotype associations established through networks (signaling, metabolic, regulatory, book of life and the $138-million effort
PPIs. The genes of interacting proteins etc.) in which the potential target is to catalog the common gene variants
can be studied to identify the mutation(s) involved can help predict the overall relevant to disease have so far failed to
leading to the interaction disruptions seen impact of the disruption. If, for example, deliver the wealth of biological knowledge
in healthy individuals or to the creation of the target is a hub (a highly connected of human diseases and the subsequent
doi:10.1371/journal.pcbi.1002819.t002
ii. How many self interactions does the Use the topological information pro- work, formulate a hypothesis to
network have? vided for you in Table 2 to investigate explain how LF2 may be driving
iii. How many pairs are not connected whether the EBV-targeted Human Pro- EBV to latency suggesting at least
to the largest connected component? teins (ET-HPs) differ from the average one molecular mechanism by which
iv. Define the following topological pa- human protein. LF2 may inactivate Rta.
rameters and explain how they might Answer the following questions: ii. Why is establishing latency (opposed
be used to characterize a protein- to promoting rapid replication of
i. Based on the degree property, what
protein interaction network: node viral particles) an effective mecha-
can you deduce about the connect-
degree (or average number of neigh- nism of virus infection?
edness of ET-HPs? What does this
bors), network heterogeneity, aver- iii. Assign putative functions to EBVs
age clustering coefficient distribu- tell you about the kind of proteins
(i.e. what type of network compo- SM and EBNA3A proteins based on
tion, network centrality. the function of the human proteins
nent) EBV targets?
II. Characterize the EBV-Human with which they interactHint: Lo-
ii. What do the number and size of the
interactome cate these proteins in the EBV-
largest components tell you about
Import Dataset S2 into cytoscape to Human network. What clinical ob-
the inter-connectedness of the ET-
create a map of the EBV-Human inter- servation (see the introductory para-
HP subnetwork?
actome. Format and output the network graph to section 6. Exercises) might
iii. Why is distance relevant to network these proteins subnetworks explain?
according to steps A through D in part I.
centrality? What is unusual about the
Answer the following questions:
distance of ET-HPs to other proteins Answers to the Exercises can be found
i. How many unique proteins were and what can you deduce about the in Text S1.
found to interact in each organism? importance of these proteins in the
ii. How many interactions are mapped? Human-Human interactome? Supporting Information
iii. How many human proteins are iv. Based on your conclusions from
Text S1 Answers to Exercises.
targeted by multiple (i.e. how many questions iiii, explain why EBV
(DOCX)
individual human proteins interact targets the ET-HP set over the other
with .1) EBV proteins? human proteins and speculate on the Dataset S1 EBV Interactome Data.
iv. How does identifying the multi- advantages to virus survival the (SIF)
targeted human proteins help you protein set might confer.
Dataset S2 EBV-Human Interactome
understand the pathogenicity of the IV. Integrating knowledge from Data.
virus? Hint: Speculate about the three different interactomes (SIF)
role of the multi-targeted human
Answer the following questions: Figure S1 EBV Interactome Map.
proteins in the virus life cycle.
(PDF)
v. How might you test the predictions i. The Rta protein is a transactivator
you formulated above? that is central to viral replication in Figure S2 EBV-Human Interactome
EBV. When Rta is co-expressed with Map.
III. Characterize the topological the LF2 protein replication attenuates (PDF)
properties of the human proteins and the virus establishes latency.
that are targeted by EBV Solely based on the EBV-EBV net-
N Chen JY, Youn E, Mooney SD (2009) Connecting protein interaction data, mutations, and disease using bioinformatics.
Methods Mol Biol 541: 449461.
N Nussinov R, Schreiber G (2009) Computational protein-protein interactions. Boca Raton: CRC Press.
N Ideker T, Sharan R (2008) Protein networks in disease. Genome Res 18: 644652.
N Juan D, Pazos F, Valencia A (2008) Co-evolution and co-adaptation in protein networks. FEBS Lett 582: 12251230.
N Panchenko A, Przytycka T (2008) Protein-protein interactions and networks: identification, computer analysis, and prediction.
London: Springer.
N Klussmann E, Scott J, Aandahl EM (2008) Protein-protein interactions as new drug targets. Berlin: Springer.
N Kann MG (2007) Protein interactions and disease: computational approaches to uncover the etiology of diseases. Brief
Bioinform 8: 333346.
N Shoemaker BA, Panchenko AR (2007) Deciphering protein-protein interactions. Part II. Computational methods to predict
protein and domain interaction partners. PLoS Comput Biol 3: e43. doi:10.1371/journal.pcbi.0030043
Glossary
Mendelian traits or diseases, named after Gregor Mendel, are the traits inherited and controlled by a single gene.
Positional cloning is a method to find the gene producing a specific phenotype in an area of interest in the genome. The first
step of positional cloning is linkage analysis, in which the gene is mapped using a group of DNA polymorphisms from families
segregating the disease phenotype.
Epistasis refers to the phenomenon in which one gene masks the phenotypic effect of another.
Angiogenesis is the physiological process leading to growth of new blood vessels. Angiogenesis is a normal and vital process
in growth, development, and wound healing; but it is also a fundamental step in the transition of tumors from a dormant to a
malignant state.
Hemangioblastomas are tumors of the central nervous system that originate from the vascular system.
References
1. De Las Rivas J, de Luis A (2004) Interactome 11. Duennwald ML, Jagadish S, Giorgini F, Mu- 20. Deng M, Mehta S, Sun F, Chen T (2002)
data and databases: different types of protein chowski PJ, Lindquist S (2006) A network of Inferring domain-domain interactions from pro-
interaction. Comp Funct Genomics 5: 173 protein interactions determines polyglutamine tein-protein interactions. Genome Res 12: 1540
178. toxicity. Proc Natl Acad Sci U S A 103: 11051 1548.
2. Barabasi AL, Oltvai ZN (2004) Network biology: 11056. 21. Nye TM, Berzuini C, Gilks WR, Babu MM,
understanding the cells functional organization. 12. Giorgini F, Muchowski PJ (2005) Connecting the Teichmann SA (2005) Statistical analysis of
Nat Rev Genet 5: 101113. dots in Huntingtons disease with protein inter- domains in interacting protein pairs. Bioinfor-
3. Grindrod P, Kibble M (2004) Review of uses of action networks. Genome Biol 6: 210. matics 21: 9931001.
network and graph theory concepts within 13. Shoemaker BA, Panchenko AR (2007) Decipher- 22. Fraser HB, Hirsh AE, Wall DP, Eisen MB (2004)
proteomics. Expert Rev Proteomics 1: 229238. ing proteinprotein interactions. Part I. Experi- Coevolution of gene expression among interact-
4. Pazos F, Ranea JA, Juan D, Sternberg MJ (2005) mental techniques and databases. PLoS Comput ing proteins. Proc Natl Acad Sci U S A 101:
Assessing protein co-evolution in the context of Biol 3: e42. doi:10.1371/journal.pcbi.0030043 90339038.
the tree of life assists in the prediction of the 14. Costanzo M, Baryshnikova A, Bellay J, Kim Y, 23. Kanaan SP, Huang C, Wuchty S, Chen DZ,
interactome. J Mol Biol 352: 10021015. Spear ED, et al. (2010) The genetic landscape of a Izaguirre JA (2009) Inferring protein-protein
5. Kann MG (2007) Protein interactions and disease: cell. Science 327: 425431. interactions from multiple protein domain com-
computational approaches to uncover the etiology 15. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori binations. Methods Mol Biol 541: 4359.
of diseases. Brief Bioinform 8: 333346. M, et al. (2001) A comprehensive two-hybrid 24. Guimaraes KS, Przytycka TM (2008) Interrogat-
6. Ideker T, Sharan R (2008) Protein networks in analysis to explore the yeast protein interactome. ing domain-domain interactions with parsimony
disease. Genome Res 18: 644652. Proc Natl Acad Sci U S A 98: 45694574. based approaches. BMC Bioinformatics 9: 171.
7. Huntington G (1872) On chorea. Med Surg Rep 16. Mrowka R, Patzak A, Herzel H (2001) Is there a 25. Guimaraes KS, Jothi R, Zotenko E, Przytycka
26: 320321. bias in proteome research? Genome Res 11: TM (2006) Predicting domain-domain interac-
8. Punnett RC (1908) Mendelism in Relation to 19711973. tions using a parsimony approach. Genome Biol
Disease. Proc R Soc Med 1: 135168. 17. von Mering C, Krause R, Snel B, Cornell M, 7: R104.
9. (1993) A novel gene containing a trinucleotide Oliver SG, et al. (2002) Comparative assessment 26. Riley R, Lee C, Sabatti C, Eisenberg D (2005)
repeat that is expanded and unstable on Hun- of large-scale data sets of protein-protein interac- Inferring protein domain interactions from data-
tingtons disease chromosomes. The Huntingtons tions. Nature 417: 399403. bases of interacting proteins. Genome Biol 6:
Disease Collaborative Research Group. Cell 72: 18. Shoemaker BA, Panchenko AR, Bryant SH R89.
971983. (2006) Finding biologically relevant protein do- 27. Izarzugaza JM, Juan D, Pons C, Ranea JA,
10. Goehler H, Lalowski M, Stelzl U, Waelter S, main interactions: conserved binding mode Valencia A, et al. (2006) TSEMA: interactive
Stroedicke M, et al. (2004) A protein interaction analysis. Protein Sci 15: 352361. prediction of protein pairings between interacting
network links GIT1, an enhancer of huntingtin 19. Sprinzak E, Margalit H (2001) Correlated families. Nucleic Acids Res 34: W315319.
aggregation, to Huntingtons disease. Mol Cell sequence-signatures as markers of protein-protein 28. Gertz J, Elfond G, Shustrova A, Weisinger M,
15: 853865. interaction. J Mol Biol 311: 681692. Pellegrini M, et al. (2003) Inferring protein
Abstract: Complex diseases are This article is part of the Transla- conditions characterized by impairments
caused by a combination of genetic tional Bioinformatics collection for in reciprocal social interaction and com-
and environmental factors. Uncov- PLOS Computational Biology. munication, and the presence of restricted
ering the molecular pathways and repetitive behaviors [1]. Similar
through which genetic factors affect heterogeneity is present in other complex
a phenotype is always difficult, but 1. Introduction diseases including cancer.
in the case of complex diseases this Complex diseases are caused, among other Given the above challenges, how can we
is further complicated since genetic approach the study of complex diseases? A
factors, by a combination of genetic
factors in affected individuals might useful clue is provided by the fact that
perturbations. Thus in the case of a
be different. In recent years, systems genes, gene products, and small molecules
biology approaches and, more spe- complex disease we do not assume that a
single genetic mutation can be pinned interact with each other to form a complex
cifically, network based approaches interaction network. Thus a perturbation
emerged as powerful tools for down as a cause. Many diseases fall in this
category including cancer, autism, diabe- in one gene can be propagated through
studying complex diseases. These
tes, obesity, and coronary artery disease. the interactions, and affect other genes in
approaches are often built on the
knowledge of physical or functional Even though there are other factors the network. However, the fact that we
interactions between molecules involved in such diseases, this review will observe similar disease phenotypes despite
which are usually represented as focus on genetic causes. different genetic causes suggests that these
an interaction network. An interac- One of the fundamental difficulties in different causes are not unrelated but
tion network not only reports the studying genetic causes of complex diseas- rather dys-regulate the same component
binary relationships between indi- es is that different disease cases might be of the cellular system [3]. Therefore in
vidual nodes but also encodes caused by different genetic perturbations. studies of complex diseases researchers
hidden higher level organization of In addition, if a disease is caused by a increasingly focus on groups of related/
cellular communication. Computa- combinatorial effect of many mutations, interconnected genes, referred to as mod-
tional biologists were challenged the individual effects of each mutation ules or subnetworks.
with the task of uncovering this might be small and thus hard to discover.
organization and utilizing it for the
For example, autism is considered to be 2. Interactome
understanding of disease complex-
one of the most heritable complex disor-
ity, which prompted rich and di- Biomoecules in a living organisim rarely
verse algorithmic approaches to be ders, but its underlying genetic causes are
still largely unknown [1]. One of the act individually. Instead, they work to-
proposed. We start this chapter with
proposed factors that contribute to this gether in a cooperative way to provide
a description of the general charac-
teristics of complex diseases fol- difficulty is the role of rare genetic specific functions. A variety of intermolec-
lowed by a brief introduction to variations in the emergence of the disease ular interactions including protein-protein
physical and functional networks. [2]. interactions, protein-DNA interactions,
Next we will show how these An additional difficulty in studying and RNA interactions are essential to
networks are used to leverage complex diseases relates to disease hetero- these cooperative activities. These interac-
genotype, gene expression, and geneity. Specifically, in a complex disease, tions can be conveniently represented as
other types of data to identify disease phenotypes might vary significant- networks (graphs) with nodes (vertices)
dysregulated pathways, infer the ly among patients. The recognition of this which denote molecules, and links (edges)
relationships between genotype fact has lead, for example, to renaming which denote interactions between them.
and phenotype, and explain disease autism to autism spectrum disorders Depending on the type of interaction, the
heterogeneity. We group the meth- (ASDs) referring in this way to a group of corresponding edge might be directed or
ods by common underlying princi-
ples and first provide a high level
description of the principles fol- Citation: Cho D-Y, Kim Y-A, Przytycka TM (2012) Chapter 5: Network Biology Approach to Complex Diseases. PLoS
Comput Biol 8(12): e1002820. doi:10.1371/journal.pcbi.1002820
lowed by more specific examples.
We hope that this chapter will give Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Baltimore County, United States of America
readers an appreciation for the
wealth of algorithmic techniques Published December 27, 2012
that have been developed for the This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted,
purpose of studying complex dis- modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under
eases as well as insight into their the Creative Commons CC0 public domain dedication.
strengths and limitations. Funding: This work is supported by the intramural program of National Library of Medicine, NIH. The funders
had no role in the preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: przytyck@ncbi.nlm.nih.gov
. These authors contributed equally to this work.
mic regions that are altered in the disease developed by Gilman et al. and applied to enriched with genes that have abnormal
are first identified, and the genes residing identify a biological subnetwork affected by expression, several different computational
in the altered regions are mapped to an rare de novo copy number variations techniques have been used to achieve
interaction network. Both physical and (CNVs) in autism [58,59], the authors first these tasks, which we discuss shortly
functional interaction networks can be constructed a gene network where edges below. The methods are also illustrated
used, and edges might be weighted based, were assigned the likelihood odd ratio for in Figure 2.
for example, on the likelihood of having contributing to the same genetic phenotype. 3.2.1 Scoring based
the same phenotypes or influences be- Subsequently a greedy growth algorithm methods. Suppose that there is a
tween genes [5456]. Next, modules are was used to find clusters in this network. In subset of genes which are differentially
typically defined as subsets of genetically another approach, Rossin et al. [60] consid- expressed in disease samples and they are
altered genes that are highly interconnect- ered the genomic regions found to be closely connected to each other in an
ed or within close proximity to each other associated with Rheumatoid Arthritis (RA) interaction network. A subnetwork
in the interaction network together with and Crohns disease (CD) in previous including such genes might be a good
non-altered genes necessary to mediate GWAS studies, and connected the genes candidate for a disease associated network
these connections. Edge weights, if given, residing in these regions based on module (Figure 2A). Implementing this
can be used to prioritize the modules. In interaction data to obtain network mod- idea requires a way to score candidate
many cases, finding the best subnetwork is ules. It was also verified that those modules. Various methods have been
computationally expensive and search identified modules exhibited significant suggested for measuring the significance
algorithms such as greedy growth heuris- differences in expression level in the of the differential expression of genes in a
tics or more sophisticated approximation disease samples. module and their connectivity (the
algorithms have been proposed. Finally, distances between the genes). In addition,
rigorous statistical tests have been applied 3.2 Differentially Expressed Network different methods adopt different search
to evaluate the significance of selected Modules algorithms to find high scoring candidate
modules. Another popular and successful ap- modules. Finally, some approaches
Examples. The idea of finding genetically proach to find disease associated modules additionally require that all genes are
altered network modules has been utilized in is to search for subnetworks that are either up-regulated or down-regulated in
various disease studies. Analyzing ovarian significantly enriched with genes whose the same direction.
cancer TCGA data (The Cancer Genome expression levels are changed in disease Examples. Chuang et al. defined the
Atlas), HOTNET identified subnetworks in samples. Building on the observation that activity score for a subnetwork by com-
a protein interaction network in which genes a molecular perturbation typically affects paring gene expression profiles from two
are mutated in a significant number of the expression levels of genes in a whole different types of samples (metastatic or
patients [54]. The identified networks in- module rather than individual genes, these non-metastatic in their study) [61]. More
cludes the NOTCH signaling pathway approaches identify the modules which specifically, they first computed how well
which is indeed known to be significantly exhibit different expression patterns in the expression of a gene discriminates
mutated in cancer samples [57]. The disease states relative to a control. Gene between the two patient groups and then
method is based on the set cover approach expression data has been widely utilized scored candidate subnetworks based on
(see Set cover based approach section for identifying dys-regulated modules and aggregate discriminative power over all
below), which is found to be effective in drug targets, inferring interactions be- genes in the subnetwork. Then they
capturing different genetic variations across tween genes, and classifying diseases. searched for the most discriminative
patients. In the NETBAG (NETwork-Based While these approaches are based on the networks in a greedy manner. While the
Analysis of Genetic associations) method, common idea of finding gene modules method was used for disease classification
(see Section 4), it can readily be applied to Examples. Aiming to identify regulatory the SMAD1 gene, which could not be
leverage the difference between disease networks defining phenotypic classes of detected by differential expression analysis
and non-disease cohorts. human cell lines, Muller et al. searched for only.
3.2.2 Correlation based Jointly Active Connected Subnetworks To understand the mechanism of aging,
methods. Comparing expression (connected subnetworks with high average Xue et al. applied a network module
patterns between genes is a basis for internal expression similarity) in a human approach [65,66]. They utilized a PPI
constructing a co-expression network, interaction network [62] and demonstrat- network and overlaid expression data
extracting modules exhibiting similar ed the power of combining network and obtained from various stages of aging.
expression patterns, and further expression data. Two types of edges correlated and anti-
understanding molecular changes in IDEA (Interactome Dysregulation En- correlated were selected. The subnet-
diseases. Considering expression correlation richment Analysis) method [63] focused work that includes only those edges was
of disease cases in the context of interactions on the identification of perturbed network called the NP (negative and positive)
can provide additional power in the edges in a combined interaction network network, is proposed to be related to the
identification of a disease associated module (PPI, transitional, signaling, posttransla- aging mechanism. Further modularizing
(Figure 2B). If the expression changes of two tional modifications predicted by MINDy the network with hierarchical clustering of
neighboring nodes are correlated with each [64]), and searched for the edges connect- expression patterns, they obtained a few
other, this may suggest that the two ing genes which in a disease state show loss age related modules and found some genes
interacting genes have related functional or gain of expression correlation. The connecting different modules through PPIs
roles. With this in mind, some approaches utility of the method was demonstrated in are more likely to affect aging/longevity,
look at connected components which show the analysis of FL lymphoma and other which was also experimentally validated.
highly correlated and anti-correlated cancer types. In particular, they identified 3.2.3. Set cover based methods. A
expression patterns. Other approaches BCL2 as the gene adjacent to the largest group of methods employ a combinatorial
search for loss and gain of correlation in number of dys-regulated edges in FL approach named set cover. In a set cover,
disease states to identify dys-regulated edges. lymphoma. This analysis also identified a gene is considered to cover a disease
sample if it is dys-regulated in the sample. so that each disease sample is covered by a more effective way to understand disease
For example, it can be decided if a gene is certain minimal number of genes. They mechanisms might be to combine both
covering a sample or not based on the fold applied this approach to a Parkinsons genotypic (the putative causes of diseases)
change of gene expression level in the disease dataset. Chowdhury et al. [68], and phenotypic data (their effects). Ex-
sample or using a statistical test such as z- developed an alternative network cover pression Quantitative trait loci (eQTL)
test. The main principle of the set cover based algorithm and used the identified analysis is a useful method to find the
approach is that each disease case has modules for disease classification in a relationship between genotype and phe-
some dys-regulated (thus covering) genes human colorectal cancer dataset. notype [72,73]. eQTL treats the level of
but in heterogeneous diseases, different Set Cover approaches have also been gene expression as a quantitative pheno-
cases will typically have different covering applied to data types other than gene type, which is assumed to be controlled by
genes. Set cover approaches provide a expression. For example, Kim et al. genotypic information. Loci that putative-
strategy to select a representative set of proposed a module cover approach to ly control the expression of a given gene
such covering genes (Figure 2C). This is identify gene modules which collectively are identified by determining the associa-
usually done by defining some cover disease samples [70]. At the same tions between genotype and gene expres-
optimization criterion and attempting to time they required that each module is sion. Given an association between a
select a set of genes which is optimal with coherent, containing genes with similar genotypic variation in a locus and expres-
respect to this criterion. For example, genotype-phenotype mappings (see Sec- sion level of a gene, the next challenge is to
given a set of genes and disease samples tion 4 for more discussion). The HotNet uncover the pathway(s) through which the
along with covering relationships, a subset Algorithm discussed in Section 3.1 also genetic variation leads to the expression
of genes is selected so that each sample is utilized a variant of a set cover approach change. Recently, several groundbreaking
covered by some minimal number of genes to find genetically altered modules. In pathway elucidation methods have
while the total number of selected genes is their case, a gene is defined to cover a emerged, as illustrated in Figure 3 and
minimized. sample if the gene is mutated in the described below.
Many observed organism-level pheno- sample, and they looked for a fixed size 3.3.1 Distance based methods. A
types arise in a heterogeneous way. connected set of genes covering as many simple approach to identify a possible
Diseases such as cancer are now seen as samples as possible. The Dendrix (De pathway from a genetically altered gene
a spectrum of related disorders that novo Driver Exclusivity) algorithm was (putative cause) to the gene with correlated
manifest themselves in a similar fashion. also developed to discover mutated gene expression change (target gene) is to test if
Since different samples may be covered by modules in cancer and, though it does not there is a path in an interaction network
different genes and those genes may be utilize interaction data, it aims to find sets connecting the putative causal gene to its
connected in an interaction network, set of genes, domains, or nucleotides whose target gene. The shortest path connecting
cover approaches can be useful to identify mutations exhibit both high coverage and a causal gene and its target is often used to
gene modules explaining a heterogeneous high exclusivity in the disease samples explain their causal relationship
set of samples [6769]. [71]. (Figure 3A). The intermediate nodes on
Examples. Aiming to detect dys-regulated such a shortest path are likely members of
pathways in complex diseases, Ulitksy et al. 3.3 Uncovering Information an affected pathway/module. Several
extended the set cover technique by Propagation Modules variations of the shortest path approach
integrating expression data and interaction The approaches discussed thus far have have been used in extracting disease
networks [67]. Their method, named dealt with modules of genes associated associated network modules [7476]. For
DEGAS (de novo discovery of dys-regu- with either phenotypic or genotypic infor- example, Carter et al. searched for the
lated pathways) searches for a smallest set mation. While both approaches are help- shortest directionally consistent paths in
of genes forming a connected subnetwork ful for predicting dys-regulated modules, a molecular interaction networks connecting
Supporting Information
Text S1 Answers to Exercises.
(PDF)
Figure 4. A hypothetical interaction network to be used with Exercises 5 and 6.
doi:10.1371/journal.pcbi.1002820.g004 Acknowledgments
The authors thank Mileidy W. Gonzalez
patients might be incorrect, how would graph shown in Figure 4, find two (NIH\NLM\NCBI) and Pawel Przytycki
you modify the optimization problem? different Steiner trees connecting genes (Princeton University) for their helpful com-
C, T1, T2, T3, and T4. ments on the manuscript.
5. A Steiner tree connecting a set of nodes
does not need to be unique. In the
Further Reading
N Schadt EE (2009) Molecular networks as sensors and drivers of common human diseases. Nature 461(7261): 218223.
N Barabasi AL, Gulbahce N, Loscalzo J (2011) Network medicine: a network-based approach to human disease. Nat Rev Genet
12(1): 5668.
N Przytycka TM, Singh M, Slonim DK (2010) Toward the dynamic interactome: its about time. Brief Bioinform 11(1): 1529.
N Przytycka TM, Cho DY (2012) Interactome. In: Meyers RA, editor. Encyclopedia of molecular cell biology and molecular
medicine. John Wiley and Sons, Inc. doi:10.1002/3527600906.mcb.201100018
N Califano A, Butte AJ, Friend S, Ideker T, Schadt E (2012) Leveraging models of cell regulation and GWAS data in integrative
network-based association studies. Nat Genet 44(8): 841847. doi:10.1038/ng.2355
N Vidal M, Cusick ME, Barabasi AL (2011) Interactome networks and human disease. Cell 144(6): 986998.
N Kim Y, Przytycka TM (2012) Bridging the gap between genotype and phenotype via network approaches. Frontiers in Genetics
special issue on mapping complex disease traits with global gene expression. Front Genet 3: 227. doi:10.3389/
fgene.2012.00227
References
1. Veenstra-Vanderweele J, Christian SL, Cook EH, 9. Wachi S, Yoneda K, Wu R (2005) Interactome- 17. Ashburner M, Ball CA, Blake JA, Botstein D,
Jr. (2004) Autism as a paradigmatic complex transcriptome analysis reveals the high centrality Butler H, et al. (2000) Gene ontology: tool for the
genetic disorder. Annu Rev Genomics Hum of genes differentially expressed in lung cancer unification of biology. The Gene Ontology
Genet 5: 379405. tissues. Bioinformatics 21: 42054208. Consortium. Nat Genet 25: 2529.
2. Pinto D, Pagnamenta AT, Klei L, Anney R, 10. De Las Rivas J, Fontanillo C (2010) Protein-protein 18. Lee I, Date SV, Adai AT, Marcotte EM (2004) A
Merico D, et al. (2010) Functional impact of interactions essentials: key concepts to building and probabilistic functional network of yeast genes.
global rare copy number variation in autism analyzing interactome networks. PLoS Comput Biol Science 306: 15551558.
spectrum disorders. Nature 466: 368372. 6: e1000807. doi:10.1371/journal.pcbi.1000807 19. Costello JC, Dalkilic MM, Beason SM, Gehlhau-
3. Schadt EE (2009) Molecular networks as sensors 11. Berggard T, Linse S, James P (2007) Methods for sen JR, Patwardhan R, et al. (2009) Gene
and drivers of common human diseases. Nature the detection and analysis of protein-protein networks in Drosophila melanogaster: integrating
461: 218223. interactions. Proteomics 7: 28332842. experimental data to predict gene function.
4. Gursoy A, Keskin O, Nussinov R (2008) Topo- 12. Yu H, Braun P, Yildirim MA, Lemmens I, Genome Biol 10: R97.
logical properties of protein interaction networks Venkatesan K, et al. (2008) High-quality binary 20. Guan Y, Myers CL, Lu R, Lemischka IR, Bult
from a structural perspective. Biochem Soc Trans protein interaction map of the yeast interactome CJ, et al. (2008) A genomewide functional
36: 13981403. network. Science 322: 104110. network for the laboratory mouse. PLoS Comput
5. Albert R (2005) Scale-free networks in cell 13. Shoemaker BA, Panchenko AR (2007) Decipher- Biol 4: e1000165. doi:10.1371/journal.-
biology. J Cell Sci 118: 49474957. ing protein-protein interactions. Part II. Compu- pcbi.1000165
6. Jeong H, Mason SP, Barabasi AL, Oltvai ZN tational methods to predict protein and domain 21. Ramani AK, Li Z, Hart GT, Carlson MW, Boutz
(2001) Lethality and centrality in protein net- interaction partners. PLoS Comput Biol 3: e43. DR, et al. (2008) A map of human protein
works. Nature 411: 4142. doi:10.1371/journal.pcbi.0030043 interactions derived from co-expression of human
7. Zotenko E, Mestre J, OLeary DP, Przytycka TM 14. Levy ED, Landry CR, Michnick SW (2009) How mRNAs and their orthologs. Mol Syst Biol 4: 180.
(2008) Why do hubs in the yeast protein perfect can protein interactomes be? Sci Signal 2: 22. Margolin AA, Nemenman I, Basso K, Wiggins C,
interaction network tend to be essential: reexam- pe11. Stolovitzky G, et al. (2006) ARACNE: an
ining the connection between the network 15. Eisen MB, Spellman PT, Brown PO, Botstein D algorithm for the reconstruction of gene regula-
topology and essentiality. PLoS Comput Biol 4: (1998) Cluster analysis and display of genome- tory networks in a mammalian cellular context.
e1000140. doi:10.1371/journal.pcbi.1000140 wide expression patterns. Proc Natl Acad Sci U S A BMC Bioinformatics 7 Suppl 1: S7.
8. Jonsson PF, Bates PA (2006) Global topological 95: 1486314868. 23. Peng J, Wang P, Zhou N, Zhu J (2009) Partial
features of cancer proteins in the human inter- 16. (2010) The Gene Ontology in 2010: extensions Correlation Estimation by Joint Sparse Regres-
actome. Bioinformatics 22: 22912297. and refinements. Nucleic Acids Res 38: D331335. sion Models. J Am Stat Assoc 104: 735746.
Abstract: Differences between in- human traits and disease. The germline number of individual genomes that are
dividual human genomes, or be- variants discovered by GWAS thus far necessary to perform a GWAS.
tween human and cancer genomes, explain only a small fraction of the In the past five years, next-generation
range in scale from single nucleotide heritability of many traits, and this miss- DNA sequencing technologies became
variants (SNVs) through intermediate ing heritability gap [1] is a major commercially available from companies
and large-scale duplications, dele- bottleneck for future GWAS. The somatic such as 454, Illumina, Life Technologies,
tions, and rearrangements of geno- mutations measured in cancer genomes and Complete Genomics. These and other
mic segments. The latter class, called are very heterogeneous, with relatively few sequencing technologies continue to ad-
structural variants (SVs), have re- mutations that are shared by large num- vance at a breathtaking pace, and conse-
ceived considerable attention in the bers of cancer patients, even those with the quently the cost of DNA sequencing has
past several years as they are a same (sub)type of cancer. This mutational declined by several orders of magnitude in
previously under appreciated source heterogeneity complicates efforts to distin- the past decade. These technologies pro-
of variation in human genomes. guish functional mutations that drive vide an unprecedented opportunity to
Much of this recent attention is the cancer development from random passen- measure all variants; germline and somat-
result of the availability of higher- ger mutations [2]. ic; SNPs and SVs, in both normal and
resolution technologies for measur- cancer genomes.
Comprehensive studies of the genetic
ing these variants, including both
basis of disease require the measurement In this chapter, we discuss the applica-
microarray-based techniques, and
of all variants that distinguish individual tion of these sequencing technologies in
more recently, high-throughput
DNA sequencing. We describe the genomes. Until recently, GWAS focused medical genomics, and specifically on the
genomic technologies and computa- on the measurement of single nucleotide characterization of structural variation.
tional techniques currently used to polymorphisms (SNPs), or single nucleo-
measure SVs, focusing on applica- tide differences between individual ge- 2. Germline and Somatic
tions in human and cancer genomics. nomes. In the past few years, it has Structural Variation
become clear that germline variants occu-
py a continuum of scales ranging from Structural variants are important con-
SNPs to larger structural variants (SVs) tributors to genome variation and consid-
duplications, deletions, inversions, and eration of these variants is necessary for
This article is part of the Transla- translocations of large (w100 nucleotides) disease association and cancer genetics
tional Bioinformatics collection for blocks of DNA sequence. Moreover, until studies. In this section, we briefly review
PLOS Computational Biology. recently GWAS focused attention on current knowledge about structural varia-
common SNPs, those whose frequency in tion in human and cancer genomes.
the population was at least 5%. This
1. Introduction restriction was part of the common 2.1 Germline Structural Variation
disease, common variant hypothesis Characterizing the DNA sequence dif-
The decade since the assembly of the which posits that an appreciable fraction ferences that distinguish individuals is a
human genome has witnessed dramatic of susceptibility to common diseases results major challenge in human genetics. Until
advances in understanding the genetic from germline variants that are common a few years ago, the primary focus was to
differences that distinguish individual hu- in the population. However, this restric- identify single nucleotide polymorphisms
mans and that are responsible for specific tion was also dictated by technological (SNPs), and projects such as HapMap [3]
traits. Genome-wide association studies limitations, as it was not cost effective to provide catalogs of common SNPs in
(GWAS) in humans have identified com- measure all genetic variants in the large several human populations. Recent
mon germline, or inherited, DNA variants
that are associated with various common
Citation: Raphael BJ (2012) Chapter 6: Structural Variation and Medical Genomics. PLoS Comput Biol 8(12):
human diseases, including diabetes, heart e1002821. doi:10.1371/journal.pcbi.1002821
disease, etc. At the same time, cancer
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
genome sequencing studies have cataloged Baltimore County, United States of America
numerous somatic mutations that arise Published December 27, 2012
during the lifetime of an individual and
Copyright: 2012 Benjamin J. Raphael. This is an open-access article distributed under the terms of the
that drive cancer progression. These Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
successes are ushering in the era of medium, provided the original author and source are credited.
personalized medicine, where treatment Funding: This work is supported by National Institutes of Health (R01 HG005690). BJR is supported by an
for a disease is tailored to the genetic National Science Foundation CAREER Award (CCF-1053753), a Career Award from the Scientific Interface from
characteristics of the individual. the Burroughs Wellcome Fund and an Alfred P. Sloan Research Fellowship. The funders had no role in the
preparation of the manuscript.
Despite this progress, significant hurdles
remain in achieving a comprehensive Competing Interests: The author has declared that no competing interests exist.
understanding of the genetic basis of * E-mail: braphael@brown.edu
below), this percentage may be an under- recent excitement surrounding structural important limitations. First, because
estimate. variation stems from improvements in aCGH measures only differences in the
There are other mechanisms for the genomics technologies that allow more number of copies of a genomic region
formation of SVs. The division between complete measurements of SVs of all between a test and reference genome,
homology mediated and non-homologous types. These include microarrays and aCGH detects only copy number variants.
mechanisms may not be so strict. NHEJ more recently next-generation DNA se- Thus, aCGH is blind to copy-neutral, or
events sometimes have some degree of quencing technologies. In this section, we balanced, variants such as inversions, or
microhomology (e.g. 225 bp of similarity) briefly describe these technologies. reciprocal translocations. Moreover,
at their breakpoints. Other mechanisms aCGH requires that the genomic probes
such as fork stalling and template switch- 3.1 Microarrays from the reference genome lie in non-
ing (FoSTeS) have also been proposed. The first genome-wide surveys of SVs in repetitive regions, making it difficult to
Some of these are reviewed in [28]. the human genome in 2004 utilized detect SVs with breakpoints in repetitive
Finally, the relative contribution of each microarray-based techniques such as array regions, such as NAHR events or the
of these mechanisms in generating germ- comparative genomic hybridization insertion/deletion of repetitive sequences.
line SVs versus somatic SVs remains an (aCGH). In aCGH, differentially fluores-
active area of investigation, with conflict- cently labeled DNA from an individual, or 3.2 Next-generation DNA
ing reports about the importance of test, genome and a reference genome are Sequencing Technologies
repetitive sequences in somatic structural hybridized to an array of genomic probes DNA sequencing technology has ad-
variants found in cancer genomes derived from the reference genome. Mea- vanced dramatically in recent years, and
[21,22,24,25,29]. surements of test:reference fluorescence several next-generation DNA sequenc-
ratio, called the copy number ratio, at ing technologies from companies such as
3. Technologies for each probe identifies locations of the test Illumina, ABI, and 454 have significantly
Measurement of Structural genome that are present in higher or lower lowered the cost of sequencing DNA.
Variation copy in the reference genome. Microar- However, these technologies, and the
rays containing hundreds of thousands of Sanger sequencing technique they are
Structural variants vary widely in size probes are available, and thus one obtains replacing, are severely limited in the
and complexity, ranging from insertions/ copy number ratios at hundreds of thou- length of a DNA molecule that can be
deletions of hundreds of nucleotides to sands of locations. Since individual copy sequenced. Present sequencing technolo-
large scale chromosomal rearrangements. number ratios are subject to various types gies produce short sequences of DNA,
Large structural variants can be visualized of experimental error, computational tech- called reads, that range from 251000
directly on chromosomes, through cytoge- niques are needed to analyze aCGH data. nucleotides, or base pairs (bp), with the
netic techniques such as chromosome For further details about aCGH and upper end of this range requiring technol-
painting, spectral karyotyping (SKY), or aCGH analysis, see [30]. ogies (e.g. Sanger and 454) that are
fluorescent in situ hybridization (FISH). In aCGH is equally applicable for mea- considerably more expensive. Much of
fact, Sturtevant and Dobzhansky studied surement of germline SVs in normal the recent excitement in DNA sequencing
inversion polymorphisms in Drosophila in genomes and somatic SVs in cancer has been in short read DNA sequencers
the 1920s well before the modern genomes. In fact, aCGH was originally (e.g.llumina Genome Analyzer, Life Tech-
genomics era. However, SVs that are too developed for cancer genomics applica- nologies SOLiD and Ion Torrent) that
small to be directly observed on chromo- tions. aCGH is now very affordable yield reads of only 25150 nucleotides.
somes are generally more difficult to detect making it possible to detect copy number These reads are much shorter than the
and to characterize than single nucleotide variants in large numbers of genomes at one to two hundred million bp of a typical
polymorphisms (SNPs). Much of the reasonable cost. However, aCGH has two human chromosome. However, the large
number of reads that are produced There are two approaches to detecting Improving de novo assembly is a very active
(hundreds of millions), results in a cost SVs from next-generation DNA sequenc- research area (see [31]), but human
per nucleotide that is several orders of ing data (Figure 2). The first is de novo genome assemblies of high enough quality
magnitude lower than Sanger sequencing. assembly. In this approach, sophisticated for SV studies remain out of reach for
Many DNA sequencing technologies algorithms are used to reconstruct the inexpensive short-read technologies.
employ a paired end, or mate pair, genome sequence from overlaps between The second approach to detect SVs in
sequencing protocol to increase the effec- reads. The assembled genome sequence is next-generation DNA sequencing data is a
tive read length. In this protocol two reads then compared to the reference genome, resequencing approach that leverages
are generated from opposite ends of a or the assembled genomes of other the extensive finishing efforts undertaken
longer DNA fragment, or insert. With individuals, to identify all types of variants. in the Human Genome Project. In a
earlier Sanger sequencing protocols, the If the genome sequence is successfully resequencing approach, one finds differ-
sizes of these DNA fragments were assembled, this approach is the best for ences between an individual genome and a
dictated by the cloning vector that was characterization of SVs. Unfortunately, closely related reference genome whose
used. Fragment, or insert, sizes of 2 kb assembling a human genome de novo sequence is known by aligning reads from
150 kb could be obtained by cloning into i.e. with no prior information of the individual genome to the reference
bacterial plasmids or bacterial artificial sufficient quality for structural variation genome. Differences (variants) between
chromosomes (BACs). With next-genera- studies remains difficult with limited read the genomes correspond to differences
tion technologies, a variety of techniques lengths. Currently, human genome assem- between the aligned reads and the refer-
have been employed to generate paired blies are highly fragmented, consisting of ence sequence. In the next section, we
reads. At present, the most efficient and tens-hundreds of thousands of contigs, describe how to predict SVs using a
effective techniques produce paired reads intermediate sized sequences of thousands resequencing approach.
from fragments of only a few hundred bp, to tens of thousands of nucleotides.
although fragments of 23 kb are avail- Moreover, the associations between some 3.3 New DNA Sequencing
able. Thus, next-generation sequencing structural variants and repetitive sequenc- Technologies
technologies have both limited read es implies that assemblies of finished (not Many of the challenges in reliable
lengths and limited insert sizes compared draft quality) are necessary for comprehen- measurement of SVs described above are
to Sanger sequencing. sive coverage of structural variation. related to limitations in sequencing tech-
nologies. In particular, SVs with break- promise for a dramatic shift in DNA specialized task of aligning millions-billions
points in highly-repetitive sequences are sequencing where extremely long reads of individual short reads led to the
beyond the abilities of current technolo- (tens of kb) are generated, making both de development of new software programs
gies. New third-generation and single- novo assembly and variant detection by tailored to this task, such as Maq, BWA,
molecule technologies promise additional resequencing straightforward problems. Bowtie/Bowtie2, BFAST, mrsFAST, etc.
advantages for structural variation discov- [3843]. A key decision in read alignment
ery. These advantages include longer read 4. Resequencing Strategies for for SV detection is whether to consider
lengths, easier sample preparation, lower Structural Variation only reads with a single, best alignment to
input DNA requirements, and higher the reference genome, or to also include
throughput. For example, Pacific Biosci- A resequencing strategy predicts SVs by reads with multiple high-quality align-
ences recently released their Single-Mole- alignments of sequence reads to the ments. Some read alignment programs
cule Real Time (SMRT) sequencing, a reference genome. There are two main will output only a single alignment for
technology that measures in real time the steps in any resequencing strategy: (1) each read, in some cases choosing an
incorporation of nucleotides by a single alignments of reads; (2) prediction of SVs alignment randomly if there are multiple
DNA polymerase molecule immobilized in from alignments. Resequencing approach- alignments of equal score. If one uses only
a nanopore [32]. es are straightforward in principle, but in reads with a unique alignment, then there
One application of this technology is practice sensitive and specific detection of is limited power to detect SVs whose
strobe sequencing. A strobe read, or strobe, structural variation in human genomes is breakpoints lie in repetitive regions, such
consists of multiple subreads from a single notoriously difficult [34,35]. While some as SVs resulting from NAHR. On the
contiguous molecule of DNA. These sub- types of SVs are easy to detect with next- other hand, if one allows reads whose
reads are separated by a number of dark generation sequencing technologies, other alignment is ambiguous, then the problem
nucleotides (called advances), whose iden- complex SVs are refractory to detection. of SV prediction requires an algorithm to
tity is unknown (Figure 3). Thus far, This is due to both technological limita- distinguish among the multiple possible
Pacific Biosciences has demonstrated tions and biological features of SVs. DNA alignments for each read. Many SV
strobes of lengths up to 20 kb with 24 sequencing technologies produce reads prediction algorithms analyze only unique
subreads each of 50400 bp. Additional with sequencing errors, have limited read alignments, although several recent algo-
improvements are expected as technology lengths and insert sizes, and have other rithms use ambiguous alignments. A few of
matures. Strobes generalize the concept of sampling biases (e.g. in GC-rich regions). these are noted below.
paired reads by including more than two Biologically, human SVs are: (i) enriched
reads from a single DNA fragment. for repetitive sequences near their break- 4.2 Split Reads
Strobes provide long-range sequence in- points [23]; (ii) may overlap, have multiple A direct approach to detect structural
formation with low input DNA require- states or complex architectures; and (iii) variants from aligned reads is to identify
ments, a feature missing from current recurrent (but not identical) variants may reads whose alignments to the reference
sequencing technologies. This additional exist at the same locus [36,37]. These genome are in two parts. These so called
information is useful for detection and de properties mean that the alignment of split reads contain the breakpoint of the
novo assembly of complex SV that lie in reads to the reference genome and the structural variant (Figure 4). To reduce
highly repetitive regions, or contain mul- prediction of SVs from these alignments is false positive predictions of structural
tiple breakpoints in a small region. How- not always an easy task. Algorithms are variants, one requires the presence of
ever, the advantages of strobes are reduced required to make highly sensitive and specific multiple split reads sharing the same
by higher single-nucleotide error rates. predictions of SVs. breakpoint. Because the two parts of a
Thus, realizing the advantages of strobes In this section we review the main issues split read align independently to the
requires new algorithms that exploit infor- in predicting SVs using a resequencing reference genome, these alignments must
mation from multiple, spaced subreads to approach. We begin with read alignment. be long enough to be aligned uniquely (or
overcome high single-nucleotide error Then we describe the three major ap- with little ambiguity) to the reference.
rates [33]. proaches that are used to identify struc- Thus, split read analysis is a feasible
Sequencing technologies continues its tural variants from aligned reads: (i) split strategy only when the reads are suffi-
rapid development. Improvements in the reads; (ii) depth of coverage analysis; and ciently long. For example, if one has a
chemistry, imaging, and manufacture of (iii) paired-end mapping. 36 bp read containing the breakpoint of
existing technologies are increasing their an SV at its midpoint, one must align the
read lengths, insert lengths, and through- 4.1 Read Alignment two 18 bp halves of the read to the
put. Additional sequencing technologies Alignment of reads to a reference reference genome. Finding unique align-
are under active development. Nanopore- genome is a special case of sequence ments of an 18 bp sequence is often not
based technologies that directly read the alignment, one of the most researched possible. There are no reports of successful
nucleotides of long molecules of DNA hold problems in bioinformatics. However, the prediction of structural variants from split
reads alone using next generation DNA eter c, called the coverage, is a key factors to consider beyond this simple
sequencing reads less that 50 bp in length. parameter in a sequencing experience. analysis. For example, since reads are
Instead, split read methods have been For example, recent cancer sequencing sampled at random from the genome,
proposed that use paired reads, and projects with Illumina technology have coverage is not constant, but rather follows
require that one read in the pair has a used 30X coverage which means that a distribution with mean c. A Poisson
full length alignment to the reference. This the number of reads and length of reads are distribution is typically used as an approx-
alignment of the read from one end of the chosen such that c~30. imation to this distribution, although other
fragment is used to anchor the search for Now, if the individual genome con- distributions sometimes provide a better fit
alignments of the other split read of the tained a deletion of a segment of the to the data. In addition, repetitive se-
fragment [4446]. human reference genome, the coverage of quences in the reference genome and
this segment would be reduced by half if biases in sequencing (e.g. different cover-
4.3 Depth of Coverage the deletion was heterozygous or re- age of GC-rich regions) also affect depth of
Depth of coverage (also called read duced to zero if the deletion was coverage calculations. Nevertheless, there
depth) analysis detects differences in the homozygous (Figure 4). Similarly, if an are several computational methods for
number of reads that align to intervals in interval of the reference genome was depth of coverage analysis [47,48]. Many
the reference genome. Assuming that reads duplicated, or amplified, in the individual of these are largely similar to those used to
are sampled uniformly from the genome genome, the coverage of this interval analyze microarray copy number data.
sequence, the number of reads that contain would increase in proportion to the
a given nucleotide of the reference is, on number of copies. Thus, the observed 4.4 Paired-end Sequencing and
NL coverage of an interval of the reference Mapping
average, c~ , where N is the number of genome, the depth of coverage, gives an
G The most common approach for rese-
reads, L is the length of each read, and G is indication of the number of copies of this quencing SVs is paired-end mapping
the length of the genome. This is the interval in the individual genome. Of (PEM) (Figure 5). Paired-end mapping
Lander-Waterman model, and the param- course, there are numerous additional was used to identify somatic SVs in cancer
genomes [49,50] and the same idea has paired reads, as most fragments will (Figure 6). Note that this is a simplification
been applied to identify germline structur- correspond to a concordant pair (Figure 5). of the underlying biology, as there are
al variants [51,52]. While the early paired- To distinguish real SVs from sequenc- sometimes small insertions or deletions at
end mapping studies used older clone- ing errors, one looks for clusters of breakpoints, but these small changes have
based sequencing, paired-end mapping is discordant pairs that indicate the same limited effect on the analysis of larger
now possible using various next-generation SV. Numerous algorithms have been structural variants.
sequencing technologies. developed to predict SVs by finding Now the discordant pairs that indicate
In PEM, a paired-end sequencing clusters of discordant pairs. Early algo- an SV have the property that the locations
protocol is used to obtain paired reads rithms used only those paired reads whose of the read alignments are near the
from opposite ends of a larger DNA alignments to the reference genome were breakpoints a and b. However, a paired
fragment, or clone, from a individual genome. non-ambiguous; i.e. there was only a single read does not give independent informa-
These paired reads are then aligned to a best alignment [5355]. More sophisti- tion about the breakpoint a and the
reference genome. Most paired reads cated algorithms use paired reads with breakpoint b. Rather, the breakpoints a
multiple ambiguous alignments to the and b are related by a linear inequality
result in concordant pairs where the
reference genome and use a variety of that defines a polygon in 2D genome space
distance between aligned reads is equal
combinatorial and statistical techniques to called the breakpoint region (Figure 6).
to the fragment length. In contrast,
select among these alignments [5658]. For example, suppose that the pair of
discordant pairs have alignments with
Finally, some approaches model the fact reads from a single fragment align to the
abnormal distance or that lie on different
that the human genome is diploid to avoid same chromosome of the reference ge-
chromosomes. These suggest the presence
making inconsistent structural variant nome such that the read with lower
of an SV or a sequencing error. For
predictions [59]. coordinate starts at position x in the
example, a discordant pair whose distance All of the approaches above rely on
between alignments is too long suggests a reference and the read with higher coor-
predicting structural variants that are dinate ends at position y in the reference.
deletion in the individual genome supported by multiple paired reads. Some,
(Figure 5), while a discordant pair whose (For simplicity, we ignore the fact that the
but not all, of them are careful when sequence of a read can align to either
alignments are on different chromosomes determining whether a group of paired
suggests a translocation. Other types of strand (forward or reverse) of the reference
reads genuinely support the same variant.
discordant pairs identify inversions, trans- genome. The strand of an alignment gives
We illustrate the issue here using the
positions, or duplications that distinguish additional information about the location
Geometric Analysis of Structural Variants
the individual genome from the reference of the breakpoint. See [55] for further
(GASV) method of [55]. A key feature of
genome. Note that in general the length of details.) If the sequenced fragment has
GASV is that it records both the informa-
any particular sequenced fragment is not length L then the breakpoints a and b
tion that the paired reads reveal about the
known. Rather, during the preparation of satisfy the equation (a{x)z(y{b)~L.
boundaries (breakpoints) of the structural
genomic DNA for sequencing, the DNA is As described above, the size of any
variant and the uncertainty associated with this
fragmented and fragments are size-select- particular fragment is typically unknown.
measurement. Most types of SV, including
ed to an appropriate target length. It is deletions, inversions, and translocations Rather, one defines a minimum size Lmin
desirable for this size selection to be as have two breakpoints a and b where the and maximum size Lmax of a sequenced
strict as possible, so that only fragments reference genome is cut. The segments fragment, perhaps according to the em-
near the target length are sequenced. adjacent to these coordinates are then pirical fragment length distribution. Thus,
However, in practice the size selection pasted together in a way that is particular we have the inequality
procedure produces fragments whose to the type of SV. For example, a deletion
lengths vary around the target length. is defined by coordinates a and b in the Lmin (a{x)z(y{b)Lmax :
Typically, the distribution of fragment reference genome such that the nucleotide
lengths is obtained empirically by exam- at position a is joined to the nucleotide at This equation defines the unknown break-
ining the distances between all aligned position b in the individual genome points a and b in terms of the known
coordinates x and y of the aligned reads relies on computational geometry algo- variants, while oligonucleotide aCGH
and the length of sequenced fragments. rithms for polygon intersection. These techniques are now used in studies profil-
The pairs of breakpoints (a,b) that satisfy scale to millions of discordant pairs that ing tens of thousands of genomes. Large
this equation form a polygon (specifically a result from next-generation sequencing projects like the 1000 Genomes Project
trapezoid) in two-dimensional genome platforms. and The Cancer Genome Atlas (TCGA)
space. We define the breakpoint region B While the algorithms above consider are performing paired-end sequencing and
of discordant pair (x,y) to be the break- many of the issues in prediction of aCGH of many human genomes, and
points (a,b) satisfying the above equation. structural variants, there remains room matched tumor and normal genomes,
This geometric representation provides for improvement. Most notably, many respectively. At the same time, smaller or
a principled way to combine information algorithms still use only one of the possible single investigator projects are using a
across multiple paired-reads: multiple signals of structural variants: read depth, variety of paired-end sequencing ap-
paired-reads indicate the same variant if split reads, or paired reads. Improvements proaches and/or microarray-based tech-
their corresponding breakpoint regions in specificity are likely possible by inte- niques with different trade-offs in cost-per-
intersect. The geometric representation grating these multiple signals into a single sample vs. measurement resolution. Thus,
also provides precise breakpoint localiza- prediction algorithm [60]. in the near future there will be an
tion by multiple paired reads; separates enormous number of measurements of
multiple measurements of the same vari- 5. Representation of Structural SVs, but using a wide range of technolo-
ant from measurements of nearby or Variants gies of varying resolution, sensitivity, and
overlapping variants; and facilitates robust specificity. This diversity of approaches
comparisons across multiple samples and Next generation DNA sequencing tech- will likely continue for some time as
measurement technologies. Finally, the nologies are dramatically reducing the cost investigators explore tradeoffs between
approach is computationally efficient as it of sequence-based surveys of structural the cost of measuring variants in one
References
1. Manolio TA, Collins FS, Cox NJ, Goldstein DB, 9. Marshall C, Noor A, Vincent J, Lionel A, Feuk L, function of chromosome aberrations in cancer.
Hindorff LA, et al. (2009) Finding the missing et al. (2008) Structural variation of chromosomes Nat Genet 36: 331334.
heritability of complex diseases. Nature 461: 747753. in autism spectrum disorder. Am J Hum Genet 18. Meyerson M, Gabriel S, Getz G (2010) Advances
2. Stratton MR (2011) Exploring the genomes of 82: 477488. in understanding cancer genomes through sec-
cancer cells: progress and promise. Science 331: 10. Stone JL, ODonovan MC, Gurling H, Kirov ond-generation sequencing. Nat Rev Genet 11:
15531558. GK, Blackwood DH, et al. (2008) Rare chromo- 685696.
3. Frazer K, Ballinger D, Cox D, Hinds D, Stuve L, somal deletions and duplications increase risk of 19. Mardis ER (2012) Genome sequencing and
et al. (2007) A second generation human schizophrenia. Nature 455: 237241. cancer. Curr Opin Genet Dev 22: 245250.
haplotype map of over 3.1 million SNPs. Nature 11. Sindi SS, Raphael BJ (2009) Identification and 20. International Cancer Genome Consortium, Hud-
449: 851861. frequency estimation of inversion polymorphisms son TJ, Anderson W, Artez A, Barker AD, et al.
4. Sharp AJ, Cheng Z, Eichler EE (2006) Structural from haplotype data. In: RECOMB. pp. 418 (2010) International network of cancer genome
variation of the human genome. Annu Rev 433. projects. Nature 464: 993998.
Genomics Hum Genet 7: 407442. 12. Nowell PC (1976) The clonal evolution of tumor 21. Bignell GR, Santarius T, Pole JCM, Butler AP,
5. Iafrate A, Feuk L, Rivera M, Listewnik M, cell populations. Science 194: 2328. Perry J, et al. (2007) Architectures of somatic
Donahoe P, et al. (2004) Detection of large-scale 13. Merlo LM, Pepper JW, Reid BJ, Maley CC genomic rearrangement in human cancer ampli-
variation in the human genome. Nat Genet 36: (2006) Cancer as an evolutionary and ecological cons at sequence-level resolution. Genome Res
949951. process. Nat Rev Cancer 6: 924935. 17: 12961303.
6. Redon R, Ishikawa S, Fitch K, Feuk L, Perry G, 14. Albertson DG, Collins C, McCormick F, Gray 22. Campbell P, Stephens P, Pleasance E, OMeara
et al. (2006) Global variation in copy number in JW (2003) Chromosome aberrations in solid S, Li H, et al. (2008) Identification of somatically
the human genome. Nature 444: 444454. tumors. Nat Genet 34: 36976. acquired rearrangements in cancer using ge-
7. Stranger BE, Forrest MS, Dunning M, Ingle CE, 15. Tomlins SA, Rhodes DR, Perner S, Dhanase- nome-wide massively parallel paired-end se-
Beazley C, et al. (2007) Relative impact of karan SM, Mehra R, et al. (2005) Recurrent quencing. Nat Genet 40: 722729.
nucleotide and copy number variation on gene fusion of tmprss2 and ets transcription factor 23. Kidd J, Cooper G, Donahue W, Hayden H,
expression phenotypes. Science 315: 848853. genes in prostate cancer. Science 310: 644648. Sampas N, et al. (2008) Mapping and sequencing
8. Lower KM, Hughes JR, De Gobbi M, Hender- 16. Soda M, Choi Y, Enomoto M, Takada S, of structural variation from eight human ge-
son S, Viprakasit V, et al. (2009) Adventitious Yamashita Y, et al. (2007) Identification of the nomes. Nature 453: 5664.
changes in long-range gene expression caused by trans- forming EML4-ALK fusion gene in non- 24. Kolomietz E, Meyn MS, Pandita A, Squire JA
polymorphic structural variation and promoter small-cell lung cancer. Nature 448: 561566. (2002) The role of Alu repeat clusters as mediators
competi- tion. Proc Natl Acad Sci USA 106: 17. Mitelman F, Johansson B, Mertens F (2004) of recurrent chromosomal aberrations in tumors.
2177121776. Fusion genes and rearranged genes as a linear Genes Chromosomes Cancer 35: 97112.
Chapter 7: Pharmacogenomics
Konrad J. Karczewski1,2, Roxana Daneshjou2,3, Russ B. Altman2,3*
1 Program in Biomedical Informatics, Stanford University, Stanford, California, United States of America, 2 Department of Genetics, Stanford University, Stanford, California,
United States of America, 3 Department of Medicine, Stanford University, Stanford, California, United States of America
Abstract: There is great variation in undergone treatment, he begins experi- to take modern medicine down a more
drug-response phenotypes, and a encing unexpected bone marrow toxicity, personalized path.
one size fits all paradigm for drug immunosuppression, and life-threatening Modern physicians prescribe medica-
delivery is flawed. Pharmacoge- infections. This type of scenario was tions based on clinical judgment or evi-
nomics is the study of how human encountered after mercaptopurine first dence from clinical trials. In order to select
genetic information impacts drug came on the market in the 1950s. In the a drug and dosage, physicians take clinical
response, and it aims to improve mid-1990s, scientists began to realize that factors such as gender, weight, or organ
efficacy and reduced side effects. In genetics could explain a majority of the function into consideration. The personal
this article, we provide an overview cases of life-threatening bone marrow variation that may affect drug selection or
of pharmacogenetics, including toxicity [1]. Now, many drugs that were dosing, such as genetics, is not considered
pharmacokinetics (PK), pharmaco- once noted to cause so-called unpredict- in many settings. Thus, while a daily 75 mg
dynamics (PD), gene and pathway able reactions are being re-evaluated for dose of clopidogrel for a 70 kg adult would
interactions, and off-target effects. drug-gene interactions. obviously be inappropriate for a 20 kg
We describe methods for discover- The history of medicine is full of child, it is less obvious that two adults with
ing genetic factors in drug response, medications with unintended consequenc- identical presentations and clinical back-
including genome-wide association
es; the ability to understand some of the grounds might require vastly different
studies (GWAS), expression analysis,
underlying causes has been a recent doses. However, for an increasing number
and other methods such as che-
moinformatics and natural lan- development. In the 1950s, succinylcholine of drugs, this appears to be the case. For
guage processing (NLP). We cover was used by anesthesiologists as a muscle instance, two patients with similar clinical
the practical applications of phar- relaxant during operations. However, presentations could be given the same dose
macogenomics both in the pharma- about 1 in 2500 individuals experienced a of the anti-platelet drug clopidogrel, and
ceutical industry and in a clinical horrific reaction respiratory arrest. Later one would be adequately protected against
setting. In drug discovery, pharma- research revealed that those individuals had cardiovascular events while the other
cogenomics can be used to aid lead defects in both copies of cholinesterase, the experiences a myocardial infarction due to
identification, anticipate adverse enzyme required to metabolize succinyl- inadequate therapeutic protection. What
events, and assist in drug repurpos- choline into an inactive form. During the accounts for this difference? Genetics the
ing efforts. Moreover, pharmacoge- 1980s, a drug used to treat angina, patient with the inadequate therapeutic
nomic discoveries show promise as perhexiline, caused neural and liver toxicity protection likely has a polymorphism of
important elements of physician in a subset of patients. Scientists later found CYP2C19 with decreased activity, so that
decision support. Finally, we con- that this toxicity occurred in individuals this key enzyme cannot efficiently metab-
sider the ethical, regulatory, and with a rare polymorphism of CYP2D6, an olize clopidogrel into its active metabolite.
reimbursement challenges that re- enzyme involved in the drugs metabolism. The interaction between drugs and genetics
main for the clinical implementation
Genetics not only plays a role in adverse has been termed pharmacogenomics.
of pharmacogenomics.
events, but also influences an individuals In general, pharmacogenomics can be
optimal drug dose. Two anticoagulants, defined as the sum of the words parts: the
warfarin and clopidogrel, have different study and application of genetic factors
This article is part of the Transla- therapeutic doses based on an individuals (often in a high-throughput, genomic
tional Bioinformatics collection for genetic makeup. Scientists are increasingly fashion) relating to the bodys response to
PLOS Computational Biology. learning more about the interaction be- drugs, or pharmacology (for the major
tween drugs and human genetics in order questions in the field of pharmacoge-
1. Introduction Citation: Karczewski KJ, Daneshjou R, Altman RB (2012) Chapter 7: Pharmacogenomics. PLoS Comput
Biol 8(12): e1002817. doi:10.1371/journal.pcbi.1002817
A child with leukemia goes to the
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
doctors office to be treated. The oncolo- Baltimore County, United States of America
gist has decided to use mercaptopurine, a Published December 27, 2012
drug with a narrow therapeutic range.
Copyright: 2012 Karczewski et al. This is an open-access article distributed under the terms of the Creative
The efficacy and toxicity of this drug lies in Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
its ability to act as a myelosuppressant, provided the original author and source are credited.
which means it suppresses white and red Funding: KJK is supported by NIH/NLM National Library of Medicine training grant Graduate Training in
blood cell production. Despite the dangers Biomedical Informatics T15-LM007033 and the NSF Graduate Research Fellowship Program. RD is supported
this regimen poses, the oncologist is by Stanford Medical Scholars. RBA is supported by PharmGKB GM61374. The funders had no role in the
preparation of the manuscript.
confident with his ability to administer
the drug based on his experience with Competing Interests: The authors have declared that no competing interests exist.
prior patients. However, after the child has * E-mail: russ.altman@stanford.edu
The bodys metabolism of a drug can teins that interact with the drug). Perhaps affecting up to 25% of all drug therapies
lead to the conversion of a precursor drug the most famous drug-metabolizing (reviewed in [12]). For instance, CYP2C9
into an active metabolite or the break- proteins are members of the cytochrome plays a major role in the metabolism of
down of the active form into an inactive P450 family (CYP genes), which are warfarin to the inactive hydroxylated
form for excretion. As with absorption and involved in the phase I metabolism of the forms, including 7-hydroxywarfarin ([13],
distribution, inter-individual variation in majority of known drugs [11]. Polymor- reviewed in [14]). As such, CYP2C9 is the
metabolism can often be explained by phisms in these genes have been implicat- second greatest contributor to the varia-
genetics (specifically, changes in the pro- ed in human drug response variation, tion in warfarin dosage discovered thus
far, which has led to its inclusion in circulated by absorption and distribution. the drugs are pumped out before achiev-
pharmacogenetic dosing equations [15]. For instance, one member of the ABC ing their therapeutic effect (reviewed in
Finally, the body constantly cycles family, P-glycoprotein (P-gp or ABCB1) is [17]). Thus, inhibition of P-gp has
through the gamut of small molecules a transporter protein that actively pumps remained an active area of research for
that flow through it. For example, the drugs and other metabolites out of cells (a augmenting cancer treatment [18]. Addi-
kidney is involved in finely regulating detailed view into the mechanism of P-gp tionally, upregulation of elimination me-
ionic concentrations and purging out can be found in [16]). Upregulation of P- diators such as P-gp should be considered
unwanted metabolites. As small mole- gp causes increased efflux of small mole- for pharmacogenomic dose adjustments,
cules, drugs are not exempt from these cules, which causes multi-drug resistance. with the caveat that increasing a drugs
processes and are also excreted from the For example, resistance to statins and dose may have other potential detrimen-
body, purging what was brought in and chemotherapeutic drugs occurs because tal effects.
In a similar vein, pathway analysis can be that target genes in the same pathway as the a lack of toxicity verified in animal models,
used to select new, potentially safer drug original gene. the small molecule goes through a series of
targets. Namely, if a drug (which targets some increasingly larger phases of clinical trials.
gene) is initially discovered as effective, but 4.2. Clinical Trial Pipeline Basic efficacy and relative safety are
found to cause adverse events, safer alterna- Once a small molecule has been demonstrated before and during Phase II
tives might be found by searching for drugs biochemically identified as a lead and clinical trials, on the path to Phase III.
many toxic side effects, and in many cases equivalent to that of a prior myocardial going to succeed or if another drug would be
of advanced cancers, physicians guess and infarction in a nondiabetic individual. Pres- a better choice [40].
test medications by prescribing them and ently, the physician chooses drugs based on One major caveat of gene-based pre-
monitoring progress. In addition, the very his best clinical judgment and then monitors scription decisions (as well as dosing, as
nature of cancer is personal, insofar as the outcome of the treatment. However, as discussed below) involves the applicability
each specific cancer is caused by the unique the tolerance and efficacy of certain popularly of a finding in one population to other
sum of individual somatic mutations (that prescribed drugs has been shown to be tied to populations (see above: Association Meth-
is, mutations that occur in the individual genetics, such information could be used in ods). While a pharmacogenomic effect
after birth and are not inherited or passed prescription decisions. For instance, statins may be true for a given population (with
on). Certain signatures of cancers, or are a class of drugs that are inhibitors of a certain genetic background, in animal
mutations that produce similar cancer HMG-CoA reductase, an enzyme that helps model parlance), it may not directly apply
phenotypes, allow for the grouping of produce cholesterol in the liver. Thus, statins to other populations due to unknown
cancers into distinctions, such as leukemia are given in an effort to lower cholesterol, genetic factors, especially combinatorial
or lymphoma, but even these exhibit particularly low-density lipoprotein (LDL) effects. Because there is no current stan-
significant variability among classifications. cholesterol, whose increased levels are a dard for translating a result between
Thus, the ability to sequence and study the cardiac risk factor. Statins are often pre- ethnicities, follow-up work is required for
genomes of cancer cells of an individual can scribed to patients with type II diabetes and each specific pharmacogenomic interac-
help identify the driving somatic mutations high cholesterol in order to help them reach a tion before it is applied in a clinical setting.
and provide a tool for rational drug choice. more healthy cholesterol range. Even though
For example, the median survival for ad- studies have suggested genetic influences on 5.2. Adverse Drug Reactions
vanced or recurrent endometrial cancer is statin efficacy and tolerance, such findings are Another factor physicians need to consider
very poor, due to the fact that physicians treat not yet widely applied in clinical medicine. when choosing a drug is the risk of adverse
empirically with chemotherapy, which may One study found that in individuals with events, or any detrimental, unintended
have no therapeutic benefits. Researchers diabetes, variation in the HMG-coA reduc- consequence of administering a drug at
studying mutations in the pathways of tase gene was associated with a decreased indicated clinical doses. In a milder form,
endometrial cancer cell lines found that response to statin therapy. In this study, a an adverse event could be an allergic rash
response to doxorubicin, a chemotherapy significantly greater percentage of individuals from penicillin. These events can also be
used to treat endometrial cancer, was related heterozygous for the G minor allele of much more intense: severe adverse drug
to mutations in the Src pathways, which are rs17238540 were unable to reach target reactions (SADRs) are those that can cause
involved in cell proliferation, motility, and cholesterol and triglyceride goals when significant injury or even death, and are
survival. By pinpointing mutations in this compared to individuals homozygous for estimated to occur in about 2 million patients
pathway, the researchers were able to the major allele. Additionally, these individ- a year in the United States. In fact, SADRs
rationalize supplementing the drug regimen uals had a 13% smaller reduction in total are the fourth leading cause of death in the
with the addition of SU6656, a drug that cholesterol and a 27% smaller reduction in United States, with about 100,000 yearly
competitively inhibits the Src pathway, which triglycerides. This is an example of just one deaths. Because of the impact of SADRs,
increased the sensitivity of some of the cell variation in the HMG-coA gene; other scientists and physicians hope that the
lines to doxorubicin [38]. As cancers are variations certainly exist and can impact application of pharmacogenomics can help
typically characterized by a lack of error- how well a patient responds to statins [39]. predict which patients are most susceptible to
correction mechanisms and inhibited apop- Another gene that has been found to affect experiencing an SADR to a given drug. With
tosis, such an approach is particularly response to statins is the APOE gene, which is this knowledge in hand, a physician can
important, as the initial failure of a chemo- associated with the regulation of total either more closely monitor these patients or
therapeutic drug allows time for a cancer to cholesterol and LDL cholesterol. There are choose an alternative therapy [41].
develop further mutations and spread further. several variants in this gene, and there are For instance, statins have been associated
In the future, interrogating cancer genomes differences between how type II diabetic with a rare but incredibly severe adverse
could allow rational drug prescribing, de- individuals carrying these variants respond to reaction: myopathy and rhabdomyolysis. A
creasing the amount of time spent on statins. For instance, the individuals homo- study looking at the possible genetic influenc-
ineffective therapies and increasing the zygous for the E2 variant were all able to es of this reaction found a SNP in the
number of successful cures. reach their target LDL cholesterol; however SLCO1B1 gene associated with this severe
Pharmacogenomics can also play a role in 32% of individuals homozygous for the E4 adverse drug reaction, with an odds ratio of
drug decisions for prevalent conditions, variant failed to reach target LDL cholesterol. 4.5 [42]. However, there are also cases of
allowing physicians to predict when a Moreover, E2 variant homozygotes had a individuals who experience milder symptoms
commonly successful therapy may fail. For significantly greater lipid lower response to and develop statin intolerance. Some of these
instance, there is an arsenal of drugs doctors statins than some of the other variants. Thus, individuals experience an elevation in crea-
can use to combat the co-morbidities of type APOE is another gene that may be predictive tine kinase or alanine aminotransferase while
II diabetes. These co-moribidities are usually of statin resistance or reduced efficacy. on statins, indicating possible muscle or liver
cardiac risk factors, such as lipid abnormal- Knowledge of these genes could play a role damage. A recent study found that the
ities and high blood pressure: the cardiac risk in the future of drug prescribing, as physicians functional variants V174A and N130D in
factor conferred by type II diabetes is would be able to predict a priori if a drug was the SLCO1B1 gene, which encodes the
doi:10.1371/journal.pcbi.1002817.t001
system (Figure 6). Full adoption will require a would be expected to work and if any grows, text mining methods may become
curated, updated database with FDA or possible adverse events might be expect- instrumental in interrogating the literature
evidence-based approved drug-gene interac- ed (Figure 6). and collecting relevant data for clinical use.
tions that would be available for physicians to Pharmacogenetics is a rapidly developing The application of pharmacogenomics in the
use in their medical practice. For example, field; however, some challenges remain in clinic can help inform physicians in drug
PharmGKB is primarily used as a scientific implementing scientific findings from the prescribing, drug dosing, and prediction of
tool for identifying drug-gene interactions. bench to the bedside. Because of the continued adverse events. Because many of the drugs
However, its clinical utility was shown when development and work in this field, these undergoing pharmacogenomic study are
it was used to generate drug recommenda- challenges will be addressed, ushering in an already FDA-approved, adoption of pharma-
tions based on an individuals fully sequenced age of personalized drug treatments. cogenomics in the clinic is mostly dependent
genome [37]. Such resources serve as the on the availability of genome sequencing and
precursor to the systems that will be in place 6. Summary the development of implementation infra-
when all individuals have sequenced genomes structure. Moreover, pharmacogenomics can
Pharmacogenomics encompasses the in-
readily available for physician use. also aid in drug development, providing
teraction between human genetics and drugs,
Finally, for pharmacogenomics to be pharmaceutical companies with an additional
which can be affected by variation in genes
widely applied, personal genomics needs tool to design more successful, cheaper trials.
involved in pharmacokinetics (PK) and
to become ingrained into modern med- Thus, pharmacogenomics promises to help
pharmacodynamics (PD). Thus, a major goal
icine. Physicians and patients must be launch medicine and drug development into
of pharmacogenomics is to elucidate which
educated as to the benefits of genomic the realm of personalized care.
genes affect drug action, using cheminfor-
medicine, in order to dispel any myths matics, expression studies, and genome-wide
and to avoid ethical issues. Moreover, association studies (GWAS). Association 7. Exercises
genetic testing facilities meeting the methods can be used to discover novel
U.S. governments Clincial Laboratory associations by comparing the genetic differ- 1. (A) Download a genotype and phenotype
Improvement Amendments (CLIA) cer- ences between cases with a certain phenotype dataset of your choosing. Using PLINK
tification requirements need to be es- and controls. Expression analysis and che- ( h t t p : / / p n g u .m g h . h a r v a r d .e d u /
tablished in order to provide patients minformatics can be used to expand knowl- ,purcell/plink/) or a statistical program
with genomic data that is considered edge about drug-gene interactions by com- such as R (http://www.r-project.org/),
acceptable for clinical use. Finally, paring gene expression or interaction profiles calculate the association (using a Fishers
insurance companies must be on board among drugs and genes. Analysis of these exact test) between ,Trait. and each
to reimburse genetic testing. Since studies can yield information about how these SNP. After Bonferroni correction, does
sequencing costs continue to drastically genes affect drug action. Because of differ- any SNP reach genome-wide signifi-
fall, the debates surrounding cost will ences in haplotype structure between popu- cance? (B) Does using a different correc-
soon become moot [46]. Thus, we are lations, studies validated in one population tion method such as Benjamini or False
rapidly entering an age where every may not be directly applicable to a different Discovery Rate (FDR) result in any more
patient can have his or her genome population. However, as knowledge accumu- significant SNPs?
available. With the availability of an lates about drug-gene interactions, scientists 2. (A) Use a pharmacogenomic database
individuals genome, a physician looking can contribute to databases, such as (such as PharmGKB) to find genes that
to administer a drug such as a statin can PharmGKB, documenting known relation- may interact with metformin. (B) Are
check to see whether or not the statin ships (Table 1). As the volume of knowledge any of these genes known to interact
Further Reading
N Altman RB, Flockhart D, Goldstein DB (2012) Principles of pharmacogenetics and pharmacogenomics. Cambridge: Cambridge
University Press. 400 p.
N Altman RB, Kroemer HK, McCarty CA, Ratain MJ, Roden D (2010) Pharmacogenomics: will the promise be fulfilled? Nat Rev
Genet 12: 6973.
N Altman RB (2011) Pharmacogenomics: noninferiority is sufficient for initial implementation. Clin Pharmacol Ther 89: 348350.
N Klein TE, Chang JT, Cho MK, Easton KL, Fergerson R, et al. (2001) Integrating genotype and phenotype information: an
overview of the PharmGKB project. Pharmacogenetics Research Network and Knowledge Base. Pharmacogenomics J 1: 167
170.
N Roses AD (2000) Pharmacogenetics and the practice of medicine. Nature 405: 857865.
N Roses AD (2004) Pharmacogenetics and drug development: the path to safer and more effective drugs. Nat Rev Genet 5: 645
656.
Glossary
Abstract: Most methods for large- extract biological meanings from the mas- regulons (i.e., sets of co-regulated genes)
scale gene expression microarray sive amounts of transcriptome expression and their putative cis-regulatory elements.
and RNA-Seq data analysis are de- data. Most of the microarray and RNA-Seq Here, the discovered motifs seem to be
signed to determine the lists of data analysis methods are designed to regarded as functional annotations to the
genes or gene products that show determine the lists of genes or gene corresponding genes. Many Functional An-
distinct patterns and/or significant products that show distinct patterns and/ notation Analysis (FAA) methods have been
differences. The most challenging or significant differences. Clustering and developed to test whether certain Gene
and rate-liming step, however, is to differential expression analysis, for exam- Ontology (GO) terms [2] or biological
determine what the resulting lists of ple, typically generate lists of significantly pathways are significantly enriched within
genes and/or transcripts biologically clustered and Differentially Expressed a particular list of genes. Many GO and
mean. Biomedical ontology and Genes (DEGs), respectively. The most biological pathway-based tools for gene
pathway-based functional enrich- challenging and rate-liming step, however, expression analysis have been developed
ment analysis is widely used to is to determine what the resulting lists of and proven to be useful [39].
interpret the functional role of tightly genes or gene products biologically mean. FAA is an attempt to extract biological
correlated or differentially expressed The first analytic approach for the semantics from given lists of genes that are
genes. The groups of genes are biological interpretation of obtained gene determined without considering any bio-
assigned to the associated biological lists was to manually collect and put down logical meaning but by a quantitative
annotations using Gene Ontology
all available descriptive information con- statistical analysis like clustering and
terms or biological pathways and
cerning each gene next to it and to try to DEG analysis methods. Gene Set Enrich-
then tested if they are significantly
enriched with the corresponding infer the collective meaning of the textual ment Analysis (GSEA) [10,11], however,
annotations. Unlike previous ap- descriptors for the group of genes under the takes quite the reverse way. GSEA uses
proaches, Gene Set Enrichment Anal- biological systems context. The assumption pre-defined gene sets with a priori estab-
ysis takes quite the reverse approach here is that if a certain keyword is lished biological meanings like biological
by using pre-defined gene sets. significantly over-represented or a mean- pathways. For each pre-defined gene set,
Differential co-expression analysis ingful pattern is found among the textual GSEA tries to determine if it shows
determines the degree of co-expres- descriptors for a gene group, then the significant expression change. Therefore,
sion difference of paired gene sets keyword or the pattern can be regarded as what GSEA essentially tests is if the pre-
across different conditions. Out- the semantic interpretation of the gene defined biological meaning assigned to
comes in DNA microarray and RNA- group. the gene set shows significant change or
Seq data can be transformed into the It seems that Tavazoie et al. [1] was first to not. It has been successfully demonstrated
graphical structure that represents formally analyze the over-representation of that GSEA can successfully detect subtle
biological semantics. A number of functional annotations for the lists of genes but set-wise coordinated expression chang-
biomedical annotation and external with semantic interpretations. By means of es that cannot be detected by individual
repositories including clinical re- partitional clustering and motif discovery, gene tests [10].
sources can be systematically inte- given genome-wide gene-expression clusters, The gene-set approach greatly improves
grated by biological semantics with- he analyzed significantly over-represented biological interpretability by using pre-
in the framework of concept lattice
regulatory motifs in the upstream sequences defined gene sets with established biological
analysis. This array of methods for
of clustered yeast genes for uncovering new meanings. The same strategy can be applied
biological knowledge assembly and
interpretation has been developed
during the past decade and clearly Citation: Kim JH (2012) Chapter 8: Biological Knowledge Assembly and Interpretation. PLoS Comput Biol 8(12):
improved our biological understand- e1002858. doi:10.1371/journal.pcbi.1002858
ing of large-scale genomic data from Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
the high-throughput technologies. Baltimore County, United States of America
Published December 27, 2012
This article is part of the Transla- Copyright: 2012 Ju Han Kim. This is an open-access article distributed under the terms of the Creative
tional Bioinformatics collection for Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
PLOS Computational Biology.
Funding: This work was supported by the basic science research program through the National Research
Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0028631). The
1. Introduction funders had no role in the preparation of the manuscript.
One of the challenges in DNA micro- Competing Interests: The author has declared that no competing interests exist.
array and RNA-Seq data analysis is to * E-mail: juhan@snu.ac.kr
tering is classified as an unsupervised BioCyc collection of pathways and genome computationally-predicted protein proper-
method. Results from supervised methods databases developed by SRI International. ties for a variety of complete genomes as
for a variety of classification tasks can The pathway figures of MetaCyc are not well. MeSH has many clinical terms
sometimes be organized into a list based static diagrams so that it can be updated including disease names. Other knowledge
on, for example, their contributions to the and expanded while KEGG provides static resources like OMIM (Online Mendelian
task. In principle any list of genes can be collections of pathway diagrams. Inheritance in Man) Morbid Map can also
carefully applied to ontology and pathway- One major goal of ontology is to be used to associate genes to MeSH disease
based annotation analysis. provide a shared understanding of a names. GO and MeSH are now parts of
Metabolic pathways like KEGG and certain domain of information. GO was UMLS (Unified Medical Language System)
MetaCyc and signaling pathways like first created as controlled vocabularies for which has a semantic network structure. In
BioCarta are very powerful resources for standardized annotation of genome data- principle, any biomedical ontology can be
the understanding of shared biological bases. Genes and gene products are systematically applied for improving bio-
processes of a group of genes. Pathways annotated by GO as well as free text medical understanding of gene expression
are commonly presented as directed input by curators. DAG structures are microarray and RNA-Seq data.
graphs, where nodes mainly represent imposed to the three controlled vocabu- Once the genes of interest are success-
molecules such as proteins and com- laries of GO; Molecular Function (MF), fully associated with correct functional
pounds, and edges represent relation types Cellular Compartment (CC), and Biolog- annotations, the next step is to examine
between two nodes. MetaCyc is an ical Process (BP). To each node (or GO if there are any GO terms that have a
experimentally determined non-redundant term), a set of genes are annotated. MIPS larger than expected subset of listed genes
metabolic pathway database. It is the began as a source for data on yeast in their annotation list. For example, if
largest collection containing over 1400 biology, and now provides an integrated 20% of the genes in a gene list are
metabolic pathways [15]. It is a part of the source for experimental, literature and annotated with a GO term apoptosis while
only 1% of the genes in the whole human the number of successes by a serial range of genomic resources and techniques.
genome fall into this functional category, sampling from a finite population. It is Visual and textual presentation helps users
then the gene list can be regarded as equivalent to a one-tailed Fishers exact to understand biological semantics and
strongly related with the functional anno- test. One should consider the choice of contexts. A number of analysis tools with
tation. Most statistical tests like Chi-square, universe (or background), that makes these steps have been introduced: ArrayX-
binomial and hypergeometric tests can be substantial impact on the result. All genes Path, Pathway Miner, EASE in pathway
applied. Chi-square test cannot be used to having at least one GO annotation, all analysis, GOFish, GOTree Machine, Fa-
test data of small sample size. Hypergeo- genes ever known in genome databases, all tiGO, GOAL, GOMIner, FuncAssociate
metric test is widely used for functional genes on the microarray, or all transcripts in ontology analysis and GeneMerge,
enrichment analysis of gene lists, but it is of RNA-Seq data that pass non-specific MAPPFinder, DAVID, GFINDer, Onto-
computationally more intensive. filters can be candidate universe. One Tools in both analyses [14].
Suppose we have a total of N genes with more problem comes from the hierarchi-
n genes belonging to a group of interest cal tree (or graphical) structure of GO 3. Gene Set-Wise Differential
(cluster or DEGs). Among them M genes categories (or pathways) while the hyper- Expression Analysis
are annotated to a specific GO term and k geometric test assumes independence of
genes belong to the interest group and are categories. A parent term can simply be Researchers primary interest with
annotated to the specific GO term. The rated as significant because of the influ- DNA microarray and RNA-Seq data is
probability of having at most k genes can ence from its significant children. More- to identify differentially expressed genes
be calculated by hypergeometric distribu- over, more general statements require (DEGs). To this aim, a number of
tion according to the following: stronger evidence that is required to prove statistical methods have been introduced,
more specific statements. Conditional evaluating statistical significance of indi-
X
k
PX k~ hyjN; M; n hypergeometric testing methods [16,17] vidual genes between two conditions.
y~0 exclude GO terms if there is no evidence Gene set-wise differential expression anal-
! ! beyond that provided by its significant ysis method, however, evaluates coordi-
M N{M children. Because many tests are per- nated differential expression of gene
X
k
y n{y formed, p-values must be interpreted with groups, the meaning of which are previ-
~ ! caution. ously defined as those of biological path-
N ways. The first developed in this category
y~0 Pathway and ontology-based analysis con-
n sist of database mapping, statistical testing, is the Gene Set Enrichment Analysis
and presentation steps [18]. Mapping gene (GSEA) that evaluates for each a priori
lists to GO terms or pathways requires defined gene set the significant association
Hypergeometric distribution is a dis- resolving gene name ambiguities and incon- with phenotypic classes in DNA micro-
crete probability distribution describing sistencies (not discussed here) using a wide array experiments [10].
While FAA tries to determine over- GSEA first creates a ranked list of genes
represented GO terms or biological pathways according to their differential expression X 1
after determining significant co-expression between experimental conditions and then Pmiss S,i~
N{NH
clusters or DEG lists (Figure 3(a) and (c)), determines, for each a priori defined gene gj 6 [ S
GSEA takes the reverse-annotation or gene set, whether members of a gene set tend to ji
set-wise approach (Figure 3(b)). This gene occur toward the top (or bottom) of the
set-wise differential expression analysis meth- ranked list, in which case the gene set is where NH indicates the number of genes in
od successfully identified modest but coordi- correlated with the phenotypic class dis- S and is an exponent to control the weight
nated changes in gene expression that might tinction. With the interesting gene set, S, of the step. The ES is the maximum
have been missed by conventional individual Enrichment Score (ES) is calculated by deviation from zero of Phit Pmiss. It
gene-wise differential expression analysis. evaluating the fractions of genes in S corresponds to a weighted Kolmogorov-
Moreover, gene set-wise approach provides (hits) weighted by their correlation and Smirnov-like statistic.
straightforward biological interpretation be- the fractions of genes not in S (misses) GSEA assesses the significance by
cause the gene sets are defined by biological present up to a given position i in the permuting the class labels. Concerning
knowledge. GSEAs success clearly demon- ranked gene list, L, where N genes are the definition of the null hypothesis,
strates that many tiny expression changes can ordered according to the correlation, methods can be classified into competitive
collectively create a big change that is r(gj) = rj of their expression profiles with and self-contained tests [19]. A competi-
statistically significant. Another advantage is interest gene set: tive test compares differential expression of
that utilizing pre-defined and well-established the gene set to a standard defined by the
gene sets rather than finding or creating novel complement of that gene set. A self-
lists of genes markedly improves semantic contained test, in contrast, compares the
interpretability and computational feasibility. X r j p X p gene set to a fixed standard that does not
Phit S,i~ , where NR ~ r j
It is believed that functionally related genes NR gj [S
depend on the measurements of genes
often show a coordinated expression pattern gj [S outside the gene set. The competitive test
to accomplish their functional role. ji is more popular than the self-contained
test.
Typical gene sets are regulatory-motif, 4. Differential Co-Expression whether a cluster shows significant condi-
function-related, and disease-related sets. Analysis tional difference in the degree of co-
MSigDB (Molecular Signatures Database) expression. An additive model-based scoring
is one of leading gene set databases (http:// Co-expression analysis determines the can be used based on the mean squared
www.broadinstitute.org/gsea/msigdb) con- degree of co-expression of a group (or residual [26]. Let conditions and genes be
taining a total of 6769 gene sets which are cluster) of genes under a certain condition. denoted by J and I, respectively. The mean
classified into five different collections Unlike co-expression analysis, differential squared residual of model is a measurement
(positional, curated, motif, computational co-expression analysis determines the degree of co-expression of genes:
and GO gene sets). Several interesting of co-expression difference of a gene pair or
extensions were proposed in terms of a gene cluster across different conditions, SI,J ~
sample level applications. For example, which may relate to key biological processes
researchers developed genomic signatures provoked by changes in environmental 1 X 2
aij {ai: {a:j {a::
to identify the activation status of on- conditions [12,2325]. Differential co-ex- jI j{1jJ j{1 I,J
cogenic pathways and predict the sensiti- pression analysis methods can be catego-
vity to individual chemotherapeutic drugs rized into three major types (Figure 4): (a) where an entry aij is the expression level of
[20,21]. Significance Analysis of Function differential co-expression of gene cluster(s) gene i in condition j, ai. is the mean
and Expression (SAFE) [22] extends [26], (b) gene pair-wise differential co- expression level of gene i in conditions, a.j
GSEA to cover multiclass, continuous and expression [24] and (c) differential co- is the mean expression level of genes in
survival phenotypes. It also provides more expression of paired gene sets [12]. condition j, a..is the mean expression levels
options for the test statistic, including To identify differentially co-expressed of genes in conditions. A group of gene with
Wilcoxon rank sum, Kolmogorov-Smirnov gene cluster(s) between two conditions, (C1 a low score S9 means high correlation of
and Hypergeometric statistic. and C2 in Figure 4 (a)), a method determines genes. Given two groups J1 and J2, e.g.
disease and control, the method minimizes calculated as expected conditional F-sta- sounds very interesting. However, it seems
the score, S (I) of a set of genes, I: tistic (ECF), a modified F-statistic, for all that there is very little chance for such a
pair of genes between two conditions [24]. cluster to exist. Similarly, one can hardly find
SI,J1 A meta-analytic approach can also detect such a set among a priori defined gene sets
SI ~ gene pairs with significant differential co- (i.e. biological pathways). It is even difficult to
SI,J2
2 expression between normal and cancer expect a biological pathway whose members
P 1 1 1
jJ2 j{1 : I ,J1 aij {ai: {a:j {a:: samples [25]. These methodologies can be are all highly positively (or negatively) co-
~ regarded as an attempt to discover gene expressed in a condition because a biological
jJ1 j{1 P 2 2 2 2
I ,J2 aij {ai: {a:j {a:: pairs that are, in principle, positively pathway is a complex functional system with
correlated in one condition (i.e. normal) interacting positive and negative feedback
and negatively correlated in another (i.e. loops. Thus, members of a biological
A greedy downhill approach finds local cancer). Identification of differentially co- pathway may not be contained in a single
minima of the score. Another approach expressed gene clusters or gene pairs co-expression cluster, especially when the
uses t-statistic for each cluster to evaluate usually do not use a priori defined gene cluster is not very big, but be split into
the difference of the degree of co-expression sets or pairs but try to find the best ones different clusters.
between conditions, after creating gene among all possible combinations without The dCoxS (differential co-expression
expression clusters [27]. These methods considering prior knowledge. Thus the of gene sets) algorithm identifies (a priori
can be viewed as an attempt to find gene biological interpretation of the clusters or defined or semantically enriched) gene set
clusters that are tightly co-regulated (i.e. pairs may also be improved by ontology pairs differentially co-expressed across
highly co-expressed) in one condition (i.e. and pathway-based annotation analysis. different conditions (Figure 4 (c) and
normal) but not in another (i.e. cancer). The idea of finding gene clusters that show Figure 5) [12]. Biological pathways can
To identify differentially co-expressed positive correlation in one condition and be used as pre-defined gene sets and the
gene pairs in Figure 4(b), F-statistic can be negative correlation in another condition differential co-expression of the biological
pathway pairs between conditions is ana- with dimensions 20 (genes) by 25 and by different conditions and N1 and N2 are
lyzed. To measure the expression similarity 15 (samples) for a condition, we calculate the numbers of upper-diagonal elements,
between paired gene-sets under the same 190 ( = (20*19)/2) sample pair-wise entro- which is calculated by n(n21)/2 (n = num-
condition, dCoxS defines the interaction py distances for each pathway expression ber of samples) for each condition.
score (IS) as the correlation coefficient matrix. The IS is obtained by calculating For the purpose of comparison, all gene
between the sample-wise entropies. Even the correlation coefficient between the two pair-wise Zf values are calculated for each
when the numbers of the genes in different entropy vectors. Finally, the statistical condition and the conditional difference of
pathways are different, IS can always be significance of the difference of the Fishers the Fishers Z-transformed correlation
obtained because it uses only sample-wise Z-transformed ISs between two conditions coefficients is tested for each gene pair as
distances regardless of whether the two is tested for each pathway pair. follows,
pathways have the same number of genes
or not.
1 1zIS 1 1zCC
Zf ~ |ln Zf ~ |ln
2 1{IS 2 1{CC
IS~
P G1 G2 G1 The p-value of the difference in the Zf
ivj (RE {RE )(RE {RE G2 )
q
P q
P values is calculated using the standard
ivj (RE {RE
G1 G1 2
) ivj (RE {RE
G2 G2 2
) (Zf1 {Zf2 )
normal distribution in equation. p(ZD p
1=N1 {3z1=N2 {3
where RESi and RESj are the matrices of the
Renyi relative entropy of gene sets, Si and Sj. (Zf1 {Zf2 ) where CC indicates the correlation coeffi-
P(ZD p
When estimating the relative entropy, mul- 1=N1 {3z1=N2 {3 cient of a gene pair, Zfi Fishers Z-
tivariate kernel density estimation was used transformed correlation coefficient and Ni
to model gene-gene correlation structure. the number of samples in conditions i. The
For example, when we compute the IS Zf1 and Zf2 are the Fishers Z-trans- p value for differential co-expression is
of a pair of pathway expression matrices formed values of the IS under two obtained according to the difference
between the Z values from the normal read the massive annotation lists for a factor binding, chromosomal co-location
distribution. For each gene pair, three p large number of clusters. It is unthinkably and proteinprotein interaction networks)
values are obtained, one from each hard to manually assemble the puzzle can be added to better explore the
condition and another from the difference pieces (i.e., the cluster-annotation sets) underlying structures. The representation
between the conditions. Bonferroni cor- into an executive summary (i.e., the of relationship between clusters can give
rection is applied. context of the whole experiment). Many more insight to interpret functions of
annotations are redundant such that many interesting genes.
5. Biological Interpretation and clusters share the same annotations in a Figure 6 demonstrates a context (or a
very complex manner. Ideally, the assem- gene expression dataset) with clusters and
Biological Semantics
bly should involve eliminating redundant annotations. Note that the relation matrix
Biological interpretation of genomic attributes and organizing the pieces in a between objects (i.e., rows or clusters) and
data requires a variety of semantic knowl- well-defined order for better biological attributes (i.e., columns or annotations)
edge. Biomedical semantics provides rich understanding and insight into the under- can be represented by a bipartite graph
descriptions for biomedical domain knowl- lying context of the experiment under (Figure 6(b)) or a concept lattice
edge. Biomedical semantics is a valuable investigation. (Figure 6(c)). A concept lattice organizes
resource not only for biological interpre- BioLattice is a mathematical framework all clusters and annotations of a relation
tation but also for multi-layered heteroge- based on concept lattice analysis to matrix into a single unified structure with
neous data integration and genotype- organize traditional clusters and associated no redundancy and no loss of informa-
phenotype association. Symbolic inference annotations into a lattice of concepts for tion. It is worth noting that the cluster
algorithms may add further values. better biological interpretation of micro- labels, C1 to C5, and the annotation labels
Although GO and pathway-based anal- array gene-expression data [13]. BioLat- appear once and only once in the lattice
ysis of co-expressed gene groups is one of tice considers gene expression clusters as diagram (Figure 6(c)). Now one can
the most powerful approaches for inter- objects and annotations as attributes and interpret the whole experimental context
preting microarray experiments, they have provides a graphical summary of the order (Figure 6(a)) by reading the ordered
limitations. The result, for example, is relations by arranging them on a concept concepts with clusters and annotations.
typically a long unordered list of annota- lattice in an order based on set inclusion Structural analyses methods like prom-
tions for tens or hundreds of gene clusters. relation. Complex relations among clusters inent sub-lattice analysis and core-periph-
Most of the analysis tools evaluate only and annotations are clarified, ordered and ery structure analysis may help further
one cluster at a time in a sequential visualized. Redundancy of annotation is understanding [13]. Figure 7 shows a
manner without considering the informa- completely removed. It also has an BioLattice for a mouse anti-GBM glomer-
tive association network of clusters and advantage that heterogeneous biological ulonephritis model [28]. Genes showing
annotations. It is very time-consuming to knowledge resources (such as transcription significant time-dose effect were clustered
Clustering: algorithm that puts similar things together and different things Supporting Information
apart.
Text S1 Answers to Exercises
Gene expression profiling: the measurement of the activity (or expression) of (DOCX)
thousands of genes at once to create a global picture of cellular function using
DNA microarray technology.
References
1. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, for comparative analysis of large gene sets in Gene collection of pathway/genome databases. Nucleic
Church GM (1999) Systematic determination of Ontology space. Appl Bioinformatics 3(4): 261264. Acids Res 38 (Database issue): D473479.
genetic network architecture. Nat Genet 22(3): 9. Chung HJ, Park CH, Han MR, Lee S, Ohn JH, 16. Alexa A, Rahnenfuhrer J, Lengauer T (2006)
281285. et al. (2005) ArrayXPath II: mapping and Improved scoring of functional groups from gene
2. Ashburner M, Ball CA, Blake JA, Botstein D, visualizing microarray gene-expression data with expression data by decorrelating GO graph
Butler H, et al. (2000) Gene ontology: tool for the biomedical ontologies and integrated biological structure. Bioinformatics 22(13): 16001607.
unification of biology. The Gene Ontology pathway resources using Scalable Vector Graph- 17. Falcon S, Gentleman R (2007) Using GOstats to
Consortium. Nat Genet 25(1): 2529. ics. Nucleic Acids Res 33(Web Server issue): test gene lists for GO term association. Bioinfor-
3. Dahlquist KD, Salomonis N, Vranizan K, Lawlor W621626. matics. 23(2): 257258.
SC, Conklin BR (2002) GenMAPP, a new tool for 10. Mootha VK, Lindgren CM, Eriksson KF, Sub- 18. Huang da W, Sherman BT, Lempicki RA (2009)
viewing and analyzing microarray data on ramanian A, Sihag S, et al. (2003) PGC-1alpha- Bioinformatics enrichment tools: paths toward the
biological pathways. Nat Genet 31(1): 1920. responsive genes involved in oxidative phosphor- comprehensive functional analysis of large gene
4. Al-Shahrour F, Daz-Uriarte R, Dopazo J (2004) ylation are coordinately downregulated in human lists. Nucleic Acids Res 37(1): 113.
FatiGO: a web tool for finding significant diabetes. Nat Genet 34(3): 267273. 19. Goeman JJ, Buhlmann P (2007) Analyzing gene
associations of Gene Ontology terms with groups 11. Subramanian A, Tamayo P, Mootha VK, Mu- expression data in terms of gene sets: methodo-
of genes. Bioinformatics 20(4): 578580. kherjee S, Ebert BL, et al. (2005) Gene set logical issues. Bioinformatics 23 (8): 980987.
5. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, enrichment analysis: a knowledge-based approach 20. Bild AH, Yao G, Chang JT, Wang Q, Potti A,
et al. (2004) TermFinderopen source software for interpreting genome-wide expression profiles. et al. (2006) Oncogenic pathway signatures in
for accessing Gene Ontology information and Proc Natl Acad Sci U S A. 102(43): 1554515550. human cancers as a guide to targeted therapies.
finding significantly enriched Gene Ontology
12. Cho SB, Kim J, Kim JH (2009) Identifying set- Nature 439(7074): 353357.
terms associated with a list of genes. Bioinfor-
wise differential co-expression in gene expression 21. Potti A, Dressman HK, Bild A, Riedel RF, Chan
matics 20(18): 37103715.
microarray data. BMC Bioinformatics 10: 109. G (2006) Genomic signatures to guide the use of
6. Chung HJ, Kim M, Park CH, Kim J, Kim JH
13. Kim J, Chung HJ, Jung Y, Kim KK, Kim JH chemotherapeutics. Nat Med 12(11): 12941300.
(2004) ArrayXPath: mapping and visualizing
(2008) BioLattice: a framework for the biological 22. Barry WT, Nobel AB, Wright FA (2005)
microarray gene-expression data with integrated
biological pathway resources using Scalable interpretation of microarray gene expression data Significance analysis of functional categories in
Vector Graphics. Nucleic Acids Res 32(Web using concept lattice analysis. J Biomed Inform 41 gene expression studies: a structured permutation
Server issue): W460464. (2): 232241. approach. Bioinformatics 21(9): 19431949.
7. Zhang B, Schmoyer D, Kirov S, Snoddy J (2004) 14. Yue L, Reisdorf WC (2005) Pathway and onto- 23. Li KC (2002) Genome-wide co-expression dy-
GOTree Machine (GOTM): a web-based plat- logy analysis: emerging approaches connecting namics: theory and application. Proc Natl Acad
form for interpreting sets of interesting genes transcriptome data and clinical endpoints. Curr Sci U S A 99(26): 1687516880.
using Gene Ontology hierarchies. BMC Bioinfor- Mol Med 5(1): 1121. 24. Lai Y, Wu B, Chen L, Zhao H (2004) A statistical
matics 5: 16. 15. Caspi R, Altman T, Dale JM, Dreher K, Fulcher method for identifying differential gene-gene co-
8. Zhong S, Storch KF, Lipan O, Kao MC, Weitz CJ, CA, et al. (2010) The MetaCyc database of expression patterns. Bioinformatics 20 (17): 3146
et al. (2004) GoSurfer: a graphical interactive tool metabolic pathways and enzymes and the BioCyc 55.
Figure 1. An overview of the process to calculate enrichment of GO categories. The steps usually followed are: (1) Get annotations for each
gene in reference set and the set of interest. (2) Count the occurrence (n) of each GO term in the annotations of the genes comprising the set of
interest. (3) Count the occurrence (m) of that same GO term in the annotations of the reference set. (4) Assess how surprising is it to find n, given m,
M and N.
doi:10.1371/journal.pcbi.1002827.g001
molecular function in a certain biological process descriptions instead of ontology terms. For child-parent (IS_A) relationship, we gen-
at a certain cellular location is enriched than example, a user might have a file associ- erate the complete set of implied (indirect)
it is to know about each of the terms ating gene IDs with their GeneRIF annotations based on child-parent rela-
separately. Similarly, when using ontolo- descriptions from NCBI. In this situation tionships, by traversing and aggregating
gies other than GO, it is more meaningful a user can invoke the NCBO Annotator along the ontology hierarchy.
to look for enrichment of combinations service [25,26] to process these textual Step 2 Once the ontology terms and
such as certain adverse reactions in a given descriptions and assign ontology terms to their aggregate frequencies in the input
disease when treated by a particular drug. the element identifiers (Step 0). Given the dataset are calculated, we arrive at the step
However, exhaustively examining all pos- users selection of an ontology, the anno- of determining the meaning or significance
sible 3-term combinations of ontology tator processes the input text (say GeneR- of the results. Enrichment analysis with
terms is computationally expensive and IFs) to identify concepts that match GO has benefited from the existence of a
most of the random term combinations ontology terms (based on preferred names natural and easily defensible choice for a
make no biological sense. The identifica- or synonyms). The implementation details background setall of the given organ-
tion of combinations that are meaningful and accuracy of the Annotator service are isms genes, all genes measured on the
and appear at a high enough frequency to described in [25]. The result is a list of platform, etc. For most of the disease
justify their use in enrichment computa- computationally annotated element iden- ontologies we consider, no such compre-
tions is an exciting and fruitful area of tifiers based on the input textual descrip- hensive distribution exists [28]; and as
research. tion, and this output is equivalent to the discussed before, for calculating statistical
first input type. Using this step, were able enrichment, we need the background term
to create ontology-based annotations from frequency to determine if the aggregate
2.2 DIY Disease Ontology-based free-text descriptions. Thus, we are no annotation counts after step 1 are sur-
Enrichment Analysis Workflow longer reliant on the availability of ex- prising given the background. By lever-
We have seen that the progress in the haustive manually-curated annotations, aging existing projects and resources, there
current state of the art in storing, accessing such as those required with GO-based are several methods by which a user can
and using ontologies for annotation pro- analyses. address this problem. We discuss a couple
vides components that allow enrichment Step 1 After this optional preprocessing of heuristic approaches to address this
analysis when preexisting annotations do step, for each ontology term in the input problem, and in Section 2.3 discuss a
not exist; as in the case of disease dataset one can programmatically traverse systematic process to create custom refer-
ontologies. We now discuss a workflow to the ontology structure and retrieve the ence sets.
conduct enrichment analysis in domains complete listing of paths from the concept In the first approach, one can access a
beyond just expression analysis. A sche- to the root(s) of the ontology using Web database of automatically created annota-
matic of the workflow is shown in Figure 2. services [27]. A traversal through each of tions over the entirety of MEDLINE
A user can start with two principal types these paths, essentially recapitulates the abstracts and use these annotations source
of inputs. In the first case, the user already ontology hierarchy. Each term along the as an approximate proxy for the true
has the elements of the dataset of interest path is associated as an annotation to that background distribution frequency of a
annotated with specific ontology terms element identifier in the input dataset to specific term. To generate the background
i.e. the user already has a file associating which the starting term was associated frequency, for a given term X, we retrieve
element identifiers (gene names, patient with. This procedure of tracing terms back the text strings corresponding to its
ID numbers, etc.) with ontology term to the graphs root performs the transitive preferred name and all of its synonyms,
identifiers. In the second case, the user closure of the annotations over the ontol- and then add up the MEDLINE occur-
has associations of identifiers to textual ogy hierarchy. In essence, for each rence counts for each of these strings. We
return this number (m) as well as the total technical accuracy before interpreting the enrichment analysis is another exciting
number of entries in the MEDLINE results for scientific significance. To and fruitful area of research.
annotation database (M). The fraction m/ evaluate technical accuracy, we suggest
M then represents the background fre- that users create benchmark data sets
quency of the term X in the annotated similar to those of Toronen and
2.3 Creating Reference Sets for
corpus. Using this frequency we can colleagues [30], who created gene lists Custom Enrichment Analysis
compute significant comparative over- or with a selected enrichment level and a As discussed before, a key pre-requisite
under-representation in the input dataset. selected number of independent, over- for performing enrichment analysis is the
The second approach uses NCBOs represented classes to compare different availability of an appropriate reference
Resource Index, which is a repository of GO-based enrichment methods. In the dataset to compare against when looking
automatically-created annotations. Access case of analyses using disease ontologies, for over- or under-represented terms. In
to the Resource Index allows a user to the benchmark data sets would comprise this section, we describe: (i) a general
make the same sort of calculations as with gene lists enriched for specific disease method that uses hand-curated GO anno-
the MEDLINE term frequencies, but also terms, clinical-trial lists enriched for a tations as a starting point, for creating
offers information on the co-occurrence of specific drug being studied; lists of research reference datasets for enrichment analysis
ontological terms in textual descriptions publications that are enriched for known using other ontologies; and (ii) a gene
and annotations of datasets; enabling the NCIt terms, and so on. A sample disease reference annotation dataset for
user to quantify the degree to which terms benchmark list of aging related genes and performing disease-based enrichment.
are independent or correlated in the their annotations is provided in Section 5. GO annotations are unique because
annotation space. Exercises. This dataset was compiled by highly trained curators associate GO
Step 3 There are several possible output computationally creating disease term terms to gene products manually, based
mechanisms to such an analysis workflow. annotations on 261 human genes on literature review. We describe how,
The simplest is a tag cloud, which intui- designated to be related to aging with the availability of tools for automatic
tively summarizes the results of the analysis according to the GeneAge database [31]. ontology-based annotation with terms
(Figure 3). The sizes and colors of terms in The annotations of this gene list are from disease ontologies, it is possible to
the cloud indicate the relative frequency of enriched for disorders, such as create reference annotation datasets for
the terms offering a high-level overview. atherosclerosis, that are known to be enrichment analysis using ontologies other
However, a tag clouds representative associated with aging. Such benchmark than the GOfor example, the Human
ability is limited because there is no easy data sets can be used to ensure accuracy of Disease Ontology.
way to show significance relative to some the enrichment statistics as well as to Unlike GO terms, which actually ap-
expectation, or to show the elements in the evaluate the appropriateness of different pear in the text with low frequency, or
input associated with some term. sources of reference-term frequencies for gene identifiers, which are ambiguous,
The second output format is in XML, computing enrichment. disease terms are amenable to automated,
which is amenable to postprocessing by The inconsistency of abstraction levels term extraction techniques. Therefore,
the user, as needed. The result for each in ontologies is an often discussed stum- using tools which recognize mentions of
term contains its respective frequency bling block for enrichment analysis [17]. ontology terms in user submitted text, we
information in the input data along with Two terms at equal depths may not can automatically recognize occurrences
the counts on which the frequency is represent concepts of similar granularity, of terms from the Human Disease Ontol-
based. The results on each term can also creating a bias in the reported term ogy (DO) from a given corpus of text [28];
contain the list of identifiers that mapped enrichment. By comprehensively analyz- the key is to identify the text source that
to that term. Each node includes informa- ing the frequencies of terms in MEDLINE can be relied upon to recognize disease
tion on the level in the ontology at which and the NCBO Resource Index, a user terms to associate with genes.
the term is found. Using such an output, it can perform a thorough analysis of Unlike other natural-language tech-
is straight forward to create graphical dependencies among ontology-term anno- niques for finding genedisease associa-
visualizations similar to those that most tations to make existing biases explicit as tions, our proposed method uses manually
GO based enrichment analysis tools pro- well as to define custom abstraction levels curated GO annotations as the starting
vide [29]; see example in Figure 4. using methods developed by Alterovitz basis to identify the text source from which
2.2.1. Ensuring quality. For any et al. [32]. The development of methods to to recognize disease terms. Basically, we
such custom analysis workflow it is reliably identify the appropriate level of use manually curated GO annotations to
essential to set up tests that ensure abstraction at which to report the results of identify those publications that were the
basis for associating a GO term with a and abstract using the National Library of shown in Figure 6 and the analysis itself
particular gene. Medicine eUtils. We save each articles is offered as an exercise for the reader in
Figure 5 summarizes our method. First, title and abstract as a file and annotate it Section 6. Exercises. What differentiates
we start with GO annotations, which via the Annotator service using the disease our suggested method from other ap-
provide the PubMed identifiers of papers ontology as the target. Once we have the proaches [28,36] for finding genedisease
based on which gene products are associ- publicationdisease tuples, we cross-refer- associations is the use of GO annotations
ated with specific GO terms by a curator. ence them with the genepublication as a basis for identifying reliable gene
The annotations essentially give us a link tuples resulting in genedisease associa- publication records that serve as the
between gene identifiers and PubMed tions for 7316 human genes. foundation for generating automated
articles and only those PubMed articles Out of 25,000 currently estimated annotations. Furthermore, researchers
that were deemed to be relevant for GO human genes, we are able to annotate can reuse our method to examine func-
annotation curation. Next, we recognize 7316 genes (29.2%) with at least one tion along other dimensions. For exam-
terms from an ontology of interest (e.g. disease term from the Human Disease ple, researchers can use the Pathway
Human Disease) in the title and abstracts Ontology. Previous methods that use ontology to generate genepathway asso-
of those articles. Finally, we associate the advanced text mining have been able to ciations.
recognized ontology terms with the gene annotate 4408 genes (17.7%) [33]. A study 2.3.1 Ensuring quality. When using
identifiers to which the article analyzed based on OMIM associated 1777 genes an automated annotation process to create
was associated. (7.1%) with disease terms to create a a reference annotation set, there are some
In order to demonstrate feasibility of the human diseasome [34] and an auto- caveats to consider. First, not all ontologies
proposed workflow and to provide a mated approach using MetaMap as the are equally suited for creating automated
sample reference annotation set for per- concept recognizer and GeneRIFs as well annotations. Second, automated
forming disease ontology based analyses in as descriptions from OMIM as the input annotation depends highly on the quality
the exercises of this chapter, we download textual descriptions annotated roughly of the input text corpus. Third, some
GO annotation files for human gene 14.9% of the human genome with disease errors in annotation are inevitable in an
products from geneontology.org. These terms [28]. Because the number of human automated process. We discuss these issues
files are tab-delimited text files that genes known at the time of each study below.
contain, among other things, a list of gene varies, we make the comparisons loosely. Using other ontologies. Although we specif-
identifiers, associated GO terms, and the In order to validate our background ically focus on creating annotations with
publication source (a PubMed identifier) annotation set, we evaluated our gene terms from the human disease ontology,
on the basis of which that GO annotation disease association dataset in several ways the method we have devised (Figure 6) can
was created. We removed all electronically described in [35]. First, we examined a set create annotations with terms from other
inferred annotations (IEA) from the anno- of genes related specifically to aging from ontologies. In the presented workflow, to
tation file. We also removed all qualified the GenAge database [31] for their obtain a background dataset for enrich-
annotations, such as negated (NOT) ones. coherence in terms of the assigned disease ment for some ontology other than DO,
As a result, we obtain a list of publications annotations. Next, we performed disease- researchers would simply configure a
and the genes they describe, genepubli- based enrichment analysis on the same parameter for the Annotator Web service
cation tuples. In the next step, using the aging related gene set using our newly to use their ontology of choice from
PubMed identifiers obtained from the GO created reference annotation set. The BioPortal. In fact, other researchers have
annotation files, we fetch each articles title results of the enrichment analysis are used a similar annotation workflow to
Figure 5. Workflow for generating background annotation sets for enrichment analysis: We obtain a set of PubMed articles from
manually curated GO annotations, which we process using the NCBO Annotator service.
doi:10.1371/journal.pcbi.1002827.g005
recognize morphological features in textu- association dataset. These missed annota- enrichment (Figure 6)though that is
al descriptions of fish species [37]. tions provide an opportunity for refining not guaranteed. Advanced text mining
Not all ontologies are viable candidates the annotation workflows to use sources of can potentially provide checks against
for automatic annotation because not all text beyond just the papers referenced in such kinds of errors by analyzing the
ontology terms appear in the text of a GO annotations. context in which a potential disease term is
MEDLINE abstract. For example, using Annotation errors. Some errors in annota- mentioned.
termfrequency counts in MEDLINE tion are inevitable in an automated
abstracts [38], we calculated that disease process. For example, in the reference
3. Novel Use Cases Enabled
terms are mentioned 46% more often annotation set we created, TP53 was also
than GO terms in MEDLINE abstracts. annotated, wrongly, to Recruitment. We believe that extending the current
As another example, only 10% of the Papers that were the basis of creating enrichment-analysis methods to ontologies
manually assigned GO terms can be GO annotations for TP53 certainly men- beyond GO and to extending the method
detected directly in the paper abstract tion the term Recruitment; however beyond analyzing gene and protein anno-
supporting that particular GO annota- that term is not a disease. The term tations to any set of entities for term
tion. Because disease terms are mentioned Recruitment is in the Human Disease enrichment will enable several novel use
significantly more often than GO terms, Ontology and is declared to be a synonym cases. For example, a user might analyze a
the automated annotation process works of auditory recruitment, which does not set of papers published in the last three
well for annotating genes with disease have an asserted superclass, or a place in years in a particular domain (say, signal
ontology terms. the hierarchy indicating a possible error in transduction) and identify which pathway
Missing annotations. Out of the 261 aging- the ontology. However, because such was mentioned most frequently. Similar-
related genes in our evaluation subset, the errors will affect annotation of both the ly, a user could analyze descriptions of
Annotator left out 24 genes (9%), for set of interest and the reference set equally, genes controlled by a particular ultra-
which we have no disease terms associated the errors will most likely cancel each conserved region of DNA to generate
with those genes in our genedisease other out when computing statistical hypotheses about the regions function in
References
1. Altman RB, Raychaudhuri S (2001) Whole- activities along the taxonomic tree. Genome Biol 28. Osborne JD, Flatow J, Holko M, Lin SM, Kibbe
genome expression analysis: challenges beyond 8: R33. WA, et al. (2009) Annotating the human genome
clustering. Curr Opin Struct Biol 11: 340 15. Farcomeni A (2008) A review of modern multiple with Disease Ontology. BMC Genomics 10 Suppl
347. hypothesis testing, with particular attention to the 1: S6.
2. Brazma A, Vilo J (2000) Gene expression data false discovery proportion. Stat Methods Med 29. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, et
analysis. FEBS Lett 480: 1724. Res 17: 347388. al. (2004) TermFinderopen source software for
3. Quackenbush J (2002) Microarray data normal- 16. Benjamini Y, Yekutieli D (2001) The control of accessing Gene Ontology information and finding
ization and transformation. Nat Genet 32 Suppl: the false discovery. Rate under dependency. Ann significantly enriched Gene Ontology terms
496501. Stat 29: 11651188. associated with a list of genes. Bioinformatics
4. Tusher VG, Tibshirani R, Chu G (2001) 17. Khatri P, Draghici S (2005) Ontological analysis 20: 37103715.
Significance analysis of microarrays applied to of gene expression data: current tools, limitations, 30. Toronen P, Pehkonen P, Holm L (2009) Gener-
the ionizing radiation response. Proc Natl Acad and open problems. Bioinformatics 21: 3587 ation of Gene Ontology benchmark datasets with
Sci U S A 98: 51165121. 3595. various types of positive signal. BMC Bioinfor-
5. Huttenhower C, Hibbs M, Myers C, Troyans- 18. Shah NH, Fedoroff NV (2004) CLENCH: a matics 10: 319.
kaya OG (2006) A scalable method for integration program for calculating Cluster ENriCHment 31. de Magalhaes JP, Budovsky A, Lehmann G,
and functional analysis of multiple microarray using the Gene Ontology. Bioinformatics 20: Costa J, Li Y, et al. (2009) The Human Ageing
datasets. Bioinformatics 22: 28902897. 11961197. Genomic Resources: online databases and tools
6. Zeeberg BR, Feng W, Wang G, Wang MD, Fojo 19. Ade AS, States DJ, Wright ZC (2007) Genes2- for biogerontologists. Aging Cell 8: 6572.
AT, et al. (2003) GoMiner: a resource for Mesh. Ann Arbor, MI: National Center for 32. Alterovitz G, Xiang M, Hill DP, Lomax J, Liu J,
biological interpretation of genomic and proteo- Integrative Biomedical Informatics. et al. (2010) Ontology engineering. Nat Biotech-
mic data. Genome Biol 4: R28. nol 28: 128130.
20. Mort M, Evani US, Krishnan VG, Kamati KK,
7. Rhee SY, Wood V, Dolinski K, Draghici S (2008) 33. Altman RB, Bergman CM, Blake J, Blaschke C,
Baenziger PH, et al. (2010) In silico functional
Use and misuse of the gene ontology annotations. Cohen A, et al. (2008) Text mining for biology
profiling of human disease-associated and poly-
Nat Rev Genet 9: 509515. the way forward: opinions from leading scientists.
morphic amino acid substitutions. Human Muta-
8. Shah NH, Jonquet C, Chiang AP, Butte AJ, Chen Genome Biol 9 Suppl 2: S7.
tion 31: 335346.
R, et al. (2009) Ontology-driven indexing of 34. Goh KI, Cusick ME, Valle D, Childs B, Vidal M,
21. Spackman KA (2004) SNOMED CT milestones:
public datasets for translational bioinformatics. et al. (2007) The human disease network. Proc
endorsements are added to already-impressive Natl Acad Sci U S A 104: 86858690.
BMC Bioinformatics 10 Suppl 2: S1.
standards credentials. Healthc Inform 21: 54, 56. 35. Lependu P, Musen MA, Shah NH (2011)
9. Subramanian A, Tamayo P, Mootha VK,
Mukherjee S, Ebert BL, et al. (2005) Gene set 22. Smith B, Ashburner M, Rosse C, Bard J, Bug W, Enabling enrichment analysis with the Human
enrichment analysis: a knowledge-based ap- et al. (2007) The OBO Foundry: coordinated Disease Ontology. J Biomed Inform 44 Suppl 1:
proach for interpreting genome-wide expression evolution of ontologies to support biomedical data S31S38.
profiles. Proc Natl Acad Sci U S A 102: 15545 integration. Nat Biotechnol 25: 12511255. 36. Krallinger M, Leitner F, Valencia A (2010)
15550. 23. Noy NF, Shah NH, Whetzel PL, Dai B, Dorf M, Analysis of biological processes and diseases using
10. Draghici S, Khatri P, Martins RP, Ostermeier GC, et al. (2009) BioPortal: ontologies and integrated text mining approaches. Methods Mol Biol 593:
Krawetz SA (2003) Global functional profiling of data resources at the click of a mouse. Nucleic 341382.
gene expression. Genomics 81: 98104. Acids Res 37(Web Server issue): W170W173. 37. Sarkar N (2010) Using biomedical ontologies to
11. (2002) Gene ontology consortium website. 24. Myhre S, Tveit H, Mollestad T, Laegreid A enable morphology based phylogenetics: a feasi-
12. Alexa A, Rahnenfuhrer J, Lengauer T (2006) (2006) Additional gene ontology structure for bility study for fishes; 2010; Boston, MA.
Improved scoring of functional groups from gene improved biological reasoning. Bioinformatics 22: 38. Xu R, Musen MA, Shah NH (2010) A compre-
expression data by decorrelating GO graph 20202027. hensive analysis of five million UMLS metathe-
structure. Bioinformatics 22: 16001607. 25. Shah NH, Bhatia N, Jonquet C, Rubin D, saurus terms using eighteen million MEDLINE
13. Grossmann S, Bauer S, Robinson PN, Vingron Chiang AP, et al. (2009) Comparison of concept citations. AMIA Annual Symposium proceed-
M (2007) Improved detection of overrepresenta- recognizers for building the Open Biomedical ings/AMIA Symposium 2010: 907911.
tion of Gene-Ontology annotations with parent Annotator. BMC Bioinformatics 10 Suppl 9: S14. 39. Tirrell R, Evani U, Berman AE, Mooney SD,
child analysis. Bioinformatics 23: 30243031. 26. Jonquet C, Shah NH, Musen MA (2009) The Musen MA, et al. (2010) An ontology-neutral
14. Schlicker A, Rahnenfuhrer J, Albrecht M, Open Biomedical Annotator; 2009 March 1517; framework for enrichment analysis. AMIA An-
Lengauer T, Domingues FS (2007) GOTax: San Francisco, CA. pp. 5660. nual Symposium proceedings/AMIA Symposium
investigating biological processes and biochemical 27. (2010) NCBO REST services. 2010: 797801.
Abstract: Genome-wide associa- tics and computer science [3,4,5,6]. In this is to find SNPs (markers) in X , that are
tion study (GWAS) aims to discover section, we will first provide a brief highly associated with Y . There are
genetic factors underlying pheno- introduction to the necessary biological several challenging issues that need to be
typic traits. The large number of background. We will then formalize the addressed when developing an analytic
genetic factors poses both compu- problem and discuss both traditional and method for GWAS [7,8].
tational and statistical challenges. recently developed methods for genome- Scalability Most GWAS datasets consist
Various computational approaches wide analysis of associations. of a large number of SNPs. Therefore the
have been developed for large A human genome contains over 3 billion algorithms for GWAS need to be highly
scale GWAS. In this chapter, we will DNA base pairs. There are four possible scalable. For example, for a typical human
discuss several widely used com- nucleotides at each base in the DNA: GWAS, the dataset may contain up to
putational approaches in GWAS. adenine (A), guanine (G), thymine (T), millions SNPs and involve thousands of
The following topics will be cov- and cytosine (C). In some locations in the individuals. Inefficient methods may con-
ered: (1) An introduction to the genome, a genetic variation may be found sume a large amount of computational
background of GWAS. (2) The
which involves two or more nucleotides resources and time to find highly associated
existing computational approaches
that are widely used in GWAS. This across different individuals. These genetic SNPs.
will cover single-locus, epistasis variations are known as single-nucleotide Missing markers Even with the
detection, and machine learning polymorphism (SNPs), i.e., a variation of a current dense genotyping technique, many
methods that have been recently single nucleotide in the DNA sequence. In genetic variants are still not genotyped.
developed in biology, statistic, and most cases, there are two possible nucleo- Current methods usually assume genetic
computer science communities. tides for a variant. We denote the more linkage to enhance the power. Imputation,
This part will be the main focus of frequent one as 0, and the less frequent which tries to impute the unknown
this chapter. (3) The limitations of one as 1. For bases on autosomal markers by using existing SNPs databases,
current approaches and future di- chromosomes, there are two parallel nucle- is another popular approach to handle
rections. otides, which leads to three possible missing markers. The well known related
combinations, 00, 01 and 11. These projects include the International Hap-
genotype combinations are known as Map project [9] and the 1000 Genomes
This article is part of the Transla- major homozygous site, heterozygous Project [10].
tional Bioinformatics collection for site and minor heterozygous site re- Complex traits One approach in
PLOS Computational Biology. spectively. These genetic variations con- GWAS is to test the association between
tribute to the phenotypic differences among the trait and each marker in a genome,
the individuals. (A phenotype is the com- which is successful in detecting a single
1. Introduction posite of an organisms observable charac- gene related disease. However, this ap-
teristics or traits.) Genome-wide association proach may have problems in finding
With the advancement of genotyping study (GWAS) aims to find strong associa- markers associated with complex traits.
technology, genome-wide high-density sin- tions between SNPs and phenotypes across This is because that complex traits are
gle nucleotide polymorphisms (SNPs) of
a set of individuals. affected by multiple genes, and each gene
human and other organisms are now
More formally, let X ~fX1 ,X2 , , may only have a weak association with the
available [1,2]. The goal of genome-wide
XN g be the set of N SNPs for M phenotype. Such markers with low mar-
association studies (GWAS) is to seek
individuals in the study, and Y be the ginal effects are hard to detect by the
strong associations between phenotype
phenotype of interest. The goal of GWAS single-locus methods.
and genetic variations in a population that
represent (genomically proximal) causal
genetic effects. As the most abundant Citation: Zhang X, Huang S, Zhang Z, Wang W (2012) Chapter 10: Mining Genome-Wide Genetic Markers. PLoS
source of genetic variation, millions of Comput Biol 8(12): e1002828. doi:10.1371/journal.pcbi.1002828
SNPs have been genotyped across the Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
entire genome. Analyzing such large Baltimore County, United States of America
amount of markers poses great challenges Published December 27, 2012
to traditional computational and statistical Copyright: 2012 Zhang et al. This is an open-access article distributed under the terms of the Creative
methods. In this chapter, we introduce the Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
basic concept of genome-wide association
study, and discuss recently developed Funding: This work was supported by the following grants: NSF IIS-1162369, NSF IIS-0812464, NIH GM076468
and NIH MH090338. The funders had no role in the preparation of the manuscript.
methods for GWAS.
Genome-wide association study is an Competing Interests: The authors have declared that no competing interests exist.
inter-discipline problem of biology, statis- * E-mail: weiwang@cs.ucla.edu
Note that for any group of A, B, a1 , a2 , b1 , R1 (Xi Xj Y ) maxf(nA la1 {na1 TA )2 ,(nA ua1 {na1 TA )2 g
2
Tgroup na1 (nA {na1 )nA
b2 , if ngroup ~0, then ngroup is defined to be Xnb
lb1 1
y
0. i~1 Bi
XnB
Let fym Dym [Ag~fyA1 ,yA2 , ,yAnA g ub1
i~n {n z1 Bi
y
B b1
be the phenotype values in group A.
Without loss of generality, assume that R2 (Xi Xj Y ) maxf(nB lb1 {nb1 TB )2 ,(nB ub1 {nb1 TB )2 g
nb1 (nB {nb1 )nB
these phenotype values are arranged in
ascending order, i.e., doi:10.1371/journal.pcbi.1002828.t004
yA1 yA2 yAnA : Note that na1 is the number of 1s in Xj upper bound. In this way, we can calculate
when Xi takes value 1, and nb1 is the number the upper bound for a group of SNP-pairs
of 1s in Xj when Xi takes value 0. It is easy to together. Note that for typical genome-wide
Let fym Dym [Bg~fyB1 ,yB2 , ,yBnB g be prove that switching na1 and na2 does not association studies, the number of individuals
the phenotype values in group B. Without change the F-statistic value and the correct- M is much smaller than the number of SNPs
loss of generality, assume that these ness of the upper bound. This is also true if N. Therefore, the additional cost for access-
phenotype values are arranged in ascend- we switch nb1 and nb2 . Therefore, without ing Array(Xi ) is minimal compared to
ing order, i.e., loss of generality, we can always assume that performing ANOVA tests for all pairs
na1 is the smaller one between the number of (Xi Xj )[AP(Xi ).
1s and number of 0s in Xj when Xi takes For multiple tests, permutation proce-
yB1 yB2 yBnB :
value 1, and nb1 is the smaller one between dure is often used in genetic analysis for
the number of 1s and number of 0s in Xj controlling family-wise error rate. For
We have the overall upper bound on
when Xi takes value 0. genome-wide association study, permuta-
SSB (Xi Xj ,Y ):
If there are m 1s and (M{m) 0s in Xi , tion is less commonly used because it often
Theorem 1 (Upper bound of SSB (Xi
then for any (Xi Xj )[AP(Xi ), the possible entails prohibitively long computation
Xj ,Y ))
values that na1 can take are f0,1,2, , times. Our FastANOVA algorithm makes
tm=2sg. The possible values that nb1 can permutation procedure feasible in ge-
SSB (Xi Xj ,Y )SSB (Xi ,Y )zR1 (Xi Xj Y )z take are f0,1,2, ,t(M{m)=2sg. nome-wide association study.
To efficiently retrieve the candidates, the Let Y ~fY1 ,Y2 , ,YK g be the K
R2 (Xi Xj Y ):
SNP-pairs (Xi Xj ) in AP(Xi ) are grouped permutations of the phenotype Y . Following
by their (na1 ,nb1 ) values and indexed in a the idea discussed above, the upper bound in
The notations in the bound can be found 2D array, referred to as Array(Xi ). Theorem 1 can be easily incorporated in the
in Table 4. The upper bound in Theorem 1 Suppose that there are 32 individuals, and algorithm to handle the permutations. For
is tight. The tightness of the bound is obvious the genotype of Xi consists of half 0s and half every SNP Xi , the indexing structure
from the derivation of the upper bound, since 1s. Thus for the SNP-pairs in AP(Xi ), the Array(Xi ) is independent of the permuted
there exists some genotype of SNP-pair possible values of na1 and nb1 are phenotypes in Y . The correctness of this
(Xi Xj ) that makes the equality hold. f0,1,2, ,8g. Figure 1 shows the 9|9 property relies on the fact that, for any
We now discuss how to apply the upper array, Array(Xi ), whose entries represent (Xi Xj )[AP(Xi ), na1 and nb1 only depend on
bound in Theorem 1 in detail. The set of the possible values of (na1 ,nb1 ) for the SNP- the genotype of the SNP-pair and thus
all SNP-pairs is partitioned into non- pairs (Xi Xj )[AP(Xi ). The entries in the same remain constant for different phenotype
overlapping groups such that the upper column have the same na1 value. The entries permutations. Therefore, for each Xi , once
bound can be readily applied to each in the same row have the same nb1 value. The we build Array(Xi ), it can be reused in all
group. For every Xi (1iN), let na1 value of each column is noted beneath permutations.
AP(Xi ) be the set of SNP-pairs each column. The nb1 value of each row is
noted left to each row. Each entry of the array 3.2 The FastChi Algorithm
is a pointer to the SNP-pairs (Xi Xj )[AP(Xi ) As our initial attempt to develop scalable
AP(Xi )~f(Xi Xj )Diz1jNg: having the corresponding (na1 ,nb1 ) values. algorithms for genome-wide association
For any SNP Xi , the maximum number of study, FastANOVA is specifically designed
For all SNP-pairs in AP(Xi ), nA , TA , nB , TB the entries in Array(Xi ) is (qM4 rz1)2 . The for the ANOVA test on quantitative pheno-
and SSB (Xi ,Y ) are constants. Moreover, proof of this property is straightforward and types. Another category of phenotypes is
la1 , ua1 are determined by na1 , and lb1 , ub1 omitted here. In order to find candidate SNP- generated in case-control study, where the
are determined by nb1 . Therefore, in the pairs, we scan all entries in Array(Xi ) to phenotypes are binary variables representing
upper bound, na1 and nb1 are the only calculate their upper bounds. Since the disease/non-disease individuals. Chi-square
variables that depend on Xj and may vary SNP-pairs indexed by the same entry share test is one of the most commonly used
for different SNP-pairs (Xi Xj ) in AP(Xi ). the same (na1 ,nb1 ) value, they have the same statistics in binary phenotype association
study. We can extend the principles in The ones whose (R1 (Xi Xj ),R2 (Xi Xj )) val- magnitude faster than the brute force
FastANOVA for efficient two-locus chi- ues fall below the line can be pruned without alternative.
square test. The general idea of FastChi is any further test.
similar to that of FastANOVA, i.e., re- Suppose that there are 32 individuals, Xi 3.3 The COE Algorithm
formulating the chi-square test statistic to contains half 0s, and half 1s. For the Both FastANOVA and FastChi rework the
establish an upper bound of two-locus chi- SNP-pairs in AP(Xi ), the possible values of formula of ANOVA test and Chi-square test
square test, and indexing the SNP-pairs 0 1 2 3 4 to estimate an upper bound of the test value
according to their genotypes in order to R1 (and R2 ) are f , , , , ,
16 15 14 13 12 for SNP pairs. These upper bounds are used
effectively prune the search space and reuse 5 6 7 8 to identify candidate SNP pairs that may have
, , , g. Figure 2 shows the 2-D space
redundant computations. Here we briefly 11 10 9 8 strong epistatic effect. Repetitive computation
introduce the FastChi algorithm. of R1 and R2 . The blue stars represent the in a permutation test is also identified and
For SNP Xi , we represent the chi-square values that (R1 ,R2 ) can take. The line performed once those results are stored for use
test value of Xi and the binary phenotype Y as x2 (Xi ,Y )zT1 S1 R1 zT2 S2 R2 ~h is plot- by all permutations. These two strategies lead
x2 (Xi ,Y ). For any SNP-pair Xi and Xj , we ted in the figure. Only the SNP-pairs whose to substantial speedup, especially for large
use x2 (Xi Xj ,Y ) to represent the chi-square (R1 ,R2 ) values are in the shaded region are permutation test, without compromising the
test value for the combined effect of (Xi Xj ) subject to two-locus Chi-square test. accuracy of the test. These approaches
with Y . Let A,B,C,D represent the following Similar to FastANOVA, in FastChi, we guarantee to find the optimal solutions.
events respectively: Y ~0 ^ Xi ~0; Y ~0^ can index the SNP-pairs in AP(Xi ) accord- However, a common drawback of these
Xi ~1; Y ~1 ^ Xi ~0; Y ~1 ^ Xi ~1. Let ing to their genotype relationships, i.e., by the methods is that they are designed for specific
Oevent denote the observed value of an event. values of (R1 ,R2 ). Experimental results tests, i.e., chi-square test and ANOVA test.
T1 , T2 , S1 , S2 , R1 , and R2 represent the demonstrate that FastChi is an order of The upper bounds used in these methods do
formulas shown in Table 5. We have the not work for other statistical tests, which are
upper bound of x2 (Xi Xj ,Y ) stated in
Theorem 2. Table 5. Notations used in the derivation of the upper bound for two-locus Chi-
Theorem 2 (Upper bound of x2 (Xi Xj , square test.
Y ))
T2 S2 R2 : T1 M2
(OA zOB )(OA zOC )(OC zOD )
S1 maxfO2A ,O2C g
For given phenotype Y and SNPXi ,
R1 OXj ~1 OXj ~0
minf DXi ~0 , DXi ~0 g
x2 (Xi ,Y ), T1 , S1 , T2 , and S2 are constants. OXj ~0 OXj ~1
R1 and R2 are the only variables that T2 M2
depend on Xj and may vary for different (OA zOB )(OB zOD )(OC zOD )
SNP-pairs (Xi Xj )[AP(Xi ). (Recall that S2 maxfO2B ,O2D g
AP(Xi )~f(Xi Xj )Diz1jNg.) Thus for
R2 OXj ~1 OXj ~0
a given Xi , we can treat equation minf
OXj ~0
DXi ~1 ,
OXj ~1
DXi ~1 g
x2 (Xi ,Y )zT1 S1 R1 zT2 S2 R2 ~h as a
straight line in the 2-D space of R1 and R2 . doi:10.1371/journal.pcbi.1002828.t005
also routinely used by researchers. In addition, However, there are two major drawbacks that exhaustively computing all two-locus test
new statistics for epistasis detection are limit their applicability. First, they are designed values in permutation test, it enables both
continually emerging in the literature. There- for relatively small sample size and only FWER and FDR controlling. It is applicable
fore, it is desirable to develop a general model consider homozygous markers (i.e., each to all statistics based on the contingency table.
that supports a variety of statistical tests. SNP can be represented as a f0,1g binary Previous methods are either designed for
The COE algorithm takes the advantage variable). In human study, the sample size is specific tests or require the test statistics satisfy
of convex optimization. It can be shown that usually large and most SNPs contain hetero- certain property. Experimental results dem-
a wide range of statistical tests, such as zygous genotypes and are coded using onstrate that TEAM is more efficient than
chi-square test, likelihood ratio test (also f0,1,2g. These make previous methods existing methods for large sample studies.
known as G-test), and entropy-based tests intractable. Second, although the family-wise TEAM incorporates the permutation test
are all convex functions of observed frequen- error rate (FWER) and the false discovery rate for proper error controlling. The key idea is
cies in contingency tables. Since the maxi- (FDR) are both widely used for error to incrementally update the contingency
mum value of a convex function is attained at controlling, previous methods are designed tables of two-locus tests. We show that only
the vertices of its convex domain, by only to control the FWER. From a compu- four of the eighteen observed frequencies in
constraining on the observed frequencies in tational point of view, the difference in the the contingency table need to be updated to
the contingency tables, we can determine the FWER and the FDR controlling is that, to compute the test value. In the algorithm, we
domain of the convex function and get its estimate FWER, for each permutation, only build a minimum spanning tree [19] on the
maximum value. This maximum value is the maximum two-locus test value is needed. SNPs. The nodes of the tree are SNPs. Each
used as the upper bound on the test statistics To estimate the FDR, on the other hand, for edge represents the genotype difference
to filter out insignificant SNP-pairs. COE is each permutation, all two-locus test values between the two connected SNPs. This tree
applicable to all tests that are convex. must be computed. structure can be utilized to speed up the
To address these limitations, TEAM is updating process for the contingency tables.
proposed for efficient epistasis detection in A majority of the individuals are pruned and
3.4 The TEAM Algorithm human GWAS. TEAM has several advan- only a small portion are scanned to update
The methods we have discussed so far tages over previous methods. It supports to the contingency tables. This is advantageous
provide promising alternatives for GWAS. both homozygous and heterozygous data. By in human study, which usually involves
N Cantor RM, Lange K, Sinsheimer JS (2008) Prioritizing GWAS results: a review of statistical methods and recommendations for
their application. Nat Rev Genet 9(11): 855867.
N Cordell HJ (2009) Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet 10(6): 392404.
N Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, et al. (2009) Finding the missing heritability of complex diseases.
Nature 461(7265): 747753.
N Moore JH, Williams SM (2009) Epistasis and its implications for personal genetics. Am J Hum Genet 85(3): 309320.
N Phillips PC (2010) Epistasis - the essential role of gene interactions in the structure and evolution of genetic systems. Am J
Hum Genet 86(1): 622.
N Wang K, Li M, Hakonarson H (2010) Analysing biological pathways in genome-wide association studies. Nat Rev Genet 11:
843854.
Supporting Information
Text S1 Answers to Exercises
(PDF)
References
1. Churchill GA, Airey DC, Allayee H, Angel JM, studies for complex traits: consensus, uncertainty 18. Zhang X, Huang S, Zou F, Wang W (2010)
Attie AD, et al. (2004) The collaborative cross, a and challenges. Nat Rev Genet 9(5): 356369. TEAM: Efficient two-locus epistasis tests in
community resource for the genetic analysis of 9. Thorisson GA, Smith AV, Krishnan L, Stein LD human genome-wide association study. Bioinfor-
complex traits. Nat Genet 36: 11331137. (2005) The international hapmap project web site. matics 26(12): 217227.
2. The International HapMap Consortium (2003) Genome Res 15: 1592. 19. Cormen TH, Leiserson CE, Rivest RL, Stein C
The international hapmap project. Nature 10. The 1000 Genomes Project Consortium (2010) A (2001) Introduction to algorithms. MIT Press and
426(6968): 789796. map of human genome variation from popula- McGraw-Hill.
3. Saxena R, Voight B, Lyssenko V, Burtt N, de tion-scale sequencing. Nature 467: 10611073. 20. Ritchie MD, Hahn LW, Roodi N, Bailey LR,
Bakker P, et al. (2007) Genome-wide association 11. Balding DJ (2006) A tutorial on statistical Dupont WD, et al. (2001) Multifactor-dimension-
analysis identifies loci for type 2 diabetes and methods for population association studies. Nat ality reduction reveals high-order interactions
triglyceride levels. Science 316: 13311336. Rev Genet 7(10): 781791. among estrogen-metabolism genes in sporadic
4. Scuteri A, Sanna S, Chen W, Uda M, Albai G, et 12. Samani NJ, Erdmann J, Hall AS, Hengstenberg breast cancer. Am J Hum Genet 69: 138147.
al. (2007) Genome-wide association scan shows C, Mangino M, et al. (2007) Genomewide 21. Cordell HJ (2002) Epistasis: what it means, what
genetic variants in the FTO gene are associated association analysis of coronary artery disease. it doesnt mean, and statistical methods to detect
N Engl J Med 357: 443453.
with obesity-related traits. PLoS Genet 3(7): e115. it in humans. Hum Mol Genet 11: 24632468.
13. Westfall PH, Young SS (1993) Resampling-based
doi:10.1371/journal.pgen.0030115 22. Wason J, Dudbridge F (2010) Comparison of
multiple testing. Wiley: New York.
5. The Wellcome Trust Case Control Consortium multimarker logistic regression models, with
14. Benjamini Y, Hochberg Y (1995) Controlling the
(2007) Genome-wide association study of 14,000 application to a genomewide scan of schizophre-
false discovery rate: a practical and powerful
cases of seven common diseases and 3,000 shared approach to multiple testing. J R Stat Soc nia. BMC Genet 11: 80.
controls. Nature 447: 661678. Series B Stat Methodol 57(1): 289300. 23. Yang C, Wan X, Yang Q, Xue H, Tang N, et al.
6. Weedon M, Lettre G, Freathy R, Lindgren C, 15. Zhang X, Zou F, Wang W (2008) FastANOVA: (2011) A hidden two- locus disease association
Voight B, et al. (2007) A common variant of an efficient algorithm for genome-wide associa- pattern in genome-wide association studies. BMC
HMGA2 is associated with adult and childhood tion study. KDD 2008: 821829. Bioinformatics 12: 156.
height in the general population. Nat Genet 39: 16. Zhang X, Zou F, Wang W (2009) FastChi: an 24. Hoh J, Ott J (2003) Mathematical multi-locus
12451250. effcient algorithm for analyzing gene-gene inter- approaches to localizing complex human trait
7. Hirschhorn J, Daly M (2005) Genome-wide actions. PSB 2009: 528539. genes. Nat Rev Genet 4: 701709.
association studies for common diseases and 17. Zhang X, Pan F, Xie Y, Zou F, Wang W (2010) 25. Musani S, Shriner D, Liu N, Feng R, Coffey C, et
complex traits. Nat Rev Genet 6: 95108. COE: a general approach for efficient genome- al. (2007) Detection of gene6gene interactions in
8. McCarthy M, Abecasis G, Cardon L, Goldstein wide two-locus epistatic test in disease association genome-wide association studies of human pop-
D, Little J, et al. (2008) Genome-wide association study. J Comput Biol 17(3): 401415. ulation data. Hum Hered 63(2): 6784.
Abstract: Genome-wide associa- While understanding the complexity of of GWAS to common diseases that have a
tion studies (GWAS) have evolved human health and disease is an important complex multifactorial etiology.
over the last ten years into a objective, it is not the only focus of human
powerful tool for investigating the genetics. Accordingly, one of the most
2. Concepts Underlying the
genetic architecture of human dis- successful applications of GWAS has been
ease. In this work, we review the in the area of pharmacology. Pharmaco- Study Design
key concepts underlying GWAS, genetics has the goal of identifying DNA 2.1 Single Nucleotide
including the architecture of com- sequence variations that are associated Polymorphisms
mon diseases, the structure of with drug metabolism and efficacy as well The modern unit of genetic variation is
common human genetic variation, as adverse effects. For example, warfarin is the single nucleotide polymorphism or SNP.
technologies for capturing genetic a blood-thinning drug that helps prevent SNPs are single base-pair changes in the
information, study designs, and the blood clots in patients. Determining the DNA sequence that occur with high
statistical methods used for data appropriate dose for each patient is frequency in the human genome [5]. For
analysis. We also look forward to
important and believed to be partly the purposes of genetic studies, SNPs are
the future beyond GWAS.
controlled by genes. A recent GWAS typically used as markers of a genomic
revealed DNA sequence variations in region, with the large majority of them
several genes that have a large influence having a minimal impact on biological
This article is part of the Transla- on warfarin dosing [4]. These results, and systems. SNPs can have functional conse-
tional Bioinformatics collection for more recent validation studies, have led to quences, however, causing amino acid
PLOS Computational Biology. genetic tests for warfarin dosing that can changes, changes to mRNA transcript
be used in a clinical setting. This type of stability, and changes to transcription
1. Important Questions in genetic test has given rise to a new field factor binding affinity [6]. SNPs are by
Human Genetics called personalized medicine that aims to far the most abundant form of genetic
tailor healthcare to individual patients variation in the human genome.
A central goal of human genetics is to based on their genetic background and SNPs are notably a type of common
identify genetic risk factors for common, other biological features. The widespread genetic variation; many SNPs are present
complex diseases such as schizophrenia availability of low-cost technology for in a large proportion of human popula-
and type II diabetes, and for rare Mende- measuring an individuals genetic back- tions [7]. SNPs typically have two alleles,
lian diseases such as cystic fibrosis and ground has been harnessed by businesses meaning within a population there are
sickle cell anemia. There are many that are now marketing genetic testing two commonly occurring base-pair pos-
different technologies, study designs and directly to the consumer. Genome-wide sibilities for a SNP location. The fre-
analytical tools for identifying genetic risk association studies, for better or for worse, quency of a SNP is given in terms of the
factors. We will focus here on the genome- have ushered in the exciting era of minor allele frequency or the frequency of
wide association study or GWAS that personalized medicine and personal ge- the less common allele. For example, a
measures and analyzes DNA sequence SNP with a minor allele (G) frequency of
netic testing. The goal of this chapter is to
variations from across the human genome
introduce and review GWAS technology, 0.40 implies that 40% of a population
in an effort to identify genetic risk factors
study design and analytical strategies as an has the G allele versus the more common
for diseases that are common in the
important example of translational bioin- allele (the major allele), which is found in
population. The ultimate goal of GWAS
formatics. We focus here on the application 60% of the population.
is to use genetic risk factors to make
predictions about who is at risk and to
identify the biological underpinnings of
disease susceptibility for developing new Citation: Bush WS, Moore JH (2012) Chapter 11: Genome-Wide Association Studies. PLoS Comput Biol 8(12):
prevention and treatment strategies. One e1002822. doi:10.1371/journal.pcbi.1002822
of the early successes of GWAS was the Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
identification of the Complement Factor H Baltimore County, United States of America
gene as a major risk factor for age-related Published December 27, 2012
macular degeneration or AMD [13]. Not Copyright: 2012 Bush, Moore. This is an open-access article distributed under the terms of the Creative
only were DNA sequence variations in this Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
gene associated with AMD but the bio-
logical basis for the effect was demonstrat- Funding: This work was supported by NIH grants ROI-LM010098, ROI-LM009012, ROI-AI59694, RO1-EY022300,
and RO1-LM011360. The funders had no role in the preparation of the manuscript.
ed. Understanding the biological basis of
genetic effects will play an important role in Competing Interests: The authors have declared that no competing interests exist.
developing new pharmacologic therapies. * E-mail: william.s.bush@vanderbilt.edu
across the genome and to characterize netic variation within a population over has existed. As such, different human sub-
correlations among variants. time. It is related to the concept of populations have different degrees and
The International HapMap Project chromosomal linkage, where two markers on patterns of LD. African-descent popula-
used a variety of sequencing techniques a chromosome remain physically joined tions are the most ancestral and have
to discover and catalog SNPs in European on a chromosome through generations of smaller regions of LD due to the accumu-
descent populations, the Yoruba popula- a family. In figure 2, two founder lation of more recombination events in
tion of African origin, Han Chinese chromosomes are shown (one in blue that group. European-descent and Asian-
individuals from Beijing, and Japanese and one in orange). Recombination descent populations were created by
individuals from Tokyo [15,16]. The events within a family from generation founder events (a sampling of chromo-
project has since been expanded to include to generation break apart chromosomal somes from the African population), which
11 human populations, with genotypes for segments. This effect is amplified through altered the number of founding chromo-
1.6 million SNPs [7]. HapMap genotype generations, and in a population of fixed somes, the population size, and the
data allowed the examination of linkage size undergoing random mating, repeated generational age of the population. These
disequilibrium. random recombination events will break populations on average have larger regions
apart segments of contiguous chromo- of LD than African-descent groups.
some (containing linked alleles) until Many measures of LD have been
3.2 Linkage Disequilibrium
eventually all alleles in the population proposed [17], though all are ultimately
Linkage disequilibrium (LD) is a prop-
are in linkage equilibrium or are indepen- related to the difference between the
erty of SNPs on a contiguous stretch of
dent. Thus, linkage between markers on a observed frequency of co-occurrence for
genomic sequence that describes the
population scale is referred to as linkage two alleles (i.e. a two-marker haplotype)
degree to which an allele of one SNP is
disequilibrium. and the frequency expected if the two
inherited or correlated with an allele of The rate of LD decay is dependent on markers are independent. The two com-
another SNP within a population. The multiple factors, including the population monly used measures of linkage disequi-
term linkage disequilibrium was coined by size, the number of founding chromo- librium are D and r2 [15,17] shown in
population geneticists in an attempt to somes in the population, and the number equations 1 and 2. In these equations, p12
mathematically describe changes in ge- of generations for which the population is the frequency of the ab haplotype, p1: is
the frequency of the a allele, and p2: is the one allele of the first SNP is often observed preventing genotyping SNPs that provide
frequency of the b allele. with one allele of the second SNP, so only redundant information. Based on analy-
one of the two SNPs needs to be sis of data from the HapMap project,
D0 ~ genotyped to capture the allelic variation. .80% of commonly occurring SNPs in
8 p p {p p 9 There are dependencies between these European descent populations can be
> AB ab Ab aB >
< min(pA pb ,pa pB ) if pAB pab {pAb paB w0 >
> =1 two statistics; r2 is sensitive to the allele captured using a subset of 500,000 to one
frequencies of the tow markers, and can million SNPs scattered across the ge-
>
> pAB pab {pAb paB >
: if pAB pab {pAb paB v0 >
; only be high in regions of high D. nome [19].
min(pA pB ,pa pb )
One often forgotten issue associated
with LD measures is that current technol- 3.3 Indirect Association
2 ogy does not allow direct measurement of The presence of LD creates two possible
(pAB pab {pAb paB )
r2 ~ 2 haplotype frequencies from a sample positive outcomes from a genetic associa-
pA pB pa pb
because each SNP is genotyped indepen- tion study. In the first outcome, the SNP
dently and the phase or chromosome of influencing a biological system that ulti-
D is a population genetics measure that is origin for each allele is unknown. Many mately leads to the phenotype is directly
related to recombination events between well-developed and documented methods genotyped in the study and found to be
markers and is scaled between 0 and 1. A for inferring haplotype phase and estimat- statistically associated with the trait. This is
D value of 0 indicates complete linkage ing the subsequent two-marker haplotype referred to as a direct association, and the
equilibrium, which implies frequent re- frequencies exist, and generally lead to genotyped SNP is sometimes referred to as
combination between the two markers and reasonable results [18]. the functional SNP. The second possibility is
statistical independence under principles SNPs that are selected specifically to that the influential SNP is not directly
of Hardy-Weinberg equilibrium. A D of 1 capture the variation at nearby sites in the typed, but instead a tag SNP in high LD
indicates complete LD, indicating no genome are called tag SNPs because alleles with the influential SNP is typed and
recombination between the two markers for these SNPs tag the surrounding stretch statistically associated to the phenotype
within the population. For the purposes of of LD. As noted before, patterns of LD are (figure 3). This is referred to as an indirect
genetic analysis, LD is generally reported population specific and as such, tag SNPs association [10]. Because of these two
in terms of r2 , a statistical measure of selected for one population may not work possibilities, a significant SNP association
correlation. High r2 values indicate that well for a different population. LD is from a GWAS should not be assumed as
two SNPs convey similar information, as exploited to optimize genetic studies, the causal variant and may require
additional studies to map the precise needed to capture the variation across the change in LDL level per allele or by
location of the influential SNP. African genome. genotype class. With an easily measurable
Conceptually, the end result of GWAS It is important to note that the technol- ubiquitous quantitative trait, GWAS of
under the common disease/common var- ogy for measuring genomic variation is blood lipids have been conducted in
iant hypothesis is that a panel of 500,000 changing rapidly. Chip-based genotyping numerous cohort studies. Their results
to one million markers will identify platforms such as those briefly mentioned were also easily combined to conduct an
common SNPs that are associated to above will likely be replaced over the next extremely well-powered massive meta-
common phenotypes. To conduct such a few years with inexpensive new technolo- analysis, which revealed 95 loci associated
study practically requires a genotyping gies for sequencing the entire genome. to lipid traits in more than 100,000 people
technology that can accurately capture These next-generation sequencing meth- [21]. Here, HDL and LDL may be the
the alleles of 500,000 to one million SNPs ods will provide all the DNA sequence primary traits of interest or can be
for each individual in a study in a cost- variation in the genome. It is time now to considered intermediate quantitative traits
effective manner. retool for this new onslaught of data. or endophenotypes for cardiovascular
disease.
4. Genotyping Technologies 5. Study Design Other disease traits do not have well-
established quantitative measures. In these
Genome-wide association studies were Regardless of assumptions about the circumstances, individuals are usually clas-
made possible by the availability of chip- genetic model of a trait, or the technology sified as either affected or unaffected a
based microarray technology for assaying used to assess genetic variation, no genetic binary categorical variable. Consider the
one million or more SNPs. Two primary study will have meaningful results without vast difference in measurement error
platforms have been used for most GWAS. a thoughtful approach to characterize the associated with classifying individuals as
These include products from Illumina phenotype of interest. When embarking either case or control versus precisely
(San Diego, CA) and Affymetrix (Santa on a genetic study, the initial focus should measuring a quantitative trait. For exam-
Clara, CA). These two competing tech- be on identifying precisely what quantity or ple, multiple sclerosis is a complex clinical
nologies have been recently reviewed [20] trait genetic variation influences. phenotype that is often diagnosed over a
and offer different approaches to measure long period of time by ruling out other
SNP variation. For example, the Affyme- 5.1 Case Control versus Quantitative possible conditions. However, despite the
trix platform prints short DNA sequences Designs loose classification of case and control,
as a spot on the chip that recognizes a There are two primary classes of GWAS of multiple sclerosis have been
specific SNP allele. Alleles (i.e. nucleotides) phenotypes: categorical (often binary enormously successful, implicating more
are detected by differential hybridization case/control) or quantitative. From the than 10 new genes for the disorder [22].
of the sample DNA. Illumina on the other statistical perspective, quantitative traits So while quantitative outcomes are pre-
hand uses a bead-based technology with are preferred because they improve power ferred, they are not required for a
slightly longer DNA sequences to detect to detect a genetic effect, and often have a successful study.
alleles. The Illumina chips are more more interpretable outcome. For some
expensive to make but provide better disease traits of interest, quantitative 5.2 Standardized Phenotype Criteria
specificity. disease risk factors have already been A major component of the success with
Aside from the technology, another identified. High-density lipoprotein multiple sclerosis and other well-conduct-
important consideration is the SNPs that (HDL) and low-density lipoprotein (LDL) ed case/control studies is the definition of
each platform has selected for assay. This cholesterol levels are strong predictors of rigorous phenotype criteria, usually pre-
can be important depending on the heart disease, and so genetic studies of sented as rule list based on clinical
specific human population being studied. heart disease outcomes can be conducted variables. Multiple sclerosis studies often
For example, it is important to use a chip by examining these levels as a quantitative use the McDonald criteria for establishing
that has more SNPs with better overall trait. Assays for HDL and LDL levels, case/control status and defining clinical
genomic coverage for a study of Africans being already useful for clinical practice, subtypes [23]. Standardized methods like
than Europeans. This is because African are precise and ubiquitous measurements the McDonald criteria establish a concise,
genomes have had more time to recom- that are easy to obtain. Genetic variants evidence-based approach that can be
bine and therefore have less LD between that influence these levels have a clear uniformly applied by multiple diagnosing
alleles at different SNPs. More SNPs are interpretation for example, a unit clinicians to ensure that consistent pheno-
Further Reading
N 1000 Genomes Project Consortium, Altshuler D, Durbin RM, Abecasis GR, Bentley DR, et al. (2010) A map of human genome
variation from population-scale sequencing. Nature 467: 10611073.
N Haines JL, Pericak-Vance MA (2006) Genetic analysis of complex disease. New York: Wiley-Liss. 512 p.
N Hartl DL, Clark, AG (2006) Principles of population genetics. Sunderland (Massachusetts): Sinauer Associates, Inc. 545 p.
N NCI-NHGRI Working Group on Replication in Association Studies, Chanock SJ, Manolio T, Boehnke M, Boerwinkle E, et al.
(2007) Replicating genotype-phenotype associations. Nature 447: 655660.
GWAS: genome-wide association study; a genetic study design that attempts to identify commonly occurring genetic variants that
contribute to disease risk
Personalized Medicine: the science of providing health care informed by individual characteristics, such as genetic variation
SNP: single nucleotide polymorphism; a single base-pair change in the DNA sequence
Linkage Analysis: the attempt to statistically relate transmission of an allele within families to inheritance of a disease
Common disease/Common variant hypothesis: The hypothesis that commonly occurring diseases in a population are caused in part
by genetic variation that is common to that population
Linkage disequilibrium: the degree to which an allele of one SNP is observed with an allele of another SNP within a population
Direct association: the statistical association of a functional or influential allele with a disease
Indirect association: the statistical association of an allele to disease that is in strong linkage disequilibrium with the allele that is
functional or influential for disease
Population stratification: the false association of an allele to disease due to both differences in population frequency of the allele and
differences in ethnic prevalence or sampling of affected individuals
False positive: from statistical hypothesis testing, the rejection of a null hypothesis when the null hypothesis is true
Genome-wide significance: a false-positive rate threshold established by empirical estimation of the independent genomic regions
present in a population
Replication: the observation of a statistical association in a second, independent dataset (often the same population as the first
association)
Generalization: the replication of a statistical association in a second population
Imputation: the estimation of unknown alleles based on the observation of nearby alleles in high linkage disequilibrium
References
1. Haines JL, Hauser MA, Schmidt S, Scott WK, complex traits. Nat Rev Genet 6: 95108. doi: studies. Methods Mol Biol 700: 316. doi:
Olson LM, et al. (2005) Complement factor H 10.1038/nrg1521 10.1007/978-1-61737-954-3_1
variant increases the risk of age-related macular 11. Corder EH, Saunders AM, Strittmatter WJ, 21. Teslovich TM, Musunuru K, Smith AV, Ed-
degeneration. Science 308: 419421. doi: Schmechel DE, Gaskell PC, et al. (1993) Gene mondson AC, Stylianou IM, et al. (2010)
10.1126/science.1110359 dose of apolipoprotein E type 4 allele and the risk Biological, clinical and population relevance of
2. Edwards AO, Ritter R, III, Abel KJ, Manning A, of Alzheimers disease in late onset families. 95 loci for blood lipids. Nature 466: 707713. doi:
Panhuysen C, et al. (2005) Complement factor H Science 261: 921923. 10.1038/nature09270
polymorphism and age-related macular degener- 12. Altshuler D, Hirschhorn JN, Klannemark M, 22. Habek M, Brinar VV, Borovecki F (2010) Genes
ation. Science 308: 421424. doi: 10.1126/ Lindgren CM, Vohl MC, et al. (2000) The associated with multiple sclerosis: 15 and count-
science.1110189 common PPARgamma Pro12Ala polymorphism ing. Expert Rev Mol Diagn 10: 857861. doi:
3. Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler is associated with decreased risk of type 2 diabetes. 10.1586/erm.10.77
RS, et al. (2005) Complement factor H polymor- Nat Genet 26: 7680. doi: 10.1038/79216 23. Polman CH, Reingold SC, Edan G, Filippi M,
phism in age-related macular degeneration. 13. Reich DE, Lander ES (2001) On the allelic spectrum Hartung HP, et al. (2005) Diagnostic criteria for
Science 308: 385389. doi: 10.1126/sci- of human disease. Trends Genet 17: 502510. multiple sclerosis: 2005 revisions to the McDon-
ence.1109557 14. Hindorff LA, Sethupathy P, Junkins HA, Ramos ald Criteria. Ann Neurol 58: 840846. doi:
4. Cooper GM, Johnson JA, Langaee TY, Feng H, EM, Mehta JP, et al. (2009) Potential etiologic 10.1002/ana.20703
Stanaway IB, et al. (2008) A genome-wide scan and functional implications of genome-wide 24. Chew EY, Kim J, Sperduto RD, Datiles MB, III,
for common genetic variants with a large association loci for human diseases and traits. Coleman HR, et al. (2010) Evaluation of the age-
influence on warfarin maintenance dose. Blood Proc Natl Acad Sci U S A 106: 93629367. doi: related eye disease study clinical lens grading
112: 10221027. doi: 10.1182/blood-2008-01-
10.1073/pnas.0903103106 system AREDS report No. 31. Ophthalmology
134247
15. International HapMap Consortium (2005) A 117: 21122119. doi: 10.1016/j.ophtha.2010.02.
5. Genomes Project Consortium (2010) A map of
haplotype map of the human genome. Nature 033
human genome variation from population-scale
437: 12991320. doi: 10.1038/nature04226 25. Denny JC, Ritchie MD, Crawford DC, Schildcr-
sequencing. Nature 467: 10611073. doi:
10.1038/nature09534 16. Ritchie MD, Denny JC, Crawford DC, Ramirez out JS, Ramirez AH, et al. (2010) Identification
6. Griffith OL, Montgomery SB, Bernier B, Chu B, AH, Weiner JB, et al. (2010) Robust replication of of genomic predictors of atrioventricular con-
Kasaian K, et al. (2008) ORegAnno: an open- genotype-phenotype associations across multiple duction: using electronic medical records as a
access community-driven resource for regulatory diseases in an electronic medical record. tool for genome science. Circulation 122: 2016
annotation. Nucleic Acids Res 36: D107-D113. Am J Hum Genet 86: 560572. doi: 10.1016/ 2021. doi: 10.1161/CIRCULATIONAHA.110.
doi: 10.1093/nar/gkm967 j.ajhg.2010.03.003 948828
7. Altshuler DM, Gibbs RA, Peltonen L, Altshuler 17. Devlin B, Risch N (1995) A comparison of linkage 26. Wilke RA, Berg RL, Linneman JG, Peissig P,
DM, Gibbs RA, et al. (2010) Integrating common disequilibrium measures for fine-scale mapping. Starren J, et al. (2010) Quantification of the
and rare genetic variation in diverse human Genomics 29: 311322. doi: 10.1006/ clinical modifiers impacting high-density lipopro-
populations. Nature 467: 5258. doi: 10.1038/ geno.1995.9003 tein cholesterol in the community: Personalized
nature09298 18. Fallin D, Schork NJ (2000) Accuracy of haplotype Medicine Research Project. Prev Cardiol 13: 63
8. Kerem B, Rommens JM, Buchanan JA, Markiewicz frequency estimation for biallelic loci, via the 68. doi: 10.1111/j.1751-7141.2009.00055.x
D, et al. (1989) Identification of the cystic fibrosis expectation-maximization algorithm for un- 27. Kullo IJ, Fan J, Pathak J, Savova GK, Ali Z, et al.
gene: genetic analysis. Science 245: 10731080. phased diploid genotype data. Am J Hum Genet (2010) Leveraging informatics for genetic studies:
9. MacDonald ME, Novelletto A, Lin C, Tagle D, 67: 947959. doi: 10.1086/303069 use of the electronic medical record to enable a
Barnes G, et al. (1992) The Huntingtons disease 19. Li M, Li C, Guan W (2008) Evaluation of genome-wide association study of peripheral
candidate region exhibits many different haplo- coverage variation of SNP chips for genome-wide arterial disease. J Am Med Inform Assoc 17:
types. Nat Genet 1: 99103. doi: 10.1038/ association studies. Eur J Hum Genet 16: 635 568574. doi: 10.1136/jamia.2010.004366
ng0592-99 643. doi: 10.1038/sj.ejhg.5202007 28. McCarty CA, Wilke RA (2010) Biobanking and
10. Hirschhorn JN, Daly MJ (2005) Genome-wide 20. Distefano JK, Taverna DM (2011) Technological pharmacogenomics. Pharmacogenomics 11: 637
association studies for common diseases and issues and experimental design of gene association 641. doi: 10.2217/pgs.10.13
Abstract: Humans are essentially bacteria, archaea, fungi, and viruses. The necessary to grow an organism in the lab
sterile during gestation, but during community formed by this complement of in order to study it. Specific microbial
and after birth, every body surface, cells is called the human microbiome; it species were detected by plating samples
including the skin, mouth, and gut, contains almost ten times as many cells as on specialized media selective for the
becomes host to an enormous are in the rest of our bodies and accounts growth of that organism, or they were
variety of microbes, bacterial, ar- for several pounds of body weight and identified by features such as the morpho-
chaeal, fungal, and viral. Under orders of magnitude more genes than are logical characteristics of colonies, their
normal circumstances, these mi- contained in the human genome [1,2]. growth on different media, and metabolic
crobes help us to digest our food Under normal circumstances, these mi- production or consumption. This ap-
and to maintain our immune sys- crobes are commensal, helping to digest proach limited the range of organisms
tems, but dysfunction of the hu- our food and to maintain our immune that could be detected to those that would
man microbiota has been linked to systems. Although the human microbiome actively grow in laboratory culture, and it
conditions ranging from inflamma- has long been known to influence human led the close study of easily-grown, now-
tory bowel disease to antibiotic- familiar model organisms such as Esche-
health and disease [1], we have only
resistant infections. Modern high-
recently begun to appreciate the breadth richia coli. However, E. coli as a taxonomic
throughput sequencing and bioin-
of its involvement. This is almost entirely unit accounts for at most 5% of the
formatic tools provide a powerful
means of understanding the con- due to the recent ability of high-through- microbes occupying the typical human
tribution of the human microbiome put sequencing to provide an efficient and gut [2]. The vast majority of microbial
to health and its potential as a cost-effective tool for investigating the species have never been grown in the
target for therapeutic interven- members of a microbial community and laboratory, and options for studying and
tions. This chapter will first discuss how they change. Thus, dysfunctions of quantifying the uncultured were severely
the historical origins of microbiome the human microbiota are increasingly limited until the development of DNA-
studies and methods for determin- being linked to disease ranging from based culture-independent methods in the
ing the ecological diversity of a inflammatory bowel disease to diabetes 1980s [4].
microbial community. Next, it will to antibiotic-resistant infection, and the Culture-independent techniques, which
introduce shotgun sequencing potential of the human microbiome as an analyze the DNA extracted directly from
technologies such as metage- early detection biomarker and target for a sample rather than from individually
nomics and metatranscriptomics, therapeutic intervention is a vibrant area cultured microbes, allow us to investigate
the computational challenges and several aspects of microbial communities
of current research.
methods associated with these
(Figure 1). These include taxonomic
data, and how they enable micro-
2. A Brief History of Microbiome diversity, such as how many of which
biome analysis. Finally, it will con-
clude with examples of the func- Studies microbes are present in a community,
tional genomics of the human and functional metagenomics, which at-
microbiome and its influences up- Historically, members of a microbial tempts to describe which biological tasks
on health and disease. community were identified in situ by stains the members of a community can or do
that targeted their physiological character- carry out. The earliest DNA-based meth-
istics, such as the Gram stain [3]. These ods probed extracted community DNA
could distinguish many broad clades of for genes of interest by hybridization, or
bacteria but were non-specific at lower amplified specifically-targeted genes by
This article is part of the Transla- taxonomic levels. Thus, microbiology was PCR prior to sequencing. These studies
tional Bioinformatics collection for almost entirely culture-dependent; it was were typically able to describe diversity at
PLOS Computational Biology.
Citation: Morgan XC, Huttenhower C (2012) Chapter 12: Human Microbiome Analysis. PLoS Comput Biol 8(12):
e1002808. doi:10.1371/journal.pcbi.1002808
1. Introduction
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
The question of what it means to be Baltimore County, United States of America
human is more often encountered in Published December 27, 2012
metaphysics than in bioinformatics, but it Copyright: 2012 Morgan, Huttenhower. This is an open-access article distributed under the terms of the
is surprisingly relevant when studying the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
human microbiome. We are born consist- medium, provided the original author and source are credited.
ing only of our own eukaryotic human Funding: This work was supported by the NIH grant 1R01HG005969-01. The funders had no role in the
preparation of the manuscript.
cells, but over the first several years of life,
our skin surface, oral cavity, and gut are Competing Interests: The authors have declared that no competing interests exist.
colonized by a tremendous diversity of * E-mail: chuttenh@hsph.harvard.edu
appropriately tuned level of specificity. shown to correlate with decreased micro- size z, how big must the sample size y be to
Bioinformaticians studying 16S sequences biome diversity, presumably as one or a observe all of them at least once? In other
must choose whether to analyze a collec- few microbes overgrow during immune or words, If Ive sequenced some amount of
tion of taxonomically-binned microbiomes nutrient imbalance in a process not unlike diversity, how much more exists in my
as a set of abundance histograms, or as a an algal bloom [26]. Intriguingly, recent microbiome? and, How much do I need
set of binary presence/absence vectors. results have also shown that essentially no to sequence to completely characterize my
However, either representation can be bacterial clades are widely and consistently microbiome? The latter is known as the
used as input to decomposition methods shared among the human microbiome [2]. Coupon Collectors Problem, as identical
such as Principle Components Analysis or Many organisms are abundant in some questions can be asked if a cereal manu-
Canonical Correlation Analysis [23] to individuals, and many organisms are facturer has randomly hidden one of
determine which OTUs represent the prevalent among most individuals, but several different possible prize coupons in
most significant sources of population none are universal. Although they can each box of cereal [27]. Within a com-
variance and/or correlate with community vary over time and share some similarity munity, several estimators including the
metadata such as temperature, pH, or with some individuals, our intestinal con- Chao1 [28], Abundance-based Coverage
clinical features [24,25]. tents appear to be highly personalized Estimator (ACE) [29], and Jackknife [30]
when considered in terms of microbial measures exist for calculating alpha diver-
3.3 Measuring Population Diversity presence, absence, and abundance. sity, the number (richness) and distribution
An important concept when dealing Two mathematically well-defined ques- (evenness) of taxa expected within a single
with OTUs or other taxonomic bins is that tions arise when quantifying population population. These give rise to figures
of population diversity, the number of diversity (Figure 2): given that x bins have known as collectors or rarefaction curves,
distinct bins in a sample or in the been observed in a sample of size y from a since increasing numbers of sequenced
originating population. This is of critical population of size z, how many bins are taxa allow increasingly precise estimates of
importance in human health, since a expected to exist in the population; or, total population diversity [31]. Addition-
number of disease conditions have been given that x bins exist in a population of ally, when comparing multiple popula-
Figure 2. Ecological representations of microbial communities: collectors curves, alpha, and beta diversity. These examples describe
the A) sequence counts and B) relative abundances of six taxa (A, B, C, D, E, and F) detected in three samples. C) A collectors curve, typically
generated using a richness estimator such as Chao1 [28] or ACE [29], approximates the relationship between the number of sequences drawn from
each sample and the number of taxa expected to be present based on detected abundances. D) Alpha diversity captures both the organismal
richness of a sample and the evenness of the organisms abundance distribution. Here, alpha diversity is defined by the Shannon index [32],
P
H~{ Si~1 (pi ln(pi )), where pi is the relative abundance of taxon i, although many other alpha diversity indices may be employed. E) Beta diversity
represents the similarity (or difference) in organismal composition between samples. In this example, it can be simplistically defined by the equation
b~(ni {c)z(n2 {c), where n1 and n2 are the number of taxa in samples 1 and 2, respectively, and c is the number of shared taxa, but again many
metrics such as Bray-Curtis [34] or UniFrac [24] are commonly employed.
doi:10.1371/journal.pcbi.1002808.g002
Further Reading
It is difficult to recommend comprehensive literature in an area that is changing
so rapidly, but the bioinformatics of microbial community studies are currently
best covered by the reviews in [22,56,165]. Computational tools for metagenomic
analysis include [13,19,63,75,76,77,166]. An overview of microbial ecology from a
phylogenetic perspective is provided in [167,168], and the use of the 16S subunit
as a marker gene is reviewed in [12]. Likewise, experimental and computational
functional metagenomics are discussed in [6,25,169]. The clinical relevance of the
human microbiome is far-ranging and is comprehensively reviewed in [157].
biofilm: a physically (and often temporally) structured aggregate of microorganisms, often containing multiple taxa, and often
adhered to each other and/or to a defined substrate
chimera: an artificial DNA sequence generated during amplification, consisting of a combination of two (or more) true
underlying sequences
collectors curve: a plot in which the horizontal axis represents samples (often DNA sequences) and the vertical axis represents
diversity (e.g. number of distinct taxa)
community structure: used most commonly to refer to the taxonomic composition of a microbial community; can also refer to
the spatiotemporal distribution of taxa
diversity: a measure of the taxonomic distribution within a community, either in terms of distinct taxa or in terms of their
evolutionary/phylogenetic distance
FBA: Flux Balance Analysis, a computational method for inferring the metabolic behavior of a system given prior knowledge of
the enzymatic reactions of which it is capable
functional metagenomics: computational or experimental analysis of a microbial community with respect to the biochemical
and other biomolecular activities encoded by its composite genome
gap filling: the process of imputing missing or inaccurate gene abundances in a set of pathways
gnotobiotic: a host animal containing a defined set of microorganisms, either synthetically implanted or transferred from
another host; often used to refer to model organisms with humanized microbiota
marker: a gene or other DNA sequence that can be (ideally) unambiguously assigned to a particular taxon or function
metagenomics: the study of uncultured microbial communities, typically relying on high-throughput experimental data and
bioinformatic techniques
metatranscriptome: the total transcribed RNA pool of all organisms within a community
microbiome: the total microbial community and biomolecules within a defined environment
microbiota: the total collection of microbial organisms within a community, typically used in reference to an animal host
ortholog: in strict usage, a homologous gene in two species distinguished only by a speciation event; in practice, used to
denote any gene sufficiently homologous as to represent strong evidence for conserved biological function
OTU: Operational Taxonomic Unit, a cluster of organisms similar at the sequence level beyond some threshhold (e.g. 95%) used
in place of species, genus, etc.
prebiotic: a food substance metabolized by the microbiota so as to directly or indirectly benefit the host
probiotic: a live microorganism consumed by the host with direct or indirect health benefits
16S rRNA: the transcribed form of the 16S ribosomal subunit gene, the smaller RNA component of the prokaryotic ribosome,
used as the most common taxonomic marker for microbial communities
WGS: Whole-Genome Shotgun, used to describe shotgun sequencing of individual organisms and, sometimes, microbial
communities, although this is not completely accurate as no whole-genome is typically involved
WMS: Whole-Metagenome Shotgun sequencing, used in reference to undirected metagenomic sequencing to distinguish it
from sequencing directed to specific taxonomic marker genes
References
1. Qin J, Li R, Raes J, Arumugam M, Burgdorf 14. DeSantis TZ, Hugenholtz P, Larsen N, Rojas 26. Sellner KG, Doucette GJ, Kirkpatrick GJ (2003)
KS, et al. (2010) A human gut microbial gene M, Brodie EL, et al. (2006) Greengenes, a Harmful algal blooms: causes, impacts and
catalogue established by metagenomic sequenc- chimera-checked 16S rRNA gene database and detection. J Ind Microbiol Biotechnol 30: 383
ing. Nature 464: 5965. workbench compatible with ARB. Appl Environ 406.
2. (2012) Structure, function and diversity of the Microbiol 72: 50695072. 27. Hildebrand MV (1993) The Birthday Problem.
healthy human microbiome. Nature 486: 207 15. Cole JR, Wang Q, Cardenas E, Fish J, Chai B, American Mathematical Monthly 100: 643.
214. et al. (2009) The Ribosomal Database Project: 28. Chao A (1984) Nonparametric estimation of the
3. Gram HC (1884) Uber die isolierte Farbung der improved alignments and new tools for rRNA number of classes in a population. Scandinavian
Schizomyceten in Schnitt- und Trockenprapar- analysis. Nucleic Acids Res 37: D141145. Journal of Statistics 11: 265270.
aten. Fortschritte der Medizin 2: 185189. 16. Pruesse E, Quast C, Knittel K, Fuchs BM, 29. Chao A, Ma M-C, Yang MCK (1993) Stopping
4. Pace NR, Stahl DA, Lane DJ, Olsen GJ (1986) Ludwig W, et al. (2007) SILVA: a comprehen- rules and estimation for recapture debugging
The analysis of natural microbial populations by sive online resource for quality checked and with unequal failure rates. Biometrika 80: 193
ribosomal RNA sequences. Advances in Micro- aligned ribosomal RNA sequence data compat- 201.
bial Ecology 9: 155. ible with ARB. Nucleic Acids Res 35: 7188 30. Heltshe JF, Forrester NE (1983) Estimating
5. Amann RI, Ludwig W, Schleifer KH (1995) 7196. species richness using the jackknife procedure.
Phylogenetic identification and in situ detection 17. Achtman M, Wagner M (2008) Microbial Biometrics 39: 111.
of individual microbial cells without cultivation. diversity and the genetic nature of microbial 31. Colwell RK, Coddington JA (1994) Estimating
Microbiol Rev 59: 143169. species. Nat Rev Microbiol 6: 431440. terrestrial biodiversity through extrapolation.
6. Handelsman J (2004) Metagenomics: application 18. Schloss PD (2010) The effects of alignment Phil Trans R Soc London B 345: 101118.
of genomics to uncultured microorganisms. quality, distance calculation method, sequence 32. Shannon CE (1948) A mathematical theory of
Microbiol Mol Biol Rev 68: 669685. filtering, and region on the analysis of 16S rRNA communication. Bell System Technical Journal
gene-based studies. PLoS Comput Biol 6: 27: 379423, 623656.
7. Sanger F, Coulson AR (1975) A rapid method
for determining sequences in DNA by primed e1000844. 33. Simpson EH (1949) Measurement of diversity.
Nature 163: 688.
synthesis with DNA polymerase. Journal of 19. Schloss PD, Westcott SL, Ryabin T, Hall JR,
34. Bray JR, Curtis JT (1957) An ordination of
molecular biology 94: 441448. Hartmann M, et al. (2009) Introducing mothur:
upland forest communities of southern Wiscon-
8. Sanger F, Nicklen S, Coulson AR (1977) DNA open-source, platform-independent, communi-
sin. Ecological Monographs 27: 325349.
sequencing with chain-terminating inhibitors. ty-supported software for describing and com-
35. Huber T, Faulkner G, Hugenholtz P (2004)
Proceedings of the National Academy of Sci- paring microbial communities. Appl Environ
Bellerophon: a program to detect chimeric
ences of the United States of America 74: 5463 Microbiol 75: 75377541.
sequences in multiple sequence alignments.
5467. 20. Hamady M, Lozupone C, Knight R (2010) Fast
Bioinformatics 20: 23172319.
9. Birney E, Stamatoyannopoulos JA, Dutta A, UniFrac: facilitating high-throughput phyloge-
36. Brodie EL, Desantis TZ, Joyner DC, Baek SM,
Guigo R, Gingeras TR, et al. (2007) Identifica- netic analyses of microbial communities includ-
Larsen JT, et al. (2006) Application of a high-
tion and analysis of functional elements in 1% of ing analysis of pyrosequencing and PhyloChip density oligonucleotide microarray approach to
the human genome by the ENCODE pilot data. ISME J 4: 1727. study bacterial population dynamics during
project. Nature 447: 799816. 21. Wang Q, Garrity GM, Tiedje JM, Cole JR uranium reduction and reoxidation. Appl Envi-
10. Bocchetta M, Ceccarelli E, Creti R, Sanange- (2007) Naive Bayesian classifier for rapid ron Microbiol 72: 62886298.
lantoni AM, Tiboni O, et al. (1995) Arrange- assignment of rRNA sequences into the new 37. Schatz MC, Phillippy AM, Gajer P, DeSantis
ment and nucleotide sequence of the gene (fus) bacterial taxonomy. Appl Environ Microbiol 73: TZ, Andersen GL, et al. (2010) Integrated
encoding elongation factor G (EF-G) from the 52615267. microbial survey analysis of prokaryotic com-
hyperthermophilic bacterium Aquifex pyrophi- 22. Hamady M, Knight R (2009) Microbial com- munities for the PhyloChip microarray. Appl
lus: phylogenetic depth of hyperthermophilic munity profiling for human microbiome pro- Environ Microbiol 76: 56365638.
bacteria inferred from analysis of the EF-G/fus jects: Tools, techniques, and challenges. Ge- 38. Riesenfeld CS, Schloss PD, Handelsman J
sequences. J Mol Evol 41: 803812. nome Res 19: 11411152. (2004) Metagenomics: genomic analysis of
11. Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin 23. Johnson RA, Wichern DW (2007) Applied microbial communities. Annu Rev Genet 38:
ML, et al. (1985) Rapid determination of 16S Multivariate Statistical Analysis: Prentice 525552.
ribosomal RNA sequences for phylogenetic Hall. 39. Chen K, Pachter L (2005) Bioinformatics for
analyses. Proc Natl Acad Sci U S A 82: 6955 24. Lozupone C, Knight R (2005) UniFrac: a new whole-genome shotgun sequencing of microbial
6959. phylogenetic method for comparing microbial communities. PLoS Comput Biol 1: 106112.
12. Tringe SG, Hugenholtz P (2008) A renaissance communities. Appl Environ Microbiol 71: 40. Gilbert JA, Field D, Huang Y, Edwards R, Li
for the pioneering 16S rRNA gene. Curr Opin 82288235. W, et al. (2008) Detection of large numbers of
Microbiol 11: 442446. 25. Gianoulis TA, Raes J, Patel PV, Bjornson R, novel sequences in the metatranscriptomes of
13. Caporaso JG, Kuczynski J, Stombaugh J, Korbel JO, et al. (2009) Quantifying environ- complex marine microbial communities. PLoS
Bittinger K, Bushman FD, et al. (2010) QIIME mental adaptation of metabolic pathways in One 3: e3042.
allows analysis of high-throughput community metagenomics. Proc Natl Acad Sci U S A 106: 41. Booijink CC, Boekhorst J, Zoetendal EG, Smidt
sequencing data. Nat Methods 7: 335336. 13741379. H, Kleerebezem M, et al. (2010) Metatranscrip-
ICD codes CPT codes Laboratory Data Medication records Clinical Documentation
doi:10.1371/journal.pcbi.1002823.t001
Figure 1. Comparison of natural language processing (NLP) and CPT codes to detect completed colonoscopies in 200 patients. In
this study, more completed colonoscopies were found via NLP than with billing codes alone, and only one colonoscopy was found with billing codes
that was not found with NLP. NLP examples were reviewed for accuracy.
doi:10.1371/journal.pcbi.1002823.g001
constructed for each inpatient. Even names, generics, combination medica- ing) or NLP systems. Keyword searching
without electronic medication administra- tions, and abbreviations that would be can effectively identify rare physical exam
tion records (such as bar-code systems), used, but has the advantage that it can be findings in text [32], and extension to use
research has shown that CPOE-ordered easily accomplished using relational data- of regular expression pattern matching has
medications are given with fairly high base queries. The downside is that this been used to extract blood pressure
reliability [31]. approach requires re-engineering for each readings [33]. NLP computer algorithms
Outpatient medication records are often medication or set of medications to be scan and parse unstructured free-text
recorded via narrative text entries within searched, and does not allow for the documents, applying syntactic and seman-
clinical documentation, patient problem retrieval of other medication data, such as tic rules to extract structured representa-
lists, or communications with patients dose, frequency, and duration. A more tions of the information content, such as
through telephone calls or patient portals. general-purpose approach can be concepts recognized from a controlled
Many EHR systems have incorporated achieved with NLP, which is discussed terminology [3437]. Early NLP efforts
outpatient prescribing systems, which cre- in greater detail in Section 3 below. to extract medical concepts from clinical
ate structured medical records during text documents focused on coding in the
generation of new prescriptions and 3. Natural Language Processing Systematic Nomenclature of Pathology or
refills. However, within many EHR to Support Clinical Knowledge the ICD for financial and billing purposes
systems, electronic prescribing tools are Extraction [38], while more recent efforts often use
optional, not yet widely adopted, or have complete versions of the Unified Medical
only been used within recent history. Although many documentation tools Language System (UMLS) [3941],
Thus, accurate construction of a patients include structured and semi-structured SNOMED-CT [16], and/or domain-spe-
medication exposure history often re- elements, the vast majority of computer cific vocabularies such as RxNorm for
quires NLP techniques. For specific algo- based documentation (CBD) remains in medication extraction [42]. NLP systems
rithms, focused free-text searching for a natural language narrative formats [24]. utilize varying approaches to understand-
set of medications can be efficient and Thus, to be useful for data mining, ing text, including rule-based and statis-
effective [17]. This approach requires the narrative data must be processed through tical approaches using syntactic and/or
researcher to generate the list of brand use of text-searching (e.g., keyword search- semantic information. Natural language
lecting DNA samples in 2007, adding be entered through a variety of sources. single measure of the overall performance
about 500 new samples weekly, and has Most commonly, administrative staff re- of an algorithm that can be used to
over 150,000 subjects as of September cord race/ethnicity via structured data compare two algorithms or selection
2012. Since it enrolls subjects prospective- collection tools in the EHR. Often, this logics. Since the scale of the graph is 0 to
ly, investigation of rare phenotypes may be field can be ignored (left as unknown), 1 on both axes, the performance of a
possible with such systems. The major especially in busy clinical environments, perfect algorithm is 1, and random chance
disadvantage of the opt-out approach is such as emergency departments. Un- is 0.5.
that it precludes recontact of the patients known percentages of patients can range
since their identity has been removed. between 9% and 23% of subjects [17,18]. 6.2 Creation of Phenotype Selection
However, the Synthetic Derivative is Among those patients for whom data is Logic
continually updated as new information entered, a study of genetic ancestry infor- Initial work in phenotype detection has
is added to the EHR, such that the mative markers correlated well with EHR- often focused on a single modality of EHR
amount of phenotypic information for reported race/ethnicities [64]. In addition, data. A number of studies have used
included patients grows over time. a study within the Veterans Administration
billing data, some comparing directly to
(VA) hospital system noted that over 95%
other genres of data, such as NLP. Li et al.
5. Race and Ethnicity in EHR- of all EHR-derived race/ethnicity agreed
compared the results of ICD-9 encoded
with self-reported race/ethnicity using
Derived Biobanks diagnoses and NLP-processed discharge
nearly one million records [65]. Thus,
summaries for clinical trial eligibility
Given that much genetic information despite concerns over EHR-derived ances-
queries, finding that use of NLP provided
varies greatly within ancestral populations, tral information, such information, when
more valuable data sources for clinical trial
accurate knowledge of genetic ancestry present, appears similar to self-report
pre-screening than ICD-9 codes [15].
information is essential to allow for proper ancestry information.
Savova et al. has used cTAKES to
genetic study design and control of discover peripheral arterial disease cases
population stratification. Without it, one 6. Phenotype-Driven Discovery by looking for particular key words in
can see numerous spurious genetic associ- in EHRs radiology reports, and then aggregating
ations due solely to race/ethnicity [62]. 6.1 Measure of Phenotype Selection the individual instances using AND-OR-
Single nucleotide polymorphisms (SNPs) Logic Performance NOT Boolean logic to classify cases into
common in one population may be rare in The evaluation of phenotype selection four categories: positive, negative, proba-
another. In large-scale GWA analyses, one logic can use metrics similar to informa- ble, and unknown [66].
can tolerate less accurate knowledge of tion retrieval tasks. Common metrics are Phenotype algorithms can be created
ancestry a priori, since the large amount of sensitivity (or recall), specificity, positive multiple ways, depending of the rarity of
genetic data allows one to calculate the predictive value (PPV, also known as the phenotype, the capabilities of the EHR
genetic ancestry of the subject using precision), and negative predictive value system, and the desired sample size of the
catalogs of SNPs known to vary between (see Box 1). If a population is assessed for study. Generally, phenotype selection logics
races. Alternatively, one can also adjust for case and control status, then another (algorithms) are composed of one or more
genetic ancestry using tools such as useful metric is comparing the receiver of four elements: billing code data, other
EIGENSTRAT [63]. However, in smaller operator characteristic (ROC) curves. structured (coded) data such as laboratory
candidate gene studies, it is important to ROC curves graph the sensitivity vs. false values and demographic data, medication
know the ancestry beforehand. positive rate (or, 1-specificity) given a information, and NLP-derived data. Struc-
Self-reported race/ethnicity data is often continuous measure of the outcome of tured data can be retrieved effectively from
used in genetic studies. In contrast race/ the algorithm. By calculating the area most EHR systems. These data can be
ethnicity as recorded within an EHR may under the ROC curve (AUC), one has a combined through simple Boolean logic
*Given the small number of multiple sclerosis cases, all possible cases were manually validated to ensure high recall.
doi:10.1371/journal.pcbi.1002823.t002
[17] or through machine learning methods have at least one billing code for the 7. Examples of Genetic
such as logistic regression [18], to achieve a disease) are reasonable and likely do not Discovery Using EHRs
predefined specificity or positive predictive lead to significant bias.
value. A drawback to the use of machine For other algorithms, the temporal The growth of EHR-driven genomic
learning data (such as logistic regression relationships of certain elements are very research (EDGR) that is, genomic
models) is that it may not be as portable to important. Consider an algorithm to research proceeding primarily from
other EHR systems as more simple Boolean determine whether a certain combination EHR data linked to DNA samples is
logic, depending on how the models are of medication adversely impacted a given a recent phenomenon [6]. Preceding
constructed. The application of many lab, such as kidney function or glucose these most recent research initiatives,
phenotype selection logics can be thought [67]. Such an algorithm would need to other studies laid the groundwork for use
of partitioning individuals into four buckets take into account the temporal sequence of EHR data to study genetic phenom-
definite cases (with sufficiently high PPV), and time between the particular medica- ena. Rzhetsky et al. used billing codes
possible cases (which can be manually tions and laboratory tests. For example, from the EHRs of 1.5 million patients to
reviewed if needed), controls (which do glucose changes within minutes to hours analyze disease co-occurrence in 161
not have the disease with acceptable of a single administration of insulin, but conditions as a proxy for possible genetic
PPV), and individuals excluded from the the development of glaucoma from corti- overlap [68]. Chen et al. compared
analysis due to either potentially overlap- costeroids (a known side effect) would not laboratory measurements and age with
ping diagnoses or insufficient evidence be expected to happen acutely following a gene expression data to identify rates of
(Figure 3). single dose. change that correlated with genes known
For many algorithms, sensitivity (or For very rare diseases or findings, one to be involved in aging [69]. A study at
recall) is not necessarily evaluated, assum- may desire to find every case, and thus Geisinger Clinic evaluated SNPs in the
ing there are an adequate number of cases. the logic may simply be a union of 9p21 region that are known to be
A possible concern in not evaluating recall keyword text queries and billing codes associated to cardiovascular disease and
(sensitivity) of a phenotype algorithm is followed by manual review of all returned early myocardial infarction [70]. They
that there may be a systematic bias in how cases. Examples include the rare physical found these SNPs were associated with
patients were selected. For example, exam finding hippus (exaggerated pupil- heart disease and T2D using EHR-
consider a hypothetical algorithm to find lary oscillations occurring in the setting of derived data. Several specific examples
patients with T2D whose logic was to altered mental status) [32], or potential of EDGR are detailed below.
select all patients that had at least one drug adverse events (e.g., Stevens-Johnson
billing code for T2D and also required syndrome), which are often very rare but 7.1 Replicating Known Genetic
that cases receive an oral hypoglycemic severe. Associations for Five Diseases
medication. This algorithm may be highly Since EHRs represent longitudinal rec- An early replication study of known
specific for finding patients with T2D ords of patient care, they are biased to genetic associations with five diseases with
(instead of type 1 diabetes), but would recording those events that are recorded as known genetic associations was performed
miss those patients who had progressed in part of medical care. Thus, they are in BioVU. The study was designed to test
disease severity such that oral hypoglyce- particularly useful for investigating dis- the hypothesis that an EHR-linked DNA
mic agents no longer worked and who now ease-based phenotypes, but potentially less biobank could be used for genetic associ-
require insulin treatment. Thus, this efficacious for investigating non-disease ation analyses. The goal was to use only
phenotype algorithm could miss the more phenotypes such as hair or eye color, left EHR data for phenotype information.
severe cases of T2D. However, for a vs. right handedness, cognitive attributes, The first 10,000 samples accrued in
practical application, such assessments of biochemical measures (beyond routine BioVU were genotyped at 21 SNPs that
recall can be challenging given large labs), etc. On the other hand, they may are known to be associated with these five
samples sizes of rare diseases. Certain be particularly useful for analyzing disease diseases (atrial fibrillation, Crohns disease,
assumptions (e.g., that a patient should progression over time. multiple sclerosis, rheumatoid arthritis,
Phenotyping
Institution Biorepository Overview Model Size EHR Summary Methods
Group Health1 GHC Biobank Disease specific 4000 Comprehensive Structured data
(Seattle, WA) Alzheimers Disease Cohort vendor-based EHR extraction, NLP
Patient Registry and Adult since 2004
Changes in Thought Study
Marshfield Clinic Research Personalized Medicine Population based 20,000 Comprehensive Structured data
Foundation1 Research Project internally developed extraction, NLP,
(Marshfield, WI) Marshfield Clinic, an integrated EHR since 1985 Intelligent Character
regional health system Recognition
Mayo Clinic1 Disease cohort Disease specific 16,500 Comprehensive Structured data
(Rochester, MN) Derived from vascular laboratory & Cohorts internally developed extraction, NLP
exercise stress testing labs EHR since 1995
Northwestern University1 NUgene Project Population based .10,000 Comprehensive Structured data
(Chicago, IL) Northwestern affiliated hospitals vendor based extraction, text
and outpatient clinics Inpatient and searches, NLP
Outpatient (different
systems) EHR
since 2000
Vanderbilt University1 BioVU Population based 150,000 Comprehensive Structured data
(Nashville, TN) Primarily drawn from outpatient routine internally developed extraction, NLP
laboratory samples EHR since 2000
Geisinger Health System2 MyCode Population based .30,000 Comprehensive Structured data
(Pennsylvania) Enrollment of health plan participants vendor-based EHR extraction, NLP
Mount Sinai Medical Institute for Personalized Population based .30,000 Comprehensive Structured data
Center2 (New York, NY) Medicine Biobank vendor-based EHR extraction, NLP
Outpatient enrollment since 2004
Cincinnati Childrens General and disease cohorts. Population based .3,000 Comprehensive Structured data
Hospital3 vendor-based EHR extraction, NLP
(Cincinnati, OH)
Childrens Hospital of General and disease cohorts. Population based .100,000 Comprehensive Structured data
Philadelphia3 vendor-based EHR extraction, NLP
(Philadelphia, PA)
Boston Childrens3 Crimson Disease based Virtual Comprehensive Structured data
(Boston MA) On-demand, de-identified internally developed extraction, NLP
phenotype-driven collection EHR
Sizes represent approximate sizes as of 2012; many sites are still actively recruiting. NLP = Natural Language Processing. Sites joined with 1eMERGE-I in 2007, 2eMERGE-II
in 2011, or as 3pediatric sites in 2012.
doi:10.1371/journal.pcbi.1002823.t003
and T2D). Reported odds ratios were randomly selected records to provide final 7.2 Demonstrating Multiethnic
1.142.36 in at least two previous studies PPVs. Associations with Rheumatoid
prior to the analysis. Automated pheno- Used alone, ICD9 codes had PPVs of Arthritis
type identification algorithms were devel- 5689% compared to a gold standard Using a logistic regression algorithm
oped using NLP techniques (to identify key represented by the final algorithm. Errors operating on billing data, NLP-derived
findings, medication names, and family were due to coding errors (e.g., typos), features, medication records, and labo-
history), billing code queries, and struc- misdiagnoses from non-specialists (e.g., a ratory data, Liao et al. developed an
tured data elements (such as laboratory non-specialist diagnosed a patient as algorithm to accurately identify rheuma-
results) to identify cases (n = 70698) and having rheumatoid arthritis followed by toid arthritis patients [18]. Kurreeman
controls (n = 8083818). Final algorithms a rheumatologist who revised the diag- et al. used this algorithm on EHR data
achieved PPV of $97% for cases and nosis to psoriatic arthritis), and indeter- to identify a population of 1,515 cases
100% for controls on randomly selected minate diagnoses that later evolved into and 1,480 matched controls [71]. These
cases and controls (Table 2) [17]. For each well-defined ones (e.g., a patient thought researchers genotyped 29 SNPs that had
of the target diseases, the phenotype to have Crohns disease was later deter- been associated with RA in at least one
algorithms were developed iteratively, mined to have ulcerative colitis, another prior study. Sixteen of these SNPs
with a proposed selection logic applied to type of inflammatory bowel disease). achieved statistical significance, and
a set of EHR subjects, and random cases Each of the 21 tests of association yielded 26/29 had odds ratios in the same
and controls evaluated for accuracy. The point estimates in the expected direction, direction and with similar effect sizes.
results of these reviews were used to refine and eight of the known associations The authors also demonstrated that
the algorithms, which were then rede- achieved statistical significance these portions of these risk alleles were
ployed and reevaluated on a unique set of [17]. associated with rheumatoid arthritis in
East Asian, African, and Hispanic Amer- Initial plans were for each site to by the PR interval on the ECG), conduct-
ican populations. analyze their own phenotypes indepen- ed entirely within samples drawn from one
dently. However, the network has realized site, identified variants in SCN10A.
the benefits of synergy. Central efforts SCN10A is a sodium channel expressed in
7.3 eMERGE Network autonomic nervous system tissue and is
across the network were involved in
The eMERGE network is composed of now known to be involved in cardiac
harmonization of the collective genetic
nine institutions as of 2012 (http://gwas. regulation. The phenotype algorithm
data.
org; Table 3). Each site has a DNA identified patients with normal ECGs
biobank linked to robust, longitudinal who did not have evidence of prior heart
EHR data. The initial goal of the 7.4 Early Genome-Wide Association
disease, were not on medications that
eMERGE network was to investigate the Studies from the eMERGE Network would interfere with cardiac conduction,
feasibility of genome-wide association As of 2012, the eMERGE Network has and had normal electrolytes. The pheno-
studies using EHR data as the primary published GWAS on atrioventricular con- type algorithm used NLP and billing code
source for phenotypic information. Each duction [72], red blood cell [23] and white queries to search for the presence of prior
of these sites initially set out to investigate blood cell [73] traits, primary hypothyroid- heart disease and medication use [72]. Of
one or two primary phenotypes (Table 3). ism [74], and erythrocyte sedimentation note, the algorithm highlights the impor-
Network sites have currently created and rate [75], with others ongoing. The first two tance of using clinical note section tagging
evaluated electronic phenotype algorithms studies published by the network were using and negation to exclude only those
for 14 different primary and secondary single-site GWAS studies; latter studies patients with heart disease, as opposed to
phenotypes, with nearly 30 more planned. have realized the advantage of pooling patients whose records contained negated
After defining phenotype algorithms, data across multiple sites to increase the heart disease concepts (e.g., no myocar-
each site then performed genome-wide sample size available for a study. Impor- dial infarction) or heart disease concepts
genotyping at one of two NIH-supported tantly, several studies in eMERGE have in related individuals (e.g., mother died of
genotyping centers. explicitly evaluated the portability of the a heart attack). Use of NLP improved
The primary goals of an algorithm are electronic phenotype algorithms by review- recall of cases by 129% compared with
to perform with high precision ($95%) ing algorithms at multiple sites. Evaluation simple text searching, while maintaining a
and reasonable recall. Algorithms incor- of the hypothyroidism algorithm at the five positive predictive value of 97% (Figure 4)
porate billing codes, laboratory and vital eMERGE-I sites, for instance, noted an [78,72].
signs data, test and procedure results, and overall weighted PPV of 92.4% and 98.5% The study of RBC traits identified four
clinical documentation. NLP is used to for cases and controls, respectively [74]. variants associated with RBC traits. One
both increase recall (find additional cases) Similar results have been found with T2D of these, SLC17A1, had not been previ-
and achieve greater precision (via im- [76], cataracts [27], and rheumatoid ar- ously identified, and is involved in sodium-
proved specificity). These phenotype algo- thritis [77] algorithms. phosphate co-transport in the kidney. The
rithms are available for download from As a case study, the GWAS for latter study of RBC traits utilized patients
PheKB (http://phekb.org). atrioventricular conduction (as measured genotyped at one site as cases and controls
for their primary phenotype of peripheral A phenome-wide association study strong association between this SNP and
arterial disease (PAD). Thus, this repre- (PheWAS) is, in a sense, a reverse multiple sclerosis, but also highlights other
sents an in silico GWAS for a new finding GWAS. PheWAS investigations require possible associations, such as Type 1
that did not require new genotyping, but large representative patient populations diabetes and acquired hypothyroidism.
instead leveraged the available data within with definable phenotypic characteristics. Recent explorations into PheWAS meth-
the EHR. The eMERGE study of primary Such studies only recently became feasi- ods using NLP have shown greater
hypothyroidism, similarly, identified a ble, facilitated by linkage of DNA bior- efficacy for detecting associations: with
novel association with FOXE1, a thyroid epositories to EHR systems, which can the same patients, NLP-based PheWAS
transcription factor, without any new provide a comprehensive, longitudinal replicated six of the seven known associ-
genotyping by using samples derived from record of disease. ations, generally with more significant p-
five eMERGE sites. The first PheWAS studies were per- values [80].
formed on 6,005 patients genotyped for PheWAS methods may be particularly
five SNPs with seven previously known useful for highlighting pleiotropy and
7.5 Phenome-Wide Association disease associations [79]. This PheWAS clinically associated diseases. For exam-
Studies (PheWAS) used ICD9 codes linked to a code- ple, an early GWAS for T2D identified,
Typical genetic analyses investigate translation table that mapped ICD9 codes among others, FTO loci as an associated
many genetic loci against a single trait or to 776 disease phenotypes. In this study, variant [81]. A later GWAS demonstrat-
disease. Such analyses cannot identify PheWAS methods replicated four of seven ed this risk association was mediated
pleiotropic associations, and may miss previously known associations with through the effect of FTO on increasing
important confounders in an analysis. p,0.011. Figure 5 shows one illustrative body mass index, and thus increasing risk
Another approach, engender by the rich PheWAS plot of phenotype associations of T2D within those individuals. Such
phenotype record included in the EHR, is with an HLA-DRA SNP known to be effects may be identified through broad
to simultaneously investigate many pheno- associated with multiple sclerosis. Of note, phenome scans made possible through
types associated with a given genetic locus. this PheWAS not only demonstrates a PheWAS.
Further Reading
N Shortliffe EH, Cimino JJ, editors (2006) Biomedical informatics: computer applications in health care and biomedicine. 3rd
edition. Springer. 1064 p.
Chapters of particular relevance: Chapter 2 (Biomedical data: their acquisition, storage, and use), Chapter 8 (Natural language
and text processing in biomedicine), Chapter 12 (Electronic health record systems)
N Hristidis V, editor (2009) Information discovery on electronic health records. 1st edition. Chapman and Hall/CRC. 331 p.
Chapters of particular relevance: Chapter 2 (Electronic health records), Chapter 4 (Data quality and integration issues in
electronic health records), 7 (Data mining and knowledge discovery on EHRs).
N Wilke RA, Xu H, Denny JC, Roden DM, Krauss RM, et al. (2011) The emerging role of electronic medical records in
pharmacogenomics. Clin Pharmacol Ther 89: 379386. doi:10.1038/clpt.2010.260.
N Roden DM, Xu H, Denny JC, Wilke RA (2012) Electronic medical records as a tool in clinical pharmacology: opportunities and
challenges. Clin Pharmacol Ther. Available: http://www.ncbi.nlm.nih.gov/pubmed/22534870. Accessed 30 June 2012.
N Kohane IS (2011) Using electronic health records to drive discovery in disease genomics. Nat Rev Genet 12: 417428.
doi:10.1038/nrg2999.
N Candidate gene study: A study of specific genetic loci in which a phenotype-genotype association may exist (e.g.,
hypothesis-led genotype experiment)
N Computer-based documentation (CBD): Any electronic note or report found within an EHR system. Typically, these can
be dictated or typed directly into a note writer system (which may leverage templates) available within the EHR. Notably,
CBD excludes scanned documents.
N Computerized Provider Order Entry (CPOE): A system for allowing a provider (typically a clinician or a nurse
practitioner) to enter, electronically, an order for a patient. Typical examples include medication prescribing or test ordering.
These systems allow for a precise electronic record of orders given and also can provide decision support to help improve
care.
N Electronic Health Record (EHR): Any comprehensive electronic medical record system storing all the data about a
patients encounters with a healthcare system, including medical diagnoses, physician notes, prescribing records. EHRs
include CPOE and CBD systems (among others), and allow for easy information retrieval of clinical notes and results.
N Genome-wide association study (GWAS): A broad scale study of a number of points selected along a genome without
using a prior hypothesis. Typically, these studies analyze more than .500,000 loci on the genome.
N Genotype: The specific DNA sequence at a given location.
N Natural language processing (NLP): Use of algorithms to created structured data from unstructured, narrative text
documents. Examples include use of comprehensive NLP software solutions to find biomedical concepts in documents, as
well as more focused applications of techniques to find extract features from notes, such as blood pressure readings.
N Phenome-wide association study (PheWAS): A broad scale study of a number phenotypes selected along the genome
without regard to a prior hypothesis as what phenotype(s) a given genetic locus may be associated.
N Phenotype selection logic (or algorithm): A series of Boolean rules or machine learning algorithms incorporating such
information as billing codes, laboratory values, medication records, and NLP designed to derive a case and control
population. from EHR data.
N Phenotype: Any observable attribute of an individual.
N Single nucleotide polymorphism (SNP): a single locus on the genome that shows variation in the human population.
N Structured data: Data that is already recorded in a system in a structured name-value pair format and can be easily queried
via a database.
N Unified Medical Language System (UMLS): A comprehensive metavocabulary maintained by the National Library of
Medicine which combines .100 individual standardized vocabularies. The UMLS is composed of the Metathesaurus, the
Specialist Lexicon, and the Semantic Network. The largest component of the UMLS is the Metathesaurus, which contains the
term strings, concept groupings of terms, and concept interrelationships.
N Unstructured data: Data contained in narrative text documents such as the clinical notes generated by physicians and
certain types of text reports, such as pathology results or procedures such as echocardiograms.
References
1. Hindorff LA, Sethupathy P, Junkins HA, Ramos 7. Manolio TA (2009) Collaborative genome-wide 12. Dean BB, Lam J, Natoli JL, Butler Q, Aguilar D,
EM, Mehta JP, et al. (2009) Potential etiologic association studies of diverse diseases: programs of et al. (2009) Use of Electronic Medical Records
and functional implications of genome-wide the NHGRIs office of population genomics. for Health Outcomes Research: A Literature
association loci for human diseases and traits. Pharmacogenomics 10: 235241. Review. Med Care Res Rev. Available: http://
Proc Natl Acad Sci USA 106: 93629367. 8. Kaiser Permanente, UCSF Scientists Complete www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd =
doi:10.1073/pnas.0903103106. NIH-Funded Genomics Project Involving 100,000 Retrieve&db = PubMed&dopt = Citationlistuids =
2. Wellcome Trust Case Control Consortium (2007) People (n.d.). Available: http://www.dor.kaiser. 19279318.
Genome-wide association study of 14,000 cases of org/external/news/press_releases/Kaiser_ 13. Elixhauser A, Steiner C, Harris DR, Coffey RM
seven common diseases and 3,000 shared con- Permanente,_UCSF_Scientists_Complete_NIH- (1998) Comorbidity measures for use with
trols. Nature 447: 661678. Funded_Genomics_Project_Involving_100,000_ administrative data. Medical care 36: 827.
3. Dehghan A, Kottgen A, Yang Q, Hwang S-J, Kao People/. Accessed 13 September 2011. 14. Charlson ME, Pompei P, Ales KL, MacKenzie
WL, et al. (2008) Association of three genetic loci 9. Herzig SJ, Howell MD, Ngo LH, Marcantonio CR (1987) A new method of classifying prognostic
with uric acid concentration and risk of gout: a ER (2009) Acid-suppressive medication use and comorbidity in longitudinal studies: development
genome-wide association study. Lancet 372: 1953 the risk for hospital-acquired pneumonia. Jama and validation. Journal of chronic diseases 40:
1961. doi:10.1016/S0140-6736(08)61343-4. 301: 21202128. 373383.
4. Benjamin EJ, Dupuis J, Larson MG, Lunetta KL, 10. Klompas M, Haney G, Church D, Lazarus R, 15. Li L, Chase HS, Patel CO, Friedman C, Weng C
Booth SL, et al. (2007) Genome-wide association Hou X, et al. (2008) Automated identification of (2008) Comparing ICD9-encoded diagnoses and
with select biomarker traits in the Framingham acute hepatitis B using electronic medical record NLP-processed discharge summaries for clinical
Heart Study. BMC Med Genet 8 Suppl 1: S11. data to facilitate public health surveillance. PLoS trials pre-screening: a case study. AMIA. Annual
doi:10.1186/1471-2350-8-S1-S11. ONE 3: e2626. doi:10.1371/journal.pone. Symposium proceedings/AMIA Symposium:
5. Kiel DP, Demissie S, Dupuis J, Lunetta KL, 0002626. 404408.
Murabito JM, et al. (2007) Genome-wide association 11. Kiyota Y, Schneeweiss S, Glynn RJ, Cannuscio 16. Elkin PL, Ruggieri AP, Brown SH, Buntrock J,
with bone mass and geometry in the Framingham CC, Avorn J, et al. (2004) Accuracy of Medicare Bauer BA, et al. (2001) A randomized controlled
Heart Study. BMC Med Genet 8 Suppl 1: S14. claims-based diagnosis of acute myocardial in- trial of the accuracy of clinical record retrieval using
6. Kohane IS (2011) Using electronic health records farction: estimating positive predictive value on SNOMED-RT as compared with ICD9-CM.
to drive discovery in disease genomics. Nat Rev the basis of review of hospital records. American Proceedings/AMIA. Annual Symposium: 159
Genet 12: 417428. doi:10.1038/nrg2999. heart journal 148: 99104. 163.
Abstract: Although there is great specific tissues that accumulate over time. from a particular type of cancer, and is
promise in the benefits to be Genetic predisposition is represented by used to identify biomarkers, characterize
obtained by analyzing cancer ge- germline variants and indeed, many com- cancer subtypes with clinical or therapeu-
nomes, numerous challenges hin- mon germline variants have been associ- tic implications, or to simply advance our
der different stages of the process, ated with specific diseases, as well as with understanding of the tumorigenic process.
from the problem of sample prep- altered drug susceptibility and/or toxicity. The second approach involves examining
aration and the validation of the The association of germline variants with the genome of a particular cancer patient
experimental techniques, to the clinical features and disease is mainly in the search for specific alterations that
interpretation of the results. This achieved through Genome Wide Associa- may be susceptible to tailored therapy.
chapter specifically focuses on the tion Studies (GWAS). GWAS use large Although both approaches draw on
technical issues associated with the cohorts of cases to analyze the relationship common experimental and bioinformatics
bioinformatics analysis of cancer between the disease and thousands or techniques, they analyze different types of
genome data. The main issues millions of mutations across the entire information, have different goals and they
addressed are the use of database genome, and they are the subject of a require the presentation of the results in
and software resources, the use of separate chapter in this issue. distinct ways.
analysis workflows and the presen- The study of cancer genomes differs The development of Next Generation
tation of clinically relevant action significantly from GWAS, as during the Sequencing (NGS) has not only helped
items. We attempt to aid new
lifetime of the organism variants only identify genetic variants but also, it
developers in the field by describ-
accumulate in the tumor or the affected represents an important aid in the study
ing the different stages of analysis
and discussing current approaches, tissues, and they are not transmitted of epigenetics (DNAseq and ChipSeq of
as well as by providing practical from generation to generation. These are histone methylation marks), transcription-
advice on how to access and use known as somatic mutations. Mutations al regulation and splicing (RNAseq). The
resources, and how to implement accumulate as the tumors progress through combined power of such genomic data
recommendations. Real cases from processes that are not completely under- provides a more complete definition of
cancer genome projects are used stood and that depend on the evolution of cancer genomes.
as examples. the different cell types in the tumor, i.e., To aid developers new to the field of
clonal versus parallel evolution [5]. Re- cancer genomics, this chapter will discuss
gardless of which model is more relevant, the particularities of cancer genome
the tumor genome includes mutations that analysis, as well as the main scientific
This article is part of the Transla-
facilitate tumorigenesis or are that essential and technical challenges, and potential
tional Bioinformatics collection for
for the generation of the tumor (known solutions.
PLOS Computational Biology.
as tumor drivers), and others that have
accumulated during the growth of the 2. Overview of Cancer Genome
1. Introduction
tumor (known as passengers) [6]. Distin- Analysis
Cancer is commonly defined as a guishing driver from passenger muta-
disease of the genes, a definition that tions is crucial for the interpretation of The sequence of the steps in an
emphasizes the importance of cataloguing cancer genomes [5]. idealized cancer genome analysis pipeline
and analyzing tumor-associated muta- Depending on the type of data and the are presented in Figure 1. For each step
tions. The recent advances in sequencing aim of the analysis, cancer genome listed, the biological disciplines involved,
technology have underpinned the prog- analysis may focus on the cancer type or the bioinformatics techniques used and
ress in several large-scale projects to on the patient. The first approach consists some of the most salient challenges that
systematically compile genomic informa- of examining a cohort of patients suffering arise are listed.
tion related to cancer. For example,
the Cancer Genome Atlas (http:// Citation: Vazquez M, de la Torre V, Valencia A (2012) Chapter 14: Cancer Genome Analysis. PLoS Comput
cancergenome.nih.gov/) and the projects Biol 8(12): e1002824. doi:10.1371/journal.pcbi.1002824
overseen by the International Cancer Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Genome Consortium [1] (http://icgc. Baltimore County, United States of America
org/) have focused on identifying links Published December 27, 2012
between cancer and genomic variation. Copyright: 2012 Vazquez et al. This is an open-access article distributed under the terms of the Creative
Unsurprisingly, the analysis of genomic Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
mutations associated with cancer is also provided the original author and source are credited.
making its way into clinical applications Funding: This article was supported in part by the grant from the Spanish Ministry of Science and Innovation
[24]. BIO2007-66855 and the EU FP7 project ASSET, grant agreement 259348. The funders had no role in the
preparation of the manuscript.
Cancer may be favored by genetic
predisposition, although it is thought to Competing Interests: The authors have declared that no competing interests exist.
be primarily caused by mutations in * E-mail: valencia@cnio.es
Figure 1. Idealized cancer analysis pipeline. The column on the left shows a list of sequential steps. The columns on the right show the
bioinformatics and molecular biology disciplines involved at each step, the types of techniques employed and some of the current challenges faced.
doi:10.1371/journal.pcbi.1002824.g001
used to perform pathogenicity predictions isoforms are specifically expressed in the type of functional annotation most com-
for point mutations in coding regions (see cell type of interest, in which case, monly considered and thus, functional
Table 1). Prediction is far more compli- additional software will be necessary to analysis is often termed pathway analysis.
cated for genomic aberrations and muta- analyze the data generated by the new However, functional annotations may also
tions that affect non-coding regions of experiments. include other types of biological associa-
DNA, an area of basic research that is still tions such as cellular location, protein
in its early stages. However, the large domain composition, and classes of cellular
collections of genomic information gath- 3.3 Functional Interpretation or biochemical terms, such as GO terms
ered by the ENCODE project [16] will Some genes harbor a large number of (Table 2 lists some useful databases along
doubtless play a key role in this research. mutations in cancer genomes, such as TP53 with the relevant functional annotations).
Despite their limited scope, mutations in and KRAS, whose importance and rele- Over the last decade, multiple statistical
coding regions are the most useful for vance as cancer drivers have been well approaches have been developed to iden-
cancer genome analysis. This is initially established. Frequently however, genomic tify functional annotations (also known as
because it is still cheaper to sequence data reveals the presence of mutated genes labels) that are significantly associated
exomes than full genomes and also, that are far less prevalent, and the signif- with lists of entities, collectively known as
because they are closer to actionable icance of these genes must be considered in enrichment analysis. Indeed, the current
medical items, given that most drugs target the context of the functional units they are systems for functional interpretation have
proteins. Indeed, most clinical success part of. For example, SF3B1 was mutated been derived from the systems previously
stories based on cancer genome analysis in only 10 out of 105 samples of chronic developed to analyze expression arrays,
have involved the analysis of point muta- lymphocytic leukemia (CLL) in the study and they have been adapted to analyze
tions in proteins [3]. conducted by the ICGC consortium [9], lists of cancer-related genes. As this step is
In particular, we have focused on the and in 14 out of 96 in the study performed critical to perform functional interpreta-
need to analyze the consequences of in the Broad Institute [17]. While these tions, special care must be taken when
mutations in alternative isoforms of each numbers are statistically significant, many selecting methods to be incorporated into
gene, in addition to those in the main other components of the RNA splicing and the analysis pipeline. Cases in which the
isoforms. Despite the potential implica- transport machinery are also mutated in characteristics of the data challenge the
tions of alternative splicing, this problem CLL. Even if these mutations occur at assumptions of the methods are parti-
remains largely overlooked by current lower frequencies they further emphasize cularly delicate. For instance, a hyper-
applications. A common solution is to the importance of this gene [18]. geometric test might be appropriate to
assign the genomic mutations to just one of Functional interpretation aims to iden- analyze gene lists that are differentially
the several potential isoforms, without tify large biological units that correlate expressed in gene expression arrays. How-
considering their possible incidence of better with the phenotype than individual ever, when dealing with lists of mutated
other splice isoforms, and in most cases mutated genes, and as such, it can produce genes this approach does not account for
without knowing which isoform is actually a more general interpretation of the factors such as the number of mutations
produced in that particular tissue. The acquired genomic information. The in- per gene, the size of the genes, or the presence
availability of RNAseq data should solve volvement of genes in specific biological, of genes in overlapping genomic clusters
this problem by demonstrating which metabolic and signaling pathways is the (where one mutation may simultaneously
doi:10.1371/journal.pcbi.1002824.t001
Ensembl Genes, proteins, transcripts, regulatory Genomic positions, relationships between them,
regions, variants identifiers in different formats, GO terms, PFAM
domains
Entrez Genes, articles Articles for genes, abstracts of articles, links to full text
UniProt Proteins PDBs, known variants
KEGG, Reactome, Biocarta, Gene Ontology Genes Pathways, processes, function, cell location
TFacts Genes Transcription regulation
Barcode Genes Expression by tissue
PINA, HPRD, STRING Proteins Interactions
PharmaGKB Drugs, proteins, variants Drug targets, pharmacogenetics
STITCH, Matador Drugs, proteins Drug targets
Drug clinical trials Investigational drugs Diseases or conditions in they are being tested
GEO, ArrayExpress Genes (microarray probes) Expression values
ICGC, TCGA Cancer Genomes Point mutations, methylation, CNV, structural variants
dbSNP, 1000 genomes Germline variations Association with diseases or conditions
COSMIC Somatic variations Association with cancer types
doi:10.1371/journal.pcbi.1002824.t002
affect several genes). As none of these issues A different type of analysis considers the Drug-related information and the tools
are accommodated by the standard ap- relationships between entities based on with which to analyze it is essential for the
proaches used for gene expression analysis, their connections in protein interaction analysis of personalized data (some of the
new developments are clearly required for networks. This approach has been used to key databases linking known gene variants
cancer genome analysis. measure the proximity of groups of cancer- to diseases and drugs are listed in Table 2).
To alleviate the rigidity introduced by related genes and other groups of genes or Accessing this information and integrating
the binary nature of set-based approaches, functions, by labeling nodes with specific chemical informatics methodologies into
whereby genes are either on the list or they characteristics (such as roles in biological bioinformatics systems presents new chal-
are not, some enrichment analysis ap- pathways or functional classes) [20]. lenges for bioinformaticians and system
proaches study the over-representation of Functional interpretation can therefore developers.
annotations/labels using rank-based statis- be facilitated by the use of a wide array of
tics. A common choice for rank-based alternative analyses. Different approaches 4. Resources for Genome
approaches is to use some variation of the can potentially uncover hidden functional Analysis in Cancer
Kolmogorov-Smirnov non-parametric sta- implications in genomic data, although the
4.1. Databases
tistic, as employed in gene set enrichment integration of these results remains a key
Although complex, the data required
analysis (GSEA) [19]. Another benefit of challenge.
for genome analysis can usually be repre-
rank approaches is that the scores used can sented in a tabular format. Tab separated
be designed to account for some of the 3.4. Applicable Results: Diagnosis, values (TSV) files are the de facto standard
features that are not well handled by set- Patient Stratification and Drug when sharing database resources. For a
based approaches. Accordingly, consider- Therapies developer, these files have several practical
ations of background mutation rates based For clinical applications, the results of advantages over other standard formats
on gene length, sequencing quality or cancer genome analysis need to be trans- popular in computer science (namely
heterogeneity in the initial tumor samples lated into practical advice for clinicians, XML): they are easier to read, write and
can be incorporated into the scoring providing potential drug therapies, better parse with scripts; they are relatively
scheme. However, rank statistics are still tumor classification or early diagnostic succinct; the format is straight-forward
unable to handle other issues, such as markers. While bioinformatics systems can and the contents can be inferred from the
mutations affecting clusters of genes that support these decisions, it will be up to first line of the file, which typically holds
are functionally related (e.g., proto-cadher- expert users to present these findings in the the names of the columns.
ins), which still challenge the assumption context of the relevant medical and clinical Some databases describe entities and
of independence made by most statistical information available at any given time. In their properties, such as: proteins and the
approaches. Note that from a bioinfor- the case of our institutions (CNIO) person- drugs that target them; germline variations
matics perspective, sets of entities are often alized cancer medicine approach, we use and the diseases with which they are
conceptually simpler to work with than mouse xenografts (also known as avatar associated; or genes along with the factors
ranked lists when crossing information models) to test the effects of drugs on that regulate their transcription. Other
derived from different sources. Moreover, tumors prior to considering their potential databases are repositories of experimental
from an application perspective, informa- to treat patients [4]. In turn, the results of data, such as the Gene Expression Omnibus
tion summarized in terms of sets of entities these xenograft studies are used as a and ArrayExpress, which contain data from
is often more actionable than ranks or scores. feedback into the system for future analyses. microarray experiments on a wide range of
1
Reusable means that the code, in whole or in part, can be reused for some other purpose.
2
May be scriptable using web scraping.
3
May support some macro definitions and batch processing.
4
If the source code is provided and is easy to pick apart.
doi:10.1371/journal.pcbi.1002824.t003
mutations in genomes of diseased components can produce similar pheno- protein precursor (AgRP), pro-opiomela-
individuals types; i.e. genes responsible for similar nocortin (POMC) and/or their processed
5. Text Evidence there is ample co- diseases often participate in the same derivatives directly bind MC4R for varied
occurrence of gene and disease terms interaction networks [22,23]. To illustrate purposes of the MC4R signaling pathway.
in scientific texts. Note that textual co- this point, consider the interaction Finally, the reported interactions with
occurrence represents some form of partners of the melanocortin 4 receptor Neuropeptide Y-precursor (NPY) and the
biological evidence, which does not yet (MC4R) in STRING [24,25] server growth hormone releasing protein
lend itself to explicit documentation. generated Figure 2. Note, not all known (GHRL) are literature derived and may
interactions are shown the inclusion reflect indirect, but tight connectivity. By
parameter is STRING server likelihood the token of same pathway evidence,
3.1 Functional Evidence .0.9. MC4R interactors, whether agonists or
3.1.1 Molecular interactions. Gene MC4R is a hypothalamic receptor with antagonists, may be predicted to be linked
prioritization tools, from the earliest field a primary function of energy homeostasis to obesity. In fact, mutations that nega-
pioneers like G2D [15,16,17] to the more and food intake regulation. Functionally tively affect normal POMC production or
recent ENDEAVOUR [18,19] and deleterious polymorphisms in this receptor processing have been shown to be obesity-
GeneWanderer [20,21], among many are known to be associated with severe associated [29,30] and gene association
others, have used gene-gene (protein- obesity [26,27,28]. Here, MC1R, MC3R, studies have linked AgRP with anorexia
protein) interaction and/or pathway and MC5R are membrane bound mela- and bulimia nervosa behavioral traits [31],
information to prioritize candidate genes. nocortin (1,3,5) receptors that interact representative of food intake abnormali-
Biologically this makes sense, because if with MC4R via shared binding partners. ties. Other pathway participants have also
diseases result from pathway breakdown Syndecan-3 (SDC3), agouti signaling pro- been marked and extensively studied for
then disabling any of the pathway tein precursor (ASIP), agouti related obesity association.
The value (dx) of neuron x is the sum of inputs into x from the previous layer of neurons (Yi = 1Rn in general; in our example: I1R3,
H1R2). Each of the n inputs is a product of value of neuron Yi and weight of connection between Yi and x (wYiRx).
X
n
dx ~ Yi wYi?x
i~1
The value of the output (zx) of a neuron x based on its dx and its threshold hx is:
zx ~f (dx zhx )
In our case, the function (f) is a sigmoid, where a is a real number constant (optimized for any given network, but generally
initially chosen to be between 0.5 and 2).
1
f (x)~
1ze{ax
Thus, to compute the output of every neuron in the network we need to use the formula:
1
zx ~
1ze{a(dx zhx )
Note, that to compute the output of the o_neuron (zO; the prediction made by the network) we first have to compute the
outputs of all h_neurons (zHi = 1Rn).
In a supervised learning paradigm, experimentally established pairs of inputs and outputs are given to the network during
training (Figure 5C). After each input, the network output (zO) is compared to the observed result (R). If the network makes a
classification error its weights are adjusted to reflect that error. Establishing the best way to update weights and thresholds in
response to error is of the major challenges of neural networks. Many techniques use some form of the delta rule a gradient
descent-based optimization algorithm that makes changes to function variables proportionate to the negative of the
approximate gradient of the function at the given point. [Its OK if you didnt understand that sentence the basic idea is to
change the weights and thresholds in the direction opposite of the direction of the error]. In our example, we use the delta rule
with back-propagation. This means that to compute the error of the hidden layer, the threshold of the output layer (hO) and the
weights connecting the hidden layer to the output layer (wh1RO, wh2RO) need to be changed first.
1. Compute the error (eO) of zO as compared to result R. Note, that the difference between the expected and the observed values
defines the gradient (g) at the output neuron.
2. Compute the change in the threshold of the output layer (DhO), using a variable l, the learning rate constant - a real number,
often initialized to 0.10.2 and optimized for each network)
DhO ~leO
DWHi ?O ~DhO Hi
gi ~eO wHi ?O
DhHi ~leHi
In on-line updating mode of our example, weights and thresholds are altered after each set of input transmissions. Once the
network has seen the full set of input/output pairs (one epoch/iteration), training continues re-using the same set until the
performance is satisfactory. Note that neural networks are sensitive to dataset imbalance. I.e. it is preferable to balance the
training data, such that the number of instances of each class is presented a roughly equal number of times.
In testing, updating of the weights no longer takes place; i.e. the zO for any given set of inputs is constant over time. See
Exercise 8 for an experience with testing. Note, there are many variations on the type and parameters of network learning
(propagation mode and direction, weight update rules, thresholds for stopping, etc.) Please consult the necessary literature for
more information, e.g. [134].
5. The Processing methods and their requirements differ, the thresholds (functions) of any one given
notion of identifying patterns in the data neuron. Training a network means
Gene prioritization methods use differ- that may be indicative disease-gene in- optimizing these parameters using an
ent algorithms to make sense of all the volvement remains the same throughout. existing set of inputs (and, possibly, out-
data they extract, including mathemati- In simplest terms, a neural network is puts). Ultimately, a trained network could
cal/statistical models/methods (e.g. Gene- essentially a mathematical model that then relatively accurately recognize learned
Prospector [125]), fuzzy logic (e.g. Topp- defines a function f: XRY, where a patterns in previously unseen data. For
Gene [126,127]), and artificial learning distribution over X (the inputs to the more details regarding the possible types
devices (e.g. PROSPECTR [54]), among network) is mapped to a distribution over and parameters of neural networks see
others. Some methods use combinations of Y (the outputs/classifications). The word [132,134]. For an illustration of network
the above. Objectively, there is no one network in the name artificial neural application see Box 2 and Figure 5.
methodology that is better than the others network refers to the set of connections
for all data inputs. For more details on between the neurons (Figure 5). The 6. Summary
computational methods used in the vari- functionality of the network is defined by
ous approaches please refer to relevant the transmission of signal from activated The development of high throughput
tool publications and method-specific neurons in one layer to the neurons in technologies has augmented our abilities to
computer science/mathematics literature, another layer via established (and weighed) identify genetic deficiencies and inconsis-
e.g. [128,129,130,131,132,133,134]. connections. Besides the choice and num- tencies that lead to the development of
To illustrate the general concepts of ber of inputs and outputs, the parameters diseases. However, a large portion of
relying on the various computational tech- defining a given ANN are (1) interconnec- information in the heaps of data that these
niques for gene prioritization we will tion patterns, (2) the process by which the methods produce is incomprehensible to
consider the use of an artificial neural weights of connections are selected/updat- the naked eye. Moreover, inferences that
network (ANN). Keep in mind that while ed (learning function), and (3) the activation could potentially be made from combining
different studies and existing research the information they extract based on associated genes or explain why this is
results are beyond reach for anyone of perceived quality and importance of each not feasible. How many SNPs associate
human (not cyborg) descent. Gene priori- piece of data available in the context of the these genes with diabetes? Is it realisti-
tization methods (Table 1) have been entire set of descriptors a function cally possible to experimentally evaluate
developed to make sense of this data by unlikely to be reproduced in manual data individual effects of each SNP in this
extracting and combining the various interpretation. Thus, computational gene set?
pieces necessary to link genes to diseases. prioritization techniques serve as interpret- 2. Using STRING (http://string-db.org/),
These methods rely on experimental work ers of both of newly retrieved data and of find all genes (hint: use limit of 50)
such as disease gene linkage analysis and information contained in previous studies. interacting with insulin (confidence
genome wide studies to establish the search They also are the bridge that connects .0.99). Note, this confidence limit is extremely
space of candidate genes that may possibly seemingly unrelated inferences creating an high computational techniques would normally
be involved in generating the observed easily comprehensible outlook on an im- deal with lower limits and thus larger data sets.
phenotype. Further, they utilize mathemat- portant problem of disease gene annota- What is the insulin gene name used by
ical and computational models of disease to tion. STRING? How many interaction part-
filter the original set of genes based on gene ners does your query return? Switch to
and protein sequence, structure, function, 7. Exercises STRING evidence view. Pick three
interaction, expression, and tissue and genes connected to insulin via text
cellular localization information. Data re- 1. Search the GAD (http://geneticassociationdb. mining, but without insulin in their
positories that contain the necessary infor- nih.gov/) database for all genes report- full name, and find one reference for
mation are diverse in both content and ed to be associated with diabetes. Refine each in PubMed (http://www.ncbi.nlm.
format and require deep knowledge of the this set to find only the positively nih.gov/pubmed/) suggesting that these
stored information to be properly inter- associated genes. How many are there? genes are involved with diabetes. Report
preted. Moreover, the models utilizing the Why was the total data set reduced? Gene IDs (e.g. MC4R), PubMed IDs and
various sources assign different weights to Count the number of unique diabetes publication citations. Use PolySearch
Experiment, observation Linkage, association, pedigree, relevant User provided CAESAR [140], CANDID [141],
texts and other data ENDEAVOR [122], G2D [15,16,17],
Gentrepid [142], GeneDistiller [121],
PGMapper [143], PRINCE [144],
Prioritizer [145], SUSPECTS [146],
ToppGene [126,127]
Sequence, structure, meta-data Sequence conservation, exon number, SCOP [147], PFam [148,149], CAESAR, CANDID, ENDEAVOR, G2D,
coding region length, known structural ProSite [150], UniProt, Gentrepid, GeneDistiller,
domains and sequence motifs, chromosomal Entrez Gene [151], ENSEMBL GeneProspector [125], MedSim [157],
location, protein localization, and other [152], InterPro [153], LocDB MimMiner [158], PGMapper,
gene-centered information and predictions [154], GeneCards [155], PhenoPred [159], Prioritizer,
PredictProtein [156] PROSPECTR [54], SNPs3D [106],
SUSPECTS, ToppGene
Pathway, protein-protein Disease-gene associations, pathways and KEGG [160,161], STRING, CAESAR, CANDID, DiseaseNet [170],
interaction, genetic linkage, gene-gene/protein-protein interactions/ Reactome [162,163], DIP [164], ENDEAVOR, G2D, Gentrepid,
expression interaction predictions, and gene expression BioGRID [165], GEO [166,167], GeneDistiller, GeneWanderer [20],
data ArrayExpress [168], ReLiance MaxLink [171], MedSim, PGMapper,
[169] PhenoPred, PRINCE, Prioritizer,
SNPs3D, SUSPECTS, ToppGene
Non-human data Information about related genes and OrthoDisease [172], OrthoMCL CAESAR, CANDID, ENDEAVOR,
phenotypes in other species [173], MGD [174], GeneDistiller, GeneProspector,
Pathbase [175] GeneWanderer, MedSim, Prioritizer,
PROSPECTR, SNPs3D, SUSPECTS,
ToppGene
Ontologies Gene, disease, phenotype, and anatomic GO, DO [176], MPO CAESAR, ENDEAVOR, G2D,
ontologies [177,178], HPO [179], GeneDistiller, MedSim, PhenoPred,
eVOC [180] Prioritizer, SNPs3D, ToppGene
Mutation associations and effects Information about existing mutations, their dbSNP, PMD [111], GAD, CAESAR, CANDID, GeneProspector,
functional and structural effects and their DMDM, SNAP, PolyDoms, GeneWanderer, PROSPECTR, SNPs3D,
association with diseases, predictions of SNPdbe, SNPselector, RAVEN, SUSPECTS
functional or structural effects for the SNPeffect, PHD-SNP,
mutations in the gene in question Mutation@A Glance,
PromoLign, SIFT, PolyPhen,
PupaSNP finder, FASTSNP
Literature Mixed information of all types extracted PubMed, PubMed Central, CAESAR, CANDID, DiseaseNet,
from literature references (e.g. disease-gene HGMD [181], GeneRIF, OMIM ENDEAVOR, G2D, Gentrepid,
correlation and non-ontology based GeneDistiller, GeneProspector,
gene-function assignment) GeneWanderer, MedSim, MimMiner,
PGMapper, PolySearch [123], PRINCE,
Prioritizer, PROSPECTR, SNPs3D,
SUSPECTS, ToppGene
There is a wide range of data sources that can be used to infer the above-described pieces of evidence. The existing tools try to take advantage of many (if not all) of
them. This table summarizes the collections and methodologies that make current state of the art in gene prioritization possible. Note, not all resources mentioned here
are utilized by all gene prioritization tools nor are all data sources available listed. Moreover, some resources may be classified as more than one data-type. Many of the
resources reported here are available electronically through the gene prioritization portal [124].
doi:10.1371/journal.pcbi.1002902.t001
(http://wishart.biology.ualberta.ca/ Which of the terms is the most exact select increased susceptibility (MPO,
polysearch) gene to disease mapping in defining the likely molecular function http://www.informatics.jax.org/
with your gene IDs to do the same. Does of insulin (lowest term in a tree searches/MP_form.shtml). How many
your experience confirm that the func- hierarchy)? Display gene products in genotypes are returned? Display
tional molecular interaction evidence GO:0005158: insulin receptor bind- the genotypes and click on the
works? Why? ing, reduce the set to human proteins, Airetm1Mand/Aire+ genotype for fur-
3. In AmiGO (GO term browser, http:// and look at the inferred tree. How many ther exploration. What is the affected
www.geneontology.org), find the hu- gene products are in this term? Pick a set gene? Click on gene title (Gene link in
man insulin record (hint: use the insulin of three gene products (report IDs) and Nomenclature section) to display fur-
ID obtained above). What is the Swiss- use them to search PolySearch for ther information. What is an ortholo-
Prot ID for insulin? Go to the term diabetes associations. In question 3 we gue? What is the human orthologue of
view. How many GO term associations used the common pathway evidence your mouse gene? Look up this gene in
does insulin have? Reduce the view to to show the relationship of genes to OMIM (http://www.ncbi.nlm.nih.
molecular function terms. How diabetes. What type of predictive evi- gov/omim) for association with diabe-
many terms are left? Create a tree dence is used here? tes. Copy/paste the citation from
view of these terms (hint: use the 4. Search the Mammalian Phenotype OMIM, describing the gene relation-
Perform an action dropdown). Ontology for keyword diabetes and ship to diabetes in humans. Do your
N Annotation any additional information about a genetic sequence. Annotation types are extremely varied, including
functional, structural, regulatory, location-related, organism-specific, experimentally derived, predicted, etc.
N CNV, copy number variation an alteration of the genome, which results in an individual having a non-standard number of
copies of one or more DNA sections.
N Gene prioritization the process of arranging possible disease causing genes in order of their likelihood in disease
involvement.
N GWAS, genome wide association studies the examination of all genes in the genome to correlate their variation to
phenotypic trait variation across individuals in a given population.
N Genetic linkage tendency of certain genetic regions on the same chromosome to be inherited together more often than
expected due to limited recombination between them.
N Genetic marker a DNA sequence variant with a known location that can be used to identify specific subsets of individuals
(cells, species, individual organisms, etc.).
N Homologue a gene derived from a common ancestor with the reference gene. Generally, gene A is a homologue of gene B
if both are derived from a common ancestor.
N Linkage disequilibrium tendency of certain genetic regions (not necessarily on the same chromosome) to be inherited
together more often that expected from considering their population frequencies. In reference to gene prioritization, this
phenomenon may complicate establishment of causal genes due to their consistent inheritance in complex with non-causal
genetic regions.
N Orthologues homologous genes separated by a speciation event. Generally, gene A is an orthologue of gene B if A and B
are homologous, but reside in different species. Orthologues often perform the same general function in different organisms.
N Paralogues homologous genes separated by a duplication event (often followed by copy differentiation). Generally, gene
A is a paralogue of gene B if A and B are homologous and reside in the same species. A and B can be functionally identical or,
on contraire, very different, but are often only slightly dissimilar.
N Pleiotropy the influence of a single gene on a number of phenotypic traits.
Further Reading
N Alterovitz G, Ramoni M, eds. (2010) Knowledge-based bioinformatics: from analysis to interpretation. Padstow, Cornwall: John
Wiley and Sons Ltd.
N Bromberg Y, Capriotti E, eds. (2012) SNP-SIG 2011: identification and annotation of SNPs in the context of structure, function
and disease. Proceedings from SNP-SIG 2011 conference, Vienna, Austria. BMC Genomics 13 Supp 4.
N Chen JY, Youn E, Mooney SD (2009) Connecting protein interaction data, mutations, and disease using bioinformatics.
Methods Mol Biol 541: 449461.
N Dalkilic MM, Costello JC, Clark WT, Radivojac P (2008) From protein-disease associations to disease informatics. Front Biosci 13:
33913407.
N Evans JA, Rzhetsky A (2011) Advancing science through mining libraries, ontologies, and communities. J Biol Chem 286:
2365923666.
N Kann MG (2007) Protein interactions and disease: computational approaches to uncover the etiology of diseases. Brief
Bioinform 8: 333346.
N Krallinger M, Leitner F, Valencia A (2010) Analysis of biological processes and diseases using text mining approaches. Methods
Mol Biol 593: 341382.
N Liberles DA, Teichmann SA, Bahar I, Bastolla U, Bloom J, et al. (2012) The interface of protein structure, protein biophysics, and
molecular evolution. Protein Sci 21: 769785.
N Maulik U, Bandyopadhyay S, Wang JTL, eds. (2010) Computational intelligence and pattern analysis in biological informatics.
Hoboken, NJ: John Wiley and Sons, Inc.
N Mooney SD, Krishnan VG, Evani US (2010) Bioinformatic tools for identifying disease gene and SNP candidates. Methods Mol
Biol 628: 307319.
N Moreau Y, Tranchevent LC (2012) Computational tools for prioritizing candidate genes: boosting disease gene discovery. Nat
Rev Genet 13: 523536.
N Oti M, Brunner HG (2007) The modular nature of genetic diseases. Clin Genet 71: 111.
N Piro RM, Di Cunto F (2007) Computational approaches to disease-gene prediction: rationale, classification and successes. FEBS
J 279: 678696.
Abstract: Text mining for transla- research potential. It is a subfield of have to be defined on a bespoke basis for
tional bioinformatics is a new field biomedical natural language processing any given translational bioinformatics task.
with tremendous research poten- (BioNLP) that concerns itself directly with One potential application is better
tial. It is a subfield of biomedical the problem of relating basic biomedical phenotyping. Experimental experience in-
natural language processing that research to clinical practice, and vice dicates that strict phenotyping of patients
concerns itself directly with the versa. improves the ability to find disease genes.
problem of relating basic biomed- When phenotyping is too broad, the
ical research to clinical practice, genetic association may be obscured by
and vice versa. Applications of text 1.1 Use Cases variability in the patient population. An
mining fall both into the category The foundational question in text min- example of the advantage of strict pheno-
of T1 translational researchtrans- ing for translational bioinformatics is what typing comes from the work of [1,2]. They
lating basic science results into the use cases are. It is not immediately worked with patients with diagnoses of
new interventionsand T2 transla- obvious how the questions that text mining pulmonary fibrosis. However, having a
tional research, or translational for translational bioinformatics should try diagnosis of pulmonary fibrosis in the
research for public health. Potential to answer are different from the questions medical record was not, in itself, a strict
use cases include better phenotyp- that are approached in BioNLP in general. enough definition of the phenotype for
ing of research subjects, and phar- The answer lies at least in part in the their work [1]. They defined strict criteria
macogenomic research. A variety of nature of the specific kinds of information for study inclusion and ensured that
methods for evaluating text mining that text mining should try to gather, and patients met the criteria through a number
applications exist, including corpo- in the uses to which that information is
ra, structured test suites, and post of methods, including manual review of
intended to be put. However, these the medical record. With their sharpened
hoc judging. Two basic principles probably only scratch the surface of the
of linguistic structure are relevant definition of the phenotype, they were able
domain of text mining for translational to identify 102 genes that were up-
for building text mining applica-
tions. One is that linguistic struc- bioinformatics, and the latter has yet to be regulated and 89 genes that were down-
ture consists of multiple levels. The clearly defined. regulated in the study group. This includ-
other is that every level of linguistic One step in the direction of a definition ed Plunc (palate, lung and nasal epitheli-
structure is characterized by ambi- for use cases for text mining for transla- um associated), a gene not previously
guity. There are two basic ap- tional bioinformatics is to determine associated with pulmonary fibrosis. Auto-
proaches to text mining: rule- classes of information found in clinical mation of the step of manually reviewing
based, also known as knowledge- text that would be useful for basic medical records would potentially allow
based; and machine-learning- biological scientists, and classes of infor- for the inclusion or exclusion of much
based, also known as statistical. mation found in the basic science litera- larger populations of patients in similar
Many systems are hybrids of the ture that would be of use to clinicians. This studies.
two approaches. Shared tasks have in itself would be a step away from the Another use for text mining in transla-
had a strong effect on the direction usual task definitions of BioNLP, which tional bioinformatics is aiding in the
of the field. Like all translational tend to focus either on finding biological preparation of Cochrane reviews and
bioinformatics software, text min- information for biologists, or on finding other meta-analyses of experimental stud-
ing software for translational bioin-
clinical information for clinicians. Howev- ies. Again, text mining could be used to
formatics can be considered
er, it is likely that there is no single set of identify cohorts that should be included in
health-critical and should be sub-
ject to the strictest standards of data that would fit the needs of biological the meta-analysis, as well as to determine
quality assurance and software scientists on the one hand or clinicians on P-values and other indicators of signifi-
testing. the other, and that information needs will cance levels.
Citation: Cohen KB, Hunter LE (2013) Chapter 16: Text Mining for Translational Bioinformatics. PLoS Comput
Biol 9(4): e1003044. doi:10.1371/journal.pcbi.1003044
Editors: Fran Lewitter, Whitehead Institute, United States of America and Maricel Kann, University of Maryland,
Baltimore County, United States of America
This article is part of the Transla-
This article
tional is part of the
Bioinformatics Transla-
collection for Published April 25, 2013
tional Bioinformatics collection
PLOS Computational Biology. for Copyright: 2013 Cohen, Hunter. This is an open-access article distributed under the terms of the Creative
PLOS Computational Biology. Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
Funding: This work was funded in part by grants NIH 5 R01 LM009254-06, NIH 5 R01 LM008111-07, NIH 5 R01
GM083649-04, and NIH 5 R01 LM009254-03 to Lawrence E. Hunter. The funders had no role in the preparation
1. Introduction of the manuscript.
Text mining for translational bioinfor- Competing Interests: The authors have declared that no competing interests exist.
matics is a new field with enormous * E-mail: kevin.cohen@gmail.com
Informatics for Integrating Biology and the Bedside National Center for Biomedical Computing with focus on translational research that
(i2b2 - https://www.i2b2.org/) facilitates and proves data sets for clinical natural language processing research
Gene Ontology (https://www.geneontology.org) Controlled vocabulary with relationships including partonymy and inheritance,
designed for describing gene functions, broadly construed
Entrez Gene (https://www.ncbi.nlm.nih.gov/gene) Source for gene names, symbols, and synonyms; also the source for GeneRIFs and
SUMMARY fields
PubMed/MEDLINE (https://www.ncbi.nlm.nih.gov/pubmed) The National Library of Medicines database of abstracts of biomedical publications
(MEDLINE) and search interface for accessing them (PubMed)
Unified Medical Language System (https://www.nlm.nih.gov/research/umls/) Large lexical and conceptual resource, including the UMLS Metathesaurus, which
aggregates a large number of biomedical and some genomic vocabularies
SWISSPROT (https://www.uniprot.org/) Database of information about proteins with literature references, useful as a gold
standard
PharmGKB (https://www.pharmgkb.org/) Database of relationships between a number of clinical, genomic, and other entities
with literature references, useful as a gold standard
Comparative Toxicogenomics Database (https://ctdbase.org/) Database of relationships between genes, diseases, and chemicals, with literature
references, useful as a gold standard
Various terminological resources, data sources, and gold-standard databases for biomedical natural language processing.
doi:10.1371/journal.pcbi.1003044.t001
swering typically involves determining the MiTAP, does multi-document summari- 5. Shared Tasks
type of answer that is expected (a time? a zation of epidemiological reports, news-
location? a person?), formulating a query wire feeds, email, online news, television The natural language processing com-
that will return documents containing the news, and radio news to detect disease munity has a long history of evaluating
answer, and then finding the answer outbreaks. applications through the shared task
within the documents that are returned. In the genomics domain, there have paradigm. Similar to CASP, a shared task
Various types of questions have varying been three major areas of summarization involves agreeing on a task definition, a
degrees of difficulty. The best results are data set, and a scoring mechanism. In
research. One has been the automatic
achieved for so-called factoid questions, biomedical text mining, shared tasks have
generation of GeneRIFs. GeneRIFs are
had a strong effect on the direction of the
such as where are lipid rafts located?, while short text snippets, less than 255 charac-
field. There have been both clinically
why questions are very difficult. In the ters in length, associated with specific
oriented and genomically oriented shared
biomedical domain, definition questions Entrez Gene entries. Typically they are
tasks.
have been extensively studied [3436]. manually cut-and-pasted from article ab-
In the clinical domain, the 2007 NLP
The medical domain presents some stracts. Lu et al. developed a method for
Challenge [42] involved assigning ICD9-
unique challenges. For example, questions finding them automatically using a variant
CM codes to radiology reports of chest x-
beginning with when might require times as of the Edmundsonian paradigm, a classic
rays and renal procedures. Also in the
their answer (e.g., when does blastocyst approach to single-document summariza- clinical domain, i2b2 has sponsored a
formation occur in humans?, but also may tion [38,39]. In the Edmundsonian para- number of shared tasks, described in
require very different sorts of answers, digm, sentences in a document are given Section 1.1.2. (At the time of writing, the
e.g.,when should antibiotics be given for a sore points according to a relatively simple set National Institute of Standards and
throat? [37]. A shared task in 2005 involved of features, including position in the Technology is preparing a shared task
a variety of types of genomic questions document, presence of cue words involving electronic medical records un-
adhering to specific templates (and thus (words that indicate that a document is a der the aegis of the annual Text Retriev-
overlapping with information extraction), good summary sentence), and absence of al Conference. The task definition is not
such as what is the biological impact of a stigma words (words that indicate that a yet defined.)
mutation in the gene X?. sentence is not likely to be a good In the genomics domain, the predomi-
summary sentence). nant shared tasks have been the BioCrea-
4.7 Summarization Another summarization problem is find- tive shared tasks and a five-year series of
Summarization is the task of taking a ing the best sentence for asserting a protein- tasks in a special genomics track of the
document or set of documents as input protein interaction. This task was made Text Retrieval Conference [43]. Some of
and returning a shorter text that conveys popular by the BioCreative shared task. the tasks were directly relevant to transla-
the information in the longer text(s). There The idea is to boil down a set of articles to tional bioinformatics. The tasks varied
is a great need for this capability in the the single sentence that best gives evidence from year to year and included informa-
biomedical domaina search in that the interaction occurs. Again, simple tion retrieval (Section 4.1), production of
PubMed/MEDLINE for the gene p53 features work well, such as looking for GeneRIFs (Section 4.7), document classi-
returns 56,464 publications as of the date references to figures or tables [40]. fication (Section 4.2), and question-an-
of writing. Finally, a small body of work on the swering (Section 4.6). A topic that was
In the medical domain, summarization generation of SUMMARY fields has been frequently investigated by participants was
has been applied to clinical notes, journal seen. More sophisticated measures have the contribution of controlled vocabularies
articles, and a variety of other input types. been applied here, such as the PageRank to performance on text mining tasks.
For example, one system, MITRES algorithm [41]. Results were equivocal; it was found that
Figure 1. The flowchart of bioimage informatics for drug and target discovery.
doi:10.1371/journal.pcbi.1003043.g001
[18], Icy [19], GCELLIQ [20], and Phe- compounds that regulate cell mitosis in example, the 3D neuron synaptic mor-
noRipper [21] can be used for the [24,25]. Moreover, the time-lapse images phological and structural changes were
multicolor cell image analysis. of live cells were used to study the dynamic investigated by using super-resolution mi-
behaviors of stem cells in [26,27] and predict croscopy, e.g., STED microscopy, to study
2.2 Live-cell Imaging-based Studies cell fates of neural progenitor cells using brain functions and disorders under dif-
for Cell Cycle and Migration their dynamic behaviors in [28]. Figure 3 ferent stimulations [3436]. Also other
shows a single frame of live HeLa cell images advanced optical techniques were pro-
Regulator Discovery
and the images of four cell cycle phases: posed in [37,38] to image and reconstruct
Two hallmarks of cancer cells are
interphase, prophase, metaphase, and ana- the 3D structure of live neurons. Figure 4
uncontrolled cell proliferation and migra-
phase [25]. The publicly available software shows an example of 2D neuron image
tion. These are also good phenotypes for
packages for time-lapse image analysis used in [39]. In [40], neuronal degenera-
screening drugs and targets that regulate cell
include, for example, the plugins of Cell- tion was mimicked by treating mice with
cycle progression and cell migration in time-
Profiler [17], Fiji [18], BioimageXD [29], different dosages of Ab peptide, which
lapse images. For example, out of 22,000
Icy [19], CellCognition [23], DCELLIQ may cause the loss of neuritis, and drugs
human genes, about 600 were identified as
[30], and TLM-Tracker [31]. that rescue the loss of neurites were
related to mitosis by using live cell (time-
lapse) imaging and RNAi treatment in the identified as candidates for AD therapy.
MitoCheck project (www.mitocheck.org) 2.3 Neuron Imaging-based Studies Figure 5 shows an example of neurites and
[22,23]. The project is now being expanded for Neurodegenerative Disease Drug nuclei images acquired in [40]. To quan-
to study how these identified genes work and Target Discovery titatively analyze neuron images, a num-
together to regulate cell mitosis, in which Neuronal morphology is illustrative of ber of publicly available software packages
mistakes can lead to cancer, in the MitoSys neuronal function and can be instructive have been developed, for example, Neur-
(systems biology of mitosis) project (http:// toward the dysfunctions seen in neurode- phologyJ [41], NeuronJ [42], NeuriteTra-
www.mitosys.org/). Also, live cell imaging of generative diseases, such as Alzheimers cer (Fiji plugin) [43], NeuriteIQ [44],
Hela cells was used to discover drugs and and Parkinsons disease [32,33]. For NeuronMetrics [45], NeuronStudio
[46,47], NeuronJ [42], NeuronIQ [39,48], opment, was built based on the confocal Due to the large amounts of images
and Vaa3D [49,50]. A review of software image stacks via the software, CellExplorer generated, it is not feasible to quantify
packages for neuron image analysis was [55,56]. In addition, CellProfiler provides the images manually. Therefore, automat-
also reported in [51]. an image analysis pipeline for delineating ed image analysis is essential for the
bodies, and quantifying the expression quantification of phenotypic changes. In
2.4 Caenorhabditis elegans Imaging- changes of specific proteins, e.g., clec-60 general, the challenges of quantitative
based Studies for Drug and Target and pharynx, of individual C. elegans under image analysis include object detection,
Discovery different treatments [57]. segmentation, tracking, and visualization.
Caenorhabditis elegans (C. elegans) is a These examples have demonstrated The word object in this context means
common animal model for drug and target diverse cellular phenotypes in different the object captured in the bioimages, e.g.,
discovery. Consisting of only hundreds of image-based studies. To quantify and the nucleus and cell. The following
cells, it is an excellent model to study analyze the complex phenotypic changes sections will introduce techniques used to
cellular development and organization. For of cells and sub-cellular components from address these challenges.
example, the invariant embryonic develop- large scale image data, bioimage infor-
ment of C. elegans was recorded by time- matics approaches are needed. 3.1 Object Detection
lapse imaging, and the embryonic lineages Object detection is to detect the loca-
of each cell were then reconstructed by cell 3. Quantitative Bioimage tions of individual objects. It is important,
tracking to study the functions of genes Analysis especially when the objects cluster togeth-
underpinning the development process er, to facilitate the segmentation task by
[5254]. Moreover, an atlas of C. elegans, After image acquisition, phenotypic providing the position and initial bound-
which quantified the nuclear locations and changes need to be quantified for charac- ary information of individual objects.
statistics on their spatial patterns in devel- terizing functions of drugs and targets. Based on the shape of objects, two
detection, a linking process is needed to used method. For example, blood vessel starting from the detected centers or
connect these center line points into detection in retinal images is a repre- centerlines of objects. Without the guid-
continuous center lines based on their sentative tubular structure detection ance of detection results, object segmen-
direction and distance. For example, in task with the supervised learning ap- tation would be more challenging.
NeuronJ, Dijkstras shortest-path was proaches [77,78]. In these methods, the
used based on the Gaussian derivative local features, e.g., intensity and wavelet 3.2 Object Segmentation
features to detect the neurons center- features, of an image patch containing a
The goal of object segmentation is to
line between two given points on the given pixel are calculated, and then a
delineate boundaries of individual objects
neuron [42]. Figure 7 provides an classifier is trained using these local
of interest in images. Segmentation is the
example of neurite images, and features based on a set of training points
basis for quantifying phenotypic changes.
Figure 8 shows the corresponding cen- [77,78]. A good survey of blood vessel
Although a number of image segmenta-
terline detection results [44] based on (tube structure) detection approaches in
tion methods have been reported, this
the local Gaussian derivative features. retinal images was reported in [79]. For
remains an open challenge due to the
In addition to the approaches based on more approaches and details of tubular
complexity of morphological appearances
Gaussian derivatives, there are other structure detection, readers should refer
of objects. This section introduces a
tubular structure detection approaches. to the aforementioned neuron image
number of widely used segmentation
For example, four sets of kernels (edge analysis software packages.
methods.
detectors) were designed to detect the In summary, blobs and tubes are the
neuron edges and centerlines [75], and dominating structures in bioimages. The Threshold segmentation
[80] is the sim-
super-ellipsoid modeling was designed detection results provide the position and 1; t2 wI(x,y)wt1
plest method: T(I)~ ,
to fit the local geometry of blood vessels initial boundary information for the quan- 0; otherwise
where I(x,y) is the image, and t1 and t2 are
[76]. tification and segmentation processes. In
Moreover, machine learning-based other words, the segmentation process the intensity thresholds. As an extension
tubular structure detection is a widely tries to delineate boundaries of objects of the thresholding method, Fuzzy-C-
Figure 8. An example of neurite centerline detection. (A) The centerline confidence image obtained by using the local Gaussian derivative
features. Higher intensity indicates higher confidence of pixels on the centerlines. (B) The neurite centerline detection result image. Different colors
indicate the disconnected branches.
doi:10.1371/journal.pcbi.1003043.g008
information to separate the image into and Vese active contour (CV) [87]. 1 2 x
regions with similar intensity. Region H(x)~ 1z arg tan ( ) , and the
2 p e
competition-based active contour models
d +y
could solve the weak boundary problem; y~a+g:+yzgkzcj+yj curvature term, k~div ~
dt j+yj
however, they require that the intensity of
touching objects is separable [87]. To GAC level set evolution equation, yxx y2y {2yx yxy yy zy2x yyy
3=2 indicates the
implement these active contour models,
level set representation is widely used [92]. y2x zy2y
d
Level set is an n+1 dimensional function y~ local smoothness of boundaries, and div is
that can easily represent any n dimensional dt the divergence operation. Figure 10 dem-
h i
shape without parameters. The inside de y m:k{n{l1 I{c1 2 zl2 I{c2 2 onstrates the segmentation result using
regions of objects are indicated by using GAC level set approach. An additional
positive levels, and outside regions are CV level set evolution equation, segmentation method, Voronoi segmenta-
represented using negative levels. For this tion [94], first defines the centers of objects
implementation, the initial boundary (zero where y denotes the level-set function, and and then constructs the boundaries be-
level) is required, and the signed distance g indicates the gradient function, + is the tween two objects on the pixels, from
function is often used to initialize the level gradient operator, c, c1, and c2 are constant which the distances are the same to the
set function [92,93]. To evolve the level set 1 e two centers. In CellProfiler, the Voronoi
variables. de x~ 2 is an approxi-
functions (grow the boundaries of objects), p e zx2 segmentation method was extended by
the following two equations are classical mation of the Dirac function to indicate considering the local intensity variations in
models. The first equation is often called the boundary bands), which is the deriv- the distance metric to achieve better
geodesic active contour (GAC) [86], and ative function of Heaviside function de- segmentation results [95]. This method is
the second one is often named the Chan noting inside/outside regions of objects: fast and generates level set comparable
results. Graph cut segmentation method could be regions (superpixel) obtained by Hela cells [30]. Object tracking is a
views the image as a graph, in which each the clustering analysis. For example, Simple challenging task due to the complex
pixel is a vertex and adjacent pixels are Linear Iterative Clustering (SLIC) made dynamic behaviors of objects over time.
connected [63,96,97]. It cuts the graph use of the intensity and coordinate infor- In general, cell tracking approaches can be
into several small graphs from the regions mation of pixels to separate the image into classified into three categories: model
where adjacent pixels have the most uniformly sized and biologically meaning- evolution-based tracking, spatial-temporal
different properties, e.g., intensity. ful regions [98,99], and then the machine volume segmentation-based tracking, and
Different from the aforementioned seg- learning approaches were used to identify segmentation-based tracking.
mentation approaches, local feature and the regions of interest, e.g., boundary In the model evolution based tracking
machine learning-based segmentation ap- superpixels, for object segmentation [99]. approaches, cells or nuclei are initially
proaches are implemented, for example, in detected and segmented in the first frame,
Fiji (trainable segmentation plugin) [18] 3.3 Object Tracking and then their boundaries and positions
and Ilastik [73]. Users can interactively To study the dynamic behaviors and evolve frame by frame. Some tracking
select the training sample pixels/voxels or phenotypic changes of objects over time techniques in this category are mean-shift
small image patches conveniently, and then (e.g., cell cycle progression and migration), [100] and parametric active contours
classifiers are automatically trained based object tracking using time lapse image [88,101]. However, neither mean-shift
on the features of the training pixels or sequences is necessary. Figure 11 shows a nor parametric active contours can cope
voxels (or patches) to predict the classes, Hela cells division process in four frames well with cell division and nuclei clusters.
e.g., cells or background, of the pixels or at different time points, and Figures 12 Though the level set method enables
voxels (or patches) in a new image. The and 13 show the examples of cell migra- topological change, e.g., cell division, it
image patches could be a circle or square tion trajectories and cell lineages recon- also allows the fusion of overlapping cells.
neighbor regions of a given point, and also structed from the time-lapse images of Extending these methods to cope with
these tracking challenges is nontrivial and ral volume segmentation based tracking, ity measurements in [113,115]. For the
increases computation time [90,102104]. 2D image sequences were viewed as 3D association approaches, the overlap region
For example, the coupled geometric active volume data (2D spatial+temporal), and and distance based method was employed
contours model was proposed to prevent the shape and size constrained level set in [114], in which objects in the current
object fusion by representing each object segmentation approaches were applied to frame were associated with the nearest
with an independent level set in [105], and segment the traces of objects, and recon- objects in the next frame. Then the false
this was further extended to the 3D cell struct the cell lineage in [110112]. matches, e.g., many-to-one or one-to-
tracking in [90]. The other approach For detection and segmentation-based many, were further corrected through
explicitly blocking the cell merging is to tracking, objects are first detected and the post processing. Different from the
introduce the topology constraints, i.e., segmented, and then these objects are individual object association above, all
labeling objects regions with different associated between two consecutive segmented objects were simultaneously
numbers or colors. For example, the frames, based on their morphology, posi- associated by using the integer program-
region labeling map was employed in tion, and motion [30,113115]. The ming optimization in [113,116]:
[27,106] to deal with the cell merging, tracking approaches are usually done fast, x ~ max Sx, s.t. Ax1, where Ax1
and planar graphvertex coloring was but their accuracy is closely related to x[f0,1gN
employed to separate the neighboring detection and segmentation results, simi- restricts that one object can be associated
contours. From that four separate level larity measurements, and association strat- to one object at most, A is an (m+n)6N
set functions could easily deal with cell egies. The cell center position, shape, matrix, and the first m rows correspond to
merging [107] based on the four-color intensity, migration distance, and spatial m objects in frame t, and the last n rows
theorem [108,109]. For the spatial-tempo- context information were used as similar- denote objects in frame t +1. N is the
number of all possible associations among velocity, and intensity. Then, two models (particles) being stochastically drawn, and it
objects in frame t and frame t+1. S is a are defined based on the state vector. The had been employed for object tracking in
16N similarity matrix, and S j first is the state evolution model, xt = fluorescent images in [119121]. In some
~S cktz1 jcit . For the unmatched cells, ft (xt21)+et, where ft is the state evolution biological studies, the motion dynamics of
e.g., the new born or new entered cells, a function at time point, t, and et is a noise, objects are complex. Therefore, one motion
linking process is usually needed to link e.g., Gaussian noise, which describes the model might not be able to describe object
them to the parent cells or as a new evolution of the state. The other is the motion dynamics well. The IMM filter is
trajectory. This optimal matching strategy observation model, zt = ht (xt21)+gt, where employed to incorporate multiple motion
was also used to link the object trajectory ht is the map function, and gt is the noise, models, and the motion model of objects can
segments in [27] to link the broken or which maps the state vector into observa- be transitioned from one to another in the
newly appearing trajectories. tions that are measurable in the image. next frame with certain probabilities. For
As an alternative to frame-by-frame Based on the two models and Bayes rule, example, the IMM filter with three motion
association strategies, Bayesian filters, the posterior density of the object state is models, i.e., random walk, first-order, and
e.g., Particle filter and Interacting Multiple estimated as follows: pxt Dz1:t !pzt Dx t second-order linear extrapolation, was used
Model (IMM) filters [117,118], are also pxt Dz1:t{1 , and pxt jz1:t{1 ~ p for 3D object tracking in [118], and for 2D
used for object tracking. The goal of these xt jxt{1 pxt{1 jzt{1 dxt{1 where the cell tracking in [27].
filters is to recursively estimate a model of p(zt |xt) is defined based on the observation
object migration in an image sequence. model, and the pxt Dxt{1 is defined based 3.4 Image Visualization
Generally, in the Bayesian methods, a on the state evolution model. The basic Most of the aforementioned software
state vector, xt, is defined to indicate principle of particle filter is to approximate packages provide functions to visualize 2D
the characters of objects, e.g., position, the posterior density by a set of samples images and the analysis results. However,
phenotypic feature space. Drugs with features explicitly clear, factor analysis representing nuclei size, DNA replication,
similar d-profiles were found to have the was employed in [12]. The basic principle chromosome condensation, nuclei mor-
same functional targets, and thus it could of factor analysis is to determine the phology, Edu texture, and nuclei elliptic-
be used to predict functions of new drugs independent common traits (factors). ity, were obtained through factor analysis
or compounds. Mathematically it is formulated by the in [12].
following equation.
5.3 Factor-based Multidimensional 5.4 Subpopulation-based
2 3
Profiling Analysis x11 ,x12 ,:::,x1n Heterogeneity Profiling Analysis
In the set of numerical features, some 6 7 In image-based screening studies, het-
6 x21 ,x22 ,:::,x2n 7
are highly correlated within groups but 6 7 erogeneous phenotypes often appeared
6 7~Xmn
poorly correlated with features in other 6 ::: 7 within a cell population, as shown in
4 5
groups. One possible explanation is that Figures 2 and 16, which indicated that
xm1 ,xm2 ,:::,xmn
the features in one group measure a individual cells responded to perturbations
common biological process, such as in- ~mmn zLmk Fkn zemn differently [142]. However, the heteroge-
crease or decrease of nuclei size. The neity information was ignored in most
challenge using these numerical features where mmn is the mean value of each row, screening studies. To better make use of
directly is that biological meanings of Fkn denotes the k factor, and the Lmk is the the heterogeneous phenotypic responses, a
certain phenotypic features are often loading matrix, which is the coordinates of subpopulation based approach was pro-
vague. It is thus difficult to explain the the n samples in the new k-dimensional posed to study the phenotypic heteroge-
phenotypic changes represented by these space. In other words, k factors are neity for characterizing drug effects in
numerical features as aforementioned. To independent and are the underlying bio- [13], and distinguishing cell populations
remove the redundant features and make logical processes that regulate the pheno- with distinct drug sensitivities in [14]. The
the biological meanings of numerical typic changes. For example, six factors basic principle of the subpopulation based
method is to characterize the phenotypic was implemented by fitting a GMM in the subpopulation. To profile the effects of
heterogeneity with a mixture of phenotyp- numerical space, and each model compo- perturbations, cells collected from pertur-
ically distinct subpopulations. This idea nent of the GMM represents a distinct bation conditions were first classified into
Figure 17. An illustration of drug profiling using the normal vector of hyperplane of SVM. The red and blue spots indicate the spatial
distribution of cells in the numeric feature space. The yellow arrow represents the normal vector of the hyperplane (the blue plane). The top left and
bottom right (MB231 cell) images are from drug treated and control conditions respectively.
doi:10.1371/journal.pcbi.1003043.g017
doi:10.1371/journal.pcbi.1003043.t001
one of the subpopulations, and then the there are a number of publicly available packages, including the microscope control
portions of cells belonging to each sub- bioimage informatics software packages [9], software for image acquisition (mManager
population were calculated as features to which provide even more powerful func- and ScanImage) and image database soft-
further characterize the effects of pertur- tions with cutting-edge algorithms and ware (OME, Bisque and OMERO.-
bations. For more details, please refer to screening-specific analysis pipelines. For searcher). Also, certain cellular image
[13,14]. the convenience of finding these popular simulation software packages, e.g., CellOr-
software packages, they are listed in Table 1. ganizer and SimuCell, provide useful in-
6. Publicly Available Bioimage It is difficult to summarize all of their sights into the organizations of proteins of
capabilities and functions because many of interest within individual cells. These soft-
Informatics Software Packages
them are designed for flexible bioimage ware packages represent the prevalent
A number of commercial bioimage analysis with a set of diverse plugins and directions of bioimage informatics research,
informatics software tools e.g., GE-InCel- function modules, e.g., Fiji, CellProfiler, Icy, thus their websites and features are worth
lAnalyzer [143], Cellomics [144], Cellumen and BioimageXD. The software selection checking.
[145], MetaXpress [146], BD Pathway for specific applications is also non-trivial,
[147] have been developed and are widely and the best way might be to check their 7. Summary
used in pharmaceutical companies, and websites and online documents. In addition
academic institutions. In addition to the to the bioimage informatics software pack- With the advances of fluorescent mi-
commercially available software packages, ages, there are other software croscopy and robotic handling, image-
Further Reading
N Taylor DL (2010) A personal perspective on high-content screening (HCS): from the beginning. J Biomol Screen 15(7): 720
255.
N Shariff A, Kangas J, Coelho LP, Quinn S, Murphy RF (2010) Automated image analysis for high-content screening and analysis.
J Biomol Screen 15(7): 726734.
N Dufour A, Shinin V, Tajbakhsh S, Guillen-Aghion N, Olivo-Marin JC, et al. (2005) Segmenting and tracking fluorescent cells in
dynamic 3-D microscopy with coupled active surfaces. IEEE Trans Image Process 14: 13961410.
N Danuser G (2011) Computer vision in cell biology. Cell 147(5): 973978.
N Murray JI, Bao Z, Boyle TJ, Boeck ME, Mericle BL, et al. (2008) Automated analysis of embryonic gene expression with cellular
resolution in C. elegans. Nat Methods 5: 703709.
N Rodriguez A, Ehlenberger DB, Dickstein DL, Hof PR, Wearne SL (2008) Automated three-dimensional detection and shape
classification of dendritic spines from fluorescence microscopy images. PLoS ONE 3: e1997. doi:10.1371/journal.pone.0001997.
N Bakal C, Aach J, Church G, Perrimon N (2007) Quantitative morphological signatures define local signaling networks
regulating cell morphology. Science 316(5832): 17531756.
N Neumann B, Walter T, Heriche JK, Bulkescher J, Erfle H, et al. (2010) Phenotypic profiling of the human genome by time-lapse
microscopy reveals cell division genes. Nature 464: 721727.
N Allan C, Burel JM, Moore J, Blackburn C, Linkert M, et al. (2012) OMERO: flexible, model-driven data management for
experimental biology. Nat Methods 9: 245253.
N Eliceiri KW, Berthold MR, Goldberg IG, Ibanez L, Manjunath BS, et al. (2012) Biological imaging software tools. Nat Methods 9:
697710.
N Cellular phenotype: A cellular phenotype refers to a distinct morphological appearance or behavior of cells as observed
under fluorescent, phase contrast, or bright field microscopy.
N Green fluorescent protein (GFP): GFP is used as a protein reporter by attaching to specific proteins, and exhibiting bright
green fluorescence when exposed to light in the blue to ultraviolet range.
N Fluorescence microscope: A fluorescence microscope is an optical microscope that uses higher intensity light source to
excite a fluorescent species in a sample of interest.
N High content analysis (HCA): HCA focuses on extracting and analyzing quantitative phenotypic data automatically from
large amounts of cell images with automated image analysis, computer vision and machine learning approaches.
N High content screening (HCS): Applications of HCA for screening drugs and targets are referred to as HCS that aims to
identify compounds or genes that cause desired phenotypic changes.
N RNA interference (RNAi): RNAi is a biological process, in which RNA molecules inhibit gene expression, typically by causing
the destruction of specific mRNA molecules.
N Automated image analysis: Automated image analysis aims to quantitatively analyze images automatically by computer
programs with minimal human interventions.
N Object detection: Object detection is to automatically detect locations of objects of interest in images.
N Blob structure detection: Blob structure detection is to detect positions of objects of interest that have circle, sphere like
structures, e.g., nuclei and particles.
N Tube structure detection: Tube structure is to detect centerlines of objects that have long tube like structures, e.g., neuron
dendrite and blood vessel.
N Object segmentation: Object segmentation is to automatically delineate boundaries of objects of interest in images.
N Object tracking: Object tracking is to identify the motion traces of objects of interest in time-lapse images.
N Feature extraction: Feature extraction is to quantify the morphological appearances of segmented objects by calculating a
set of numerical features.
N Phenotype classification: Phenotype classification is to assign each segmented object into a sub-group that has distinct
phenotypes from other sub-groups.
N Cell cycle phase identification: Cell cycle phase identification is to automatically identify the corresponding cell cycle
phase that a given cell is in according to its morphological appearances.
References
1. Tsien RY (1998) The green fluorescent protein. content screening and ligand-target prediction 21. Rajaram S, Pavie B, Wu LF, Altschuler SJ
Annu Rev Biochem 67: 509544. to identify mechanism of action. Nat Chem Biol (2012) PhenoRipper: software for rapidly profil-
2. Lichtman JW, Conchello JA (2005) Fluorescence 4: 5968. ing microscopy images. Nat Methods 9: 635
microscopy. Nat Methods 2: 910919. 13. Slack MD, Martinez ED, Wu LF, Altschuler SJ 637.
3. Shariff A, Kangas J, Coelho LP, Quinn S, (2008) Characterizing heterogeneous cellular 22. Neumann B, Walter T, Heriche JK, Bulkescher
Murphy RF (2010) Automated image analysis responses to perturbations. Proc Natl Acad J, Erfle H, et al. (2010) Phenotypic profiling of
for high-content screening and analysis. J Biomol Sci U S A 105: 1930619311. the human genome by time-lapse microscopy
Screen 15: 726734. 14. Singh DK, Ku C-J, Wichaidit C, Steininger RJ, reveals cell division genes. Nature 464: 721727.
4. Danuser G (2011) Computer vision in cell Wu LF, et al. (2010) Patterns of basal signaling 23. Held M, Schmitz MH, Fischer B, Walter T,
biology. Cell 147: 973978. heterogeneity can distinguish cellular popula- Neumann B, et al. (2010) CellCognition: time-
5. Taylor DL (2010) A personal perspective on tions with different drug sensitivities. Mol Syst resolved phenotype annotation in high-through-
high-content screening (HCS): from the begin- Biol 6: 369. put live cell imaging. Nat Methods 7: 747
ning. J Biomol Screen 15(7): 720255. 15. Bakal C, Linding R, Llense F, Heffern E, 754.
6. Abraham VC, Taylor DL, Haskins JR (2004) Martin-Blanco E, et al. (2008) Phosphorylation 24. Shi Q, King RW (2005) Chromosome nondis-
High content screening applied to large-scale networks regulating JNK activity in diverse junction yields tetraploid rather than aneuploid
cell biology. Trends Biotechnol 22: 1522. genetic backgrounds. Science 322: 453456. cells in human cell lines. Nature 437: 1038
7. Giuliano KA, DeBiasio RL, Dunlay RT, Gough 16. Bakal C, Aach J, Church G, Perrimon N (2007) 1042.
A, Volosky JM, et al. (1997) High-content Quantitative morphological signatures define 25. Sigoillot FD, Huckins JF, Li F, Zhou X, Wong
screening: a new approach to easing key local signaling networks regulating cell morphol- ST, et al. (2011) A time-series method for
bottlenecks in the drug discovery process. ogy. Science 316: 17531756. automated measurement of changes in mitotic
J Biomol Screen 2: 249259. 17. Carpenter AE, Jones TR, Lamprecht MR, Clarke and interphase duration from time-lapse movies.
8. Peng H (2008) Bioimage informatics: a new area C, Kang IH, et al. (2006) CellProfiler: image PLoS ONE 6: e25511. doi:10.1371/journal.
of engineering biology. Bioinformatics 24: 1827 analysis software for identifying and quantifying pone.0025511
1836. cell phenotypes. Genome Biol 7: R100. 26. Miki T, Lehmann T, Cai H, Stolz DB, Strom
9. Eliceiri KW, Berthold MR, Goldberg IG, Ibanez 18. Schindelin J, Arganda-Carreras I, Frise E, SC (2005) Stem cell characteristics of amniotic
L, Manjunath BS, et al. (2012) Biological imaging Kaynig V, Longair M, et al. (2012) Fiji: an epithelial cells. Stem Cells 23: 15491559.
software tools. Nat Methods 9: 697710. open-source platform for biological-image anal- 27. Li K, Miller ED, Chen M, Kanade T, Weiss LE,
10. Yarrow JC, Feng Y, Perlman ZE, Kirchhausen ysis. Nat Methods 9: 676682. et al. (2008) Cell population tracking and lineage
T, Mitchison TJ (2003) Phenotypic screening of 19. de Chaumont F, Dallongeville S, Chenouard N, construction with spatiotemporal context. Med
small molecule libraries by high throughput cell Herve N, Pop S, et al. (2012) Icy: an open Image Anal 12: 546566.
imaging. Comb Chem High Throughput Screen bioimage informatics platform for extended 28. Cohen AR, Gomes FL, Roysam B, Cayouette M
6: 279286. reproducible research. Nat Methods 9: 690696. (2011) Computational prediction of neural
11. Perlman Z, Slack M, Feng Y, Mitchison T, Wu 20. Yin Z, Zhou X, Bakal C, Li F, Sun Y, et al. progenitor cell fates. Nat Methods 7: 213218.
L, et al. (2004) Multidimensional drug profiling (2008) Using iterative cluster merging with 29. Kankaanpaa P, Paavolainen L, Tiitta S, Karja-
by automated microscopy. Science 306: 1194 improved gap statistics to perform online lainen M, Paivarinne J, et al. (2012) BioIma-
1198. phenotype discovery in the context of high- geXD: an open, general-purpose and high-
12. Young DW, Bender A, Hoyt J, McWhinnie E, throughput RNAi screens. BMC Bioinformatics throughput image-processing platform. Nat
Chirn G-W, et al. (2008) Integrating high- 9: 264. Methods 9: 683689.
PLOS is a nonprofit organization founded to accelerate progress in science and medicine by leading a transformation in
research communication.