Computational Toxicology

METHODS MOLECULAR BIOLOGY
TM
IN
Series Editor
John M. Walker
School of Life Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
For further volumes:

http://www.springer.com/series/7651
.
Computational Toxicology
Volume II
Edited by
Brad Reisfeld
Department of Chemical and Biological Engineering and School of Biomedical Engineering
Colorado State University, Fort Collins, Colorado, USA
Arthur N. Mayeno
Department of Chemical and Biological Engineering,
Colorado State University, Fort Collins, Colorado, USA
Editors
Brad Reisfeld Arthur N. Mayeno
Department of Chemical Department of Chemical
and Biological Engineering and and Biological Engineering
School of Biomedical Engineering Colorado State University
Colorado State University Fort Collins, Colorado, USA
Fort Collins, Colorado, USA
ISSN 1064-3745 ISSN 1940-6029 (electronic)

ISBN 978-1-62703-058-8 ISBN 978-1-62703-059-5 (eBook)
DOI 10.1007/978-1-62703-059-5
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2012946102
Springer Science+Business Media, LLC 2013

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this
legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for
the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publishers location, in its current version, and permission for use must always be obtained from Springer. Permissions
for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution
under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not
imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and
regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the
authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be
made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Printed on acid-free paper
Humana Press is a brand of Springer

Springer is part of Springer Science+Business Media (www.springer.com)
Preface
Rapid advances in computer science, biology, chemistry, and other disciplines are enabling
powerful new computational tools and models for toxicology and pharmacology. These
computational tools hold tremendous promise for advancing applied and basic science,
from streamlining drug efficacy and safety testing to increasing the efficiency and effective-
ness of risk assessment for environmental chemicals. These approaches also offer the
potential to improve experimental design, reduce the overall number of experimental trials
needed, and decrease the number of animals used in experimentation.
Computational approaches are ideally suited to organize, process, and analyze the vast
libraries and databases of scientific information and to simulate complex biological phe-
nomena. For instance, they allow researchers to (1) investigate toxicological and phar-
macological phenomena across a wide range of scales of biological organization
(molecular cellular organism), (2) incorporate and analyze multiple biochemical and
biological interactions, (3) simulate biological processes and generate hypotheses based
on model predictions, which can be tested via targeted experimentation in vitro or in vivo,
(4) explore the consequences of inter- and intra-species differences and population varia-
bility on the toxicology and pharmacology, and (5) extrapolate biological responses across
individuals, species, and a range of dose levels.
Despite the exceptional promise of computational approaches, there are presently very
few resources that focus on providing guidance on the development and practice of these
tools to solve problems and perform analyses in this area. This volume was conceived as
part of the Methods in Molecular Biology series to meet this need and to provide both
biomedical and quantitative scientists with essential background, context, examples, useful
tips, and an overview of current developments in the field. To this end, we present a
collection of practical techniques and software in computational toxicology, illustrated with
relevant examples drawn principally from the fields of environmental and pharmaceutical
sciences. These computational techniques can be used to analyze and simulate a myriad of
multi-scale biochemical and biological phenomena occurring in humans and other animals
following exposure to environmental toxicants or dosing with drugs.
This book (the second in a two-volume set) is organized into six parts each covering a
methodology or topic, subdivided into chapters that provide background, theory, and
illustrative examples. Each part is generally self-contained, allowing the reader to start
with any part, although some knowledge of concepts from other parts may be assumed.
The final part provides a review of relevant mathematical and statistical techniques. Part I
explores the critical area of predicting toxicological and pharmacological endpoints, such as
mutagenicity and carcinogenicity, and demonstrates the formulation and application of
quantitative structureactivity relationships (QSARs) and the use of chemical and endpoint
databases. Part II details approaches used in the analysis of gene, signaling, regulatory, and
metabolic networks, and illustrates how perturbations to these systems may be analyzed in
the context of toxicology. Part III focuses on diagnostic and prognostic molecular indica-
tors and examines the use of computational techniques to utilize and characterize these
biomarkers. Part IV looks at computational techniques and examples of modeling for
risk and safety assessment for both internal use and regulatory purposes. Part V details
approaches for integrated systems modeling, including the rapidly evolving development
v
vi Preface
of virtual organs and organisms. Part VI reviews some of the key mathematical and
statistical methods used herein, such as linear algebra, differential equations, and least-
squares analysis, and lists other resources for further information.
Although a complete picture of toxicological risk often involves an analysis of environ-
mental transport, we believe that this expansive topic is beyond the scope of this volume,
and it will not be covered here; overviews of computational techniques in this area are
contained in a variety of excellent references [14].
Computational techniques are increasingly allowing scientists to gain new insights into
toxicological phenomena, integrate (and interpret) the results from a wide variety of
experiments, and develop more rigorous and quantitative means of assessing chemical
safety and toxicity. Moreover, these techniques can provide valuable insights before initiat-
ing expensive laboratory experiments and into phenomena not easily amenable to experi-
mental analysis, e.g., detection of highly reactive, transient, or trace-level species in
biological milieu. We believe that the unique collection of explanatory material, software,
and illustrative examples in Computational Toxicology will allow motivated readers to
participate in this exciting field and undertake a diversity of realistic problems of interest.
We would like to express our sincere thanks to our authors whose enthusiasm and
diverse contributions have made this project possible.
Fort Collins, Colorado, USA Brad Reisfeld

Arthur N. Mayeno
References
1. Clark, M.M., Transport modeling for environmental engineers and scientists. 2nd ed. 2009, Hobo-
ken, N.J.: Wiley.
2. Hemond, H.F. and E.J. Fechner-Levy, Chemical fate and transport in the environment. 2nd ed.
2000, San Diego: Academic Press. xi, 433 p.
3. Logan, B.E., Environmental transport processes. 1999, New York: Wiley. xiii, 654 p.
4. Nirmalakhandan, N., Modeling tools for environmental engineers and scientists. 2002, Boca Raton,
Fla.: CRC Press. xi, 312 p.
Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
PART I TOXICOLOGICAL/PHARMACOLOGICAL ENDPOINT PREDICTION

1 Methods for Building QSARs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
James Devillers
2 Accessing and Using Chemical Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Nikolai Nikolov, Todor Pavlov, Jay R. Niemel a , and Ovanes Mekenyan
3 From QSAR to QSIIR: Searching for Enhanced Computational
Toxicology Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Hao Zhu
4 Mutagenicity, Carcinogenicity, and Other End points . . . . . . . . . . . . . . . . . . . . . . . . 67
Romualdo Benigni, Chiara Laura Battistelli, Cecilia Bossa,
Mauro Colafranceschi, and Olga Tcheremenskaia
5 Classification Models for Safe Drug Molecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
A.K. Madan, Sanjay Bajaj, and Harish Dureja
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity . . . . . . 125
Andrew P. Worth, Silvia Lapenna, and Rositsa Serafimova
PART II BIOLOGICAL NETWORK MODELING
7 Gene Expression Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

Reuben Thomas and Christopher J. Portier
8 Construction of Cell Type-Specific Logic Models of Signaling Networks
Using CellNOpt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Melody K. Morris, Ioannis Melas, and Julio Saez-Rodriguez
9 Regulatory Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Gilles Bernot, Jean-Paul Comet, and Christine Risso-de Faverney
10 Computational Reconstruction of Metabolic Networks from KEGG . . . . . . . . . . . 235
Tingting Zhou
PART III BIOMARKERS
11 Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Harmony Larson, Elena Chan, Sucha Sudarsanam, and Dale E. Johnson
12 Biomonitoring-based Environmental Public
Health Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Andrey I. Egorov, Dafina Dalbokova, and Michal Krzyzanowski
vii
viii Contents
PART IV MODELING FOR REGULATORY PURPOSES

(RISK AND SAFETY ASSESSMENT)
13 Modeling for Regulatory Purposes (Risk and Safety Assessment). . . . . . . . . . . . . . . 297
Hisham El-Masri
14 Developmental Toxicity Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Raghuraman Venkatapathy and Nina Ching Y. Wang
15 Predictive Computational Toxicology to Support Drug Safety Assessment. . . . . . . 341
Luis G. Valerio Jr.
PART V INTEGRATED MODELING/SYSTEMS TOXICOLOGY APPROACHES
16 Developing a Practical Toxicogenomics Data Analysis System Utilizing

Open-Source Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Takehiro Hirai and Naoki Kiyosawa
17 Systems Toxicology from Genes to Organs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
John Jack, John Wambaugh, and Imran Shah
18 Agent-Based Models of Cellular Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Nicola Cannata, Flavio Corradini, Emanuela Merelli, and Luca Tesei
PART VI MATHEMATICAL AND STATISTICAL BACKGROUND

19 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
Kenneth Kuttler
20 Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475
Jir Lebl
21 On the Development and Validation of QSAR Models . . . . . . . . . . . . . . . . . . . . . . . 499
Paola Gramatica
22 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Detlef Groth, Stefanie Hartmann, Sebastian Klie, and Joachim Selbig
23 Partial Least Squares Methods: Partial Least Squares Correlation
and Partial Least Square Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
Herve Abdi and Lynne J. Williams
24 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
Shuying Yang and Daniela De Angelis
25 Bayesian Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
Frederic Y. Bois
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637
Contributors
HERVE ABDI School of Behavioral and Brain Sciences, The University of Texas
at Dallas, Richardson, TX, USA
SANJAY BAJAJ S.V. College of Pharmacy, Patiala, India
CHIARA LAURA BATTISTELLI Environment and Health Department, Istitituto
Superiore di Sanita, Rome, Italy
ROMUALDO BENIGNI Environment and Health Department, Istitituto Superiore
di Sanita, Rome, Italy
GILLES BERNOT I3S laboratory, UMR 6070 CNRS, University of Nice-Sophia
Antipolis, Sophia Antipolis, France
FREDERIC Y. BOIS Technological University of Compiegne, Royallieu Research Center,
Compiegne, France; INERIS, DRC/VIVA/METO, Verneuil en Halatte, France
CECILIA BOSSA Environment and Health Department, Istitituto Superiore di Sanita,
Rome, Italy
NICOLA CANNATA School of Science and Technology, University of Camerino,
Camerino, Italy
ELENA CHAN Emiliem, Inc., San Francisco, CA, USA
MAURO COLAFRANCESCHI Environment and Health Department, Istitituto Superiore
FLAVIO CORRADINI School of Science and Technology, University of Camerino,
Camerino, Italy
DAFINA DALBOKOVA Consultant, Sofia, Bulgaria
DANIELA DE ANGELIS MRC Biostatistics Unit, Institute of Public Health, University
Forvie Site, Cambridge, UK
JOHN C. DEARDEN School of Pharmacy & Biomolecular Sciences, Liverpool John Moores
University, Liverpool, UK
JAMES DEVILLERS CTIS, Rillieux La Pape, France
HARISH DUREJA Department of Pharmaceutical Sciences, M. D. University,
Rohtak, India
ANDREY I. EGOROV World Health Organization (WHO), Regional Office for Europe,
European Centre for Environment and Health (ECEH), Bonn, Germany
HISHAM EL-MASRI Integrated Systems Toxicology Division, Systems Biology Branch,
US Environmental protection Agency, Research Triangle Park, NC, USA
CHRISTINE RISSO-DE FAVERNEY ECOMERS laboratory, University of Nice-Sophia
Antipolis, Nice Cedex, France
PAOLA GRAMATICA QSAR Research Unit in Environmental Chemistry and
Ecotoxicology, Theoretical and Applied Sciences, University of Insubria,
via Dunant 3, Varese, Italy
DETLEF GROTH AG Bioinformatics, University of Potsdam, Potsdam-Golm, Germany
STEFANIE HARTMANN AG Bioinformatics, University of Potsdam,
Potsdam-Golm, Germany
ix
x Contributors
TAKEHIRO HIRAI Translational Medicine and Clinical Pharmacology Department,

Daiichi Sankyo Co., Ltd., Tokyo, Japan
JOHN JACK U.S. Environmental Protection Agency, Research Triangle Park,
NC, USA
DALE E. JOHNSON Emiliem, Inc., San Francisco, CA, USA
NAOKI KIYOSAWA Medicinal Safety Research Laboratories, Daiichi Sankyo Co., Ltd.,
Fukuroi, Shizuoka, Japan
SEBASTIAN KLIE AG Bioinformatics, University of Potsdam, Potsdam-Golm, Germany
MICHAL KRZYZANOWSKI World Health Organization (WHO), Regional Office
for Europe, European Centre for Environment and Health (ECEH), Bonn, Germany
KENNETH KUTTLER Department of Math, Brigham Young University, Provo,
UT, USA
SILVIA LAPENNA Institute for Health and Consumer Protection, European
CommissionJoint Research Centre, Ispra (VA), Italy
HARMONY LARSON Emiliem, Inc., San Francisco, CA, USA
JIRI LEBL Department of Mathematics, University of Wisconsin-Madison, Madison,
WI, USA
A.K. MADAN Department of Pharmaceutical Sciences, Pt. B.D. Sharma University
of Health Sciences, Rohtak, India
OVANES MEKENYAN Laboratory of Mathematical Chemistry, University Prof. Asses
Zlatarov, Bourgas, Bulgaria
IOANNIS MELAS European Bioinformatics Institute (EMBL-EBI), Cambridge, UK;
National Technical University of Athens, Athens, Greece
EMANUELA MERELLI School of Science and Technology, University of Camerino,
Camerino, Italy
MELODY K. MORRIS Center for Cell Decision Processes Massachusetts Institute of
Technology and Harvard Medical School, Cambridge, MA, USA;
Department of Biological Engineering, Massachusetts Institute of Technology,
Cambridge, MA, USA
JAY R. NIEMELA National Food Institute Technical University of Denmark,
Soeborg, Denmark
NIKOLAI NIKOLOV National Food Institute Technical University of Denmark,
Soeborg, Denmark
TODOR PAVLOV Laboratory of Mathematical Chemistry, University Prof. Asses
Zlatarov, Bourgas, Bulgaria
CHRISTOPHER J. PORTIER National Center for Environmental Health
and Agency for Toxic Substances and Disease Registry Centers for Disease
and Prevention, Atlanta, GA, USA
JULIO SAEZ-RODRIGUEZ European Bioinformatics Institute (EMBL-EBI),
Cambridge, UK; Genome Biology Unit, European Molecular Biology Laboratory
(EMBL), Heidelberg, Germany
JOACHIM SELBIG AG Bioinformatics, University of Potsdam, Potsdam-Golm, Germany
ROSITSA SERAFIMOVA Institute for Health and Consumer Protection, European
IMRAN SHAH U.S. Environmental Protection Agency, Research Triangle Park,
NC, USA
Contributors xi
SUCHA SUDARSANAM Emiliem, Inc., San Francisco, CA, USA

OLGA TCHEREMENSKAIA Environment and Health Department, Istitituto Superiore
LUCA TESEI School of Science and Technology, University of Camerino, Camerino, Italy
REUBEN THOMAS Division of Environmental Health Sciences,
University of California, Berkeley, CA, USA
LUIS G. VALERIO JR Office of Pharmaceutical Science, Center for Drug Evaluation
and Research, U.S. Food and Drug Administration, Silver Spring, MD, USA
RAGHURAMAN VENKATAPATHY Pegasus Technical Services, Inc., Cincinnati, OH, USA
JOHN WAMBAUGH U.S. Environmental Protection Agency, Research Triangle Park,
NC, USA
NINA CHING Y. WANG National Center for Environmental Assessment,
U. S. Environmental Protection Agency, Office of Research and Development,
Cincinnati, OH, USA
LYNNE J. WILLIAMS Kunin-Lunenfeld Applied Research Unit, Rotman Research
Institute at Baycrest, Toronto, Canada
ANDREW P. WORTH Institute for Health and Consumer Protection, European
SHUYING YANG GlaxoSmithKline Services Unlimited, Brentford, Middlesex, UK
TINGTING ZHOU Laboratory of Molecular Immunology, Institute of Basic Medical
Sciences, Beijing, Peoples Republic of China
HAO ZHU Department of Chemistry,The Rutgers Center for Computational and
Integrative Biology Rutgers University 315 Penn St. Camden, NJ 08102 University
of North Carolina Chapel Hill, NC, USA
Part I
Toxicological/Pharmacological Endpoint Prediction

Chapter 1
Methods for Building QSARs

James Devillers
Abstract
Structureactivity relationship (SAR) and quantitative structureactivity relationship (QSAR) models are
increasingly used in toxicology, ecotoxicology, and pharmacology for predicting the activity of the mole-
cules from their physicochemical properties and/or their structural characteristics. However, the design of
such models has many traps for unwary practitioners. Consequently, the purpose of this chapter is to give a
practical guide for the computation of SAR and QSAR models, point out problems that may be encoun-
tered, and suggest ways of solving them. Attempts are also made to see how these models can be validated
and interpreted.
Key words: QSAR, SAR, Linear model, Nonlinear model, Validation
1. Introduction
All branches of research benefit from the use of computers, even

though no increase of memory size or processor speed will compen-
sate for lack of original ideas, analytical mind, or specialized exper-
tise. Nevertheless, the impact of computers in structureactivity
relationship (SAR) and quantitative structureactivity relationship
(QSAR) that try to relate the activity of a set of molecules to their
chemical structures has been tremendous. Only based on empirical
relationships in the second part of the nineteenth century (13), the
QSARs were mathematically formalized in the 1930s (4) and their
acceptance, as a discipline in its own right, was definitively acquired
in the early 1960s from the seminal works of Hansch and Fujita (5,
6). Since then a huge number of models have been designed for the
prediction of various biological activities of interest. With the
increase of computational power, it has been possible to calculate
collections of molecular descriptors to encode more and more larger
sets of molecules for which biological data were available and also, to
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_1, # Springer Science+Business Media, LLC 2013
3
4 J. Devillers
be not limited in the choice of a statistical method to derive linear

and nonlinear SARs. However independently of their characteristics
and complexity, the design of SAR and QSAR models is under-
pinned by the respect of the same basic principles. In this context,
the goal of this chapter is to provide a practical guide to the design of
SAR and QSAR models and consequently, to underline the com-
mon pitfalls to avoid in the building of such models. It is expected
that this chapter will prove useful to those who are unfamiliar with
the discipline. Intended first to people interested in the design of
models in toxicology and ecotoxicology, most of the concepts and
remarks discussed in this chapter remain true regarding the design
of QSAR in pharmacology as well as the computation of structure
property relationships called QSPRs.
The chapter is set out in the same order in which an SAR or a
QSAR model is derived starting with the selection of the biological
data for the endpoint of interest, description of the molecules,
computation of the model from a statistical tool, evaluation of
the model, estimation of its prediction performances, and last its
interpretation.
2. Biological Data
A well-defined endpoint is critical for the design of an accurate

QSAR model. Thus, for deriving a structuretoxicity model, it is
required to select toxicity data obtained from standardized proce-
dures. LD50 values (i.e., doses of chemicals at which 50% of the test
organisms die) have to be preferentially used for the determination
of the acute toxicity of chemicals to rodents due to their better
statistical and toxicological significance. The LD50 values are influ-
enced by the species, strain, gender, and physiological conditions of
the animals as well as the experimental conditions in which the
laboratory assay is performed. Thus, for example, Gaines (7, 8)
using adult Sherman strain rats under strictly the same experimental
conditions showed that most of the pesticides tested by the oral
route were more toxic to female than to male rats. This is illustrated
in Table 1 where the difference of sensibility between male and
female rats is indicated by the fact that the confidence limits of the
LD50 values for each of these pesticides do not overlap between the
sexes. The greater chemical metabolizing activity of liver micro-
somes in male rat (9) probably accounts for many of the sex-related
differences in acute toxicity which are observed in this species. It is
noteworthy that from the Gaines data, a QSAR model was proposed
by Devillers (10) for predicting the acute toxicity of organophos-
phorus pesticides to rats that included the gender of the organisms
in the modeling process in addition to the molecular descriptors.
1 Methods for Building QSARs 5
Table 1
Acute oral toxicity of some pesticides in male and female
rats (7, 8)
LD50 (mg/kg) LD50 (mg/kg)

Name CAS RN in males in females
Coumaphos 56-72-4 41 (3450)a 16 (1417)
Endosulfan 115-29-7 43 (4146) 18 (1521)
Endrin 72-20-8 17.8 (14.721.5) 7.5 (6.88.3)
EPN 2104-64-5 36 (3340) 7.7 (6.98.6)
Isodrin 465-73-6 15.5 (12.719.1) 7 (68.1)
Methyl parathion 298-00-0 14 (1217) 24 (2228)
Mevinphos 298-01-1 6.1 (5.27.1) 3.7 (34.5)
Parathion 56-38-2 13 (1017) 3.6 (3.24)
Schradan 152-16-9 9.1 (8.110.2) 42 (3255)
a
Confidence limits
Use of his/her own experimental toxicity data for deriving a

QSAR model minimizes the biases. However, in practice, this
situation is not usual and generally the biological data are retrieved
from publications, databases, and/or the Internet (11). When it is
not possible to obtain all the desired data from one source, it is
necessary to verify their compatibility.
Very often, the biological data need to be transformed before
used for deriving a QSAR model. Thus, for example, the LD50
values of chemicals have to be expressed on a molar basis to be
structurally comparable. In addition, they have to be converted into
a logarithmic scale to avoid statistical problems when classical sta-
tistical methods, such as regression analysis, are used to derive the
models. By convention, negative logarithms are preferred to obtain
larger values for the more active chemicals. The biological data set
after its logarithmic transformation should ideally span several
orders of magnitude to be safely used for deriving a QSAR model.
It is worth noting that if the LD50 values have heterogeneous
or dubious origins, it is needed to transform them into categorical
data. The choice of the threshold limits and number of categories
is problem dependent. Some toxicological activities, such as carci-
nogenicity, are basically expressed in a Boolean manner (i.e.,
carcinogenic/noncarcinogenic) and are modeled as such (1214).
From a semantic point of view, the structureactivity models
computed from categorical response data are called SAR models.
They are derived from specific statistical methods (see Subhead-
ing 4). It is important to note that the use of imbalanced data sets
affects the performances of the models (15, 16). This was recently
6 J. Devillers
shown in an external validation exercise (17) aiming to estimate the

performances of an SAR model designed by Benigni et al. (18) for
predicting the mutagenicity of the a-b unsaturated aliphatic alde-
hydes and that is included in the OECD QSAR application toolbox
1.1.02 (19) and in Toxtree 2.1.0 (toxic hazard estimation by deci-
sion tree approach) (20) that are two computational systems specifi-
cally designed for facilitating the practical use of QSAR approaches
in regulatory contexts.
3. Molecular
Descriptors
For a chemical to trigger a biological activity when administered to
an organism, a number of processes must occur that depend on its
structural characteristics. The correct encoding of these character-
istics is the keystone of the design of SAR and QSAR models with
good predictive performances. There are different ways for describ-
ing a molecule depending on the endpoint of concern and the
characteristics of the set of molecules used for computing the
model. The different categories of descriptors are discussed below
focusing on their interest and limitations rather than their calcula-
tion procedures.
3.1. Indicator Variables The indicator variables, also termed 1D descriptors, allow to
account for structural features (i.e., atoms or functional groups)
that influence or which are responsible for the biological activity of
the molecules. They are also termed dummy variables or Boolean
descriptors when they encode the presence ( 1) or absence ( 0)
of a structural element in a molecule. These descriptors represent
the simplest way for describing a molecule. The Boolean descriptors
are particularly suited for encoding the position and/or the num-
ber of substituents on a molecule. The FreeWilson method (21) is
rooted on the use of such descriptors. Indeed, this approach allows
to mathematically quantify the contribution of a substituent to the
biological activity of a molecule. It is assumed that the substituents
on a parent molecule provide a constant contribution to the activity
according to a simple principle of additivity. The method operates
by the generation of a data matrix consisting of zero and one values.
Each column in this matrix corresponds to a particular substituent
at a specific position on the molecule and is treated as an indepen-
dent variable. The data table also contains a column of dependent
data (i.e., biological data). A multiple regression analysis is applied
between the dependent variable and the independent variables
using classical statistical criteria for measuring the goodness of fit.
The regression coefficients of the model represent the contribution
of the substituents of the molecule to its activity. The FreeWilson
approach has found numerous applications in QSAR; see, e.g., refs.
(2227). Its main advantage relies on the fact that the mechanistic
interpretation of the obtained QSAR models is straightforward.

However, unlike the QSAR models derived from classical molecular
descriptors (e.g., physicochemical properties), those derived from
the FreeWilson method cannot be used to estimate the activity of
molecules including substituents different from those found in the
set of molecules used to compute the model.
Instead of a simple Boolean description of the structure of the
molecules, it is possible to use the frequency of occurrence of
specific atoms and/or functional groups as molecular descriptors
(28, 29). However, this frequency has to be high enough to obtain
reliable coefficient values in the regression equations. Different
approaches, based on the use of multivariate methods, have been
proposed to overcome this problem (30, 31).
Last, it is noteworthy that the use of regression analysis or other
statistical methods with this kind of descriptors is not compulsory
to yield structureactivity predictions. Indeed, the indicator vari-
ables stand on their own for some specific endpoints. In that case,
they are called structural alerts. Thus, for example, more than
20 years ago, Ashby and Tennant (32) showed the interest of
structural alerts for predicting the carcinogenic potential of chemi-
cals. Since this pioneering work, it has been possible to refine these
structural alerts over time, as more experimental results have
become available and additional mechanistic insights have been
gained. To date, one of the most advanced lists of structural alerts
for evaluating the carcinogenicity of chemicals is the list of
35 structural alerts proposed by Benigni and Bossa (33). This list
is implemented as a rule-based system in the OECD QSAR appli-
cation toolbox 1.1.02 (19) and in Toxtree 2.1.0 (20). Recently
(14), the prediction results for carcinogenicity potential obtained
with both systems were confronted to experimental data collected
in databases and original publications for more than 500 structur-
ally diverse chemicals. It was demonstrated that the overall perfor-
mance of the structural alerts was satisfying but less convincing
results were obtained on specific groups such as the polycyclic
aromatic compounds (14). The same conclusion was reached with
the pesticides and biocides (34). Structural alerts have been also
proposed for other toxicological endpoints such as eye (35) and
skin (36) irritation/corrosion potential.
3.2. 2D Molecular The absorption, distribution, metabolism, and excretion (ADME)

Descriptors of chemicals in the organisms are under the dependence of their
physicochemical properties (37). In the same way, knowing the
physicochemical properties of xenobiotics is a prerequisite to esti-
mate their bioactivity, bioavailability, transport, and distribution
between the different compartments of the biosphere (3841).
Among them, the 1-octanol/water partition coefficient (Kow),
encoding the hydrophobicity of the molecules, is undoubtedly
the physicochemical property that is the most widely used as
8 J. Devillers
molecular descriptor in QSAR (42, 43) and for encoding the

partitioning in the biota (4446). Numerous methods are available
for the experimental measurement of log Kow (also termed log P)
(47) as well as for its estimation from contribution methods (48,
49) or from quantitative structureproperty relationship (QSPR)
models (5054). This is also the case for the other physicochemical
properties that are used in QSAR and environmental fate modeling
(5558).
The whole structure of an organic molecule can also be depicted
as a graph without hydrogen atoms for deriving numerical descrip-
tors termed topological indices (59). Many algorithms are available
in the literature (5962) for calculating these interesting molecular
descriptors which can be easily computed for all the existing, new,
and in-development chemicals and allow a multivariate description
of the molecules when they are judiciously combined.
To date thousands of 2D descriptors can be computed (62) but
to be safely used in the design of a QSAR model they have to be
meaningful and uncorrelated. Indeed, unfortunately, some compu-
tational descriptors are so mathematically transformed that, from a
mechanistic point of view, they become meaningless even if they
can have a good discrimination power. The molecular descriptors
should be as independent (orthogonal) from each other as possible
because when using descriptors that are too correlated there is an
increased danger of obtaining non-optimum models due to chance
correlation (63). To avoid this problem, depending on the nature
of the descriptors, a principal components analysis (PCA) (64) or a
correspondence factor analysis (CFA) (65) can be used. These two
linear multivariate analyses work by creating new variables that are
linear combinations of the original variables and are called principal
components (PCs) and factors (F), respectively. These new variables
are orthogonal to one another. They allow the reduction of the
dimensionality of the data matrix of descriptors and to graphically
represent the spaces of variables (i.e., descriptors) and objects
(i.e., chemicals) to show and explain the relationships between
them. Last, they can be used to be correlated to the response
variable (i.e., biological activity) to perform an orthogonal regres-
sion analysis or a stochastic regression analysis in the case of a PCA
or a CFA, respectively (30).
3.3. 3D Molecular In the classical Hansch analysis and the related QSAR approaches,
Descriptors the descriptors are calculated from a flat representation of the
molecules. This has a limited interest to understand the receptor
ligand interactions. More than 20 years ago, Richard Cramer of
Tripos, Inc. proposed to use the field properties of molecules in
3D space to derive QSAR models (66). The method, called com-
parative molecular field analysis (CoMFA), was the first true 3D
QSAR approach and remains the most widely used (67, 68). The
basic assumption in CoMFA is that a suitable sampling of steric
(van der Waals) and electrostatic (Coulombic) fields around a set

of aligned molecules yields all the information necessary to explain
their biological activities. Sampling is achieved by calculating
interaction energies between each molecule and an appropriate
probe at regularly spaced grid points surrounding the molecules
(66). Partial least squares (PLS) analysis (69) is then used to relate
the calculated field parameter properties to the studied biological
activity. The critical step in CoMFA is the initial alignment of the
molecules that can be time-consuming and which requires some
experience. It is generally performed by using the most active
molecule in the data set as template. For nonrigid molecules, the
selection of the active conformation is a major hurdle to over-
come. A systematic conformational search is therefore performed
beforehand to define the minimum energy conformation that will
be used. Different methods have been proposed to improve the
alignment step with CoMFA (7074).
Comparative molecular similarity indices analysis (CoMSIA)
(75) is rooted on the same principles as those of CoMFA but
different fields are used. After alignment and embedding of the
molecules in the 3D lattice (as in CoMFA), the similarities between
the atoms of the studied molecule and a probe are evaluated for five
properties related to steric, electrostatic, hydrophobic, hydrogen
bond donor, and hydrogen bond acceptor. Use of more fields
allows a sharper analysis displaying space regions where the con-
tributions of the different fields are important for the biological
activity (75).
The common reactivity pattern (COREPA) method circum-
vents the problem of conformer alignment and specific pharmaco-
phore atom/fragment selection by analyzing the conformational
distribution of chemicals across global and local reactivity parame-
ter(s) potentially associated with the studied endpoint (76, 77).
There also exist pseudo-3D descriptors that are independent of
alignment. Among them, we can cite the eigen value (EVA) descrip-
tors derived from fundamental IR and Raman range molecular
vibrational frequencies (52, 78) or the weighted holistic invariant
molecular (WHIM) descriptors that are calculated from (x, y, z)-
coordinates of a molecule within different weighting schemes (79).
For more information on the different types of 3D molecular
descriptors and beyond, their calculation procedures, as well as
their advantages and limitations, the reader is invited to consult
the recent book by Doucet and Panaye (68).
4. Model
Computation
There is a huge number of methods available to build models that
relate biological data to molecular descriptors. Rather than catalo-
ging all the approaches in use in the domain, it is better to give
10 J. Devillers
Table 2
Nineteen chemicals with their activity (log 1/EC50 in mM)
and their electrophilic superdelocalizability for atom 10
(ESDL10), 1-octanol/water partition coefficient (P),
and melting point (MP) (adapted from refs. (80, 81))
Chemical log 1/EC50 ESDL10 log P log MP

1 0.10 0.79 4.89 2.33
2 0.23 0.34 5.35 2.36
3 0.30 0.34 5.68 2.25
4 0.32 0.41 7.37 2.16
5 0.42 0.45 7.37 2.18
6 0.48 2.77 4.65 2.23
7 0.82 0.33 6.99 2.28
8 0.82 0.33 8.47 1.83
9 0.89 0.33 8.47 1.95
10 0.92 0.41 6.11 2.32
11 1.02 0.54 6.70 2.30
12 1.03 0.58 6.70 2.25
13 1.07 0.42 7.27 2.28
14 1.13 0.41 6.21 2.39
15 1.36 0.46 6.84 2.35
16 1.36 0.43 9.30 1.91
17 1.40 0.47 6.99 2.32
18 1.55 0.42 7.87 2.29
19 1.84 0.36 6.76 2.41
some guidelines for selecting the most suited statistical method in a

given situation as well as the main pitfalls to avoid when designing
an SAR or a QSAR model.
The easiest way to start a QSAR analysis is to graph the
biological data and molecular descriptors that will be embodied in
the modeling process. Simple scatter plots can allow us to discard
the molecular descriptors that are not significant and can avoid the
design of biased models. Thus, for example, if we consider the set of
19 chemicals in Table 2 with their antimalarial activity (log 1/EC50
in mM) and three molecular descriptors that are the 1-octanol/
water partition coefficient (P), melting point (MP), and electro-
philic superdelocalizability for atom 10 in the molecules (ESDL10)
(data arranged from refs. 80, 81), it is possible to derive the follow-
ing three-parameter equation 1 that appears at first sight to be quite
reasonable:
log 1=EC50 0:5770:105 log P 3:1910:678 log MP
0:3760:168ESDL10 10:382:172; (1)
n 19, r 0.83, r 0.69, s 0.30, and F 11.1,
2
where n is the number of individuals (i.e., chemicals), r is the

correlation coefficient, r2 is the coefficient of determination, s is
the standard error of estimate, and F is the Fisher test. The standard
errors of the regression parameters are represented in parentheses.
It is worth noting that, unfortunately, most of the QSAR
equations published in the literature, which are based on multiple
regression analysis, are characterized by fewer statistical parameters.
Nonetheless, inspection of Eq. 1 and of its statistical parameters
does not allow us to detect a problem and the mechanistic interpre-
tation of the QSAR model seems straightforward from its three
descriptors. ESDL10 negatively contributes to the activity while
log P and log MP have a positive contribution. In fact, simple scatter
plots of each molecular descriptor versus biological activity could
have shown that one of them was biased. Indeed, Fig. 1 clearly
reveals that ESDL10 acts as a Boolean descriptor and hence, no
physicochemical meaning has to be given to this descriptor.
Here, a simple inspection of Table 2 shows that the value of
ESDL10 for chemical #6 is extreme compared with the rest, thus
turning this variable into a Boolean indicator, but it is generally not
so obvious and scatter plots are particularly suited for detecting this
Fig. 1. Scatter plot of log 1/EC50 versus ESDL10 (see Table 2).
12 J. Devillers
kind of problem. Different types of graphs are available for analyz-

ing the molecular descriptors before to start the design of a QSAR
model (8286).
Nevertheless, when chemical #6 in Table 2 is deleted and
a stepwise regression analysis is performed on the remaining
18 chemicals, ESDL10 is no longer selected (Eq. 2).
log 1=EC50 0:5660:101 log P 3:200:676
log MP 10:152:104; (2)
n 18, r 0.82, r2 0.68, s 0.30, and F 15.8.

The r and r2 values of Eq. 1 are only slightly better than those of
Eq. 2. Interestingly the coefficient values for log P and log MD as
well as the intercepts are broadly the same in both equations.
It is worth noting that to correctly compare regression models
containing different numbers of variables and/or that are derived
from different numbers of data points, it is necessary to calculate
the adjusted r2 (r 2adj.) values (Eq. 3):

n1
r 2adj: 1 1 r 2 ; (3)
np1
where n is the number of data points and p is the number of
parameters in the equation. In Eqs. 1 and 2, the r 2adj. values of
both equations are equal to 0.63.
Consequently, Eq. 2 is undoubtedly the best model. There
exist different statistical criteria allowing to calculate the optimal
number of predictor variables in a regression model (87).
Pairwise scatter plots of each selected molecular descriptor
versus the biological endpoint can also orientate the selection of
the most suited model. Thus, for example, Tanii (88) examined the
in vivo anesthetic activity of monoketones in mice in relation to
their hydrophobicity encoded by log P. Male mice of ddY strain
(Japan SLC Co., Shizuoka, Japan) weighing 2530 g were used.
The AD50, the dose required to anesthetize 50% of the animals
belonging to the treated group, was determined for each chemical
and expressed in mmol/kg (Table 3). A simple inspection of Fig. 2,
which is a scatter plot of log P versus log 1/AD50, shows that the
most suited model is a parabolic equation in the form log P, log P2.
Such a model yields a correlation coefficient of 0.99 while with a
simple regression model in log P, the correlation coefficient is only
equal to 0.30.
It is obvious that pairwise scatter plots can only be used
when the number of predictor variables is not too high. Other-
wise, a variable selection procedure has first to be used. There is
quite a large variety of methods for variable selection but one
of the most powerful approaches is undoubtedly the genetic
Table 3
Anesthetic activity of monoketones in mice (88)
No. Chemical AD50a log P

1 Acetone 59.6 0.48
2 Methyl ethyl ketone 16.0 0.26
3 Methyl n-propyl ketone 8.78 0.78
4 Methyl isopropyl ketone 8.78 0.56
5 Methyl n-butyl ketone 5.64 1.19
6 Methyl isobutyl ketone 5.26 1.31
7 Methyl n-amyl ketone 4.40 2.03
8 Methyl n-hexyl ketone 5.05 2.37
9 Methyl n-heptyl ketone 6.91 3.14
10 Methyl 3-methyhexyl ketone 5.76 2.92
11 Methyl n-octyl ketone 12.2 3.73
12 Methyl n-nonyl ketone 19.2 4.09
a
AD50: Dose required to anesthetize 50% of the animals
Fig. 2. Scatter plot of log P versus log 1/AD50 (see Table 3).
algorithm (89). Genetic algorithms are rooted on the Darwinian

principles of the natural selection with the survival of the fittest,
employing a population of individuals (i.e., descriptors) that
undergo selection in the presence of variation-inducing operators
14 J. Devillers
such as mutations and crossovers. A fitness function is used to

evaluate the individuals (89, 90).
The choice of the best statistical tool for computing a structure
activity model should be a key step in an SAR or a QSAR modeling
process but unfortunately, it is generally not the case. The reason
relies on the fact that, unfortunately, a lot of people implicitly
postulate that the relationship between the activities of the mole-
cules and their molecular descriptors can only be linear. As a
result, the linear regression analysis (91) and PLS analysis (69)
are very often used for modeling the continuous data and the
linear discriminant analysis (92) is employed for the categorical
data. This is a mistake because a lot of SARs are nonlinear and
hence, only purely nonlinear statistical methods are able to cor-
rectly encode such relationships. This has been clearly demon-
strated in numerous QSAR studies; see, e.g., refs. (10, 93101 )
where first a linear method was used and then, a nonlinear
approach, such as a three-layer perceptron (Fig. 3), which is an
artificial neural network, was tested from the same pool of descrip-
tors. However, these nonlinear statistical tools require some expe-
rience to be correctly used. They include different parameters that
have to be tuned at the correct value to produce acceptable
desired output values. Consequently, the best strategy always
consists in starting by using a linear method such as a regression
analysis or a PLS analysis and then, trying to see whether the use
of a nonlinear approach could improve the quality of the predic-
tion results.
Fig. 3. A three-layer perceptron.

It is important to note that the hybridization of statistical

methods (102) can yield the design of more powerful SAR and
QSAR models (103, 104).
5. Model
Diagnostics
Once the model has been constructed, it is important to verify
whether no basic assumptions justifying the use of the selected
statistical method have been violated, to analyze the statistical
significance of its parameters, and to check whether the activities
of the chemicals from which it was computed have been correctly
predicted. The first two points can considerably change from a
statistical method to another while the last one only depends on
the type of biological data.
Regarding the SAR models aiming to predict categorical
response data such as positive versus negative responses, there are
four different possible model outcomes that are a true positive
(TP), a true negative (TN), a false positive (FP), and a false negative
(FN). A false positive is when the outcome is incorrectly classified as
active (or positive), when it is in fact inactive (or nega-
tive). A false negative is when the activity is incorrectly classified as
negative when it is in fact positive. True positives and true negatives
are obviously correct classifications. From these four types of
results, it is possible to calculate various parameters (105107),
the most important are the following:
Sensitivity or true positive rate (TPR) TP/(TP + FN).
False positive rate (FPR) FP/(FP + TN).
Specificity or true negative rate (TNR) TN/(FP + TN).
Accuracy TP + TN/(TP + FP + TN + FN).
It is noteworthy that the TPR versus FPR plot is called a
receiver operating characteristic (ROC) curve (108, 109). The
ROC curves are particularly useful for comparing different config-
urations or classifiers (108, 109).
Now, regarding the QSAR models, computed on continuous
biological data (e.g., log 1/LD50), the calculated activity value of
each chemical is subtracted to the corresponding experimental
activity value for that chemical. This difference is termed residual.
A too large residual is called an outlier. From a statistical point of
view, an outlier among residuals is one that is far greater than the
rest in absolute value and perhaps lies three or four standard devia-
tions or further from the mean of the residuals (91).
In practice, the outliers have to be the subject of lot of attention
because they pinpoint that there is more to understand about the
model before to safely use it. Moreover, an understanding of the
16 J. Devillers
cause of the outlier behavior can be of inestimable value to gain

insight into the underlying biochemical processes governing the
studied activity. The origin of such outlier behavior can be ascribed
to one or more of the following (110, 111):
The outlier is the result of an incorrect design of the set of
chemicals used for computing the QSAR model.
Some experimental biological data are inaccurate or wrong.
This can be due to differences in the endpoints, experimental
conditions, and so on. This can also be due to a simple typo
made during a data compilation process but also the result of
more pernicious events. Thus, for example, the toxicity of che-
micals is commonly expressed in ppm (i.e., parts per million)
and ppb (i.e., parts per billion); see, e.g., ref. 112. Unfortu-
nately, billion in Europe and in North America refer to 1012 and
109, respectively.
The model requires additional or other descriptors to correctly
encode the studied biological activity or the values of some of
the selected descriptors are incorrect.
The chemical detected as outlier interacts by a different molecular
mechanism at its biochemical site of action than the other studied
chemicals.
The outlier yields one or more metabolic or chemical transfor-
mation products acting by a different mechanism at its bio-
chemical site of action than the other studied compounds
(113, 114).
Last, the statistical method can be deficient to find the func-
tional relationship between the biological activity and the
selected molecular descriptors.
After a logical explanation has been found for the presence of
these outliers, the model is generally refined to increase its predic-
tive performances. It is important to stress that the elimination of a
chemical acting as outlier must be only performed when a problem
is clearly identified. Otherwise, the strategy consists in the addition
of chemicals and/or the addition of other molecular descriptors
and/or the use of another statistical engine. Refining a QSAR model
can be a time-consuming process, especially when it was derived
from nonlinear methods such as artificial neural networks (93).
6. Model
Performance
Estimation
After building a model of whatever type, it is necessary to assess
how well it might be expected to work. To do so, in a first step, its
performances are commonly estimated from the chemicals that
were used to compute the model. A simple method to perform
this internal validation is called leave-one-out (LOO) cross validation

procedure. As the name suggests, the process involves leaving one
chemical out, fitting the model from the remaining chemicals,
making a prediction for the excluded chemical, and then repeating
the process until every chemical in the set has been left out once.
A variety of statistics can be generated using this procedure such as
predictive residual sum of squares (PRESS), the cross-validated r2
called Q2, and the standard deviation of errors of predictions
(SDEP) (115, 116). Other techniques such as bootstrapping, ran-
domization tests, and permutation tests, which are different types
of resampling methods, are also commonly used for internal valida-
tion (117123). These approaches can provide a reasonable esti-
mate of the stability of congeneric QSAR models derived from
linear regression analysis or PLS analysis. This is also the case with
the SAR models derived from linear discriminant analysis when the
classes are not too unbalanced. Conversely, their interest is highly
questionable when a purely nonlinear method, such as a three-layer
perceptron (Fig. 3), is used for deriving non-congeneric SAR or
QSAR models. Indeed, with such a tool, it is important to keep in
mind that, for the same data set and parameter setting, different
solutions (i.e., models) exist which yield slightly different results.
The definitive choice of a model is done after an optimization
procedure during which the parameters of the artificial neural
network are refined and its architecture (i.e., neurons, weights) is
subject to a pruning algorithm to try reducing the number of
connections within the network. Ultimately, the choice of a
model results from a compromise between its prediction perfor-
mances obtained on the set of molecules used to derive the model
(training set) and those calculated on an external test set in order to
secure its generalization capability (93). In fact, in order to cor-
rectly estimate the prediction performances of a linear or a nonlin-
ear SAR or QSAR model, an external test set has to be used.
Unfortunately sometimes, for specific series of molecules, there
are so few compatible biological data that it is impossible to con-
sider a test set and all the data have to be used to derive the model.
For such situations, the above statistical approaches are better than
nothing. However, it is noteworthy that a leave-n-out procedure is
always better than an LOO.
Nevertheless, ideally an SAR or a QSAR model has to be derived
from a training (learning) set and then, its performances have to be
evaluated on an external test set. The selection of these sets is
problem dependent. It is generally performed by means of linear
and nonlinear multivariate methods (124127) to secure the repre-
sentativeness of the biological activities and chemical structures in
both sets. In order to try to estimate with accuracy the simulation
performances of a structureactivity model, it can be interesting to
split the test set into an in-sample test set (ISTS) and an out-of-
sample test set (OSTS) (109, 128). The ISTS, including structures
18 J. Devillers
Table 4
Log P-dependent QSAR equations for nonpolar
narcotics (130)
Species Slope Intercept r2 s F n

Pimephales promelas 0.87 1.79 0.96 0.30 1,088 51
Tetrahymena pyriformis 0.74 1.86 0.96 0.21 3,341 148
Vibrio fischeri 0.94 1.46 0.76 0.77 212 69
widely represented in the training set (e.g., isomers of position), is

used for assessing the interpolation performances of the QSAR
model while the OSTS, including particular chemicals weakly repre-
sented in the training set, is useful to try to estimate the extrapola-
tion performances of the model (109, 128). It is obvious that this
strategy can be used when the availability of experimental data is not
a limiting factor but in all cases, it has to be made with care. Indeed,
it is well accepted that interpolated data will be safer and less likely to
be prone to uncertainties than extrapolated data (129).
7. Interpreting
the Models
If the molecular descriptors included in a model are meaningful, the
signs and the values of their coefficients can be used to interpret it.
This is classically done with regression models where it is possible to
directly see the parameters that contribute positively or negatively
to the modeled activity. Mechanistic information is also obtained
from the comparison of the slope and intercept of a new QSAR
model with those of existing QSAR models designed on the same
type of molecules and with the same molecular descriptor. This is
exemplified in Table 4 that lists three log P-dependent QSAR equa-
tions for nonpolar narcotics obtained on three aquatic species (130).
This approach is called comparative QSAR (131, 132) but also
lateral validation because through this comparison exercise an indi-
rect validation of the equation parameters is obtained (132, 133).
Indeed, if the same molecular descriptor is present, with a similar
contribution, in the QSAR models being compared, more confi-
dence can be attributed to all the models.
Another point to consider is the way in which the modeling
results are interpreted. Unfortunately, abusive generalizations are
very often done. Thus, for example, in endocrine disruption mod-
eling (134), most of the SAR and QSAR models are aimed at
predicting the binding activity of chemicals on a specific endocrine
receptor. Very often, from such studies, chemicals are claimed as
non-endocrine disruptors only because they gave a negative result

on the modeled receptor. This is wrong because a chemical can be
inactive against one endocrine receptor but in the meantime be an
effective binder to another receptor and/or interact with another
process in relation with the endocrine system (135). Consequently,
all generalization from a result obtained on one target is dangerous
and as a result, some complex activities need to simultaneously
consider different targets to be correctly modeled.
8. Concluding
Remarks
SAR and QSAR models are increasingly used to provide insights
into the mechanism of actions of the organic chemicals and to fill
data gaps. When possible, they have to be used as surrogate of the
toxicity tests on vertebrates in registration, evaluation, authoriza-
tion and restriction of chemicals (REACH), the EU regulation on
chemicals (136). However, to be safely used, an SAR or a QSAR
model needs to be correctly designed. According to the so-called
OECD principles for validation of the SAR and QSAR models, the
models must present (137):
(1) A defined endpoint.
(2) An unambiguous algorithm.
(3) A defined domain of applicability.
(4) Appropriate measures of goodness of fit, robustness, and pre-
dictivity.
(5) A mechanistic interpretation, if possible.
These five conditions represent basic requirements because, in
fact, there are numerous points to respect for deriving an SAR or a
QSAR model that does not fail in its predictions and, hence, that
can be used for research or regulation.
Preferably the biological activity data have to be obtained under
the same experimental conditions (i.e., same protocol). If it is
not the case, they have to be compatible. QSAR models need
activity data of quality, with enough magnitude, and that have
to be generally log10 transformed. If there is too much uncer-
tainty on their quality, they must be transformed into categori-
cal data for deriving SAR models. A particular attention has to
be paid to the issue that some statistical methods are very
sensitive to unbalanced classes. This is the case of the linear
discriminant analysis (15, 16, 92, 138140).
Use molecular descriptors that are informative, not redundant,
and not correlated. A descriptor can be highly discriminative
20 J. Devillers
but totally meaningless. When an indicator variable is used, its

frequency among the training set has to be enough to avoid
statistical problems, especially with linear regression analysis.
At least an occurrence of 5% has to be respected. The physico-
chemical properties undoubtedly represent the best molecular
descriptors. However, for a defined property, it is dangerous to
mix experimental and computed values as well as to use data
computed from different QSPR models. If a large number of
descriptors are utilized, it is necessary to reduce their number
prior to modeling by using an appropriate statistical method.
Sometimes, the descriptors have to be scaled prior to using.
If so, an optimal scaling procedure has to be used (109).
Very often, statistics still seems to be the poor relation of SAR
and QSAR modeling. Indeed, unfortunately, a lot of QSAR
modelers only focus their attention on the biological data and
the molecular descriptors, omitting to pay enough attention on
the statistics. This explains why it is rather common to find
models of poor quality and of poor interest published in the
literature. Ideally, statistics should accompany all the modeling
steps. Indeed, it should be used for graphing the data, for
reducing the number of descriptors and for selecting the best
ones, for constituting the training and test sets, for computing
the model, and for analyzing its prediction results. While the
choice of a statistical method for deriving an SAR or a QSAR
model is problem dependent, a lot of modelers always use the
same approach. This can deteriorate the obtained model and
can yield to misinterpretations. Thus, it is crucial to use a
statistical approach, suitable to the problem and appropriate
to the data. Start with a linear method and then show whether
a nonlinear approach can improve the quality of the results,
represents the best modeling strategy. Last, it is noteworthy
that the statistical tool used for deriving the SAR or QSAR
model can be a source of mistakes. Collections of statistical
tools are available on the Internet (141) but it is very often
hard to separate the wheat from the chaff. The main problem
relies on the fact that a huge number of freeware and shareware
programs have not been enough debugged, the selected algo-
rithms are not enough efficient, etc. To avoid this problem,
especially if the potential user has a limited skill in informatics,
commercial software designed by professionals such as Statis-
tica (142) or SIMCA (143) can be used.
The models have to be assessed both in terms of their goodness
of fit as well as in terms of their predictive power. While the
assessment of the goodness of fit is based on the use of statistical
parameters, the predictive power of a model can only be cor-
rectly assessed by estimating the activity of chemicals not
included in the training set. Although there is a consensus on
the crucial interest of this modeling step, divergences exist on

the persons most suited to perform the external validation
exercise when the studied model is intended to be used for
regulatory purposes. Indeed, there remain people that do not
make a difference between models designed for research pur-
poses and those developed to support regulatory decisions.
They are convinced that the persons at the origin of a model
and those who support its development and diffusion stand on
their own to fully perform the external validation of the model.
In the case of models used for regulatory purposes, it is defini-
tively not enough because, among other things, the end users
of models require more and more credibility and transparency
in the models they will have to use to support regulatory
decisions. External validation is the cornerstone for establishing
credibility of a model. If model performance is found accept-
able by fully independent investigators, this is more convincing
than when this result is found by investigators who also devel-
oped the model or that were involved, at different levels, in its
development process (34).
Obviously, if possible, the SAR and QSAR models should be
mechanistically interpretable. The analysis of the descriptors in
the model should help to understand the underlying mechanis-
tic basis of the studied endpoint. This does not mean that the
so-called pragmatic QSAR models (144) in which the descrip-
tors are selected for their modeling performances rather than
for their interpretability are useless. If they are correctly derived
from large sets of molecules, they can be used to screen large
data sets to prioritize the most hazardous chemicals or to select
those showing an activity of interest, depending on the mod-
eled endpoint.
When building an SAR or a QSAR model, it is important to
keep in mind that a model cannot be any better than its constitutive
parameters. Moreover, if data of poor quality are used for comput-
ing a model, the resulting output will be also of poor quality. This
has been popularized by the maxim Garbage in, Garbage out that
can be applied to the modelers but also to the end users that have to
select the values of the descriptors before running the model that
they want to use.
References
1. Cros AFA (1863) Action de lalcool amylique 4. Lipnick RL, Filov VA (1992) Nikolai Vasilye-
sur lorganisme. Thesis, Strasbourg vich Lazarev, toxicologist and pharmacolo-
2. Dujardin-Beaumetz D, Audige (1875) Sur les gist, comes in from the cold. Trends
proprietes toxiques des alcools par fermenta- Pharmacol Sci 13:5660
tion. CR Acad Sci Paris LXXX:192194 5. Hansch C, Maloney PP, Fujita T et al (1962)
3. Overton E (1901) Studien u ber die Narkose. Correlation of biological activity of phenox-
Gustav Fischer, Jena yacetic acids with Hammett substituent
22 J. Devillers
constants and partition coefficients. Nature 21. Free SM, Wilson JW (1964) A mathematical
194:178180 contribution to structureactivity studies. J
6. Hansch C, Fujita T (1964) r-s-p analysis. A Med Chem 7:395399
method for the correlation of biological activ- 22. Serebryakov EP, Epstein NA, Yasinskaya NP
ity and chemical structure. J Am Chem Soc et al (1984) A mathematical additive model of
86:16161626 the structureactivity relationships of gibber-
7. Gaines TB (1960) The acute toxicity of pesti- ellins. Phytochemistry 23:18551863
cides to rats. Toxicol Appl Pharmacol 2:8899 23. Zahradnik P, Foltinova P, Halgas J (1996)
8. Gaines TB (1969) Acute toxicity of pesticides. QSAR study of the toxicity of benzothiazo-
Toxicol Appl Pharmacol 14:515534 lium salts against Euglena gracilis: the Free-
9. Kato R (1974) Sex-related differences in drug Wilson approach. SAR QSAR Environ Res
metabolism. Drug Metab Rev 3:132 5:5156
10. Devillers J (2004) Prediction of mammalian 24. Fouchecourt MO, Beliveau M, Krishnan K
toxicity of organophosphorus pesticides from (2001) Quantitative structurepharmacoki-
QSTR modeling. SAR QSAR Environ Res netic relationship modelling. Sci Total Envi-
15:501510 ron 274:125135
11. Kaiser KLE (2004) Toxicity data sources. In: 25. Globisch C, Pajeva IK, Wiese M (2006)
Cronin MTD, Livingstone D (eds) Predicting Structureactivity relationships of a series of
chemical toxicity and fate. CRC, Boca Raton, tariquidar analogs as multidrug resistance mod-
FL ulators. Bioorg Med Chem 14:15881598
12. Tan NX, Rao HB, Li ZR, Li XY (2009) Pre- 26. Alkorta I, Blanco F, Elguero J (2008) Appli-
diction of chemical carcinogenicity by cation of Free-Wilson matrices to the analysis
machine learning approaches. SAR QSAR of the tautomerism and aromaticity of azapen-
Environ Res 20:2775 talenes: a DFT study. Tetrahedron 64:3826
3836
13. Fjodorova N, Vracko M, Jezierska A et al
(2010) Counter propagation artificial neural 27. Baggiani C, Baravalle P, Giovannoli C et al
network categorical models for prediction of (2010) Molecularly imprinted polymers for
carcinogenicity for non-congeneric chemicals. corticosteroids: analysis of binding selectivity.
SAR QSAR Environ Res 21:5775 Biosens Bioelectron 26:590595
14. Mombelli E, Devillers J (2010) Evaluation of 28. Hall LH, Kier LB, Phipps G (1984) Struc-
the OECD (Q)SAR application toolbox and tureactivity relationship studies on the toxi-
Toxtree for predicting and profiling the carci- cities of benzene derivatives: I. An additivity
nogenic potential of chemicals. SAR QSAR model. Environ Toxicol Chem 3:355365
Environ Res 21:731752 29. Hall LH, Kier LB (1986) Structureactivity
15. Sanchez PM (1974) The unequal group size relationship studies on the toxicities of ben-
problem in discriminant analysis. J Acad Mark zene derivatives: II. An analysis of benzene
Sci 2:629633 substituent effects on toxicity. Environ Toxi-
col Chem 5:333337
16. Japkowicz N, Stephen S (2002) The class
imbalance problem: a systematic study. Intell 30. Devillers J, Zakarya D, Chastrette M et al
Data Anal 6:429450 (1989) The stochastic regression analysis as a
tool in ecotoxicological QSAR studies.
17. Devillers J, Mombelli E (2010) Evaluation of Biomed Environ Sci 2:385393
the OECD QSAR application toolbox and
Toxtree for estimating the mutagenicity of 31. Duewer DL (1990) The Free-Wilson para-
chemicals. Part 2. ab Unsaturated aliphatic digm redux: significance of the Free-Wilson
aldehydes. SAR QSAR Environ Res 21:771 coefficients, insignificance of coefficient
783 uncertainties and statistical sins. J Chemom
4:299321
18. Benigni R, Passerini L, Rodomonte A (2003)
Structureactivity relationships for the muta- 32. Ashby J, Tennant RW (1988) Chemical struc-
genicity and carcinogenicity of simple and a-b ture, Salmonella mutagenicity and extent of
unsaturated aldehydes. Environ Mol Mutagen carcinogenicity as indicators of genotoxic car-
42:136143 cinogenesis among 222 chemicals tested in
rodents by the U.S. NCI/NTP. Mutat Res
19. OECD QSAR Application Toolbox. http:// 204:17115
www.oecd.org/document/54/0,3343,en_
2649_34379_42923638_1_1_1_1,00.html 33. Benigni R, Bossa C (2008) Structure alerts for
carcinogenicity, and the Salmonella assay sys-
20. Toxtree. http://ecb.jrc.it/qsar/qsar-tools/ tem: a novel insight through the chemical
index.php?cTOXTREE
relational databases technology. Mutat Res 47. Sangster J (1997) Octanol-water partition
659:248261 coefficients: fundamentals and physical chem-
34. Devillers J, Mombelli E, Samsera` R (2011) istry. Wiley, Chichester
Structural alerts for estimating the carcinoge- 48. Rekker RF, Mannhold R (1992) Calculation
nicity of pesticides and biocides. SAR QSAR of drug lipophilicity. The hydrophobic frag-
Environ Res 22:89106 mental constant approach. VCH, Weinheim
35. Tsakovska I, Gallegos Saliner A, Netzeva T 49. Hansch C, Leo A (1995) Exploring QSAR.
et al (2007) Evaluation of SARs for the pre- Fundamentals and applications in chemistry
diction of eye irritation/corrosion potential: and biology. American Chemical Society,
structural inclusion rules in the BfR decision Washington, DC
support system. SAR QSAR Environ Res 50. Devillers J, Domine D, Guillon C (1998)
18:221235 Autocorrelation modeling of lipophilicity
36. Gallegos Saliner A, Tsakovska I, Pavan M et al with a back-propagation neural network. Eur
(2007) Evaluation of SARs for the prediction J Med Chem 33:659664
of skin irritation/corrosion potential: struc- 51. Domine D, Devillers J (1998) A computer
tural inclusion rules in the BfR decision sup- tool for simulating lipophilicity of organic
port system. SAR QSAR Environ Res 18: molecules. Sci Comput Autom 15:5563
331342 52. Devillers J (2000) EVA/PLS versus autocor-
37. Dearden JC (1990) Physico-chemical descrip- relation/neural network estimation of parti-
tors. In: Karcher W, Devillers J (eds) Practical tion coefficients. Pespect Drug Discov Design
applications of quantitative structureactivity 19:117131
relationships (QSAR) in environmental chem- 53. Yaffe D, Cohen Y, Espinosa G et al (2002)
istry and toxicology. Kluwer, Dordrecht Fuzzy ARTMAP and back-propagation neural
38. Domine D, Devillers J, Chastrette M et al networks based quantitative structureprop-
(1992) Multivariate structureproperty rela- erty relationships (QSPRs) for octanol-water
tionships (MSPR) of pesticides. Pestic Sci partition coefficient of organic compounds.
35:7382 J Chem Inf Comput Sci 42:162183
39. Samiullah Y (1990) Prediction of the environ- 54. Tetko IV, Tanchuk VY (2002) Application of
mental fate of chemicals. Elsevier, London associative neural networks for prediction
40. Mackay D, Di Guardo A, Hickie B et al (1997) of lipophilicity in ALOGPS 2.1 program.
Environmental modelling: progress and pro- J Chem Inf Comput Sci 42:11361145
spects. SAR QSAR Environ Res 6:117 55. Lyman WJ, Reehl WF, Rosenblatt DH (1990)
41. Hemond HF, Fechner EJ (1994) Chemical Handbook of chemical property estimation
fate and transport in the environment. Aca- methods. American Chemical Society,
demic, San Diego, CA Washington, DC
42. Devillers J (1998) Environmental chemistry: 56. Reinhard M, Drefahl A (1999) Handbook for
QSAR. In: Schleyer PvR, Allinger NL, Clark estimating physicochemical properties of
T, Gasteiger J, Kollman PA, Schaefer HF, organic compounds. Wiley, New York, NY
Schreiner PR (eds) The encyclopedia of 57. Boethling RS, Howard PH, Meylan WM
computational chemistry, vol 2. Wiley, Chi- (2004) Finding and estimating chemical
chester property data for environmental assessment.
43. Devillers J (2007) Application of QSARs in Environ Toxicol Chem 23:22902308
aquatic toxicology. In: Ekins S (ed) Computa- 58. Cronin MTD, Livingstone DJ (2004) Calcu-
tional toxicology. Risk assessment for pharma- lation of physicochemical properties. In: Cro-
ceutical and environmental chemicals. Wiley, nin MTD, Livingstone DJ (eds) Predicting
Hoboken, NJ chemical toxicity and fate. CRC, Boca
44. Devillers J, Domine D, Bintein S et al (1998) Raton, FL
Fish bioconcentration modeling with log P. 59. Devillers J, Balaban AT (1999) Topological
Toxicol Methods 8:110 indices and related descriptors in QSAR and
45. Bintein S, Devillers J (1994) QSAR for QSPR. Gordon and Breach Science Publish-
organic chemical sorption in soils and sedi- ers, Amsterdam
ments. Chemosphere 28:11711188 60. Kier LB, Hall LH (1986) Molecular connec-
46. Trapp S, Rasmussen D, Samse-Petersen L tivity in structureactivity analysis. Wiley,
(2003) Fruit tree model for uptake of organic Letchworth
compounds from soil. SAR QSAR Environ 61. Kier LB, Hall LH (1999) Molecular structure
Res 14:1726 description: the electrotopological state.
Academic, New York, NY
24 J. Devillers
62. Todeschini R, Consonni V (2009) Molecular 73. Feher M, Schmidt JM (2000) Multiple flexi-
descriptors for chemoinformatics: volume I: ble alignment with SEAL: a study of mole-
alphabetical listing/volume II: appendices, cules acting on the colchicine binding site.
references, 2nd edn. Wiley-VCH, Weinheim J Chem Inf Comput Sci 40:495502
63. Topliss JG, Costello RJ (1972) Chance corre- 74. Pastor M, Cruciani G, McLay I et al (2000)
lations in structureactivity studies using GRid-INdependent descriptors (GRIND): a
multiple regression analysis. J Med Chem novel class of alignment-independent three-
15:10661068 dimensional molecular descriptors. J Med
64. Devillers J, Thioulouse J, Karcher W (1993) Chem 43:32333243
Chemometrical evaluation of multispecies- 75. Klebe G, Abraham U, Mietzner T (1994)
multichemical data by means of graphical Molecular similarity indices in a comparative
techniques combined with multivariate ana- analysis (CoMSIA) of drug molecules to cor-
lyses. Ecotoxicol Environ Saf 26:333345 relate and predict their biological activity.
65. Devillers J, Karcher W (1990) Correspon- J Med Chem 37:41304146
dence factor analysis as a tool in environmen- 76. Serafimova R, Walker J, Mekenyan O (2002)
tal SAR and QSAR studies. In: Karcher W, Androgen receptor binding affinity of pesti-
Devillers J (eds) Practical applications of cide active formulation ingredients. QSAR
quantitative structureactivity relationships evaluation by COREPA method. SAR QSAR
(QSAR) in environmental chemistry and toxi- Environ Res 13:127134
cology. Kluwer, Dordrecht 77. Petkov PI, Rowlands JC, Budinsky R et al
66. Cramer RD, Patterson DE, Bunce JD (1988) (2010) Mechanism-based common reactivity
Comparative molecular field analysis pattern (COREPA) modelling of aryl hydro-
(CoMFA). 1. Effect of shape on binding of carbon receptor binding affinity. SAR QSAR
steroids to carrier proteins. J Am Chem Soc Environ Res 21:187214
110:59595967 78. Turner DB, Willett P (2000) The EVA spec-
67. Cramer RD, DePriest SA, Patterson DE et al tral descriptor. Eur J Med Chem 35:367375
(1993) The developing practice of compara- 79. Todeschini R, Gramatica P (1997) The
tive molecular field analysis. In: Kubinyi H WHIM theory: new 3D molecular descriptors
(ed) 3D QSAR in drug design. Theory meth- for QSAR in environmental modelling. SAR
ods and applications. ESCOM, Leiden QSAR Environ Res 7:89115
68. Doucet JP, Panaye A (2010) Three dimen- 80. Selwood DL, Livingstone DJ, Comley JCW
sional QSAR: applications in pharmacology et al (1990) Structureactivity relationship of
and toxicology. CRC, Boca Raton, FL antifilarial antimycin analogues: a multivariate
69. Geladi P, Tosato ML (1990) Multivariate pattern recognition study. J Med Chem
latent variable projection methods: SIMCA 33:136142
and PLS. In: Karcher W, Devillers J (eds) 81. Livingstone DJ (1995) The trouble with che-
Practical applications of quantitative struc- mometrics. In: Sanz F, Giraldo J, Manaut F
tureactivity relationships (QSAR) in environ- (eds) QSAR and molecular modelling: con-
mental chemistry and toxicology. Kluwer, cepts, computational tools and biological
Dordrecht applications. Prous Science, Barcelona
70. Kearsley SK, Smith GM (1990) An alternative 82. Thioulouse J, Devillers J, Chessel D et al
method for the alignment of molecular (1991) Graphical techniques for multidimen-
structures: maximizing electrostatic and steric sional data analysis. In: Devillers J, Karcher W
overlap. Tetrahedron Comput Method 3: (eds) Applied multivariate analysis in SAR and
615633 environmental studies. Kluwer, Dordrecht
71. Korhonen SP, Tuppurainen K, Laatikainen R 83. Cleveland WS (1994) The elements of graph-
et al (2005) Comparing the performance of ing data. Hobart Press, Summit
FLUFF-BALL to SEAL-CoMFA with a large 84. Cook RD, Weisberg S (1994) An introduction
diverse estrogen data set: from relevant super- to regression graphics. Wiley, New York, NY
positions to solid predictions. J Chem Inf
Model 45:18741883 85. Devillers J, Chezeau A, Thybaud E et al
(2002) QSAR modeling of the adult and
72. Korhonen SP, Tuppurainen K, Laatikainen R developmental toxicity of glycols, glycol
et al (2003) FLUFF-BALL, a template-based ethers, and xylenes to Hydra attenuata. SAR
grid-independent superposition and QSAR QSAR Environ Res 13:555566
technique: validation using a benchmark ste-
roid data set. J Chem Inf Comput Sci 43: 86. Devillers J, Chezeau A, Thybaud E (2002)
17801793 PLS-QSAR of the adult and developmental
toxicity of chemicals to Hydra attenuata. SAR Ecotoxicology modeling. Springer, New

QSAR Environ Res 13:705712 York, NY
87. Kundu D, Murali G (1996) Model selection 101. Devillers J, Doucet JP, Panaye A et al (2009)
in linear regression. Comput Stat Data Anal Structureactivity modeling of a diverse set of
22:461469 androgen receptor ligands. In: Devillers J (ed)
88. Tanii H (1996) Anesthetic activity of mono- Endocrine disruption modeling. CRC, Boca
ketones in mice: relationship to hydrophobi- Raton, FL
city and in vivo effects on Na+/K+-ATPase 102. Goonatilake S, Khebbal S (1995) Intelligent
activity and membrane fluidity. Toxicol Lett hybrid systems. Wiley, Chichester
85:4147 103. Devillers J (1996) Designing molecules with
89. Devillers J (1996) Genetic algorithms in specific properties from intercommunicating
molecular modeling. Academic, London hybrid systems. J Chem Inf Comput Sci
90. Leardi R (2003) Nature-inspired methods in 36:10611066
chemometrics: genetic algorithms and artifi- 104. Devillers J (2005) A new strategy for using
cial neural networks. Elsevier, Amsterdam supervised artificial neural networks in QSAR.
91. Draper N, Smith H (1981) Applied regression SAR QSAR Environ Res 16:433442
analysis, 2nd edn. Wiley, New York, NY 105. Komaroff AL (1979) The variability and inac-
92. Tomassone R, Danzart M, Daudin JJ et al curacy of medical data. Proc IEEE 67:1196
(1988) Discrimination et classement. Masson, 1207
Paris 106. Baldi P, Brunak S, Chauvin Y et al (2000)
93. Devillers J (1996) Neural networks in QSAR Assessing the accuracy of prediction algo-
and drug design. Academic, London rithms for classifications: an overview. Bioin-
94. Zakarya D, Boulaamail A, Larfaoui EM et al formatics 16:412424
(1997) QSARs for toxicity of DDT-type ana- 107. Carugo O (2007) Detailed estimation of bio-
logs using neural network. SAR QSAR Envi- informatics prediction reliability through the
ron Res 6:183203 fragmented prediction performance plots.
95. Eldred DV, Jurs PC (1999) Prediction of BMC Bioinform 8:380. doi:10.1186/1471-
acute mammalian toxicity of organophospho- 2105-8-380
rus pesticide compounds from molecular 108. Sonego P, Kocsor A, Pongor S (2008) ROC
structure. SAR QSAR Environ Res 10:7599 analysis: applications to the classification of
96. Panaye A, Fan BT, Doucet JP et al (2006) biological sequences and 3D structures. Brief
Quantitative structure-toxicity relationships Bioinform 9:198209
(QSTRs): a comparative study of various non 109. Devillers J (1996) Strengths and weaknesses
linear methods. General regression neural of the backpropagation neural network in
network, radial basis function neural network QSAR and QSPR studies. In: Devillers J (ed)
and support vector machine in predicting tox- Neural networks in QSAR and drug design.
icity of nitro- and cyano- aromatics to Tetra- Academic, London
hymena pyriformis. SAR QSAR Environ Res 110. Devillers J, Lipnick RL (1990) Practical appli-
17:7591 cations of regression analysis in environmental
97. Kaiser KLE (2003) Neural networks for effect QSAR studies. In: Karcher W, Devillers J
prediction in environmental and health issues (eds) Practical applications of quantitative
using large datasets. QSAR Comb Sci structureactivity relationships (QSAR) in
22:185190 environmental chemistry and toxicology.
98. Devillers J (2008) Artificial neural network Kluwer Academic Publishers, Dordrecht
modeling in environmental toxicology. In: 111. Lipnick RL (1991) Outliers: their origin and
Livingstone D (ed) Artificial neural networks: use in the classification of molecular mechan-
methods and protocols. Humana, New York, isms of toxicity. In: Hermens JLM, Opperhui-
NY zen A (eds) QSAR in environmental
99. Fatemi MH, Abraham MH, Haghdadi M toxicology-IV. Elsevier, Amsterdam
(2009) Prediction of biomagnification factors 112. Frear DEH, Boyd JE (1967) Use of Daphnia
for some organochlorine compounds using magna for the microbioassay of pesticides. I.
linear free energy relationship parameters Development of standardized techniques for
and artificial neural networks. SAR QSAR rearing Daphnia and preparation of dosage-
Environ Res 20:453465 mortality curves for pesticides. J Econ
100. Devillers J (2009) Artificial neural network Entomol 60:12281236
modeling of the environmental fate and 113. Devillers J, Zakarya D, Chastrette M (1988)
ecotoxicity of chemicals. In: Devillers J (ed) Structureactivity relationships for the toxicity
26 J. Devillers
of organic pollutants to Brachydanio rerio. In: for designing test series. In: Devillers J (ed)
Turner JE, England MW, Schultz TW et al Genetic algorithms in molecular modeling.
(eds) QSAR88, 3rd international workshop Academic, London
on quantitative structureactivity relationships 128. Devillers J, Bintein S, Domine D et al (1995)
in environmental toxicology, Knoxville A general QSAR model for predicting the
114. Devillers J, Boule P, Vasseur P et al (1990) toxicity of organic chemicals to luminescent
Environmental and health risks of hydroqui- bacteria (Microtox test). SAR QSAR Envi-
none. Ecotoxicol Environ Saf 19:327354 ron Res 4:2938
115. Cruciani G, Clementi S, Baroni M (1993) 129. Anonymous (1998) QSARs in the assessment
Variable selection in PLS analysis. In: Kubinyi of the environmental fate and effects of che-
H (ed) 3D QSAR in drug design. Theory, micals. Technical report no. 74. ECETOC,
methods and applications. ESCOM, Leiden Brussels
116. Cruciani G, Baroni M, Bonelli D et al (1990) 130. Schultz TW, Sinks GD, Bearden AP (1998)
Comparison of chemometric models for QSAR in aquatic toxicology: a mechanism of
QSAR. Quant Struct Act Relat 9:101107 action approach comparing toxic potency to
117. Efron B, Tibshirani RJ (1993) An introduc- Pimephales promelas, Tetrahymena pyriformis,
tion to the bootstrap. Chapman & Hall, and Vibrio fischeri. In: Devillers J (ed) Com-
New York, NY parative QSAR. Taylor and Francis, Washing-
118. Gray HL, Baek J, Woodward WA et al (1996) ton, DC
A bootstrap generalized likelihood ratio test 131. Hansch C, Gao H, Hoekman D (1998) A
in discriminant analysis. Comput Stat Data generalized approach to comparative QSAR.
Anal 22:137158 In: Devillers J (ed) Comparative QSAR. Tay-
119. Jonathan P, McCarthy WV, Roberts AMI lor & Francis, Washington, DC
(1996) Discriminant analysis with singular 132. Selassie CD, Klein TE (1998) Comparative
covariance matrices. A method incorporating quantitative structure activity relationships
cross-validation and efficient randomized per- (QSAR) of the inhibition of dihydrofolate
mutation tests. J Chemom 10:189213 reductase. In: Devillers J (ed) Comparative
120. Tropsha A, Gramatica P, Gombar VK (2003) QSAR. Taylor & Francis, Washington, DC
The importance of being earnest: validation is 133. Kim KH (1995) Comparison of classical
the absolute essential for successful applica- QSAR and comparative molecular field analy-
tion and interpretation of QSPR models. sis. Toward lateral validations. In: Hansch C,
QSAR Comb Sci 22:6977 Fujita T (eds) Classical and three-dimensional
121. Kolossov E, Stanforth R (2007) The quality QSAR in agrochemistry. ACS symposium
of QSAR models: problems and solutions. series 606, American Chemical Society,
SAR QSAR Environ Res 18:89100 Washington, DC
122. Gramatica P (2007) Principles of QSAR mod- 134. Devillers J (2009) Endocrine disruption
els validation: internal and external. QSAR modeling. CRC, Boca Raton, FL
Comb Sci 26:694701 135. Devillers J, Marchand-Geneste N, Dore JC
123. Rucker C, R ucker G, Meringer M (2007) et al (2007) Endocrine disruption profile
y-Randomization and its variants in QSPR/ analysis of 11,416 chemicals from chemome-
QSAR. J Chem Inf Model 47:23452357 trical tools. SAR QSAR Environ Res 18:181
193
124. Domine D, Devillers J, Chastrette M (1994)
A nonlinear map of substituent constants for 136. Regulation (EC) no 1907/2006 of the Euro-
selecting test series and deriving structure pean parliament and of the Council of 18
activity relationships. I. Aromatic series. J December 2006 concerning the Registration,
Med Chem 37:973980 Evaluation, Authorisation and Restriction of
Chemicals (REACH), establishing a Euro-
125. Domine D, Devillers J, Chastrette M (1994) pean Chemicals Agency, amending Directive
A nonlinear map of substituent constants for 1999/45/EC and repealing Council Regula-
selecting test series and deriving structure tion (EEC) No 793/93 and Commission
activity relationships. II. Aliphatic series. J Regulation (EC) No 1488/94 as well as
Med Chem 37:981987 Council Directive 76/769/EEC and Com-
126. Domine D, Devillers J, Wienke D et al (1996) mission Directives 91/155/EEC, 93/67/
Test series selection from nonlinear neural EEC, 93/105/EC and 2000/21/EC. Jour-
mapping. Quant Struct Act Relat 15:395 nal L396 30.12.2006
402 137. Anonymous, The principles for establishing
127. Putavy C, Devillers J, Domine D (1996) the status of development and validation of
Genetic selection of aromatic substituents
(quantitative) structureactivity relationships 141. Devillers J, Dore JC (2002) e-statistics for

(Q)SARs, OECD document, ENV/JM/TG deriving QSAR models. SAR QSAR Environ
(2004)27 Res 13:409416
138. Dillon WR, Goldstein M (1984) Multivariate 142. Statistica, StatSoft, http://www.statsoft.
analysis, methods and applications. Wiley, com/# (accessed 18 Jan 2011)
New York, NY 143. SIMCA-P, Umetrics, http://www.ume-
139. Menardi G (2009) Statistical issues emerging trics.com/ (accessed 18 Jan 2011)
in modeling unbalanced data set. In: 16th 144. Gedeck P, Kramer C, Ertl P (2010) Compu-
European Young Statisticians Meeting, 24 tational analysis of structureactivity relation-
28 Aug 2009, Bucharest, Romania ships. In: Lawton G, Witty DR (eds) Progress
140. He H, Garcia EA (2009) Learning from in medicinal chemistry, vol 49. Elsevier, The
imbalanced data. IEEE Trans Know Data Netherlands
Eng 21:12631284
Chapter 2
Accessing and Using Chemical Databases

Nikolai Nikolov, Todor Pavlov, Jay R. Niemela, and Ovanes Mekenyan
Abstract
Computer-based representation of chemicals makes it possible to organize data in chemical databases
collections of chemical structures and associated properties. Databases are widely used wherever efficient
processing of chemical information is needed, including search, storage, retrieval, and dissemination.
Structure and functionality of chemical databases are considered. The typical kinds of information found
in a chemical database are consideredidentification, structural, and associated data. Functionality of
chemical databases is presented, with examples of search and access types. More details are included
about the OASIS database and platform and the Danish (Q)SAR Database online. Various types of chemical
database resources are discussed, together with a list of examples.
Key words: Chemical database, Molecular modeling, Cheminformatics
1. Introduction
The primary advantage of database organization is the possibility of

rapid search and retrieval of desired subsets of data needed for a
specific purpose. This is very relevant for large data repositories as
database search can be orders of magnitude more efficient com-
pared to search in unstructured data.
Chemical databases are collections of data representing chemi-
cal structures and associated properties. A large part of the func-
tionality of chemical databases is inherited from general-purpose
databases but there are also specific functions and modes of use.
Various other kinds of chemistry-related knowledge can also be
found in databases (reaction pathways relevant to a chemical, e.g.,
metabolic pathways, pharmacological properties, synthesis proce-
dures, material safety information, etc.).
Modern scientific research makes a wide use of databases and
computer-based processing of chemical information in all areas
requiring its efficient access, storage, and manipulationchemistry,
29
30 N. Nikolov et al.
biology, pharmacology, medicine, etc. Databases of structures and

properties can be integrated into broader chemical software systems
for automating chemical information processing. Applications
include chemical data mining, prediction of physicochemical or
biological properties, virtual screening, etc.
In Subheading 2 we look at the kinds of information typically
found in a chemical database. Subheading 3 describes how to use
and build a chemical database. Subheading 4 presents examples.
Subheading 5 includes a brief introduction to the theory of rela-
tional databases as a supplement.
2. Materials
Databases, or data organized in order to achieve efficient access or

processing, are widely used in all computing applications. Although
alternatives exist, the database type of choice for most applications
is the so-called relational database (see, e.g., ref. (1)). These data-
bases use a well-developed mathematical theory, are capable of high
performance and versatility, and are relatively robust to application
software errors. Understanding relational database theory is not
essential for the rest of this chapter, but the interested reader can
find the basics in Subheading 5.
Chemical databases typically store the following kinds of infor-
mation.
Chemical identification data may include an identification
number, such as the Chemical Abstracts Service (CAS) registry
number, other registry or product numbers, and systematic or
trivial chemical names and synonyms.
An appropriately represented chemical structure is essential for
the representation of a chemical. While molecular formula tells
about the composition of the chemical, it is only the complete
chemical structure that can precisely identify the typical organic
molecule. Chemical structure diagrams (representing atoms and
bonds as vertices and edges of a graph, as in Fig. 1) are useful for
the human reader but not so much for the machine representation
Fig. 1. A chemical structure diagram: 2,4,6-tris (dimethylamino)methylphenol.

2 Accessing and Using Chemical Databases 31
a c1(CN(C)C)c(O)c(CN(C)C)cc(CN(C)C)c1
b
000090 - 72 - 2 1 2 1 0 0 0 0
426 414343D 000090-72-2 1 6 2 0 0 0 0
Database Manager 4.3 v.1.5 1 19 1 0 0 0 0
46 46 1 2 V2000 2 3 1 0 0 0 0
-0.7465 -1.1911 -0.7348 C 0 0 0 0 0 0 0 0 0 0 0 0 2 20 1 0 0 0 0
-1.4785 -2.4503 -1.1019 C 0 0 0 0 0 0 0 0 0 0 0 0 2 21 1 0 0 0 0
-0.5861 -3.5982 -1.1975 N 0 0 0 0 0 0 0 0 0 0 0 0 3 4 1 0 0 0 0
-0.6256 -4.2608 -2.4772 C 0 0 0 0 0 0 0 0 0 0 0 0 3 5 1 0 0 0 0
-0.6593 -4.4890 -0.0684 C 0 0 0 0 0 0 0 0 0 0 0 0 4 22 1 0 0 0 0
-0.0143 -0.4830 -1.7121 C 0 0 0 0 0 0 0 0 0 0 0 0 4 23 1 0 0 0 0
-0.0489 -0.9976 -2.9860 O 0 0 0 0 0 0 0 0 0 0 0 0 4 24 1 0 0 0 0
0.6557 0.7087 -1.3793 C 0 0 0 0 0 0 0 0 0 0 0 0 5 25 1 0 0 0 0
1.3872 1.5197 -2.4066 C 0 0 0 0 0 0 0 0 0 0 0 0 .
2.6248 0.9153 -2.8926 N 0 0 0 0 0 0 0 0 0 0 0 0 .
3.2472 1.7297 -3.9152 C 0 0 0 0 0 0 0 0 0 0 0 0 .
3.5459 0.5461 -1.8408 C 0 0 0 0 0 0 0 0 0 0 0 0 M END
0.5952 1.1731 -0.0620 C 0 0 0 0 0 0 0 0 0 0 0 0 > <CHEMICAL>
. Phenol, 2,4,6-tris
. (dimethylamino)methyl -
. > <CAS>
90722
$$$$
Fig. 2. (a) SMILES and (b) SD record representations of the same chemical structure.
of structures. Therefore, various computer-readable representations

of chemical structure have been proposed (2). These representations
can most often encode the two-dimensional (2D) structure of a
chemical, the knowledge about the number and type of atoms and
bonds, which atoms are connected and by what type of bond, as well
as stereochemical information. Figure 2 shows two of the most
widely accepted computer-readable formats for representation of
the 2D structure of the chemical considered in Fig. 1: SMILES
notation (Fig. 2a) and a fragment of a structure definition (SD)
file record (Fig. 2b). It is easy to automatically draw chemical
structure diagrams from such representations, if necessary; it is also
convenient to do database selections of chemicals based on their 2D
structure. Chemical databases use such representations not only to
store information about chemical structures but also to exchange
data with other computer applications.
In addition to the 2D structure, some chemical databases also
keep the spatial configuration of the atoms of a chemical, or the
chemicals 3D structure. To this end, the three-dimensional coor-
dinates of every atom are stored in the database; information may
also be available about how the 3D structure is obtained. As the
Fig. 3. A 3D structure diagram of 2,4,6-tris (dimethylamino)methylphenol generated from

the atom coordinates in the SD record in Fig. 2b.
possible (and, in particular, the biologically active) conformations

of a chemical are not usually limited to a single 3D structure, some
chemical information systems can also generate and store multiple
representative conformations for every chemical in the database;
this is sometimes called the 4D structure of chemicals (3). Similarly
to the case with 2D structure, three-dimensional chemical structure
diagrams can be generated from the database representation of the
3D structure of a chemical (Fig. 3 shows the 3D structure diagram
from the atom coordinates in Fig. 2b).
A database may support additional means to represent discrete
chemical structures as well as salts, mixtures, generic structures
(e.g., R-group structures), etc. A generic structure describes a list
of chemical structures having a common substructure and is used
to improve efficiency compared to having to enumerate all the
structures from the list.
Various types of information about chemicals are stored using
descriptorsusually numeric or text tags accompanying a chemical
structure. They can represent physicochemical characteristics, envi-
ronmental fate, biological properties, including human health
effects, animal toxicity studies, environmental toxicity, etc. In addi-
tion, all these data may be either observed (e.g., representing
experimental results) or predicted (calculated using mathematical
models of the corresponding chemical or biological properties).
The database may also contain relevant bibliographic references or
other details about an assay, such as test duration, organism, route

of administration, etc. Logical or Boolean descriptors (taking either
Yes or No as possible values) are also used in some databases,
for example, to store the presence or absence of a categorical
property associated with a chemical structure. Descriptors usually
reflect a functional dependence, or one-to-one (one descriptor
has a single value for a single structure) but one-to-many descrip-
tors are also used (one descriptor takes a set of values for a single
structure). An example can be a text descriptor representing the set
of chemical names of a structure.
3. Methods
Database search, perhaps the most important database function, is

the selection of all chemical structures matching a given condition.
Generally search can be performed on all data types stored in a
database: identification dataindividual registry numbers or lists of
those, chemical names and synonyms or their parts, as well as
numeric or Boolean descriptor values (search for precise values or
ranges of specified physicochemical or biological observed or cal-
culated descriptors), or text search within bibliographical references
or other text descriptors.
Substructure search, or the task of finding all structures con-
taining a given structure fragment (see ref. 4), is a special search
used in chemical databases.
Figure 4a shows a case of search for chlorobenzene as a sub-
structure. Four different chemicals containing the fragment are in
the search result.
Refining the search condition leaves fewer compounds in the
search result: Fig. 4b shows another search query where three
wildcard atoms are required to be attached to chlorobenzene at
the specified positions with single bonds. The result contains three
of the four compounds previously found.
The search condition is further refined by demanding that no
other atoms should appear in the target structure beyond the ones
in the fragment (hydrogen atoms are not relevant). The result is
now confined to two of the initial four chemical structures.
Exact structure search can be implemented as a particular case
of substructure search by using the exact matching mode: if no
wildcard atoms are specified, the search procedure will only look for
the query chemical.
Substructure search is usually implemented by means of
specialized algorithms. In these algorithms, every chemical struc-
ture is represented by a graph, a mathematical object including a set
of vertices interconnected by edges. In the representation, graph
vertices denote atoms while edges stand for bonds. Identifying the
Fig. 4. Substructure search: (a) Fixed fragment, substructure mode, (b) varying atoms (R stands for any atom),
substructure mode, (c) varying atoms (R stands for any atom), no other heavy atoms are allowed except those found
in the fragment.
structures having a given fragment as a substructure reduces to a

mathematical problem called finding a subgraph isomorphism
(or, depending on the task at hand, finding the maximum common
subgraph of two given ones). There exist efficient algorithms for
these tasks (57); however, it has been proved that these problems
are computationally heavy for any algorithm. The solution usually
applied in chemical software is to pre-calculate necessary raw data to
be later used as a prescreen by the search system so that the majority
of structures not matching the search condition can be detected
before computationally expensive graph-theoretical procedures are
invoked.
Structure similarity search (see, e.g., ref. 8) is another type
of database search specific to chemical databases. The task here is
to retrieve all database chemicals similar to a specified one,
where similarity between two chemicals is defined in terms of a
mathematical function taking the 2D or 3D structure of the
Fig. 5. Similarity search.
chemicals as arguments. The system then calculates the similarity

between the given structure and the relevant structures from the
database and selects the appropriate ones (for example, exceeding
a specified similarity measure threshold). The definition of simi-
larity can be extended to cover both chemical structure and
descriptors. Figure 5 shows some chemicals structurally similar
to 2,4,6-tris (dimethylamino)methylphenol (2D structure shown
in Fig. 1) where similarity is defined with atom-centered frag-
ments and fingerprints (9).
A database storing 3D structures may provide search function-
ality involving atomic coordinates. One example is search by
Euclidean distance between specified atoms. Search is then carried
out on all conformers and a structure matches the search condition
whenever at least one of its representative conformers has atoms
that match the distance condition.
Some databases provide a database browser, a tool to scroll
through the list of structures or search results. A database browser
may also allow for inspection, insertion, modification, and deletion

of different types of data items.
Visualization of a chemical structure based on its computer-
readable representation is a widely used feature of chemical data-
base software. It generates a 2D or a 3D structure diagram from the
database record of a chemical for browsing or reporting purposes.
While the inverse task is not often addressed, there are tools to
generate computer-readable representations of 2D chemical struc-
ture from 2D structure diagrams (a brief overview is found in
ref. 10). Such tools can be useful for automating, e.g., the import
of large collections of paper documents into a chemical database.
Similarly, tools to generate a chemical name from a computer
representation of 2D structure and, conversely, to derive such a
representation from a chemical name exist (11); for systematic
names, the task is easier to solve in the general case.
Tools to create new databases or import content into existing
ones include converter facilities that understand different input
formats for storing molecular structure and convert them to the
internal representation of chemical structures adopted in the spe-
cific chemical database platform.
Database access may also offer different options in different
databases. With single input access, the database software accepts a
single chemical as a search query, or as input for processing. The
structure may be prepared in a file or submitted using a structure
editor. Batch input access provides functionality to work with a
large number of structure entries stored in a file and then processed
without human intervention.
Chemical databases can be built with the human user in mind
but they may also serve computer processing of chemical data, such
as automatic retrieval of chemical structure representation and
providing it to further calculations. With manual access, database
systems accept commands from a user interface, typically a graphical
one. With programmatic access, a chemical database package pro-
vides a protocol of functions for other computer programs to use so
that complex scientific or business logic can be programmed that
includes the functionality of the chemical database to be integrated
in a computer application.
As different software suites may have different, and often
incompatible, sets of such functions, some efforts (12, 13) aim at
standardizing such function protocols, replacing them with a more
general framework for providing functionality between computer
applications, such as Web services (14). In relation to that, approa-
ches were proposed towards a unified representation of chemical
information in terms of more flexible data description languages,
such as CMLthe Chemical Markup Language (15), helping dif-
ferent computer applications understand the same chemical infor-
mation and interpret it correctly.
The work on Semantic Web (16), a set of technologies for

developing machine-readable semantic annotations of the
resources of the Web so that software applications may be capable
of intelligent search and reasoning of, e.g., Web pages, has also
influenced the development of chemical databases. In the last
decade, projects were started (17) for building chemical or bio-
logical ontologies (18), collections of data plus formalized repre-
sentation of knowledge from chemistry and other natural sciences,
allowing computer programs to perform reasoning or answer com-
plex queries selecting not only data from a database but also taking
into account chemical notions, properties, and relations between
scientific and technological concepts.
We must note that chemical information systems evolve
towards improving the automation capabilities of database opera-
tions and modes of use. This ranges from simple scripting of
sequences of operations to complex systems of standardized archi-
tecture capable of assisting decision making on the basis on modern
artificial intelligence approaches applied to the field of computa-
tional chemistry.
4. Examples
This section presents three examples, and includes a brief overview

of the available types of chemical database resources online. The
OASIS Centralized database is a large collection of chemicals from
regulatory databases with 2D and 3D structural information and
one of the first where all structures were conformationally multi-
plied and quantum-chemically optimized using algorithms for
complete coverage of the conformational space and had an exten-
sive list of pre-calculated structural, conformational, and atomic
descriptors representing physicochemical parameters as well as
topological and quantum-chemical descriptors. The database was
built using the OASIS Database platform, a software system for
extended 2D and 3D chemical data management of both pre-
loaded and user data. The same system was used to power the
Danish (Q)SAR Database, a large collection of environmental and
human-health toxicity properties also available online.
4.1. The OASIS Database The OASIS Database platform is a software framework for building
Platform for Creating chemical databases (9). It contains a database schema and accom-
Chemical Databases panying software for management of chemical information. It is a
part of the OASIS chemical software (19). The platform provides
an extensive list of features, such as:
l Storage of chemical 2D and 3D structures.
l User-defined structural (2D), conformational (3D), and atomic
descriptors and model (test protocol) information.
Fig. 6. Some structure similarity search options in OASIS Database Manager.
l Representation of discrete structures, defined and undefined

mixtures, hydrolyzing chemicals, and polymers. The system
automatically partitions the mixtures into components and a
module is designed to automatically identify hydrolyzable
structures and subsequently simulate the hydrolysis.
l Import and export of structural (2D and 3D) and descriptor
data from/to several known connectivity formats, such as SDF,
MOL, SMILES, and INChI.
l Pre-calculation module. For every new chemical entering a
database, this module checks and optimizes the molecular
structure, generates 3D structures, and calculates important
descriptors. The module performs 2D3D structural migra-
tion, conformer multiplication of chemicals by a specially
designed algorithm for optimum representation of the confor-
mational space (20), quantum-chemical calculations of the
obtained 3D structures, checking out (filtering) of the
quantum-chemically optimized conformers for incorrect
geometry, and calculation of molecular descriptors.
l OASIS Database enables the construction of single, combined,
or result-based search queries. A single search is defined by one
search conditionsearch by CAS, chemical names, chemical
type, chemical class, observed or calculated descriptors using
ranges or values, extensive 2D and 3D fragment search includ-
ing atom modifiers, R-group (wildcard or enumerated wildcard
atoms and fragments) query structures, distances, conditions
on atom descriptor ranges, or advanced similarity search (some
options for fragment-based similarity are shown in Fig. 6).
Combined search queries contain one or more queries
Fig. 7. A search tree in OASIS Database Manager. Q0: CAS RN between 200,000 and
1,000,000; Q1: structure found in TSCA; Q2: structure found in IUCLID; Q3: MOL_WEIGHT
> 300; Q5: contains fragment c1ccccc1RX1 where the wildcard atom RX1 is one of F,Cl,
Br,I; Q6: there exist two oxygen atoms at a distance of min. 10 A and max. 12 A; single
queries are displayed green.
combined with the logical operators AND, OR, or NOT. The

logical operators can be applied to single as well as combined
searches and search queries of arbitrary complexity and level of
nesting can be built (Fig. 7 shows an example). Result-based
queries are executed over the results of a previous search. Query
trees can be saved as well as results can be exported to flat files or
databases.
l The database browser is the subsystem designed for interactive
work and manipulation of all types of data items contained in an
OASIS database. It can display either a whole database or a set
of structures resulting from a search.
l Visual structure editor for defining and editing structures and
fragments. Drag and drop operations are used to build frag-
ments from available palettes of atom and bond types and
simple fragments. Adding, editing, and cutting of atoms,
bonds, and fragments are possible.
l Database statistics, descriptor distribution, and model correla-
tion tools are included.
l Report generator with visual template editor is included in a
template-based report subsystem. Report Template Designer is
provided where users can define and edit their own templates. A
specification language for templates is developed and template
files can be saved and reused.
l A software suite is developed for implementation of the OASIS
database functionality on Web servers for public or restricted
access to chemical information (OASIS Web database).
Fig. 8. OASIS Centralized database statistics.
4.2. OASIS Centralized A centralized 3D database was built using the OASIS software
Database framework for building databases and managing chemical informa-
tion (9). Chemicals which are under regulation by the respective
government agencies are considered as existing. The individual
databases of regulatory agencies in North America and Europe,
including IUCLID of European Chemicals Bureau (with 61,573
chemicals), Danish EPA (159,448 chemicals), TSCA (56,882 che-
micals), HPVC_EU (4,750 chemicals), HPVC_USEPA (10,546
chemicals) and pesticides active/inactive ingredients of the US
EPA (1379), DSL of Environment Canada (10,851 chemicals),
and Japanese METI (16,811) were combined in the database
(Fig. 8). The structural information for all chemicals is pre-
calculated in terms of conformer multiplication of all chemicals
and quantum-chemical optimization of each conformer, using in-
house algorithms for complete coverage of the conformational
space. The 2D3D migration, conformational multiplication, and
quantum-chemical evaluation are described in refs. 20, 21. Pres-
ently the database contains approximately 185,500 structures
and 3,700,000 conformers with hundreds of millions of descriptor
data items.
The pre-calculation of the chemicals in the Centralized 3D
database combined with the flexible searching capabilities (on 2D
and 3D level) allows testing hypotheses on the structural condi-
tioning of modeled endpoints. Thus, the search of the database for
chemicals which could elicit significant estrogen receptor binding
affinity with earlier defined 3D structural pattern (22),
l C{ar}O{*}H_O{*}{acy} C{scy}{10.2 <DISTANCE <11}

l C{ar}O{*}H_O{*}{acy}(H)C{scy}{10.2 <DISTANCE <11}
l 2,
VAN_D._WAALS_SUR in the range of 284365 A
shows that 201 out of around 185,000 chemicals could be poten-
tial significant estrogen receptor ligands. In the above expression of
the search query:
C{ar}O{*}H_O{*}{acy} C{scy}holds for a distance range
of 10.211.0 A between O atom of a hydroxyl group attached to
aromatic carbon (C{ar}O{*}H) and acyclic O atom bound with
cyclic C-atom by double bond (O{*}{acy} C{scy}) or single
bond (O{*}{acy}{H}C{scy}); asterisk indicates the atoms between
which the distance is specified.
4.3. The Danish (Q)SAR Chemical databases are often set up on the Web and coupled with
Database Online the necessary tools for Web-based search/retrieval in order to
provide wide access to the contained data. Many of these resources
are free or contain free functionality. Below we consider in detail the
Danish (Q)SAR Database as an example of a chemical database
available online; the next section lists some more examples.
The Danish Environment Protection Agency (Danish EPA) has
for a number of years worked with the development and use of
computer models for prediction of properties of chemical sub-
stances (23). (Quantitative) StructureActivity Relationships
(Q)SARare relations between structure properties of chemical
substances and some other property. The other property can be a
physicalchemical property or a biological activity, including the
ability to cause toxic effects. The Danish EPA together with the
Danish National Food Institute at the Technical University of Den-
mark has created a database integrating predictions from more than
45 (Q)SAR models on endpoints for physicochemical properties,
eco-toxicity, absorption, metabolism, and toxicity (23). More than
half of all the estimates are for mammalian (human) toxicity end-
points and include commercial data sets as well as many models
developed in-house.
A structure set of about 166,000 discrete organic chemicals,
including all discrete organic European Inventory of Existing
Chemical Substances (EINECS) substances whose structure was
available (57,014), has been batched through the models, and the
results are integrated in the database. When examining substances
in the database, predictions from all models for the substance are
displayed together with domain information giving a (Q)SAR
profile and allowing for interpretation of the agreements or
disagreements between models for related endpoints (Fig. 9).
Substructure comparisons can also be performed.
A reduced version of the Danish (Q)SAR Database has been
published online with support from the former European Chemi-
cals Bureau and is available at refs. (24, 25). The implementation
Fig. 9. The Danish (Q)SAR Database: A chemical profile.

Fig. 9. (continued)
Fig. 9. (continued)
Fig. 9. (continued)
uses the OASIS Web database software, a software solution for a

Web access to OASIS databases. The package includes an interface
subsystem, an engine subsystem, and a database. The engine part
and the database reside on a Web server; clients connect to the
system through a Web browser.
The interface subsystem collects users requests and transfers
them onto the engine subsystem. A 2D structure editor provides an
interface for fragment and structure drawing. The system can read
structural information coded in commonly used connectivity
formats. The engine subsystem performs requested search queries
on the database and generates reports. These comprise search
queries beyond standard SQL processing, e.g., fragment queries
or distances between fragments or atoms. For these reasons, the
engine subsystem includes modules of the OASIS Database plat-
form.
The Danish (Q)SAR Database was included in the OECD (Q)
SAR Application Toolbox (26, 27), a free software application
intended to be used by governments, chemical industry, and
other stakeholders in filling gaps in (eco)toxicity data needed for
assessing the hazards of chemicals. The Toolbox incorporates infor-
mation and tools from various sources into a logical workflow.
Crucial to this workflow is grouping chemicals into chemical cate-
gories. Development of the Toolbox is part of a collaborative
project between the OECD and the European Chemicals Agency.
The (Q)SAR Database and other (Q)SAR tools are used by the
Danish EPA in its daily work to fill in data gaps when there are no
experimental test results or where (Q)SAR gives information

beyond the experimental test data (23). Information from the (Q)
SAR Database on environmental and health effects has been used to
give input on substances undergoing evaluation in the OECD High
Productions Volume Chemicals program, and where relevant also
on substances undergoing risk assessment in the EU, as well as used
in many other contexts. The database was also used to prepare the
Advisory list for self-classification of dangerous substances pub-
lished by the Danish EPA in 2001 and updated in 2009 (28, 29).
4.4. Other Examples Chemical databases on the Internet are so abundant that it is well
of Chemical Database beyond the scope of this chapter to present an exhaustive list. For a
Resources list of available chemical resources see, e.g., refs. (30, 31); a list of
toxicology-related chemical databases is presented in ref. 32. The
examples below are just illustrative of different types of chemical
database resources.
While databases can be used to store various types of chemical
information, we focus on databases of chemical structures and
properties.
Database names in italics refer to the example list of chemical
databases (Subheading 6); the list also includes the relevant litera-
ture and Web page references.
Some databases are primarily used to register the identity of
chemical structures from a specific collection, e.g., from a regulatory
list of substances or from a list of substances sharing common
features. An example is the ECHA Database of registered substances
under REACH. The identity may include a representation of the
chemical structure, chemical names, or identification numbers.
General-purpose registration databasesregistration databases
aiming at including very large and diverse collection of chemi-
calsare also known (CAS Registry database, PubChem database,
etc.). Databases may introduce a registry numberan identification
number or text string assigned to all chemicals within the specific
database (e.g., CAS registry number, Pubchem Id, EC Number).
Non-registration data collections focus on some properties of
interest besides the identity of the contained chemicals (for exam-
ple, RTECS).
The type of chemical structures contained in chemical data-
bases varies. PDB is an example of a protein database and many
databases also contain inorganic and organometallic compounds
(e.g., Reaxys); however, the majority of the resources we mention
here deal with small organic compounds.
Some chemical database systems provide solely their fixed con-
tent, while others offer data as well as software tools for building
new chemical databases based on users data (e.g., OASIS Database
Platform, Accelrys, and many more).
Ordinary Database Management Systems (DBMS) do not have
means to deal specifically with chemical information. However,
some of them have universal mechanisms of extending the expres-

sive power of the database query language with extra constructs;
this feature may be called, depending on the DBMS in question,
data cartridge, datablades, etc. Using these, some software
packages provide chemical database functionality to ordinary data-
base management systems (example: DayCart).
Databases can also be part of a broader software package with
other functions than just databasing, for example chemical data anal-
ysis and modeling tools (e.g., OECD (Q)SAR Toolbox, Leadscope).
As discussed in Subheading 2, while many chemical databases
were mostly oriented towards the human user, there are now
systems allowing for both user interface control and programmatic
access to the data (for example, ChEBI).
Depending on the type of structural information in the data-
base, we can distinguish between databases without structural infor-
mation (e.g., containing only registry numbers and/or chemical
names, such as some registration databases), ones with representa-
tion of 2D chemical structures only, with 2D and 3D, and with 2D,
3D, and multiple conformers for every chemical. Some databases
and database creation tools (for example, ChemAxon) can handle
generic structures (sometimes called Markush structures) both in
storage and search queries. This feature is important for the needs of
combinatorial chemistry as well as chemical patent databases where
groups of chemicals have to be described concisely and enumeration
may be too lengthy or impossible.
Databases of supplier information (e.g., eMolecules) provide
data on the availability of the chemical substances.
Systems like ChemFinder aggregate data found in other chemical
resources over the Internet, along with maintaining own databases.
5. Relational
Databases
This section covers some notions from the theory of relational
databases, along with relevant examples. A more detailed introduc-
tion can be found in, e.g., ref. (33).
A relational database (34) organizes the data in the form of
tables, and all rows of a table must have the same structure
(Table 1).
The columns of the table, in contrast, may be of different data
types, such as number, text, binary flag (yes or no), date, etc. In
database terminology, a table is called a relation, a table row is a
record or a tuple, and a column is an attribute.
Database search is the selection of a set of rows from a table
that match a given condition. For example, selecting rows where
log P > 3 will return the second row of the table. More generally,
a search may only return some of the attributes of a tablefor
Table 1
A relational database table
CAS Registry Number 2D structure Molecular weight Log P

90-72-2 c1(CN(C)C)c(O)c(CN(C)C)cc(CN(C)C)c1 265.4 0.77
97-23-4 C1(O)c(Cc2c(O)ccc(Cl)c2)cc(Cl)cc1 269.13 4.34
Table 2
A non-normalized relational database table
CAS RN 2D structure Chemical name Mol. wt. Log P

97-23-4 C1(O)c(Cc2c(O)ccc(Cl)c2)cc(Cl)cc1 Dichlorophen 4.34
97-23-4 C1(O)c(Cc2c(O)ccc(Cl)c2)cc(Cl)cc1 2,20 -Methylenebis 4-chlorophenol 269.13 4.34
example, a search for CAS RN of the rows where log P > 3 will
only return one value, 97-23-4, and not a whole record.
Any attribute or attribute combination may be indexed:
assigned auxiliary data to improve search performance. Search
involving conditions on indexed attributes is much faster, at the
price of taking extra disk space for the auxiliary data and slightly
increasing the insertion and modification time.
A column or a combination of columns that identifies uniquely
a whole record is called a key; in the above table, CAS RN can be
chosen as a key, since the remaining attributes may coincide by
chance with those for another chemical while CAS will not.
Further, to store multiple values (e.g., chemical names) for a
single chemical, we may choose to add a table row for every value
with the disadvantage that 2D structure and descriptors have to be
repeated (Table 2).
A better solution is to introduce a second table for names
(Table 3): here, CAS RN will serve as the link between the tables,
the first table will still contain unique records for every chemical,
and data redundancy will be avoided. The relational database the-
ory defines formally how to avoid data redundancy in the general
case. A normalized databasedesigned according to the principles
recommended by the theorywill be easier to access and less prone
to programming errors.
The two-table construction presumes that for every record in
the second table, there is a record with the same CAS RN in the first
table (defining the structure). This referential integrity constraint
(relevant data is expected when a table refers to another) should be
enforced during all database operationsdeletion in Table 3a
should invoke deletion of the related records in Table 3b, etc.
Table 3
A two-table database representing (a) CAS RN, 2D structure,
and descriptors and (b) chemical names
Mol.
CAS RN 2D structure wt. Log P CAS RN Chemical name
90-72-2 c1(CN(C)C)c(O)c(CN(C)C) 265.4 0.77 90-72-2 2,4,6-Tris (dimethylamino)
cc(CN(C)C)c1 methylphenol
97-23-4 C1(O)c(Cc2c(O)ccc(Cl)c2)cc 269.13 4.34 97-23-4 Dichlorophen
(Cl)cc1
97-23-4 2,20 -Methylenebis 4-chlorophenol
Modern databases handle this and other related issues using the
mechanism of transactions (grouped modifications). Two or more
modification operations can be declared to constitute a transaction,
and then they either succeed as a whole or fail as a wholeif a
record is deleted from Table 3a, the database will wait for records
with the same CAS number to be deleted from Table 3b and then
confirm the whole group of operations at once. If anything fails,
neither of the operations is confirmed and the database restores its
original state. This is implemented at the level of the DBMS,
software controlling all operations on the database and providing
data to the other application programs that use it.
6. Examples of
Chemical Database
Resources Online
The following list contains more information and references about
some chemical databases online:
1. Accelrys Databases are a source of supplier information as well
as contain various collections of biological and chemical data.
Accelrys Registration is a software tool for creation and man-
agement of chemical databases (35).
2. Aggregated Computational Toxicology Resource (ACToR) is a
collection of databases collated or developed by the US EPA
National Center for Computational Toxicology (NCCT) (36).
3. ChEBI is a database with both manual and programmatic
access, literature references for compounds, and a chemical
ontology (37, 38).
4. CAS Registry database is the chemical substance database of
CAS, the most authoritative and comprehensive source for
chemical and scientific information, also publishing Chemical
Abstracts, with millions of documents from the chemical jour-
nal and patent literature (39). The registry database contains
more than 56 million substances.
5. ChemAxon JBase is a set of tools for creating chemical data-

bases on different DBMS (40). It supports a variety of input/
output formats and stores/searches generic structures.
6. ChemDB is a large database of molecules and reactions (41, 42).
7. ChemIDPlus (43) is part of the system of databases of the US
National Library of Medicine. It consists of a database of struc-
ture and nomenclature information, and a number of toxicity
databases.
8. ChemSpider is a large repository of chemical compounds
aggregated from a variety of other chemical databases on the
Web (44, 45).
9. Danish (Q)SAR Database online (discussed in Subheading 4.3)
(24, 25).
10. DayCart is an Oracle data cartridge (an extension of the Oracle
DBMS with extra functionality) for dealing with chemical data,
available from Daylight (46).
11. DrugBank (47) is a database of FDA-approved or experimental
drugs.
12. ECHA Registered substances database is a public list maintained
by the European Chemicals Agency. It contains information
about all substances which are registered under REACH (48).
13. eChemPortal is an OECD-based system for providing informa-
tion on existing chemicals, new industrial chemicals, pesticides
and biocides, and links to collections of information prepared
for government chemical review programs at national, regional,
and international levels, together with classification results
according to national/regional hazard classification schemes
or according to the Globally Harmonized System of Classifica-
tion and Labelling of Chemicals (GHS) (49).
14. eMolecules is an online service for searching chemical struc-
tures in a large database of 2D and 3D structural and supplier
information (50).
15. Leadscope is a suite of software tools for creating and using
chemical databases, and includes ready-made databases of
structure and toxicity data plus cheminformatics tools (51).
16. OASIS Database and Platform (discussed in Subheadings 4.1
and 4.2) (19).
17. OECD (Q)SAR Toolbox is a free software application for
assessing the hazards of chemicals. It incorporates numerous
collections of structural information as well as experimental and
predicted toxicity data and tools for development of new pre-
dictive models (26, 27).
18. PDB contains 2D and 3D information about large molecules
(proteins and nucleic acids) (52).
19. PubChem is a very large repository of chemical and biological

data, with a searchable database of chemical structures (53, 54).
Search and download is also possible on tested, active, or
inactive compounds, as well as with many of the traditional
search types.
20. Reaxys is a union of databases of organic compounds (Beil-
stein), inorganic and organometallic compounds (Gmelin),
and patent databases (55).
21. RTECS is a collection of experimental toxicity information for
approximately 169,000 chemicals (35).
References
1. Halpin TA, Morgan AJ (2008) Information 11. http://opsin.ch.cam.ac.uk/index.html
modeling and relational databases, 2nd edn. 12. Hardy B, Douglas N, Helma C et al (2010)
In: The Morgan Kaufmann series in data man- Collaborative development of predictive toxi-
agement systems. Elsevier, Amsterdam cology applications. J Cheminform 2(1):7
2. Dalby A, Nourse JG, Hounshell DW et al 13. Stein LD (2008) Towards a cyberinfrastructure
(1992) Description of several chemical struc- for the biological sciences: progress, visions
ture file formats used by computer programs and challenges. Nat Rev Genet 9:678688.
developed at molecular design limited. J Chem doi:10.1038/nrg2414
Inf Comput Sci 32:244255 14. http://www.w3.org/standards/webofservices
3. Hopfinger AJ, Wang S, Tokarski JS et al (1997) 15. Murray-Rust P, Rzepa H, Wright M et al
Construction of 3D-QSAR models using the (2000) A universal approach to web-based
4D-QSAR analysis formalism. J Am Chem Soc chemistry using XML and CML. Chem Com-
119:1050910524 mun 14711472
4. Miller M (2002) Chemical database techniques 16. Nature Vol. 451 (7179):648651. http://
in drug discovery. Nat Rev Drug Discov www.w3.org/standards/semanticweb
1:220227. doi:10.1038/nrd745
17. Murray-Rust P (2008) Chemistry for every-
5. Barnard JM (1993) Substructure searching one. Nature 451:648651. doi:10.1038/
methods: old and new. J Chem Inf Comput 451648a
Sci 33:532538
18. http://www.w3.org/standards/semanticweb/
6. Jamil H (2011) Computing subgraph isomor- ontology
phic queries using structural unification and
minimum graph structures, Proc. of the 26th 19. http://www.oasis-lmc.org
ACM symposium on applied computing SAC 20. Mekenyan O, Pavlov T, Grancharov V et al
2011, pp 10581065, doi:10.1145/ (2005) 2D3D migration of large chemical
1982185.1982415 inventories with conformational multiplica-
7. Ullmann JR (1976) An algorithm for sub- tion. Application of the genetic algorithm. J
graph isomorphism. J Assoc Mach 23 Chem Inf Model 45(2):283292
(1):3142. doi:10.1145/321921.321925 21. Mekenyan O, Dimitrov D, Nikolova N et al
8. Willett P (2003) Similarity searching in chemi- (1999) Conformational coverage by a genetic
cal structure databases. In: Gasteiger J (ed) algorithm. J Chem Inf Comput Sci 39(6):
Handbook of chemoinformatics. Wiley, Wein- 9971016. doi:10.1021/ci990303g
heim 22. Mekenyan OG, Kamenska, V, Serafimova R
9. Nikolov N, Grancharov V, Stoyanova G et al et al (2002) Development and validation of an
(2006) Representation of chemical informa- average mammalian estrogen receptor-based
tion in OASIS Centralized 3D database for (Q)SAR model. In: Mekenyan O, Schultz TW
existing chemicals. J Chem Inf Model 46 (eds) Proceedings of quantitative structure
(6):25372551 activity relationships in environmental
sciencesIX. SAR (Q)SAR Environ Res 13
10. Park J, Rosania GR, Shedden KA et al (2009) (6):579595
Automated extraction of chemical structure
information from digital raster images. J 23. http://www.mst.dk/English/Chemicals/
Chem Cent 3 (1): 1-16, doi:10.1186/1752- Substances_and_materials/qsar
153X-3-4 24. http://ecbqsar.jrc.it
25. http://qsar.food.dtu.dk 39. http://www.cas.org

26. http://www.oecd.org/document/54/0,3343, 40. http://www.chemaxon.com/products/jchem-
en_2649_34373_42923638_1_1_1_1,00.html base
27. http://toolbox.oasis-lmc.org 41. Chen J, Swamidass SJ, Dou Y et al (2005)
28. Niemela JR, Wedebye EB, Nikolov NG et al Chemdb: a public database of small molecules
(2009) The advisory list for self-classification of and related chemoinformatics resources. Bioin-
dangerous substances: DK EPA environmental formatics 21:41334139
project No. 1303 2009Danish EPA environ- 42. http://cdb.ics.uci.edu
mental report, p 62. In: Danish EPA Environ- 43. http://chem.sis.nlm.nih.gov/chemidplus
mental Projects; 1303 44. Pence HE, Williams A (2010) ChemSpider: an
29. http://www.mst.dk/English/Chemicals/Sub- online chemical information resource. J Chem
stances_and_materials/The_advisory_list_for_ Educ 87(11):11231124. doi:10.102c1/
selfclassification ed100697w
30. http://www.chembiogrid.org/related/ 45. http://cs.m.chemspider.com
resources/databases.html 46. http://www.daylight.com/products/daycart.
31. http://www.liv.ac.uk/Chemistry/Links/ html
refdatabases.html 47. Wishart DS, Knox C, Guo AC et al (2008)
32. Valerio LG (2009) In silico toxicology for the DrugBank: a knowledgebase for drugs, drug
pharmaceutical sciences. Toxicol Appl Pharma- actions and drug targets. Nucleic Acids Res
col 241:356370 36(1):D901D906
33. ODonnell TJ (2009) Design and use of rela- 48. http://apps.echa.europa.eu/registered/
tional databases in chemistry. CRC Press, Boca registered-sub.aspx
Raton, FL 49. http://www.echemportal.org
34. Codd EF (1970) A relational model of data for 50. http://www.emolecules.com
large shared data banks. Commun ACM 13
(6):377387. doi:10.1145/362384.362685 51. http://www.leadscope.com
35. http://accelrys.com/products/databases/ 52. http://www.wwpdb.org
bioactivity/rtecs.html 53. Bolton E, Wang Y, Thiessen PA et al (2008)
36. http://actor.epa.gov PubChem: integrated platform of small mole-
cules and biological activities. Annual reports in
37. de Matos P, Alcantara R, Dekker A et al (2009) computational chemistry, Apr 2008
Chemical entities of biological interest: an
update. Nucleic Acids Res (in press) 54. http://pubchem.ncbi.nlm.nih.gov
38. http://www.ebi.ac.uk/chebi 55. http://www.reaxys.com/info
Chapter 3
From QSAR to QSIIR: Searching for Enhanced

Computational Toxicology Models
Hao Zhu
Abstract
Quantitative structure activity relationship (QSAR) is the most frequently used modeling approach to
explore the dependency of biological, toxicological, or other types of activities/properties of chemicals on
their molecular features. In the past two decades, QSAR modeling has been used extensively in drug
discovery process. However, the predictive models resulted from QSAR studies have limited use for
chemical risk assessment, especially for animal and human toxicity evaluations, due to the low predictivity
of new compounds. To develop enhanced toxicity models with independently validated external prediction
power, novel modeling protocols were pursued by computational toxicologists based on rapidly increasing
toxicity testing data in recent years. This chapter reviews the recent effort in our laboratory to incorporate
the biological testing results as descriptors in the toxicity modeling process. This effort extended the
concept of QSAR to quantitative structure in vitroin vivo relationship (QSIIR). The QSIIR study
examples provided in this chapter indicate that the QSIIR models that based on the hybrid (biological
and chemical) descriptors are indeed superior to the conventional QSAR models that only based on
chemical descriptors for several animal toxicity endpoints. We believe that the applications introduced in
this review will be of interest and value to researchers working in the field of computational drug discovery
and environmental chemical risk assessment.
Key words: QSAR, QSIIR, Computational toxicology, HTS, Predictive model, Compounds, Chemi-
cal descriptors, Biological descriptors
1. Introduction
Many compounds entering clinical studies do not survive as a good

pharmacological lead to a drug on the market. The chemical toxi-
cology and safety has been regarded as the major reason for attri-
tion of new drugs in the past decades (1). However, evaluation of
chemical toxicity and safety in vivo at the early stage of drug
discovery process is expensive and time consuming. To find the
alternatives for the traditional animal toxicity testing and to
53
54 H. Zhu
understand the relevant toxicological mechanisms, many in vitro

toxicity screens and computational toxicity models have been devel-
oped and implemented by academic institutes and pharmaceutical
companies (29). In the past 15 years, innovative technologies that
enable rapid synthesis and high throughput screening (HTS) of
large libraries of compounds have been adopted in toxicity studies.
As a result, there has been a huge increase in the number of
compounds and the associated testing data in different in vitro
screens. With this data, it becomes feasible to reveal the relationship
between the high throughput in vitro toxicity testing results and
the low throughput in vivo toxicity evaluation for the same set of
compounds. Understanding these relationships could help us
delineate the mechanisms underlying animal toxicity of chemicals
as well as potentially improve our ability to predict chemical toxicity
using short term bioassays.
The unique advantage of using a computational toxicity model
in risk analysis is that a chemical could be evaluated for toxicity
potentials even before being synthesized. The computational toxic-
ity tools based on quantitative structure activity relationship
(QSAR) models have been used to assist in predictive toxicological
profiling of pharmaceutical substances for understanding drug
safety liabilities (7, 1012), supporting regulatory decision making
on chemical safety and risk of toxicity (13), and are effectively
enhancing an already rigorous US regulatory safety review of phar-
maceutical substances (14). Although the predictive QSAR model-
ing of toxicity are starting being used to evaluate the toxicity
potential for the pharmaceutical companies and environmental
agencies (10, 15), most of previous studies showed that current
available QSAR models did not work well to evaluate in vivo toxic-
ity potentials, especially for new compounds not existing in the
training data (16, 17). Due to this reason, it is needed to establish
novel modeling techniques that could improve the conventional
QSAR approaches to take advantage of the numerous of in vitro
toxicity screening (especially the HTS) results to develop enhanced
toxicity models.
2. Availability of
Large Compound
Collections for In
Vivo and In Vitro Since 1990s, great efforts of developing toxicity testing methods
Toxicity Evaluation have generated an extensive amount of toxicity data (18). However,
most of the available toxicity databases that house these data are not
suitable for developing QSAR toxicity models. All the cheminfor-
matics tools require the biological data to be associated with molec-
ular structures, and these are not included in many existing
databases. Furthermore, the testing data may be not easily accessed
by modelers or the quality is questionable (18). In the end, the
existing errors of chemical structures also greatly affect the reliability
3 From QSAR to QSIIR: Searching for Enhanced Computational Toxicology Models 55
and predictivity of the predictive models based on these databases

(19, 20). As the result, the lack of data problem is always the first
issue that needs to be solved in predictive toxicology field.
Much progress has been made since several toxicity data
collection and/or sharing projects that were initiated in the past 5
years. These efforts resulted in many toxicity databases available
publically or commercially (5, 2124) and most of these databases
could be used to develop QSAR toxicity models. Listing a full
version of available toxicity databases is outside the scope of this
review. Tables 1 and 2 showed several examples of the major known
Table 1
Publicly available databases of in vivo toxicity endpoints
ToxRefDB US EPA Toxicological Reference Database (ToxRefDB) is capturing toxicological

National Center for endpoints, critical effects and relevant doseresponse data from EPAs Office
Computational of Pesticide Programs into a relational database using a standardized data field
Toxicology (25, 26) structure and vocabulary. Chemicals included in the database represent over
800 conventional pesticide active ingredients. Data types include the
following: subchronic toxicity endpoints (rodents and nonrodents), prenatal
developmental toxicity (rat and rabbit), reproductive and fertility effects (two-
generation studies), immunotoxicity, developmental neurotoxicity, chronic
toxicity (rat, mouse, dog), and 2-year carcinogenicity bioassays (rat and
mouse)
DSSTox dataset (21) Tumor target site incidence and TD50 potencies for 1,354 chemical substances
tested in rats and mouse, 80 chemical substances tested in hamsters, five
chemicals tested in dogs, and 27 chemical substances tested in nonhuman
primates; data reviewed and compiled from literature and NTP studies
NIEHS/NTP datasets Data from more than 500 2-year, two species, toxicology and carcinogenesis
(27) studies collected by the NTP. The database also contains the results collected
on approximately 300 toxicity studies from shorter duration tests and from
genetic toxicity studies. In addition, test data from the immunotoxicity,
developmental toxicity, and reproductive toxicity studies are continually being
added to this database. Some of the endpoint observations are labels (tox or
nontox, increased or decreased). Classification methods will be applicable to
this type of studies. While others are quantitative data, e.g., TD50, these will
be studied using regression type of methods. Both can be addressed by the
Combi-QSPR framework, which includes multiple machine learning and
statistical methods, quantitative regression as well as classification
FDA adverse liver The database contains the following fields: generic name of each chemical,
effects database (28) SMILES code, for module A10 (liver enzyme composite module): overall
activity category for each compound (A for active, M for marginally active, or
I for inactive) based on the number of active and marginally active scores for
each compound at the five individual endpoints; number of endpoints at
which each compound is marginally active (M); number of endpoints at which
each compound is active (A); for modules A11 to A15 (alkaline phosphatase
increased, SGOT increased, SGPT increased, LDH increased, and GGT
increased, respectively): overall activity category for each compound (A for
active, M for marginally active, or I for inactive) based on the RI and ADR
values; number of ADR reports for each compound, given as <4 or 4
56 H. Zhu
Table 2
Public databases of in vitro toxicity endpoints
NCGC qHTS Concentration-response profiles of 1,408 substances screened for their effects
cytotoxicity data (2) on cell viability are available through PubChem for 13 cell lines: HepG2
available through (human hepatoma; AID #433), H-4-II-E (rat hepatoma; AID #543), BJ
PUBCHEM (24) (human foreskin fibroblast; AID #421), Jurkat (clone E6-1, human acute T
(via PubChem cell leukemia; AID #426), HEK293 (transformed human embryonic kidney
AID#) cell; AID #427), MRC-5 (human lung fibroblast; AID #434), SK-N-SH
(human neuroblastoma; AID #435), N2a (mouse neuroblastoma; AID
#540), NIH 3T3 (mouse embryonic fibroblast; AID #541), HUV-EC-C
(human vascular endothelial cell; AID #542), SH-SY-5Y (human
neuroblastoma, subclone of SK-N-SH; AID #544), Renal Proximal Tubule
(rat kidney cell; AID #545) and Mesenchymal (human renal glomeruli cell;
AID #546). Each compound was tested at 14 concentrations ranging from
0.006 to 92 mM and the response was measured as % change in cell viability as
compared to vehicle control at each concentration
ChEMBLdb (29) A database of bioactive drug-like small molecules abstracted and curated from
available through the primary scientific literature. Bioactivities are represented by binding
PUBCHEM (24) constants, pharmacology and ADMET data. ChEMBL assays are available
(via PubChem through PubChem. Human toxicity related endpoints are primarily from
AID#) in vitro data, such as: cytotoxicity on SNU-354 cells (hepatoma cell line, AID
#200819), antiproliferative action on L02 cells (normal hepatocytes, AID
#416061), growth inhibition of SK-Hep1 cells (liver adenocarcinoma cell
line, AID #201649), cytotoxicity and anticancer activity on HepG2 cells (AID
#86696, 340104, 421266), etc.
ToxCast (30) Phase I (August 7 2009 update) provided 304 unique compounds characterized
in over 600 HTS endpoints. The endpoints include biochemical assays of
protein function, cell-based transcriptional reporter and gene expression, cell
line and primary cell functional, and developmental endpoints in zebrafish
embryos and embryonic stem cells. Additionally, mapping of these assays to
315 genes and 438 pathways was made publicly available. Phase II will
complete screens of additional 700 compounds, HTS data on nearly 10,000
chemicals will be available through Tox21 collaboration in 2010
ToxNET (31) A data network covering toxicology, hazardous chemicals, environmental health
and related areas. Managed by US National Library of Medicine
publically available toxicity data sources in vivo and in vitro respec-

tively (2531). There are tens of thousands diverse compounds
being tested in various toxicity protocols and included in these
databases. To study and compare the results of same compounds
obtained from different testing protocols, it is important to read
across different toxicity databases for the target compounds and
the current database landscape is still disparate and fragmented for
this purpose. To address this important issue, the most recent
progress of toxicity data generation is highlighted by some toxicol-
ogy collaborative programs, such as Tox21 (32), among universities,
institutes and government agencies.
3. QSAR and
Current Challenge
of Computational
Toxicology Computational toxicology modeling relies on the use of QSAR
approaches to build the toxicity models for available reference
data. It traditionally derives the computed properties solely based
on the molecular structures as defined by descriptors or fingerprints
and has been broadly used to predict the side effects that chemicals
posed to human health or animals. The QSTR models developed
from computational toxicity studies strongly depend on the QSAR
approaches used in the modeling process. The optimization of the
variable selected or the weighting of variables is the core compo-
nent of a QSAR approach. This procedure selects only the most
meaningful and statistically significant subset of available chemical
descriptors in terms of correlation with biological activity. The
optimum selection is achieved by combining stochastic search
methods such as generalized simulated annealing (33), genetic
algorithms (34) or evolutionary algorithms (35) with the correla-
tion techniques such as MLR, PLS analysis, or artificial neural net-
works (3336).
Recent research has emphasized model validation as the key
component of QSAR modeling (37). Tropsha (37, 38) and
others (3942) have demonstrated that various commonly
accepted statistical characteristics of QSAR models derived for
a training set are insufficient to establish and estimate the pre-
dictive power of QSTR models. The only way to ensure the high
predictive power of a QSAR model is to demonstrate a signifi-
cant correlation between predicted and observed activities of
compounds for a validation (test) set, which was not employed
in model development. This goal can be achieved by a division
of an experimental SAR dataset into the training and test set,
which are used for model development and validation, respec-
tively. It is believed that special approaches should be used to
select a training set to ensure the highest significance and pre-
dictive power of QSAR models (43, 44). The recent reviews and
publications describe several algorithms that can be employed
for such division (37, 38, 44).
Most of previous QSAR studies showed that current available
QSTR models do not work well to evaluate in vivo toxicity poten-
tials, especially for external compounds. Due to this reason, several
reviews were published recently to challenge the feasibility and
reliability of using QSAR approaches in toxicity studies (45, 46).
Often, disappointing results could be linked to the key aspects of
the modeling procedure, many of which related to the original data
and their interpretation. Similarly, Lombardo et al. (47) noted
that not much progress has been made in developing robust and
58 H. Zhu
predictive models, and that the lack of accurate data, together with
the use of questionable modeling end-points, has hindered the real
progress. Most chemical toxicity models (predictors) are either
reported in the literature but not available to research community
or available in the form of commercial software that is not univer-
sally successful as discussed above. These examples illustrate that
although individual successes have been indeed reported as dis-
cussed above, in general there remains a strong need in developing
widely accessible and reliable computational toxicology modeling
techniques and specific end-point predictors.
4. Quantitative
Structure In
VitroIn Vivo
Relationship To stress a broad appeal of the conventional QSAR approach, it
should be made clear that from the statistical viewpoint QSAR
modeling is a special case of general statistical data mining and
data modeling where the data is formatted to represent objects
described by multiple descriptors and the robust correlation
between descriptors and a target property (e.g., chemical toxicity
in vivo) is sought. In previous computational toxicology studies,
additional physicochemical properties, such as water partition coef-
ficient (logP) (48), water solubility (49), and melting point (50)
were used successfully to augment computed chemical descriptors
and improve the predictive power of QSAR models. These studies
suggest that using experimental results as descriptors in QSAR
modeling could prove beneficial. The current available and still
rapidly growing HTS data for large and diverse chemical libraries
makes it possible to extend the scope of the conventional QSAR in
toxicity studies by using in vitro testing results as extra biological
descriptors. Therefore, in some of the most recent toxicology stud-
ies, the relationships between various in vitro and in vivo toxicity
testing results were generated (5154). Based on these reports, we
proposed a new modeling workflow called quantitative structure
in vitroin vivo relationship (QSIIR) and used it in animal toxicity
modeling studies (5557). The target properties of QSIIR model-
ing were still biological activities, such as different animal toxicity
endpoints, but the content and interpretation of descriptors and
the resulting models will vary. This focus on the prediction of the
same target property from different (chemical, biological, and
genomic) characteristics of environmental agents affords an oppor-
tunity to most fully explore the source-to-outcome continuum of
the modern experimental toxicology using cheminformatics
approaches.
5. Case Studies
5.1. Using Hybrid To explore efficient approaches for rapid evaluation of chemical
Descriptors for QSIIR toxicity and human health risk of environmental compounds, NTP
Modeling of Rodent in collaboration with the National Center for Chemical Genomics
Carcinogenicity (NCGC) has initiated an HTS Project (2, 58). The first batch of
HTS results for a set of 1,408 compounds tested in 6 human cell
lines was released via PubChem. We have explored this data in terms
of their utility for predicting adverse health effects of the environ-
mental agents (57). Initially, the classification k Nearest Neighbor
(kNN) QSAR modeling method was applied to the HTS data only
for the curated dataset of 384 compounds. The resulting models
had prediction accuracies for training, test (containing 275 com-
pounds together), and external validation (109 compounds) sets as
high as 89, 71, and 74%, respectively. We then asked if HTS results
could be of value in predicting rodent carcinogenicities. We identi-
fied 383 compounds for which data were available from both the
Berkeley Carcinogenic Potency Database and NTP-HTS studies.
We found that compounds classified by HTS as actives in at least
one cell line were likely to be rodent carcinogens (sensitivity 77%);
however, HTS inactives were far less informative (specificity 46%).
Using chemical descriptors only, kNN QSAR modeling resulted in
62% overall prediction accuracy for rodent carcinogenicity applied to
this data set. Importantly, the prediction accuracy of the model was
significantly improved (to 73%) when chemical descriptors were
augmented by the HTS data, which were regarded as biological
descriptors (Fig. 1). Our studies suggested that combining HTS
profiles with conventional chemical descriptors could considerably
improve the predictive power of computational approaches in chem-
ical toxicology.
0.9
0.8
0.7
0.6 Chemical (MolConnZ)
Descriptors
0.5
Chemical (MolConnZ)
0.4 Descriptors + Biological (HTS)
Descriptors
0.3
0.2
0.1
0
Sensitivity Specificity Overall Predictivity
Fig. 1. Comparison of the prediction power of QSAR models using conventional and hybrid descriptors for carcinogenicity of
external compounds.
60 H. Zhu
5.2. Using Hybrid We used the cell viability qHTS data from NCGC as mentioned in
Descriptors for the the above section for the same 1,408 compounds but in 13 cell
QSIIR Modeling of lines (59). Besides the carcinogenicity, we asked if HTS results
Rodent Acute Toxicity could be of value in predicting rodent acute toxicity (56). For this
purpose, we have identified 690 of these compounds, for which
rodent acute toxicity data (i.e., toxic or nontoxic) was also available.
The classification kNN QSAR modeling method was applied to
these compounds using either chemical descriptors alone or as a
combination of chemical and qHTS biological (hybrid) descriptors
as compound features. The external prediction accuracy of models
built with chemical descriptors only was 76%. In contract, the
prediction accuracy was significantly improved to 85% when using
hybrid descriptors. The receiver operating characteristic (ROC)
curves of conventional QSAR models and different hybrid models
are shown in Fig. 2. The sensitivity and specificity of hybrid models
are clearly better than for conventional QSAR model for predicting
the same external compounds. Furthermore, the prediction cover-
age increased from 76% when using chemical descriptors only to
93% when qHTS biological descriptors were also included. Our
studies suggest that combining HTS profiles, especially the dose
response qHTS results, with conventional chemical descriptors
could considerably improve the predictive power of computational
approaches for rodent acute toxicity assessment.
Fig. 2. The ROC curves for conventional QSAR model (bold line) and different hybrid
models for the same external compounds within acute toxicity modeling.
5.3. Hierarchical QSIIR A database containing experimental in vitro IC50 cytotoxicity values
Modeling of Rodent and in vivo rodent LD50 values for more than 300 chemicals was
Acute Toxicity Based compiled by the German Center for the Documentation and Vali-
on In VitroIn Vivo dation of Alternative Methods (ZEBET). The application of con-
Relationships ventional QSAR modeling approaches to predict mouse or rat acute
LD50 from chemical descriptors of ZEBET compounds yielded no
statistically significant models (60). Furthermore, analysis of these
data showed the correlation between IC50 and LD50 is obscure
(60). However, a linear IC50 versus LD50 correlation could be
established for a fraction of compounds. To capitalize on this
observation, a novel two-step modeling approach was developed
as follows. First, all chemicals are partitioned into two groups based
on the relationship between IC50 and LD50 values: one group is
formed by compounds with linear IC50 versus LD50 relationship,
and another group consists of the remaining compounds. Second,
conventional binary classification QSAR models are built to predict
the group affiliation based on chemical descriptors only. Third,
kNN continuous QSAR models are developed for each sub-class
individually to predict LD50 from chemical descriptors. All models
have been extensively validated using special protocols. We have
found that this type of in vitroin vivo correlations could be estab-
lished not only between cytotoxicity and rat acute toxicity (Fig. 3a)
but also between cytotoxicity and other types of rodent toxicity
Fig. 3. The identification of the baseline correlation between cytotoxicity (IC50) and various types of in vivo toxicity testing
results. (a) Rat LD50. (b) Mouse LD50. (c) Rat LOAEL. (d) Rat NOAEL. C1, class 1; C2, class 2.
62 H. Zhu
(Fig. 3bd), including two types of low dose toxicity endpoints: rat
low observed adverse effect level (LOAEL) and no observed
adverse effect level (NOAEL) (55).
6. Conclusion
In the past 15 years, innovative technologies that enable rapid

synthesis and HTS of large libraries of compounds have advanced
the risk assessment to a new stage. As a result, there has been a huge
increase in the number of compounds available on a routine basis to
quickly screen for novel drug candidates against new targets or
pathways. This growth creates new challenges for QSAR modeling
such as developing novel approaches for the analysis and visualiza-
tion of large databases of screening data, novel biologically relevant
chemical diversity or similarity measures, and novel tools to utilize
the diverse HTS biological profiles of compounds to ensure high
predictive toxicity models. Application studies discussed in this
chapter have proved that integrating biological data as descriptors
in the QSAR model development could be beneficial for the
resulted toxicity models, especially for some specific animal toxicity
endpoints. This effort extended the traditional concept of QSAR to
QSIIR and may be applied for developing widely accessible and
reliable computational toxicity predictors.
Acknowledgments
The work was supported by the Society of Toxicology (grant num-

ber 11-0897). The author want to thank Dr. Alexander Tropsha of
University of North Carolina at Chapel Hill for his help in the past
five years.
Most of the research projects that were reviewed in this chapter
were finished with his help and guidance.
References
1. Kola I, Landis J (2004) Can the pharmaceutical 4. Riley RJ, Kenna JG (2004) Cellular models for
industry reduce attrition rates? Nat Rev Drug ADMET predictions and evaluation of drug-
Discov 3(8):711715 drug interactions. Curr Opin Drug Discov
2. Inglese J, Auld DS, Jadhav A, Johnson RL, Devel 7(1):8699
Simeonov A, Yasgar A, Zheng W, Austin CP 5. Dix DJ, Houck KA, Martin MT, Richard AM,
(2006) Quantitative high-throughput screen- Setzer RW, Kavlock RJ (2007) The ToxCast
ing: a titration-based approach that efficiently program for prioritizing toxicity testing of
identifies biological activities in large chemical environmental chemicals. Toxicol Sci 95(1):
libraries. Proc Natl Acad Sci USA 103 512
(31):1147311478 6. Yang C, Valerio LG Jr, Arvidson KB (2009)
3. Cheeseman MA (2005) Thresholds as a unify- Computational toxicology approaches at the
ing theme in regulatory toxicology. Food Addit US Food and Drug Administration. Altern
Contam 22(10):900906 Lab Anim 37(5):523531
7. Valerio LG Jr (2009) In silico toxicology for structure curation in cheminformatics and

the pharmaceutical sciences. Toxicol Appl QSAR modeling research. J Chem Inf Model
Pharmacol 241(3):356370 50(7):11891204
8. Dash A, Inman W, Hoffmaster K, Sevidal S, 21. Richard AM, Williams CR (2002) Distributed
Kelly J, Obach RS, Griffith LG, Tannenbaum structure-searchable toxicity (DSSTox) public
SR (2009) Liver tissue engineering in the eval- database network: a proposal. Mutat Res 499
uation of drug safety. Expert Opin Drug Metab (1):2752
Toxicol 5(10):11591174 22. Judson R, Richard A, Dix DJ, Houck K, Martin
9. Park MV, Lankveld DP, Loveren H van, Jong M, Kavlock R, Dellarco V, Henry T, Holder-
WH de (2009) The status of in vitro toxicity man T, Sayre P, Tan S, Carpenter T, Smith E
studies in the risk assessment of nanomaterials. (2009) The toxicity data landscape for environ-
Nanomedicine (Lond) 4(6):669685 mental chemicals. Environ Health Perspect
10. Durham SK, Pearl GM (2001) Computational 117(5):685695
methods to predict drug safety liabilities. Curr 23. Yang C, Richard AM, Cross KP (2006) The Art
Opin Drug Discov Devel 4(1):110115 of data mining the minefields of toxicity data-
11. Jacobson-Kram D, Contrera JF (2007) Genetic bases to link chemistry to biology. Curr Com-
toxicity assessment: employing the best science put Aided Drug Des 2:135150
for human safety evaluation. Part I: early 24. PubChem (2008) http://pubchem.ncbi.nlm.
screening for potential human mutagens. nih.gov/
Toxicol Sci 96(1):1620 25. Knudsen TB, Martin MT, Kavlock RJ, Judson
12. Muster W, Breidenbach A, Fischer H, Kirchner RS, Dix DJ, Singh AV (2009) Profiling devel-
S, Muller L, Pahler A (2008) Computational opmental toxicity of 387 environmental chemi-
toxicology in drug development. Drug Discov cals using EPAs toxicity reference database
Today 13(78):303310 (ToxRefDB). Birth Defects Res A Clin Mol
13. Bailey AB, Chanderbhan R, Collazo-Braier N, Teratol 85(5):406
Cheeseman MA, Twaroski ML (2005) The use 26. Martin MT, Judson RS, Reif DM, Kavlock RJ,
of structure-activity relationship analysis in the Dix DJ (2009) Profiling chemicals based on
food contact notification program. Regul chronic toxicity results from the US EPA Tox-
Toxicol Pharmacol 42(2):225235 Ref database. Environ Health Perspect 117
14. Valerio L Jr (2008) Tools for evidence-based (3):392399
toxicology: computational-based strategies as a 27. ToxRefDB (2010) http://actor.epa.gov/tox-
viable modality for decision support in chemi- refdb/faces/Home.jsp
cal safety evaluation and risk assessment. Hum 28. FDA Liver Side Effect (2010) http://www.fda.
Exp Toxicol 27(10):757760 gov/AboutFDA/CentersOffices/CDER/
15. Snyder RD (2009) An update on the genotoxi- ucm092203.htm
city and carcinogenicity of marketed pharma- 29. ChEMBL (2010) http://www.ebi.ac.uk/
ceuticals with reference to in silico predictivity. chembldb/index.php
Environ Mol Mutagen 50(6):435450 30. ToxCast (2010) http://www.epa.gov/comp-
16. Zvinavashe E, Murk AJ, Rietjens IM (2009) tox/toxcast/
On the number of EINECS compounds that 31. Fonger GC, Stroup D, Thomas PL, Wexler P
can be covered by (Q)SAR models for acute (2000) TOXNET: a computerized collection
toxicity. Toxicol Lett 184(1):6772 of toxicological and environmental health
17. Zvinavashe E, Murk AJ, Rietjens IM (2008) information. Toxicol Ind Health 16(1):46
Promises and pitfalls of quantitative structure- 32. Shukla SJ, Huang R, Austin CP, Xia M (2010)
activity relationship approaches for predicting The future of toxicity testing: a focus on
metabolism and toxicity. Chem Res Toxicol 21 in vitro methods using a quantitative high-
(12):22292236 throughput screening platform. Drug Discov
18. Yang C, Benz RD, Cheeseman MA (2006) Today 15(2324):9971007
Landscape of current toxicity databases and 33. Rogers D, Hopfinger AJ (1994) Application of
database standards. Curr Opin Drug Discov genetic function approximation to quantitative
Devel 9(1):124133 structure-activity relationships and quantitative
19. Young DM, Martin TM, Venkatapathy R, Har- structure-property relationships. J Chem Inf
ten P (2008) Are the chemical structures in your Comput Sci 34:854866
QSAR correct. QSAR Comb Sci 27:13371345 34. Kubinyi H (1994) Variable selection in QSAR
20. Fourches D, Muratov E, Tropsha A (2010) studies. I. An evolutionary algorithm. Quant
Trust, but verify: on the importance of chemical Struct Act Relat 13:285294
64 H. Zhu
35. So SS, Karplus M (1996) Evolutionary optimi- 48. Klopman G, Zhu H, Ecker G, Chiba P (2003)
zation in quantitative structure-activity rela- MCASE study of the multidrug resistance
tionship: an application of genetic neural reversal activity of propafenone analogs. J
networks. J Med Chem 39(7):15211530 Comput Aided Mol Des 17(56):291297
36. So SS, Karplus M (1996) Genetic neural net- 49. Stoner CL, Gifford E, Stankovic C, Lepsy CS,
works for quantitative structure-activity rela- Brodfuehrer J, Prasad JVNV, Surendran N
tionships: improvements and application of (2004) Implementation of an ADME enabling
benzodiazepine affinity for benzodiazepine/ selection and visualization tool for drug discov-
GABAA receptors. J Med Chem 39(26): ery. J Pharm Sci 93(5):11311141
52465256 50. Mayer P, Reichenberg F (2006) Can highly
37. Tropsha A, Gramatica P, Gombar VK (2003) hydrophobic organic substances cause aquatic
The importance of being earnest: validation is baseline toxicity and can they contribute to
the absolute essential for successful application mixture toxicity? Environ Toxicol Chem 25
and interpretation of QSPR models. Quant (10):26392644
Struct Act Relat Comb Sci 22:6977 51. Forsby A, Blaauboer B (2007) Integration of
38. Golbraikh A, Tropsha A (2002) Predictive in vitro neurotoxicity data with biokinetic
QSAR modeling based on diversity sampling modelling for the estimation of in vivo neuro-
of experimental datasets for the training and toxicity. Hum Exp Toxicol 26(4):333338
test set selection. J Comput Aided Mol Des 52. Schirmer K, Tanneberger K, Kramer NI, Volker
16(56):357369 D, Scholz S, Hafner C, Lee LE, Bols NC, Her-
39. Norinder U (1996) Single and domain made mens JL (2008) Developing a list of reference
variable selection in 3D QSAR applications. chemicals for testing alternatives to whole fish
J Chemomet 10:95105 toxicity tests. Aquat Toxicol 90(2):128137
40. Zefirov NS, Palyulin VA (2001) QSAR for boil- 53. Piersma AH, Janer G, Wolterink G, Bessems
ing points of small sulfides. Are the high- JG, Hakkert BC, Slob W (2008) Quantitative
quality structure-property-activity regressions extrapolation of in vitro whole embryo culture
the real high quality QSAR models? J Chem Inf embryotoxicity data to developmental toxicity
Comput Sci 41(4):10221027 in vivo using the benchmark dose approach.
41. Kubinyi H, Hamprecht FA, Mietzner T (1998) Toxicol Sci 101(1):91100
Three-dimensional quantitative similarity- 54. Sjostrom M, Kolman A, Clemedson C, Cloth-
activity relationships (3D QSiAR) from SEAL ier R (2008) Estimation of human blood LC50
similarity matrices. J Med Chem 41(14): values for use in modeling of in vitro-in vivo
25532564 data of the ACuteTox project. Toxicol In Vitro
42. Novellino E, Fattorusso C, Greco G (1995) 22(5):14051411
Use of comparative molecular field analysis 55. Zhu H, Ye L, Richard A, Golbraikh A, Wright
and cluster analysis in series design. Pharm FA, Rusyn I, Tropsha A (2009) A novel two-
Acta Helv 70:149154 step hierarchical quantitative structure-activity
43. Golbraikh A, Tropsha A (2002) Beware of q2! relationship modeling work flow for predicting
J Mol Graph Model 20(4):269276 acute toxicity of chemicals in rodents. Environ
44. Golbraikh A, Shen M, Xiao Z, Xiao YD, Lee Health Perspect 117(8):12571264
KH, Tropsha A (2003) Rational selection of 56. Sedykh A, Zhu H, Tang H, Zhang L, Richard
training and test sets for the development of A, Rusyn I, Tropsha A (2011) . Use of in vitro
validated QSAR models. J Comput Aided Mol HTS-derived concentration-response data as
Des 17(24):241253 biological descriptors improves the accuracy
45. Stouch TR, Kenyon JR, Johnson SR, Chen of QSAR models of in vivo toxicity. Environ
XQ, Doweyko A, Li Y (2003) In silico Health Perspect 119:364370
ADME/Tox: why models fail. J Comput 57. Zhu H, Rusyn I, Richard AM, Tropsha A
Aided Mol Des 17(24):8392 (2008) Use of cell viability assay data improves
46. Johnson SR (2008) The trouble with QSAR the prediction accuracy of conventional quan-
(or how I learned to stop worrying and titative structure activity relationship models of
embrace fallacy). J Chem Inf Model 48(1): animal carcinogenicity. Environ Health Per-
2526 spect 116(4):506513
47. Lombardo F, Gifford E, Shalaeva MY (2003) In 58. Thomas CJ, Auld DS, Huang R, Huang W,
silico ADME prediction: data, models, facts and Jadhav A, Johnson RL, Leister W, Maloney
myths. Mini Rev Med Chem 3(8):861875 DJ, Marugan JJ, Michael S, Simeonov A,
Southall N, Xia M, Zheng W, Inglese J, Austin 60. ICCVAM and NICEATM (2001) Report of
CP (2009) The pilot phase of the NIH Chemi- the International Workshop on In Vitro Meth-
cal Genomics Center. Curr Top Med Chem 9 ods for Assessing Acute Systemic Toxicity.
(13):11811193 Interagency Coordinating Committee on the
59. Xia M, Huang R, Witt KL, Southall N, Fostel J, Validation of Alternative Methods and
Cho MH, Jadhav A, Smith CS, Inglese J, National Toxicology Program Interagency
Portier CJ, Tice RR, Austin CP (2008) Com- Center for the Evaluation of Alternative Toxi-
pound cytotoxicity profiling using quantitative cological Methods Report 01-4499 National
high-throughput screening. Environ Health Institutes of Health, Bethesda, MD
Perspect 116(3):284291
Chapter 4
Mutagenicity, Carcinogenicity, and Other End points

Romualdo Benigni, Chiara Laura Battistelli, Cecilia Bossa,
Mauro Colafranceschi, and Olga Tcheremenskaia
Abstract
Aiming at understanding the structural and physical chemical basis of the biological activity of chemicals,
the science of structureactivity relationships has seen dramatic progress in the last decades. Coarse-grain,
qualitative approaches (e.g., the structural alerts), and fine-tuned quantitative structureactivity relation-
ship models have been developed and used to predict the toxicological properties of untested chemicals.
More recently, a number of approaches and concepts have been developed as support to, and corollary of,
the structureactivity methods. These approaches (e.g., chemical relational databases, expert systems,
software tools for manipulating the chemical information) have dramatically expanded the reach of the
structureactivity work; at present, they are powerful and inescapable tools for computer chemists, toxicol-
ogists, and regulators. This chapter, after a general overview of traditional and well-known approaches,
gives a detailed presentation of the latter more recent support tools freely available in the public domain.
Key words: QSAR, Structureactivity, Toxicology, Human health, Chemicals, Databases, Expert
systems, Predictive toxicology, Structural alerts
1. Introduction
Since the birth of modern chemistry, investigators have always been

eager to understand the structural and physical chemical basis of the
biological activity of chemicals, one of the main aims being their
domestication. A brilliant illustration of how the concepts and
practice of structureactivity relationships (SARs) had a strong accel-
eration in toxicology in the mid 1980s is provided by Ariens (1). One
approach is the qualitative one that takes into account the signifi-
cance of particular groups in the molecule for particular aspects, part
processes, in the biological action. Examples are groups described as
pharmacophores or toxicophores or structural alerts (SAs). This can
be called a coarse-grain approach. The other approach (fine-tuned) is
the formalized quantitative structureactivity relationship (QSAR)
approach.
67
68 R. Benigni et al.
The foundation of QSAR came about 40 years ago, when

Corwin Hansch found the way to bring together two areas of
science which had seemed far apart for many years: physical organic
chemistry and the study of Chemical Life Interaction (2, 3).
This model worked for an enormous number of biological pro-
blems, and its success is demonstrated clearly by its widespread
diffusion. In the years subsequent to the 1960s, the need to solve
new problems, together with the contributions of many other
investigators, generated hundreds of variations of the Hansch
approach, as well as approaches that are formally completely new.
For example, new descriptors generated through direct mathemat-
ical modeling of chemical structures were introduced (48), and
the range of mathematical models used to link chemistry and
biology has expanded accordingly (913). However, the QSAR
science still maintains a fundamental unity, founded on the system-
atic use of mathematical models and on the multivariate point of
view. At present, the QSAR science is one of the basic tools of
modern drug and pesticide design, and has an increasing role in
environmental sciences (1420). A great aspect of the QSARs
especially when applied within individual chemical classesis that
they point to the chemical determinants of the biological activity of
the compounds; thus they can contribute to the rationalization of
the biological activity mechanisms (3).
Even though the QSAR approach was having a dramatic devel-
opment in the 1980s, with an exponential increase in methods and
computerized technologies proposed, the knowledge of the SAs
as recognition and classification of the molecular substructures and
reactive groups responsible for the toxic effectsstill plays a primary
role in the mechanistic science of toxicology and provides powerful
means of intervention to domesticate the chemicals. This is even
more so in the field of carcinogenicity and mutagenicity, both for
historical reasons and because the availability of QSARs is still
limited. As a matter of fact, the knowledge on the action mechan-
isms as exemplified by the SAs is routinely used in the regulatory
context (see, for example, the mechanistically based reasoning as in
ref. 21), or in the prioritization of chemicals to be tested in the
animal bioassay (22). In addition, the SAs are at the basis of popular
commercial (e.g., DEREK, by Lhasa Ltd. (23)) and noncommercial
software systems (e.g., Oncologic (24, 25); Toxtree (26, 27)).
Regarding the QSARs, these have been generated for several
individual chemical classes of chemical carcinogens and mutagens
(including aromatic amines, nitroarenes, quinolines, triazenes,
polycyclic aromatic hydrocarbons, lactones, aldehydes) (28, 29).
The majority of these QSARs are relative to in vitro mutagenicity;
however, a number of QSAR models for the animal carcinogenicity
exist as well.
Whereas the original aim of QSAR was to analyze congeneric
classes of chemicals (i.e., chemicals structurally similar, and acting
4 Mutagenicity, Carcinogenicity, and Other End points 69
through the same mechanism of action-better, the same rate-

limiting step), because of practical reasons many QSAR applications
to toxicology consist of models developed from noncongeneric
samples of chemicals, and hopefully suitable for predicting the
activity of any class of compounds. Several popular commercial
systems are in this category (e.g., TopKat (30), Multicase
(31, 32)), as well as freely available methods (e.g., LAZAR (33)).
A wide range of very valuable reviews have been written on the
subject of the above methods and approaches (a few examples, out
of many, are refs. 20, 3348) and they will not be repeated here. We
focus instead on some crucial issues related to the predictive ability
of the qualitative and quantitative information that can be obtained
from the SARs. Subsequently, we present in more detail a number
of approaches and concepts that have been developed recently as
support to, and corollary of, the structureactivity work. These
approaches (e.g., chemical relational databases, expert systems,
software tools for manipulating the chemical information) have
dramatically expanded the reach of the structureactivity work,
and are powerful and inescapable tools for computer chemists,
toxicologists, and regulators.
2. StructureActivity
Relationships and
the Prediction of
Toxicological End Even though in general it is difficult to assess the contribution that
points qualitative, mechanistically based SAR information has given to risk
assessment and to the domestication of chemistry, various evidence
is however available, and indicates that the contribution is impor-
tant. A brilliant case is represented by the priority setting process at
the US National Toxicology Program in selecting chemicals to be
bioassayed. The analysis was performed when around 400 chemi-
cals had been bioassayed for their carcinogenic activity. It appears
that two-thirds were selected for the bioassay because suspect,
mainly on the basis of structural considerations. One-third was
selected because of production/exposure considerations. The anal-
ysis showed that the structural criteria adopted to short-list suspect
chemicals were able to enrich the target up to ten times. In fact,
70% of the chemicals bioassayed as suspect carcinogens were carci-
nogens, whereas 7% of the chemicals bioassayed only on produc-
tion/exposure considerations were carcinogen (22).
Another evidence on the positive influence of the mechanistic
information is that the rate of drugs and pesticides with known SAs
or with positive Salmonella results put into the market in recent
times has considerably decreased, as shown by the personal experi-
ence of one of the authors (R.B.) in his regulatory work. This is
confirmed by the much lower proportion of known SAs among
pharmaceutical drugs approved by the US Food and Drug Admin-

istration, in respect to the historical database of chemicals tested for
carcinogenicity (49). This indicates that the mechanistic informa-
tion on chemical carcinogenicity has become shared knowledge
among the synthetic chemists, and allows them to synthesize safer
chemicals.
After the definition and compilation of SAs following the elec-
trophilicity theory of the Millers by Ashby (50, 51), in more recent
times the mechanistic knowledge on chemical carcinogens has been
implemented into computerized expert systems that permit a faster
and more flexible assessment of chemicals, notably Oncologic (24)
and Toxtree (26) among the systems in the public domain. The
compilation of SAs implemented into the expert system Toxtree
2.1.0 has been subjected to validation studies: it appears that the
SAs have both high sensitivity and specificity for the Ames test
(overall accuracy 0.79), while having a lower agreement with
carcinogenicity (overall accuracy 0.70). The lower agreement
with carcinogenicity depends on the fact that the available SAs for
nongenotoxic carcinogens is still limited. On the other hand, it
should be emphasized that the SAs predictive ability for Salmonella
mutagenicity (accuracy 0.79) is of the same order of magnitude
of the experimental variability of the test itself (inter-laboratory
reproducibility reported to be 8085% (52)). This implies that
the assessment of chemical mutagenicity through the Ames test
and through the SAs has similar reliability (35).
A concept particularly important is that of the proper use of the
SAs. An SA is a chemical functional group or molecular substruc-
ture that is known to be linked, mechanistically and/or statistically,
to a certain adverse effect. Thus the presence of an SA in a molecule
is a forewarning about the potential adverse effects of the molecule.
For example, the presence of an epoxide in a molecule points to a
high carcinogenic and mutagenic risk. On the contrary, the absence
of SAs is not an indication of safety. Since the list of SAs for
carcinogenicity is limited to DNA-reactivity mechanisms and to a
few nongenotoxic carcinogenicity mechanisms (27), the molecule
may be active through mechanisms not yet coded into SAs. Very
often this fundamental conceptual difference between the presence
and absence of SAs is not taken into consideration, and a list of SAs
is regardederroneouslyas able to predict both negative (e.g.,
noncarcinogens) and positive (e.g., carcinogens) test results. On
the contrary, the most important feature of a set of SAs is their
positive predictivity, i.e., the probability that chemicals with SAs are
actually positive in toxicity tests. It should be noticed that the
prediction of both positive and negative results is typical of the
more sophisticated QSAR models, where a mathematical combina-
tion of properties or structural characteristics of the molecules
permit the calculation of a probability toxicity score ranging from
zero (negative) to one (positive). Only in very special cases a list of
SAs can be considered as a predictor of both negative and positive

chemicals: one such case is the prediction of Salmonella mutagenic-
ity, because the mechanisms of Salmonella mutagenicity are well
known and are exhaustively coded into SAs.
The predictivity of the QSARs for individual chemical classes of
congeners has been assessed as well. A survey on the QSARs for
mutagens and carcinogens in the public domain has been per-
formed in a collaborative effort between the Istituto Superiore di
Sanita and the former European Chemicals Bureau (ECB). The
details of the study are published in ref. 47, and are put into a wider
perspective in ref. 53. The predictivity of the QSARs was checked
with real external test sets. The biological activities included Salmo-
nella mutagenicity and rodent carcinogenicity, for the classes of
aromatic amines, nitroarenes, and aliphatic aldehydes. The QSARs
for potency (applicable only to toxic chemicals) generated predic-
tions 3070% correct, whereas the QSARs for discriminating
between active and inactive chemicals were 70100% correct in
their external predictions. To fully appreciate this result, the exter-
nal predictivity of activity models (70100% accuracy) should be
compared with the variability range of the experimental tests. As
reported above, the inter-laboratory repeatability of a very good
test like the Ames test has been estimated to be 8085% (52). Thus
the level of uncertainty of good QSARs is comparable to that of
good experimental assays.
The external predictivity of the QSAR for noncongeneric
chemicals has been assessed as well in various exercises. A survey
of the different exercises has pointed out that the predictivity of the
models vary very much according to the sample of chemicals whose
toxicity is to be predicted, and that the performance in the different
regions of the universe of chemicals is largely unpredictable (53).
3. Sar-Based
Expert Systems
This section presents in detail three expert systems aimed at asses-
sing the hazard posed by chemicals, based primarily (but not only)
on structural considerations. The systems are Oncologic, Toxtree,
and the Organization for the Economic Co-operation and Devel-
opment (OECD) QSAR Toolbox. They have very different struc-
ture and extent, since they were designed in different periods of
time and have different specific aims. In particular, Oncologic aims
at predicting carcinogenicity, Toxtree has modules for a series of
toxicity end points, whereas the OECD QSAR Toolbox has a
number of tools that in combination support the users in the
process of hazard assessment.
3.1. Oncologic Cancer OncoLogic is a publicly available, rules-based expert approach

Expert System developed by the US Environmental Protection Agencys (EPA)
Structure Activity Team (SAT). It is a computerized system that
mimics the thinking and reasoning of human experts on the basis
of their knowledge regarding toxicological effects of certain classes
of compounds. The system predicts the potential carcinogenicity of
chemicals by applying rules of SAR analysis, and high-level expert
judgement that incorporates what is known about metabolism,
mechanisms of action, and human epidemiological studies. Onco-
Logic has been developed using expert system technology based on
knowledge rules which represent SATs knowledge of SAR of
carcinogens (48). The goals of developing this system were various:
(1) to provide guidance to industries on elements of concern for
developing safer chemicals; (2) to provide a source of information on
the rationale for identifying potential cancer hazard of chemicals; (3)
to provide a forum for reaching a common understanding among
various regulatory agencies in hazard identification of chemical car-
cinogens; and (4) to stimulate research to fill knowledge gaps.
OncoLogic uses two different methods to predict potential
carcinogenicity, structural (SAR) analysis, and functional analysis.
Structural analysis makes use of mechanism-based SAR analysis,
which involves comparison with structurally related compounds
with known carcinogenic activity, identification of structural moi-
eties or fragments that may contribute to carcinogenic activity
through a perceived or postulated mechanism, and evaluation of
the modifying role of the remainder of the molecule to which the
structural moiety/fragment is attached. The structural analysis arm
consists of four modules, including the organics module, metals
module, polymers module, and fibers module. Functional analysis
integrates available mechanistic/noncancer studies on the chemical
in order to predict the potential for the chemical to be a tumor
initiator, promoter, or progressor. Results from the functional anal-
ysis can be used to provide support to the results of the structural
analysis, or can be used as an independent method of analysis. The
structural and functional analyses must be performed separately.
The OncoLogic expert system provides an interactive, user-
friendly interface that assumes the user to have a basic understand-
ing of chemistry. The fiber, metal, and polymer subsystems require
the user to enter information about substances by selecting items
from menus or entering data through data entry screens. The
organics subsystem allows the user to draw the molecule to be
evaluated through the use of an uncomplicated, straightforward
drawing package designed specifically for this program. Necessary
information to input a chemical into OncoLogic depends on the
type of chemical. Inputs may include chemical name, CAS number,
or structure, and chemical, biological, and mechanistic information
(e.g., physicochemical properties, chemical stability, route of
exposure, bioactivation and detoxification, genotoxicity, and
other supportive data) critical to the evaluation of carcinogenic

potential. OncoLogic prompts the user when any of this informa-
tion is required for the specific chemical in question.
Output information will include a prediction of the carcino-
genic potential of the chemical, expressed semiquantitatively, and
the underlying scientific rationale. The concern levels used by the
OncoLogic system are semantic terms used by the US EPA SAT for
ranking the hazard levels of chemicals. Six concern levels are used to
allow semiquantitative ranking of relative hazard. The specific terms
in order from lowest concern to highest concern are low, marginal,
low-moderate, moderate, high-moderate, and high.
Once an evaluation has been performed, the OncoLogic system
produces reports explaining the carcinogenicity concern for the
evaluated compound. There are two types of reports produced, a
data report and a justification report. The data report contains the
information that was entered about the compound. Essentially it
becomes a reference report of the data entered upon which the
OncoLogic system based its evaluation. The justification report
contains a summary of the predicted concern level for the com-
pound and the line of reasoning used by the expert system to arrive
at the concern level. It consists of two parts: a summary of the
evaluation and a line of reasoning (justification) of how the final
conclusions were derived. Within the summary section, a final level
of concern is stated along with any other messages that merit special
attention. The line of reasoning part of the justification report
keeps track of the rules that are used to arrive at a level of concern.
It is specific to each evaluation and represents the actual rules used
for the particular compound.
The user of OncoLogic takes a compound of unknown toxicity
and determines to which structural class the compound belongs
(e.g., carbamates, ethyleneimine, haloalkylamine, nitrosamines,
etc.). Then he or she chooses the appropriate structural class and
adds substituents, heteroatoms, and other components to create an
accurate representation of the compound in question. Therefore, in
order to use OncoLogic correctly, the user must have a basic
knowledge of organic chemistry and ability to place chemicals in
the appropriate chemical class. In particular the user should know
certain characteristics of the chemical of interest, like structure
(including subunits present), physical/chemical properties (stabil-
ity, etc.), biological and mechanistic information, as well as possible
routes of exposure.
The system assigns a baseline concern level ranging from low to
high and evaluates how substituents on the chemical may affect
carcinogenicity. To add substituents or atoms when drawing a
chemical structure, the user has to select the type of substituent
or atom to add, and place the cursor just ahead of where he or she
wants the substituent to appear. When all substituents are correctly
placed, the evaluation will be performed and the justification report
displayed.
If an appropriate chemical class is not available for a specific

chemical, potential carcinogenicity cannot be evaluated using SAR
analysis by OncoLogic. For most cases, absence of an appropriate
class/structure in OncoLogic provides suggestive, but not defini-
tive, evidence of low cancer concern. If mechanistic/noncancer
studies are available, potential carcinogenicity can be evaluated
using the functional analysis arm of OncoLogic, instead of the
structural analysis arm.
A major drawback of the program is its inability to analyze
certain compounds (e.g., in a recent exercise many compounds in
the Carcinogenic Potency Database (CPDB) could not be analyzed
by OncoLogic (approximately 46%), since they did not fall into one
of OncoLogics defined structure classes (54)). However, Onco-
Logics accuracy in predicting the carcinogenic potential of the
compounds it can analyze has been clearly established and indicates
the power of an SAR analysis software program whose basis is the
decision tree/expert rules approach (e.g., refs. 54, 55).
Oncologic version 7.0 is freely available at http://www.epa.
gov/oppt/sf/pubs/oncologic.htm.
3.2. Toxtree Toxtree version 2.1.0, developed by Ideaconsult Ltd under an ECB
contract, is an open source software application available as a free
download from the ECB Web site: http://ecb.jrc.ec.europa.eu/
qsar/qsar-tools/. Toxtree is capable of estimating different types
of toxic effects and modes of action by applying structural rules in a
decision tree approach. Currently, modules are available for apply-
ing the following rulebases:
1. Cramer classification scheme for threshold of toxicological
concern (TTC) estimation.
2. Extended Cramer scheme.
3. Verhaar scheme for predicting the mode of toxic action in
aquatic species.
4. Decision tree for estimating skin irritation and corrosion
potential.
5. Decision tree for estimating eye irritation and corrosion.
6. Mutagenicity and carcinogenicity rulebase.
7. ToxMic rulebase for the in vivo micronucleus assay.
8. Structural alerts for the identification of Michael acceptors.
9. START rulebase for persistence/biodegradation potential.
10. Kroes decision tree for TTC estimation.
11. Decision trees for estimating skin sensitization.
12. Cytochrome P-450 Drug Metabolism.
1. Cramer decision tree

Cramer et al. procedure for estimation of toxic hazard gives rise
to a classification of a specific chemical, by criteria based largely
on structure or on biochemistry and physiological chemistry
(56). The decision tree consists of 33 questions, derived by the
use of available toxicological data. Each answer leads to another
question or to final classification into one of the three classes (I,
II, and III), reflecting a presumption of low, moderate, and
serious toxicity, respectively.
Cramer et al. structural classes are widely used in TTC
approaches for risk characterization (57, 58).
2. Extended Cramer scheme
Extensions to Cramer rules were developed and implemented
by Curious-IT, the Netherlands: five new rules have been intro-
duced in the Extended Cramer tree module in addition to the
existing ones. This extension was made in order to take into
account the toxicity of several compounds, analyzed by Munro
et al. (59), and possibly misclassified in the former scheme.
3. Verhaar scheme
Verhaar scheme for aquatic toxicity is a chemistry-based classi-
fication system, which separates simple organic chemicals into
four classes, according to mechanism of action (60).
Class 1 includes inert chemicals that do not cause overall
acute effects and do not interact with specific receptors in an
organism. This class defines the baseline toxicity, characterized
by narcosis mode of action for acute aquatic toxicity. Less inert
chemicals are positioned in class 2. These chemicals are slightly
more toxic than class 1 compounds and are described to act by a
polar narcosis mechanism. In class 3 reactive chemicals take
place. Such chemicals are able to react unselectively with bio-
molecules, forming irreversible covalent bonds. Class 4 groups
specifically acting chemicals that carry out toxic action via
receptor binding mechanisms.
The classification of a compound to a particular mechanism
of action, using Verhaar structure-based rules, is important in
the development and utilization of class-specific QSARs, and in
the formation of categories of chemicals for further assessment.
4. Skin irritation/corrosion
This module consists of a decision tree for estimating skin
irritation and corrosion potential, based on the Skin Irritation
Corrosion Rules Estimation Tool (SICRET) (61).
The procedure includes identification of chemicals with no
skin irritation or corrosion potential, by the application of
physicochemical property limits (62). The physicochemical
properties include melting point, molecular weight, octanol-
water partition coefficient, surface tension, vapor pressure,
aqueous solubility, and lipid solubility. If a chemical does not
meet the prescribed limits, then its skin irritation or corrosion

potential is estimated via structural alerts. The structural alerts
were categorized based on the mechanisms associated with skin
irritation or/and skin corrosion, or irritation and corrosion
(63).
5. Eye irritation/corrosion
Similarly to the skin irritation/corrosion module, this decision
tree estimates eye irritation and corrosion potential by physico-
chemical property limits and structural rules (64).
6. Mutagenicity and carcinogenicity rulebase
This module uses a structure-based approach consisting of a
refined compilation of SAs for carcinogenicity and Salmonella
typhimurium (Ames test) mutagenicity. It also includes three
mechanistically based QSAR models for congeneric classes
(aromatic amines and a,b-unsaturated aldehydes). Among 33
structural alerts included in the expert system, five structural
alerts refer to nongenotoxic mechanisms of action. Most SAs
have accompanying modulating factors. The list of SAs results
from a critical analysis of previous compilations (26) (http://
ecb.jrc.it/documents/QSAR/EUR_23241_EN.pdf).
7. ToxMic rulebase for the in vivo micronucleus assay
This rulebase provides a list of SAs for a preliminary screening of
potentially in vivo mutagens. The list was developed using the
alerts for genotoxic carcinogenicity (from the carcinogenicity/
mutagenicity module in Toxtree) as a core, and then including
additional substructures specific to the micronucleus-positive
chemicals (65).
8. Structure alerts for identification of Michael acceptors
This module contains structural alerts, able to identify Michael
acceptors (i.e., molecules reactive via Michael-type addiction)
(66). Michael-type acceptors are electrophilic molecules con-
taining double bonds polarized by a neighboring electron-
withdrawing substituent. These compounds may react with
nucleophilic groups on peptides, proteins, or DNA, resulting
in covalent modifications. As a consequence, a wide range of
adverse effects may arise, including general toxicity, allergenic
reactions, mutagenicity, and carcinogenicity.
9. START
START rulebase (Molecular Networks, Germany) is a compila-
tion of SAs for environmental persistence and biodegradability
(http://ecb.jrc.ec.europa.eu/documents/QSAR/QSAR_
TOOLS/Toxtree _start_manual.pdf). Thirty-two SAs are
included in the decision tree; out of them, 23 SAs refer to
environmental persistent chemicals mechanisms of action while
9 SAs refer to easily biodegradable chemicals. According to
START rulebase, the substances are classified in one of the
three categories. Class 1, easily biodegradable chemical, is

assigned if one or more SAs for biodegradability are found in
the query chemical. Otherwise, SAs for persistence are
checked, and if any is found the query chemical is classified as
persistent (class 2). If the query chemical is not assigned to
any of the above classes, it is classified in class 3, unknown
biodegradability.
10. Kroes decision tree
This Toxtree module is an implementation of the decision tree
proposed by ILSI Europe to decide whether substances can
be assessed by the TTC approach (58). The initial step is the
identification and evaluation of possible genotoxic (by the
application of rules contained in the mut/carc module)
and/or high-potency carcinogens. Following this step, non-
genotoxic substances are evaluated in a sequence of steps
related to the concerns (based on Cramer classes) that
would be associated with increasing intakes. Daily intake is
required from user input.
11. Skin sensitization alerts
This module identifies skin sensitization structural alerts (67).
It is widely accepted that skin sensitization to low-molecular-
weight compounds almost invariably involves covalent bind-
ing of the sensitizing chemical to protein in skin (68). Usually,
the sensitizer is an electrophile itself or undergoes a-biotic or
biochemical transformation into an electrophile (pro-
electrophile), and reacts with nucleophilic groups in proteins.
In skin sensitization, some of the most frequently encoun-
tered reactions are Michael-type reactions, SN2 reactions,
SNAr reactions, acylation reactions, and Schiff base forma-
tion. SAs associated with these mechanisms of action are
implemented in Toxtree module.
12. Cytochrome P-450 Drug Metabolism
This Toxtree module is based on SMARTCyp program
(http://www.farma.ku.dk/index.php/SMARTCyp/7990/0/).
SMARTCyp is an in silico method that predicts the sites of
cytochrome P450-mediated metabolism of drug-like mole-
cules. Preferentially, sites labile for metabolism by isoform
3A4 are predicted (69).
The carcinogenicity/mutagenicity module will be used in the
following example to illustrate Toxtree functioning (full details on
the usage of the program can be found in http://ecb.jrc.ec.europa.
eu/DOCUMENTS/QSAR/QSAR_TOOLS/Toxtree_user_man-
ual.pdf).
In the Toxtree application window, four different areas can be
distinguished: (1) the compounds properties area on the top left;
(2) the compound structure diagram area on the left bottom;
(3) the classification area on the right top; and (4) the verbose
explanation area on the bottom right.
After launching Toxtree, the first step is to provide the program

with a 2D chemical structure, which is the fundamental (and often
unique) information that the software needs in order to estimate
toxic hazard. One or more chemical structures can be loaded in the
software in order to perform an evaluation (batch processing is
available in case of necessity, e.g., large datasets). This can be easily
accomplished by importing a structure file (several structure for-
mats are recognized, i.e., mol or sdf format; SMILES or InChi
strings; etc.), by typing a SMILES in the SMILES input area, or by
drawing directly a chemical with the structure editor facility.
Once the query structure/s is entered, the following step will
be the choice of the decision tree to be applied, depending on the
end point of interest. Finally, in order to apply the active decision
tree to the query chemical, the Estimate button should be
pressed. Results will be displayed in the classification area, and
the verbose explanation can be checked to get details on the under-
lying reasoning. The processed molecules, together with classifica-
tion data, can be saved in a file (compatible types are CSV, SDF,
and TXT).
Through the application of the carcinogenicity/mutagenicity
module, the processing of a query chemical can give rise to different
outcomes, concerning the presence/absence of structural alerts
and the result of the application of one or two QSAR models,
when applicable. In particular, the system flags either outcome
through one or a combination of several labels.
Based on the SAs decision tree application, the following results
(labels) are possible:
l Structural alert for genotoxic carcinogenicityif the system
recognizes the presence of one or more SAs, and specifies a
genotoxic mechanism
or
l Negative for genotoxic carcinogenicityif no SAs for geno-
toxic carcinogenicity have been detected
and/or
l Structural alert for nongenotoxic carcinogenicityif the sys-
tem recognizes the presence of one or more SAs, and specifies a
nongenotoxic mechanism
or
l Negative for nongenotoxic carcinogenicityif no SAs for
nongenotoxic carcinogenicity have been detected
Furthermore, for molecules belonging to the applicability
domain of the three QSAR models included in this Toxtree mod-
ule, further labels may be flagged:
l Potential S. typhimurium TA100 mutagen based on QSAR

assigned according to the positive outcome of QSAR6 or
QSAR13 (aromatic amines or a,b-unsaturated aldehydes,
respectively)
or
l Unlikely to be a S. typhimurium TA100 mutagen based on
QSARassigned according to the negative outcome of
QSAR6 or QSAR13
and/or
l Potential carcinogen based on QSARassigned according to
the positive outcome of QSAR8 (aromatic amines)
or
l Unlikely to be a carcinogen based on QSARassigned accord-
ing to the negative outcome of QSAR8
or
l For a better assessment a QSAR calculation could be applied
assigned when one of QSAR6, QSAR8, or QSAR13 is applica-
ble, but the user chooses not to apply a QSAR.
It is important to note that the SAs decision tree and the
QSARs calculations are independent.
Conflicting results that may arise are not errors; instead they are
due to the intrinsically different degree of detail distinctive of the
two methods: a coarse-grain classification for the SAs and a more
fine estimation for the QSARs.
In Fig. 1, Toxtree evaluation of 2-chloro-1,4-benzenediamine
by the carcinogenicity/mutagenicity rulebase is displayed. Out-
comes assigned are shown in the left top panel. The red label,
structural alert for genotoxic carcinogenicity, considers the pres-
ence of a structural alert for primary aromatic amines (SA28). The
green and blue labels have been generated after QSAR6 and
QSAR8 calculation, since the chemical falls into their applicability
domains. The model for Salmonella mutagenicity (QSAR6) gives a
positive outcome, whereas the chemical is predicted to be noncar-
cinogen by the carcinogenicity model (QSAR8). As shown for this
chemical, QSAR model application allowed to refine the structural
alert-based predictions.
3.3. OECD QSAR The OECD QSAR Application Toolbox is a stand-alone free soft-
Application Toolbox ware application http://www.oecd.org/document/54/0,3343,
en_2649_34379_42923638_1_1_1_1,00.html intended to be
used by the governments, the chemical industry, and the toxicolo-
gists to fill gaps in toxicity data needed for assessing the hazards
posed by chemicals.
The Toolbox contains databases with results from experimental
studies, a library of QSAR models, tools to estimate missing
Fig. 1. Toxtree evaluation of 2-chloro-1,4-benzenediamine by the carcinogenicity/mutagenicity rulebase.
experimental values by read across, i.e., extrapolating results from

tested chemicals to untested chemicals, tools to define categories of
chemicals to be regulated or considered in a similar way from the
point of view of risk, and tools to estimate missing experimental
values by trend analysis, i.e., interpolating or extrapolating from a
trend (increasing, decreasing, or constant) in results for tested
chemicals to untested chemicals within a category. The Toolbox is
able to quickly evaluate chemicals for common mechanisms or
modes of action as well as for common toxicological behavior or
consistent trends among results related to regulatory end points.
Documents on the use of the QSAR Toolbox and of the
underlying rationale are available from http://www.oecd.org/
document/54/0,3343,en_2649_34379_42923638_1_1_1_1,00.
html#Guidance_Documents_and_Training_Materials_for_Using_
the_Toolbox.
The Toolbox has six work modules which are used in the
workflow.
1. Chemical input: Provides the user with several means of enter-
ing the chemical of interest or target chemical. Since all
subsequent functions are based on chemical structure, the
goal here is to make sure that the molecular structure assigned
is the correct one.
2. Profiling: Identifies structural and mechanistic properties of the
target chemical, which are stored in the Toolbox and can
subsequently be used in the module on category definition to
group the target chemical with similar chemicals.
3. End points: Retrieving physicalchemical properties, environ-
mental fate, and toxicity data which are stored in the Toolbox.
This data gathering can be executed in a global fashion (i.e.,
collecting all data of all end points) or on a more narrowly
defined basis (e.g., collecting data for a single or a limited
number of end points).
4. Category definition: Provides the user with several means of
grouping chemicals into a (toxicologically) meaningful cate-
gory that includes the target molecule. This is the critical step
in the workflow and several options are available in the Toolbox
to assist the user in refining the category definition via subcate-
gorization.
5. Filling Data Gaps: Provides the user with three options for
making an end point-specific prediction for the untested chem-
ical (i.e., the target molecule). These options, in increasing
order of complexity, are by read across, by trend analysis, and
through the use of QSAR models.
6. Report: Provides the user with a downloadable report of the
Toolbox prediction.
Since the definition of a chemical category for data gap filling is
the critical step in the workflow of the Toolbox, the selection and
use of the profilers is an important process. The Toolbox provides
several options (i.e., profilers) to assist the user in defining and
refining the category definition. A chemical category is a group of
chemicals with physicalchemical and toxicological properties
which are likely to be similar or follow a regular pattern because
of their similar chemical structure. Using this category approach,
not every chemical needs to be tested for every end point because
data gaps may be filled by read across from a tested chemical to an
untested chemical, trend analysis (interpolation or extrapolation),
or related QSAR methods. The category approach used in the
Toolbox enables robust hazard assessment through mechanistic
comparisons without testing.
The current version of the software includes several profilers

connected with genotoxicity and carcinogenicity such as the Tox-
tree carcinogenicity/mutagenicity module (26), the OASIS DNA
binding profiler developed by LMC Bourgas (70), and the Liver-
pool John Moore University DNA binding profiler.
The following example describes the prediction of the Ames
mutagenicity for an untested compound 3-methylhexanal (CAS
Number: 19269-28-4, SMILES: CCCC(C)CC O), which is
the target chemical. The category approach is applied.
Chemical input could be done in several ways: most used are
CAS number, SMILES (simplified molecular information line entry
system) notation, and drawing the structure. After entering the
target molecule, we can pass to the second step in the Toolbox
workflow, i.e., profiling. This is an electronic process of retrieving
relevant information on the target compound including likely
mechanism of action and a survey of organic functional groups,
which form the target chemical. The results of profiling automati-
cally appear as a box under the target chemical (see Fig. 2).
The next step in the workflow is the End point determina-
tion that refers to the electronic process of retrieving the
Fig. 2. OECD QSAR Toolbox: Profiling results for 3-methylhexanal.

environmental, ecotoxicity, and toxicity data that are stored in the

Toolbox. Data gathering can be executed in a global way collecting
all data of all end points or on a more defined basis collecting data
for a single or a limited number of end points. In this example, we
focus our data gathering to the end point of mutagenicity and the
databases OASIS Genotox and ISSCAN. We will get an answer that
no experimental data is currently available for 3-methylhexanal. In
other words, we have identified a data gap, which we will try to fill
in next steps. The following module is category definition that
allows grouping chemicals into a toxicologically meaningful cate-
gory. For example, starting from a target chemical for which a
specific DNA binding mechanism is identified, analogues can be
found which can bind by the same mechanism and for which
experimental results are available. We could also decide to define
the category by using molecular similarity. In this case it is sufficient
to highlight organic functional groups grouping method and
then click on defining category. For example, selecting the cri-
teria for category definition as a DNA binding mechanism by
OASIS (aldehydes class) we get a list of 34 chemicals. At this step
all available experimental data on these similar chemicals can be
collected specifying the end point of interest that in our case is
Bacterial Reverse Mutation Assay (Ames test) (Fig. 3).
During the next step in the workflow Filling Data Gap the
user has three options for making a prediction for the target mole-
cule: Read Across, Trend Analysis, and (Q)SAR models. In our case
with qualitative mutagenicity data we can use read across method.
The resulting plot is experimental results of all analogues
(Y axis) according to a descriptor (X axis) with the default descrip-
tor of log Kow, where the red dot represents the target chemical,
Fig. 3. OECD QSAR Toolbox: Results of category definition and data collection for 3-methylhexanal.
Fig. 4. OECD QSAR Toolbox: Results of read across for 3-methylhexanal.
while the purple dots the experimental results available for the
analogues, which are used for the read across (Fig. 4). A user have
a possibility to refine the read across results, for example deleting
some chemicals with functional groups which are not present in the
target molecule.
In our example the non-mutagenic potential could be pre-
dicted with confidence for the target chemical. During the last
step all history of the prediction can be printed or copied to a
detailed report. It is important to be always sure that the data
used in any read across or trend analysis predictions are quality-
checked, and that unusual or outlying data within a category are
investigated before use.
4. Support Tools for

StructureActivity
Relationships Work
The FP7 Project OpenTox supported by the European Union
(www.opentox.org) provides an OpenSource predictive toxicol-
4.1. The OpenTox ogy Framework that allows toxicologists and risk assessors to
Platform access experimental data, (Q)SAR models, and toxicological infor-
mation through an integrating platform that adheres to regu-
latory requirements including OECD Guidelines for validation
and reporting (71).
The OpenTox prototype data infrastructure includes European

Chemical Agency (EChA)s list of preregistered substances, along
with experimental data from several sources in the public domain.
The two most important OpenTox applications are (a) Tox-
Predict (http://www.toxpredict.net) which predicts and reports
on toxicities for an input chemical structure and (b) ToxCreate
(http://www.toxcreate.net) which builds and validates a predic-
tive toxicity model based on an input toxicology dataset.
ToxPredict is aimed at the user having no or little experience in
QSAR predictions, and offers an easy-to-use user interface, allow-
ing the user to enter a chemical structure and to obtain in return a
toxicity prediction for one or more end points. If experimental
results for the molecule are found in the OpenTox database, then
they will be summarized and available to user. All necessary descrip-
tors are calculated, results of regression obtained, and chemical
similarity to calibration molecules evaluated.
The ToxPredict workflow can be divided into the following five
steps:
1. Enter/select a chemical compound: A user can specify the
chemical structure(s) for further estimation of toxicological
properties. Free text searching allows the user to find chemical
compounds by chemical names and identifiers, SMILES and
InChI strings, and any keywords available in the OpenTox data
infrastructure.
2. Display selected/found structures: The second step displays the
chemical compounds, selected by the previous step. The user
interface supports the selection/de-selection of structures, and
editing of the structures and associated relevant information.
3. Select models: A list of available models is displayed. Links to
training datasets, algorithms, and descriptor calculation are
provided.
4. Perform the estimation: Models, selected in step 3, are
launched. If a model relies on a set of descriptors, an automatic
calculation procedure is performed.
5. Display the results: This step displays estimation results, as well
as compound identification and other related data. Initial dem-
onstration reports in several formats can be accessed via icons
on the right-hand side of the browser display (Fig. 5).
The ToxCreate application, in contrast to ToxPredict, is aimed
at researchers working in the life sciences and toxicology, QSAR
experts, and industry and government groups supporting risk
assessment, who are interested in building predictive toxicology
models. It allows the creation of a number of models using one or
more algorithms.
Fig. 5. An example of the ToxPredict toxicity prediction (substance: benzidine, selected model: ToxTree rulebase for
carcinogenicity and mutagenicity).
The following sequence of steps explains the ToxCreate sample

session:
1. Upload dataset: Enables the user to specify a model training
dataset in CSV format, consisting of chemical structures
(SMILES) with binary class labels (e.g., active/inactive).
2. Create and display model: Displays information about the
model learned from the data submitted in the previous step.
At this point, the model is permanently stored on the server
and can be used for predictions at any time in the future.
3. Select and use model(s) for prediction: A chemical (specified via
SMILES code) can be entered in order to predict its chemical
behavior by arbitrary models existent on the server.
4. Display prediction results: Displays the predictions made by the
selected models from the previous step along with an image of
the predicted structure. Based on the selections made in the
previous screen, the expert user may predict the same structure
by a variety of algorithms for the same dataset/end point and
compare the predictions.
Together with model validation, users will be able to use
ToxCreate to select appropriate models with adjusted parameters.
By predicting a variety of related end points, instead of just one,
combined with arbitrary models at the same time, ToxCreate
enables free predictive toxicology modeling exploration along dif-
ferent dimensions.
4.2. Databases The definition of toxicity databases (DB) in literature is variable; the
on Chemical definition proposed here is that a toxicity database is a valued set of
Toxicological electronic information (i.e., data) that can be related to the toxicity of
Information substances of which the information is accessible by computer,
organized by software, and utilized for safety and risk analysis of
chemicals, product discovery and development, or academic research

for settings in the biomedical and toxicological sciences (72).
Most important uses of toxicity DB are the following:
l Data-mining information: Data mining is a technique that
provides analysis of large databases to identify unsuspected
relationships or hidden patterns in a dataset that can be used
to predict future behavior.
l Read across of various sources (also regulators) and end
points: In the read across approach, the end point information
for one or a few chemicals is used to predict the same end point
for another chemical, considered to be similar in some way
(usually on the basis of structural similarity).
l Building a computational model (source for (Q)SAR datasets).
The OECD has recently published guidance on steps and con-
siderations for performing qualitative and quantitative read across
in order to fill gaps in hazard and/or risk assessment (73).
Many authors noted that there is lack of data problem with
current database (7477) because of lack or fragmentation of data
(most not exist in electronic form, or non-exchangeable data for-
mats), lack of standardization dictionary and controlled vocabul-
aries, and problem of accessibility. Among others, the need of a
development and adoption of shared public standards for toxicity
databases should be emphasized. These standards should include
controlled vocabulary and hierarchical data relationships or ontol-
ogies (i.e., using the same terminology to describe the same things,
and incorporating the layered relationships of different terms to
one another), should be derived from close working knowledge of
the toxicity study domain (e.g., carcinogenicity, mutagenicity,
developmental toxicity, neurotoxicity), and should be inclusive of
chemical structure.
Table 1 summarizes the publicly available carcinogenicity and
genetic toxicology/mutagenicity databases, including a short
description and the main features and use. The list is representative
of what is publicly available, but not inclusive of all possible.
Until lately, toxicity database has been constructed as a look-
up table of existing data, and did not contain chemical structure.
Typically the use of chemical name or CAS number, as identifier, is
nonunique and prone to errors. In the new types of databases
instead, defined as Chemical Relational Database (CRD), the main
informational unit is a chemical structure and the fields are data (e.
g., toxicity) associated with that chemical structure. Therefore,
using the chemical structure as a chemical identifier has a universally
understood meaning and scientific relevance for chemical toxicity
databases: effective linkage of chemical toxicity data with chemical
structure information can facilitate and greatly enhance data
gathering and hypothesis generation.
Table 1
Databases on chemical toxicological information
Database/link Description Main features and use

ACToR Aggregated Computational Toxicology l Cluster of environmental
http://actor.epa.gov Resource (ACToR) by the US EPA chemicals and toxicology
National Center for Computational databases
Toxicology (NCCT); contains data on l Fully relational database
environmental chemicals and toxicology (in future)
(including chemical structure,
physicochemical values, in vitro and
in vivo toxicology assays) for over
2,500,000 compounds derived from
more than 500 sources of data. Data
searchable by name, CAS number, or
structure
Benchmark Data Set http:// Berlin University of Technology; contains l Toxicity experimental
ml.cs.tu-berlin.de/ Ames mutagenicity dataset for 6,500 data not from primary
toxbenchmark/ compounds with their biological activity source
CCRIS http://toxnet.nlm. Chemical Carcinogenesis Research l Toxicity experimental
nih.gov/cgi-bin/sis/ Information System (CCRIS) by the data
htmlgen?CCRIS National Cancer Institute (NCI);
contains over 9,000 chemical records
with animal carcinogenicity,
mutagenicity, tumor promotion, and
tumor inhibition test results. Test results
have been reviewed by experts in
carcinogenesis and mutagenesis.
Included in TOXNET
CPDB http://potency. Carcinogenic Potency Database (CPDB) l Toxicity experimental
berkeley.edu/ by University of California (Berkeley); data
contains results of 6,400 chronic, long-
term animal cancer tests on 1,547
chemicals. Data downloadable in easy-
to-use format (pdf, xls, and txt), and
searchable by chemical name, CAS
number, or author. Chemically-
indexed in the DSSTox database
Danish (Q)SAR database Danish EPA; estimates (predictions, not l (Q)SAR database
http://ecbqsar.jrc.ec. experimental data) from over 70 (Q)
europa.eu/ SAR models and health effects for over
166,000 chemicals
DSSTox http://www.epa. Distributed Structure-Searchable Toxicity l Cluster of toxicity
gov/ncct/dsstox/index. Database (DSSTox) network; databases
html downloadable, structure-searchable, l (Q)SAR-ready database
standardized chemical structure files l Chemical Relational
associated with toxicity data. Database
Emphasizes quality procedures for
accurate and consistent chemical
structure annotation of toxicological
experiments
(continued)
Table 1
(continued)

ESIS http://ecb.jrc.ec. European Chemical Substances l Toxicity experimental
europa.eu/esis/ Information System (ESIS) organized data
by European Chemicals Bureau (ECB)
of European Commissions Joint
Research Centre (JRC). Provides access
to a collection of European regulatory
inventories and data associated to them
on 2,500 chemicals related to risk and
safety. Data searchable by name, CAS
number, or molecular formula
GENE-TOX http://toxnet. Genetic toxicology (GENE-TOX) test l Toxicity experimental
nlm.nih.gov/cgi-bin/sis/ data by the US NLM; data on over data
htmlgen?GENETOX 3,000 chemicals, peer-reviewed by EPA.
Included in TOXNET
IARC http://monographs. International Agency for Research on l Risk assessment
iarc.fr/ Cancer (IARC); monographs on the
evaluation of carcinogenic risks to
humans of more than 900 chemicals.
Data searchable by key word, CAS
number, synonym, or chemical name
IRIS http://cfpub.epa.gov/ EPAs Integrated Risk Information System l Risk assessment
ncea/iris/index.cfm (IRIS); electronic reports on more than
540 chemicals and their potential to
cause human health (noncancer and
cancer) effects. Included in TOXNET
ISSCAN http://www.iss.it/ Istituto Superiore di Sanita` (Italy) l (Q)SAR-ready database
ampp/dati/cont.php? chemical carcinogens: structures and l Chemical Relational
id233&lang1&tipo7 experimental data (ISSCAN); database Database
containing information on more than
1,150 chemical compounds tested with
the long-term carcinogenicity bioassay
on rodents and mutagenicity data. Data
downloadable in pdf, xls, and sdf
formats; data searchable by chemical
name and CAS number
NTP http://ntp.niehs.nih. US NIH/NIEHS National Toxicology l Historical control
gov/ Program (NTP); contains testing status database
and information of agents registered in l Primary toxicology
the USA of public health interest (more experimental data
than 500 two-year studies and more
than 2,000 genetic toxicity studies),
accessed as technical reports; searchable
by chemical name or CAS number; the
reports are downloadable in pdf form;
chemically-indexed in the DSSTox
database
(continued)
Table 1
(continued)

PAN pesticide http://www. Pesticide Action Network (PAN) North l Toxicity experimental
pesticideinfo.org/ America; contains toxicity data (human data
acute toxicity, human carcinogenicity,
reproductive/developmental toxicity,
endocrine disruption, neurotoxicity,
ecotoxicity), and regulatory information
on more than 6,500 pesticides
PubChem http://pubchem. National Center for Biotechnology l Toxicity and biological
ncbi.nlm.nih.gov/ or Information (NCBI); it is not an activity experimental data
http://www.ncbi.nlm.nih. independent database, but a depositor providing link to other
gov/pcassay system with standardized data from databases
toxicological/biological database
sources and from literature. Contains
chemical structure-annotated data
submissions, with summary bioassay
data
TOXNET http://toxnet. TOXicology Data NETwork by the US l Cluster of toxicity
nlm.nih.gov/ National Library Medicine (NLM); databases
databases on toxicology, hazardous
chemicals, environmental health, and
toxic releases. Searchable within and
across the databases by chemical name,
CAS number, molecular formula,
classification code, locator code, and
structure or substructure
ToxRefDB http://www.epa. Toxicity Reference Database (ToxRefDB) l Toxicity experimental
gov/ncct/toxrefdb/ by the US EPA (component of ACToR); data
contains detailed information on
standard toxicity studies including
acute, chronic, cancer, sub-chronic,
developmental, and reproductive one,
on nearly 500 chemicals including
pesticides and other potentially toxic
chemicals. Downloadable in xls format
without structural information
The use of CRD allows the exploration across both chemical

and biological domains and structure-search ability through the
data. In order to be accessed with a CRD application, the informa-
tion has to be stored in specialized file formats; the most widely
used for exchange of structure/data information on chemicals is
Structure Data File (SDF) format. When the SDF file is imported
into a CRD application, it is possible to do structure/text/data
relational searching (data mining) across records in the database.

With structural databases it is possible to do all these operations:
l Searching results using a basic substructure or functional group
as query structure.
l Classification of the chemicals by chemical classes (frequency in
each class can be given).
l Formulating query with specific operation of structure-data-
text (chemical profiles).
l Calculating chemical similarity between pairs of chemicals.
l Identification of one or more common structural patterns
among groups of chemicals with similar characteristics or pro-
files of toxicity; these patterns can be used as predictive models
for estimating the toxicity of chemicals with similar structural
patterns.
A major problem is that of transforming the available database
according to these new standards. In this regard recently, a very
useful tool that expands the access to a wide range of toxicological
databases has been created by the National Center for Biotechnol-
ogy Information (NCBI) through the PubChem project (http://
pubchem.ncbi.nlm.nih.gov/). PubChem is a public information
system (integrated into the cluster of biological and literature data-
bases) that links chemical identifiers (e.g., chemical name, CAS
number, and chemical structures) of small molecules and small
interfering RNA (siRNA) reagent to biological properties. Pub-
Chem is not an independent database, but rather a repository
with standardized data from many sources, providing a tool to
interrogate databases, including toxicological and biomedical
ones, in the US public domain. The PubChem interfaces provide
extensive query capabilities on textual and numeric information, as
well as a comprehensive set of structure-based query methodolo-
gies. PubChem consists of three interconnected databases: Sub-
stance, BioAssay, and Compound database.
l The Substance database (SID) contains sample descriptions
(chemical structures, synonyms, registration IDs, description,
related URLs, database cross-reference links to PubMed, pro-
tein 3D structures, and biological screening results) provided
by depositors on more than 69 million records.
l The BioAssay database (AID) contains assay descriptions and
biological results of more than 434,000 BioAssays.
l The Compound database (CID) contains more than 27 million
unique chemical structures derived from the Substance database
records, allowing substance information (e.g., bioassay data)
from different depositors to be viewed for unique chemical struc-
tures. Structures stored within PubChem Compound are pre-
clustered and cross-referenced by identity and similarity groups.
Now PubChem has expanded as a user-depositor public data

repository, housing large amounts of public bioassay data, includ-
ing the NLM TOXNET (http://toxnet.nlm.nih.gov) and US EPA
Distributed Structure-Searchable Toxicity (DSSTox) (http://www.
epa.gov/ncct/dsstox/index.html) inventories.
Despite decisive progress in term of structural criteria searching,
there is no quality reviewer of the PubChem bioassay data, which
came from a large number of depositors or sources at different levels
of quality reviewed: it can be considered as a user beware data
source. In this regard, the DSSTox Database Network, a project of
the US EPA, is an example providing the user with self-contained
data files that can be readily incorporated into CRD application and
used freely. A primary objective of the DSSTox Web site is to serve as
a central community forum for publishing standard-format,
structure-annotated chemical toxicity data files for open-access,
public use, and for use in CRD applications. DSSTox efforts include
the quality review of the chemical annotation and representation
with respect to the toxicity information. DSSTox Standard Chemi-
cal Fields make clear distinction between the assigned chemical
structure and its relationships to the actual substance tested, which
can be a single compound, a mixture, polymer, macromolecule, or
active ingredient in a formulation. In addition, the entire DSSTox
chemical inventory is centrally indexed by unique record ID, generic
substance ID, and chemical (i.e., structure) ID, to be compatible
with indexing PubChem inventory as well as with the new Aggre-
gated Computational Toxicology Resource (ACToR) inventory
(http://www.epa.gov/ncct/actor/) which incorporates PubChem
inventory into larger aggregated inventory. The DSSTox project
provides an enhanced level of toxicity end point description and
annotation to encourage alternative modeling strategies and use of
the data files in structure-searching and chemical relational database
applications (77, 78). DSSTox Web site allows structure/substruc-
ture/similarity-searching through all DSSTox data file content, and
can be additionally accessed from off-site collaborators (e.g.,
CPDB, EPA IRIS, NTP) for either Web site searching through
local content (e.g., just the content of the originators Web site) or
broader searching through the DSSTox inventory and providing
external links to PubChem. Currently, the DSSTox data file cluster
includes 14 separate databases, spanning over 7,000 chemical sub-
stances. Each DSSTox database is published as a separate and dis-
tinct module that adheres to standard conventions in SDF data file
format, file names, chemical structure fields, and minimum docu-
mentation requirements. Together with the SDF file, the DSSTox
provides an MS Excel-readable file (.xls) (reporting the nonstruc-
tural data), and an Acrobat-readable file (.pdf) which displays the
traditional graphical representation of the chemicals.
Accessing the DSSTox Structure-Browser, users can perform a
name or CASRN search, or by typing a smile or drawing a structure
can perform exact/substructure/similarity searches across the

DSSTox chemical inventory, with search results displayed by
generic substance ID.
The publicly available data may not be immediately suitable for
use and data standardization can become extremely critical for some
applications, such as QSAR analyses. For example for each chemical
the CCRIS, as well as the CPDB, reports all the available results,
even in the presence of contradictory one, and database user has to
employ his or her judgement to make an activity assignment; NTP
database, including high-level detail on animal bioassay and genetic
toxicology experiments, does not provide ready access to data for the
entire chemical study inventory nor relational access to a particular
slide of the data or aggregate summarizations of the data according
to the requirements of QSAR modeling. To overcome these pro-
blems, the Istituto Superiore di Sanita` (ISS) has built the ISSCAN
chemical carcinogens: structures and experimental data database,
freely available from the ISS Web site (http://www.iss.it/ampp/
dati/cont.php?id233&lang1&tipo7). The ISSCAN database
contains information on chemical compounds tested with the
long-term carcinogenicity bioassay on rodents. To ensure the quality
of data, these have been rechecked on different sources of informa-
tion: contradictions were solved going back to the original papers,
and results based on insufficient protocols were not included. More-
over the biological data were coded in numerical terms that can be
used directly for QSAR analyses. This aspect of being QSAR-ready
eliminates the intermediate passage of data transformation, often
problematic without specific toxicological expertise (77). The gen-
eral structure of the database is inspired by that of the DSSTox
Network of the US EPA, contributing to the free diffusion of
scientific data in a standardized, easy-to-read format.
The latest initiative focused on unifying the public chemical
toxicity information is represented by ACToR, by the US EPA
National Center for Computational Toxicology (http://www.epa.
gov/ncct/actor/), launched in December 2008. ACToR is a set of
linked databases and software applications that bring together
many types and sources of data on environmental chemicals and
toxicology into one central location. Currently, the ACToR chemi-
cal and assay databases contain information on chemical structure,
physicochemical values, and in vitro and in vivo toxicology assays
for over 2,500,000 compounds derived from more than 500
sources of data (called data collection), including US EPA, US
Food and Drug Administration (FDA), National Institute of
Health (NIH), state agencies, corresponding government agencies
in Canada, Europe, and Japan, universities, the World Health
Organizations (WHO), and nongovernment organizations
(NGOs); an important set of data comes from the DSSTox project.
The design of ACToR has followed that of the NIH PubChem
Project in many aspects, but has been generalized to allow for the
broader types of data that are of interest to toxicologists and

environmental regulators (79).
Chemicals include, but are not limited to, high- and medium-
production-volume (HPV and MPV) industrial chemicals, pesti-
cides (active and inert ingredients), and potential ground and
drinking water contaminants. In ACToR chemicals are organized
into three main classes, the first two of which are modeled closely
after the corresponding PubChem data model. The three main
classes are substance (a unique chemical from a single data collec-
tion); compound (holds chemical structure information); and
generic chemical (aggregates a chemical structure plus all of the
corresponding substances). The common link is that all substances
share the same CAS registry number. Data entered into ACToR
undergo a limited quality control process. Data are preferably taken
from sources of high-quality data, so the quality control is limited
to checking that the data are correctly transferred from the source
via a reformatting and loading process into the ACToR database.
ACToR database includes Toxicity Reference Database
(http://www.epa.gov/ncct/toxrefdb), developed in partnership
with EPAs Office of Pesticide Programs (OPP) and assembled by
the National Center for Computational Toxicology (NCCT). Tox-
RefDB is a database containing pesticide registration toxicity data,
and includes chronic, cancer, sub-chronic, developmental, and
reproductive studies on hundreds of chemicals (many are pesticide
active ingredients). Each study entered utilizes a detailed standar-
dized vocabulary for capturing nearly every element in the data-
base, including the toxicological outcomes. ToxRefDB contains
roughly 2,000 studies on nearly 500 chemicals. Through the Web
interface the data is provided in an accessible and computable
manner.
ACToR database provides a connection to an EPA chemical
screening project called ToxCast, a research program for utilizing
computational chemistry, bioactivity profiling, and toxicogenomic
data to predict potential toxicity and prioritize limited testing
resources. This program is focused on environmental chemicals
with extensive toxicity data to provide interpretative context for
these bioactivity profiles (80).
Both ACToR and PubChem have the objective of aggregating
large sets of chemical structure and assay data (PubChem being the
largest effort currently available, with information on more than
10 M unique chemical compounds). PubChem allows more
generalized types of assay data to be submitted and displayed, but
its query engine is not tailored to the types of custom toxicology-
based queries needed for particular purposes. However, its under-
lying data model maps easily into the ACToR application and allows
to import all the PubChem data.
ACToR is a database consisting of information on environmen-
tal chemicals from a wide number of sources. However, currently
much of the high-quality toxicology data indexed in ACToR still

resides in text reports and remains to be manually extracted into
tabular form.
ACToR is a rapidly evolving system: in future development it
will include extraction of tabular data from online text documents
linked to chemicals, more curated chemical structures, a more
flexible query, and data export interface. It will be used for con-
structing training and validation dataset for the ToxCast program,
for building computational models linking chemical structure with
in vitro and in vivo assays, and as resources for reviewers with
regulatory agency examining new chemicals submitted for market
approval. ACToR can be considered as a super-aggregator and data
mining facilitator. The strength of the DSSTox project, in terms of
quality structuretoxicity annotation of a growing data inventory,
will be fully incorporated into ACToR.
Generally speaking, there are three general types of currently
available database in the hazard identification and risk assessment
fields: (a) database storing the results of toxicity experiments; (b)
database for use by regulatory agencies; and (c) aggregated data-
base to support SARs and predictive modeling. Initiatives as
ACToR and DSSTox are significant steps to store in the same
database all these requirements.
References
1. Ariens EJ (1984) Domestication of chemistry ular design approach for predicting mutagene-
by design of safer chemicals: structureactivity sis end points of ab-unsaturated carbonyl
relationships. Drug Metab Rev 15:425504 compounds. Toxicology 268:6477
2. Hansch C, Fujita T (1964) r-s-p Analysis. A 9. Dunn WJ, Wold S (1980) Structureactivity
method for the correlation of biological activity analyzed by pattern recognition: the asymmet-
and chemical structure. J Am Chem Soc 86: ric case. J Med Chem 23:595599
16161626 10. Wold S (1995) Chemometricswhat do we
3. Hansch C, Hoekman D, Leo A et al (2002) mean with it, and what do we want from it.
Chem-bioinformatics: comparative QSAR at Chemom Intell Lab Syst 30(1):109115
the interface between chemistry and biology. 11. Manallack DT, Ellis DD, Livingstone DJ
Chem Rev 102:783812 (1994) Analysis of linear and nonlinear QSAR
4. Livingstone DJ (2000) The characterization of data using neural networks. J Med Chem
chemical structures using molecular properties. 37:37583767
A survey. J Chem Inf Comput Sci 40:195209 12. Carlsson L, Helgee EA, Boyer S (2009) Inter-
5. Todeschini R, Lasagni M, Marengo E (1994) pretation of nonlinear QSAR models applied to
New molecular descriptors for 2D and 3D Ames mutagenicity data. J Chem Inf Model 49:
structures. Theory. J Chemom 8:263272 25512558
6. Basak SC, Mills D (2009) Predicting the 13. Michielan L, Moro S (2010) Pharmaceutical
vapour pressure of chemicals from structure: a perspectives of nonlinear QSAR strategies. J
comparison of graph theoretic versus quantum Chem Inf Model 50:961978
chemical descriptors. SAR QSAR Environ Res 14. Hansch C, Leo A (1995) Exploring QSAR. 1.
20:119132 Fundamentals and applications in chemistry
7. Estrada E (2008) Quantum chemical founda- and biology. American Chemical Society,
tion of the topological substructural molecular Washington, DC
design. J Phys Chem A 112:52085217 15. Benigni R (2003) Quantitative structureactiv-
8. Perez-Garrido A, Helguera AM, Lopez GC ity relationship (QSAR) models of mutagens
et al (2010) A topological substructural molec- and carcinogens. CRC, Boca Raton, FL
16. Kubinyi H (1993) 3D QSAR in drug design: tional databases technology. Mutat Res Rev
theory, methods and applications. ESCOM, 659:248261
Leiden 28. Hansch C (1991) Structureactivity relation-
17. Tute MS (1990) Hystory and objectives of ships of chemical mutagens and carcinogens.
quantitative drug design. In: Hansch C (ed) Sci Total Environ 109(110):1729
Comprehensive medicinal chemistry. Perga- 29. Benigni R (2005) Structureactivity relation-
mon, Oxford, pp 131 ship studies of chemical mutagens and carcino-
18. Franke R (1984) Theoretical drug design gens: mechanistic investigations and prediction
methods. Elsevier, Amsterdam approaches. Chem Rev 105:17671800
19. Franke R, Gruska A (2003) General introduc- 30. Enslein K, Gombar VK, Blake BW (1994) Use
tion to QSAR. In: Benigni R (ed) Quantitative of SAR in computer-assisted prediction of car-
structure-activity relationship (QSAR) models cinogenicity and mutagenicity of chemicals by
of mutagens and carcinogens. CRC, Boca the TOPKAT program. Mutat Res 305:4761
Raton, FL, pp 140 31. Klopman G (1992) Multicase 1. A hierarchical
20. Benigni R, Netzeva TI, Benfenati E et al computer automated structure evaluation pro-
(2007) The expanding role of predictive toxi- gram. Quant Struct Act Relat 11:176184
cology: an update on the (Q)SAR models for 32. Rosenkranz HS (2003) SAR in the assessment
mutagens and carcinogens. J Environ Sci of carcinogenesis: the MultiCASE approach.
Health C: Environ Carcinog Ecotoxicol Rev In: Benigni R (ed) Quantitative structureac-
25:5397 tivity relationship (QSAR) models of chemical
21. Woo YT, Lai DY, McLain JL et al (2002) Use mutagens and carcinogens. CRC, Boca Raton,
of mechanism-based structureactivity rela- FL, pp 175206
tionships analysis in carcinogenic potential 33. Helma C (2006) Lazy structureactivity rela-
ranking for drinking water disinfection by- tionships (lazar) for the prediction of rodent
products. Environ Health Perspect 110:7587 carcinogenicity and Salmonella mutagenicity.
22. Fung VA, Huff J, Weisburger EK, Hoel DG Mol Divers 10:147158
(1993) Predictive strategies for selecting 379 34. Benigni R, Richard AM (1998) Quantitative
NCI/NTP chemicals evaluated for carcino- structure-based modeling applied to character-
genic potential: scientific and public health ization and prediction of chemical toxicity.
impact. Fund Appl Toxicol 20:413436 Methods 14:264276
23. Greene N, Judson PN, Langowski JJ et al 35. Benigni R, Bossa C, Tcheremenskaia O et al
(1999) Knowledge-based expert systems for (2010) Alternatives to the carcinogenicity bio-
toxicity and metabolism prediction: DEREK, assay: in silico methods, and the in vitro and
StAR and METEOR. SAR QSAR Environ in vivo mutagenicity assays. Exp Opin Drug
Res 10:299314 Metab Toxicol 6:111
24. Woo YT, Lai DY (2005) Oncologic: a mecha- 36. Serafimova R, Fuart Gatnik M, Worth A
nism based expert system for predicting the (2010) Review of the QSAR models and soft-
carcinogenic potential of chemicals. In: Helma ware tools for predicting genotoxicity and car-
C (ed) Predictive toxicology. Taylor and Fran- cinogenicity. JRC Technical Report EUR
cis, Boca Raton, FL, pp 385413 24427 EN. Publications Office of the Euro-
25. Woo YT, Lai DY, Argus MF et al (1998) An pean Union, Luxenbourg
integrative approach of combining mechanisti- 37. Cronin MTD, Dearden JC (1995) QSAR in
cally complementary short-term predictive tests toxicology. 4. Prediction of non-lethal mam-
as a basis for assessing the carcinogenic potential malian toxicological end points, and expert
of chemicals. J Environ Sci Health C: Environ systems for toxicity prediction. Quant Struct
Carcinog Ecotoxicol Rev C16:101122 Act Relat 14:518523
26. Benigni R, Bossa C, Jeliazkova NG et al (2008) 38. Greene N (2002) Computer systems for the
The Benigni/Bossa rulebase for mutagenicity prediction of toxicity: an update. Adv Drug
and carcinogenicitya module of Toxtree. Deliv Rev 54:417431
EUR 23241 EN. Office for the Official Pub- 39. Hansch C (1977) On the predictive value of
lications of the European Communities, Lux- QSAR. In: Keverling Buisman JA (ed)
enbourg. EURScientific and Technical Biological activity and chemical structure. Else-
Report Series. Ref Type: Report vier, Amsterdam
27. Benigni R, Bossa C (2008) Structure alerts for 40. Hulzebos EM, Posthumus R (2003) (Q)SARs:
carcinogenicity, and the Salmonella assay sys- gatekeepers against risk on chemicals? SAR
tem: a novel insight through the chemical rela- QSAR Environ Res 14:285316
41. Martin YC (2006) What works and what does idation and applicability in predicting carcino-
not: lessons from experience in a pharmaceutical gens. Regulat Pharmacol Toxicol 50:5058
company. QSAR Combinat Sci 25:11921200 55. Benigni R, Zito R (2004) The second National
42. Pearl GM, Livingstone-Carr S, Durham SK Toxicology Program comparative exercise on
(2001) Integration of computational analysis the prediction of rodent carcinogenicity: defin-
as a sentinel tool in toxicologic assessments. itive results. Mutat Res Rev 566:4963
Curr Top Med Chem 1:247255 56. Cramer GM, Ford RA, Hall RL (1978) Esti-
43. Richard AM (1998) Commercial toxicology mation of toxic hazarda decision tree
prediction systems: a regulatory perspective. approach. Food Cosmet Toxicol 16:255276
Toxicol Lett 102103:611616 57. Munro IC, Renwick AG, Danielewska-Nikiel B
44. Snyder RD, Pearl GM, Mandakas G et al (2008) The threshold of toxicological concern
(2004) Assessment of the sensitivity of the (TTC) in risk assessment. Toxicol Lett 180:
computational programs DEREK, TOPKAT, 151156
and MCASE in the prediction of the genotoxi- 58. Kroes R, Renwick AG, Cheeseman MA et al
city of pharmaceutical molecules. Environ Mol (2004) Structure-based thresholds of toxico-
Mutagen 43:143158 logical concern (TTC): guidance for applica-
45. Valerio LG, Arvidson KB, Chanderbhan RF tion to substances present at low levels in the
et al (2007) Prediction of rodent carcinogenic diet. Food Chem Toxicol 42:6583
potential of naturally occurring chemicals in 59. Munro IC, Ford RA, Kennepohl E et al (1996)
the human diet using high-throughput QSAR Correlation of structural class with No-
predictive modeling. Toxicol Appl Pharmacol Observed-Effect levels: a proposal for estab-
222:116 lishing a threshold of concern. Food Chem
46. Zeiger E, Ashby J, Bakale G et al (1996) Pre- Toxicol 34:829867
diction of Salmonella mutagenicity. Mutagen 60. Verhaar HJM, Solbe J, Speksnijder J et al
11:474484 (2000) Classifying environmental pollutants:
47. Benigni R, Bossa C, Netzeva TI, et al (2007) Part 3. External validation of the classification
Collection and evaluation of (Q)SAR models system. Chemosphere 40:875883
for mutagenicity and carcinogenicity. EUR 61. Walker JD, Gerner I, Hulzebos E et al (2005)
22772 EN. Office for the Official Publications The Skin Irritation Corrosion Rules Estimation
of the European Communities, Luxenbourg. Tool (SICRET). QSAR Combinat Sci 24:
EURScientific and Technical Research 378384
Series. 127-2007. Ref Type: Report 62. Gerner I, Schlegel K, Walker JD et al (2004)
48. Woo YT, Lai DY, Argus MF et al (1995) Devel- Use of physicochemical property limits to
opment of structureactivity relationship rules develop rules for identifying chemical sub-
for predicting carcinogenic potential of chemi- stances with no skin irritation or corrosion
cals. Toxicol Lett 79:219228 potential. QSAR Combinat Sci 23:726733
49. Benigni R, Zito R (2003) Designing safer 63. Hulzebos E, Walker JD, Gerner I et al (2005)
drugs: (Q)SAR-based identification of muta- Use of structural alerts to develop rules for
gens and carcinogens. Curr Top Med Chem identifying chemical substances with skin irrita-
3:12891300 tion or skin corrosion potential. QSAR Com-
50. Ashby J (1985) Fundamental structural alerts binat Sci 24:332342
to potential carcinogenicity or noncarcinogeni- 64. Gerner I, Liebsch M, Spielmann H (2005)
city. Environ Mutagen 7:919921 Assessment of the eye irritating properties of
51. Ashby J, Tennant RW (1988) Chemical struc- chemicals by applying alternatives to the Draize
ture, Salmonella mutagenicity and extent of rabbit eye test: the use of QSARs and in vitro
carcinogenicity as indicators of genotoxic tests for the classification of eye irritation.
carcinogenesis among 222 chemicals tested by Altern Lab Anim 33:215237
the U.S.NCI/NTP. Mutat Res 204:17115 65. Benigni R, Bossa C, Worth AP (2010) Struc-
52. Piegorsch WW, Zeiger E (1991) Measuring tural analysis and predictive value of the rodent
intra-assay agreement for the Ames Salmonella in vivo micronucleus assay results. Mutagen
assay. In: Rienhoff O, Lindberg DAB (eds) 25:335341
Statistical methods in toxicology. Springer, 66. Schultz TW (2007) Verification of the struc-
Heidelberg, pp 3541 tural alerts for Michael acceptors. Chem Res
53. Benigni R, Bossa C (2008) Predictivity of Toxicol 20:13591363
QSAR. J Chem Inf Model 48:971980 67. Enoch SJ, Madden JC, Cronin MT (2008)
54. Mayer J, Cheeseman MA, Twaroski ML (2008) Identification of mechanisms of toxic action
Structure activity relationship analysis tools: val- for skin sensitisation using a SMARTS pattern
based approach. SAR QSAR Environ Res digm for toxicity prediction. Toxicol Mech
19:555578 Meth 18:103118
68. Aptula AO, Patlewicz G, Roberts DW (2005) 75. Yang C, Benz RD, Cheeseman MA (2006)
Skin sensitization: reaction mechanistic appli- Landscape of current toxicity databases and
cability domains for structureactivity relation- database standards. Curr Opin Drug Discov
ships. Chem Res Toxicol 18:14201426 Dev 9:124133
69. Rydberg P, Gloriam DE, Zaretzki J et al (2010) 76. Yang C, Richard AM, Cross KP (2006) The art
SMARTCyp: a 2D method for prediction of of data mining the minefields of toxicity data-
cytochrome P450-mediated drug metabolism. bases to link chemistry to biology. Curr Com-
ACS Med Chem Lett 1:96100 put Aid Drug Des 2:135150
70. Serafimova R, Todorov M, Pavlov T et al 77. Benigni R, Bossa C, Richard AM et al (2008) A
(2007) Identification of the structural require- novel approach: chemical relational databases,
ments for mutagencitiy, by incorporating and the role of the ISSCAN database on asses-
molecular flexibility and metabolic activation sing chemical carcinogenicity. Ann Ist Super
of chemicals. II. General Ames mutagenicity Sanita` 44:4856
model. Chem Res Toxicol 20:662676 78. Richard AM (2004) DSSTox web site launch:
71. Hardy B, Douglas N, Helma C et al (2010) improving public access to databases for build-
Collaborative development of predictive toxi- ing structure-toxicity prediction models. Pre-
cology applications. J Cheminf 2:7 clinica 2:103108
72. Valerio LG (2009) In silico toxicology for the 79. Judson R, Richard AM, Dix D et al (2008)
pharmaceutical sciences. Toxicol Appl Pharma- ACToRaggregated computational toxicol-
col 241:356370 ogy resource. Toxicol Appl Pharmacol 233:
73. OECD (2007) Approaches to data gap filling 713
in chemical categories. Chapter 3: Guidance on 80. Dix DJ, Houck KA, Martin MT et al (2007)
grouping of chemicals, vol 80. OECD Series The ToxCast program for prioritizing toxicity
on Testing and Assessment, Paris, pp 3041 testing of environmental chemicals. Toxicol Sci
74. Richard AM, Yang C, Judson RS (2008) Tox- 95:512
icity data informatics: supporting a new para-
Chapter 5
Classification Models for Safe Drug Molecules

A.K. Madan, Sanjay Bajaj, and Harish Dureja
Abstract
Frequent failure of drug candidates during development stages remains the major deterrent for an early
introduction of new drug molecules. The drug toxicity is the major cause of expensive late-stage develop-
ment failures. An early identification/optimization of the most favorable molecule will naturally save
considerable cost, time, human efforts and minimize animal sacrifice. (Quantitative) Structure Activity
Relationships [(Q)SARs] represent statistically derived predictive models correlating biological activity
(including desirable therapeutic effect and undesirable side effects) of chemicals (drugs/toxicants/envi-
ronmental pollutants) with molecular descriptors and/or properties. (Q)SAR models which categorize the
available data into two or more groups/classes are known as classification models. Numerous techniques of
diverse nature are being presently employed for development of classification models. Though there is an
increasing use of classification models for prediction of either biological activity or toxicity, the future trend
will naturally be towards the development of classification models capable of simultaneous prediction of
biological activity, toxicity, and pharmacokinetic parameters so as to accelerate development of bioavailable
safe drug molecules.
Key words: Classification models, Drug safety, Biological activity, QSAR models, Toxicity,
Bioavailability
1. Introduction
Drug development has gradually evolved from a laboratory synthesis

to an automated and in some cases virtual process wherein large
numbers of structures can be simulated, synthesized, evaluated, and
shortlisted. Subsequently, all effective molecules must be tried in
animal models and finally in man and these two stages constitute the
bottleneck in drug development (1). However, frequent failure of
drug candidates during development remains the major deterrent
for an early introduction of new drug molecules. Apart from phar-
macokinetics, drug toxicity is the major cause of expensive late-
stage development failures and devastating market withdrawals
99
100 A.K. Madan et al.
(2). As a consequence, early screening needs to be significantly

improved so as to avoid late drop-out. Aforementioned preemptive
approach is commonly known as the fail early, fail cheap dictum
(3). Consequently, an early identification/optimization of the most
favorable molecule will naturally save considerable cost, time,
human efforts and minimize animal sacrifice (4). Therefore, in
order to reduce these failures, regulatory agencies in collaboration
with the pharmaceutical companies/institutions are engaged in the
development of novel toxicity prediction technologies for integra-
tion into the drug discovery process for detection of toxicity at the
earliest possible. The US Environmental Protection Agency (EPA)
has also supported the use of alternate testing technologies.
Recently adopted European Chemical Regulation (REACH) envi-
sages the wide use of computer assisted models as a substitute for
in vivo and in vitro testing for evaluation of chemical properties.
Numerous models have already been developed through European
Union (EU) funded project CAESAR (computer assisted evaluation
of substances according to regulation) for five chosen regulatory
endpoints relevant for REACH legislation: bio-concentration factor,
skin sensitization, mutagenicity, carcinogenicity, and toxicity (5).
According to REACH, results of (Quantitative) Structure Activity
Relationships [(Q)SARs] may be utilized as a substitute for testing
subject to fulfillment of following conditions (6):
(a) Results should be derived from a scientifically validated (Q)SAR
model.
(b) The molecule should be within the applicability domain of the
(Q)SAR model.
(c) Adequate results for the purpose of classification, labeling, and
risk assessment.
(d) Adequate and reliable documentation of the methodology to
be duly furnished.
Comprehensive technical guidance document (TGD) of EU
identified four specific applications of (Q)SARs, comprising of data
evaluation, decision for further testing, estimating specific para-
meters, and analyzing data needs on effects of potential concern (6).
(Q)SARs represent statistically derived predictive models cor-
relating biological activity (including desirable therapeutic effect
and undesirable side effects) of chemicals (drugs/toxicants/envi-
ronmental pollutants) with molecular descriptors and/or proper-
ties (7). In silico or computational toxicology utilizes the data
generated through in vitro and in vivo models by evaluating the
quality of the data, developing databases, using computer algo-
rithms and statistically analyzing chemical and/or biological data
for establishing predictive capabilities based on (Q)SAR models or
toxicity endpoint libraries (8). These in silico prediction models can
be divided into two categories: classification models and correlation
5 Classification Models for Safe Drug Molecules 101
models (9). (Q)SAR models which categorize the available data

into two or more groups/classes are known as classification models.
The binary classification models classify the compounds into two
different groups like active and inactive, toxic and nontoxic, muta-
genic and nonmutagenic. However, ternary, quaternary, and higher
classification models are also possible. The binary classification
models act as versatile tools in virtual screening protocols (10).
There is an ever increasing need for development of robust and
accurate classification algorithms that support the high throughput
analysis of virtual libraries of molecules (11). These models can
predict either the biological activity or the toxicity of a compound.
Since modeling critical drug safety toxicological endpoints is a slow
moving field (12), therefore, there is a strong need for the develop-
ment of models for simultaneous prediction of biological activity
and toxicity so as to accelerate development of safe drug molecules.
The term therapeutic index (TI) reveals quantitatively the selec-
tivity of a drug when therapeutic and untoward effects are com-
pared (13). The TI is the ratio of LD50 to ED50 and is a statement
of how selective the drug is in producing its desired therapeutic
effects (14). A term similar to TI is the selectivity index (SI), which
is the ratio of CC50 to EC50 and is an indirect measure of safety of
compounds. A therapeutic agent should not only possess high
potency (low EC50 values) but should also exhibit high safety
(high SI or TI values). In general, selectivity index of greater than
100 is considered as one of the criteria for consideration of lead
amongst the hits (15). Such safety parameters should therefore be
determined as early as possible in the initial stages of the drug
discovery process so as to avoid much costlier late stage failures.
2. Materials
and Methods
The compound classification approaches are usually based upon
either clustering or partitioning methods. Clustering of com-
pounds in chemical space involves the calculation of intermolecular
distances and compounds that are close to each other so as to form
clusters. In partitioning, chemical space is subdivided into sections,
based upon ranges of descriptor values, and compounds falling in
the same section are combined (16). These models are being widely
used for prediction of physicochemical property, biological activity
and toxicity of lead structures prior to their synthesis followed by
synthesis and subsequent development of few highly active and
nontoxic lead compounds for the chosen biological activity and
toxicity endpoint. Use of such classification schemes has also been
recommended in the second and sixth steps of workflow (Fig. 1) for
use of nontesting data in chemical assessment developed as per
Fig. 1. Workflow for the use of nontesting data in chemical assessment (reproduced from ref. 17 with permission from
Wiley-VCH Verlag GmbH & Co. KGaA, Germany).
REACH regulation (17) and consequently has gained importance

from regulatory point of view as well.
The classification models can be developed through a variety of
statistical methods, which include discriminant analysis (linear or
multiple), decision or classification trees (CT), cluster analysis
(CA), k-nearest neighbor (k-NN), factor and principal component
analysis, pattern recognition, moving average analysis (MAA),
Nave Bayes method, support vector machine (SVM), and ensem-
ble method (1821). Other methods include embedded cluster
modeling (ECM), neural networks (NN), stepwise discriminant
analysis (SDA), soft independent modeling of class analogy
(SIMCA) learning vector quantization (LVQ), genetic algorithms,
and neural networks (2225). Table 1 exemplifies various classifica-
tion modeling techniques reported for drug-like molecules.
2.1. Classification Classification and Regression Trees (CART) represent classification

or Decision Trees method involving use of historical data for construction of decision
trees (DTs). CART methodology was introduced in 1980s by Brei-
man et al. (106). DTs create an iterative branching topology in
which the branch taken at each intersection is regulated by a rule
related to a molecular descriptor. The decision tree (DT) classifica-
tion model comprises of a tree-like structure consisting of nodes
and links. Finally each terminating tree is assigned to a particular
class (107). In a DT, the molecules at each parent node are cate-
gorized, based upon the index value, into two child nodes. The tree
exhibiting the lowest value of error in cross-validation is selected as
an optimal tree (108). The regression trees do not have classes,
splitting in regression trees is conducted in accordance with
squared residuals minimization algorithm which implies that
expected sum variances for two resulting nodes should be minimal.
Classification trees simply work like regression trees, only they try
to predict a discrete category/class, rather than a numerical value.
The splitting of data is done by applying Ginni rule or Twoing rule
(109). Random forest represents an ensemble of unpruned classifi-
cation trees created through use of bootstrap samples of the train-
ing data and random feature selection in tree induction. The
prediction is done by majority vote of the individual trees (110).
Another approach that combines multiple decision tree models is
termed as Decision Forest. Each decision tree model is developed by
using a unique set of molecular descriptors. When models of similar
predictive quality are combined using the decision forest method,
quality compared to the individual models is consistently and sig-
nificantly improved in both training and testing steps (111). Deci-
sion trees for discrimination of toxicity (LD50) and biological
activity have been illustrated in Figs. 2 and 3, respectively.
2.2. Pattern Recognition Pattern recognition is a term pertaining to collection of computer-

based methods employed for detection of previously unknown
Table 1
Examples of various classification based QSAR/QSTR models
Number of Biological activity/

Methodology Chemical class compounds toxicity References
Cluster analysis, k- 4,5a-Dihydrotestosterone 31 Anabolic (26)
NN and Diverse compounds 200 Adverse effects of drugs (27)
hierarchical Noncongeners 71 FPR agonist (28)
cluster analysis Noncongeners NA hERG blockers (29)
Noncongeners NA COX 2 inhibition (30)
Natural products 2,420 Activity against drug (31)
resistant tumor
Diverse structures 653 Tyrosinase inhibitors (32)
Noncongeners NA AChE inhibition (33)
N-oxide containing NA Anti-T. cruzi (34)
heterocycles
Diverse range 1,098 Factor Xa (FXa) inhibitors (35)
Diverse compounds 255 Skin sensitization (36)
Diverse compounds 644 Aquatic toxicity in (37)
Tetrahymena pyriformis
Curated dataset 384 Animal carcinogenicity (38)
Phenanthrene-based 52 Anticancer (39)
tylophrines
Diverse compounds 311 Estrogenic activity (40)
Wide variety of chemical NA P. aeruginosa deacetylase (41)
classes LpxC inhibition
Decision tree Broad variety of drugs 507 Adverse drug reactions (42)
Noncongeners NA CyP450 aromatase (43)
inhibition
NA NA CyP450 (CYP) 1A2 and (44)
2D6 inhibition
Three diverse sets 471 P-glycoprotein substrate, (45)
induction/inhibition
Noncongeners 611 Antibacterial (46)
NA 346 5-HT1A receptor affinity (47)
Broad range structural 200 Endocrine disruption (48)
classes potential
Noncongeners NA Mutagenic toxicity (49)
Noncongeners 2,772 Antipsychotic (50)
Structurally diverse 232 Estrogen receptor (51)
datasets binding activity
Diverse compounds 644 Aquatic toxicity in (37)
T. pyriformis
Diverse compounds 16 Antihistaminic/LD50 (52)
50 -O-[(N-Acyl)sulfamoyl] 31 Antitubercular activity (53)
adenosines
(continued)
Table 1
(continued)

Factor analysis/ Substituted benzenes 51 Nonspecific toxicity (54)
principal Coumarin derivatives NA Inhibition of (55)
component prothrombin and
analysis thrombin production
Cannabinoid 26 Psychoactivity (56)
Quinone compounds 25 Antitrypanocidal (57)
NA 26 Dermal and respiratory (58)
sensitizers
NA NA Phosphodiesterase 5 (59)
inhibitors
Flavones 136 HIV-1 IN inhibition (60)
Ellipticine derivatives 40 Cytotoxic activity (61)
Linear discriminant Acrylates, methacrylates, 268 Mutagenicity (62)
analysis and a,b-unsaturated
carbonyl compounds
Highly dissimilar 246 Tyrosinase-inhibitory (63)
molecules activity
Noncongeners 4,508 42 different assays (64)
Noncongeners 1,429 Antityrosinase activity (65)
Homogenous compounds 307 Bloodbrain partitioning (66)
Noncongeners 2,688 Antiparasitic against 16 (67)
species
Noncongeners 440 Antitrypanosomal (68)
Noncongeners 1,239 Antiparasitic (69)
Noncongeners 13,034 Antifungal against 90 (70)
species
Diaryl ureas 74 Against vascular (71)
endothelial GFR-
2 kinase
Noncongeners 1,660 CNS activity20 (72)
receptors
Uracil-based acyclic and NA Antimalarial (73)
deoxyuridine
Arylcarboxylic acid 23 Induction of TNF-a (28)
hydrazide production
Diverse structures 157 Permeability through (74)
cultured Caco-2 cells
Diverse steroids NA Anabolic-androgenic (75)
steroid
Nave Bayes Peptide and b-lactams NA PEPT1 inhibition (76)
Diverse structures 1,02,000 11 activity classes (77)
Noncongeners 1,979 hERG blockers (78)
Support vector Nucleoside derivatives Anti-HIV (79)
machine Noncongeners (marketed 871 Inhibitory activity of 125 (80)
drugs) proteins
Noncongeners NA Ether-a-go-go related (29)
gene blockers
(continued)
Table 1
(continued)

Diverse structures 1,02,000 11 activity classes (77)
Thiophene 140 Genotoxicity (81)
Noncongeners NA Mutagenic toxicity (49)
Chemical diverse 59 Histone deacetylases (82)
compounds (HDACIs) inhibition
Ensemble Reported compounds 37 Human CyP450 2B6- (83)
substrate interactions
Selected compounds 39 hERG liability (84)
Alkylbenzene flavoring NA Carcinogenic activity (85)
agents
Diverse compounds 33 CYP2A6-substrates/ (86)
inhibitors
Diverse compounds 1,008 Human serum protein (87)
binding
Huprines 60 Acetylcholinesterase (88)
inhibition
Diverse compounds 105 Mutagenic activity (89)
Piperazyinylquinazoline 79 PDGFR inhibition (90)
analogue
3-Nitrocoumarins and 33 Antimicrobial activity (91)
related
Moving average Pyridinium azole 42 Carbonic anhydrase (92)
analysis activation
N-Aryl anthranilic acids 112 Anti-inflammatory (93)
N-phenylphenylglycines 39 CRF antagonizing activity (94)
a,g-Diketo acids 30 Anti-HCV (95)
Benzamides/ 41 Anticonvulsant (96)
benzylamines
Pyrrolopyrimidines and 82 MRP inhibition (97)
derivatives
4-Substituted-2- 128 Antiulcer (98)
guanidino thiazoles
Propan-2-ones 44 Phospholipase A2 (99)
inhibition
Thiazolidinone 28 GSK 3-b inhibition (100)
Sulphonamide 34 Carbonic anhydrase (101)
inhibition
Indole-2-ones 67 CDK-2 inhibition (102)
Arlindoles 31 h5HT2a antigonism (103)
Telomerase inhibition 30 Flavonoids (104)
Methyl hydrazines 61 Antineoplastic (105)
Fig. 2. A decision tree for distinguishing between low LD50 and high LD50 (reproduced
from ref. 52 with permission from Inderscience Enterprises Limited, Switzerland).
relationships/patterns within a large amount of multivariate data

(20). Pattern recognition is an ensemble of techniques that utilizes
artificial intelligence for predicting biological activity or chemical
characteristics (112). In order to apply pattern recognition, multi-
ple data (variables) are utilized to characterize a set of objects. The
objects are initially divided into two setsthe training set and the
test set. The training set may be subsequently divided into several
subsets, classes of objects with inherent similarity. The training data
is firstly employed for the development of mathematical rules which
can be subsequently used to assign new objects to one of the classes
on the basis of the same type of data measured on these new objects
(113). Robustness of the classifications can be tested by repeating
the training procedure several times with slightly altered but ran-
domly varied training sets (112).
2.3. Cluster Analysis Cluster analysis (CA) groups the data from the available information
that describes the object and their relationships (114). There are
numerous types of clusters which include prototype based, graph
based, density based, well-separated and shared property or
Fig. 3. Topology of decision tree distinguishing active compounds {A} from inactive
compounds {B} (reproduced from ref. (53) with permission from Osterreichische
Apotheker-Verlagsgesellschaft m. b. H., Vienna, Austria).
conceptual clustering (20). Cluster analysis approach is of two cate-

gorieshierarchical and nonhierarchical or partitioning clustering.
Hierarchical clustering sets up objects in a binary tree-structure and
arranges them in either bottom-up known as agglomerative or top-
down known as divisive procedure. The most widely used nonhier-
archical clustering techniques comprise of k-means and k-nearest
neighbor (k-NN) (114). k-means clustering is a prototype which is
expressed in terms of a centroid, as the mean of a group of points,
and is applied to objects in a continuous n-dimensional space. k-NN
is also known as Jarvis-Patrick algorithm and is simple, supervised
method which classifies unknown compound based on the class
membership of its k nearest neighbor compounds (30). A dendro-
gram illustrating the hierarchical k-NN cluster analysis (k-NNCA)
has been presented in Fig. 4.
Fig. 4. A dendrogram illustrating the results of the hierarchical k-NNCA of the set of 31
steroids used in the training and the prediction sets (reproduced from ref. (26) with
permission from Wiley-VCH Verlag GmbH & Co. KGaA, Germany).
2.4. Principal Principal component analysis (PCA) and factor analysis (FA) are
Component Analysis two techniques which can reveal relatively simpler patterns within a
and Factor Analysis complex set of variables. These techniques particularly seek to
discover if the observed variables can be explained largely or entirely
in terms of a much smaller number of variables called factors (115).
PCA is a well known multivariate procedure that minimizes
dimensionality of data space while retaining as much information
as possible (116). Data set obtained from a series of compounds
tested for their biological activity in several test systems can be easily
and successfully subjected to PCA (20). Though both PCA and FA
aim to reduce the dimensionality of a set of data, the approaches to
do so are significantly different for these two techniques. Each
technique provides a different insight into the data structure, with
PCA dealing with the diagonal elements of the covariance matrix,
while FA emphasizing on the off-diagonal elements (112).
2.5. Linear Discriminant Linear discriminant analysis (LDA) is an important classification

Analysis technique that creates a linear transformation of the original feature
into a space. This approach maximizes the interclass separability and
minimizes the intraclass variance (108). This technique involves a
linear combination of parameters termed as a linear discriminant
function which will successfully classify the observations into their
observed or assigned categories. Parameters are added or deleted so
as to improve discrimination and the results are simply judged by
the number of observations correctly classified (112). The basic
objective of LDA is to classify the dependent by dividing an n-
dimensional descriptor space into two regions which are separated
by a hyperplane defined by a linear discriminant function (79). LDA
provides an opportunity to examine the weighing and the statistical
significance of various physical properties that might distinguish
active from inactive analogues, and agonists from antagonists. It
acts as a useful extension to regression analysis because it allows
the inclusion of inactive compounds in the analysis and facilitates
analysis of nonquantitative data of other types (20). LDA has been
successfully employed in prediction of the validity of compounds
(90), blood-brain barrier permeability (117), and ecotoxicity (118).
A classification model using LDA has been illustrated in Fig. 5.
2.6. Support Vector SVM represents a powerful tool for general (nonlinear) classifica-
Machine tion, regression, and outlier detection with an intuitive model
development (79). SVM originated from the early concepts devel-
oped by Vapnik and Cortes (120122). SVMs are based upon the
structured risk minimization principle. SVMs construct a hyper-
plane which separates two classes (9). SVM is a learning system
involving hypothesis space of linear functions in a high-dimensional
feature space, trained with a learning algorithm from the optimiza-
tion theory (79). In the simplest case, compounds from different
categories can be easily separated by linear hyperplane which is
defined solely by its nearest compounds from the training set.
Such compounds are referred to as support vectors and are respon-
sible for the nomenclature of the said technique. SVMs constitute
widely used classification technique for differentiating drugs for
their classes using sets of molecular descriptors due to their high
performance in generalization, computational efficiency and
robustness in high dimensions (123). Figure 6 shows an example
of a classification model using SVM.
Fig. 5. Plot of canonical variable 2 versus canonical variable 1 for LDA with three groups
(discriminant function FD2) (reproduced from ref. (119) with permission from Oxford
University Press).
Fig. 6. Support vectors and margins in linearly separable (a) and nonseparable (b)
problems. In nonseparable case, negative margins are encountered and their magnitude
is subject to optimization along with the magnitude of the positive margins (reproduced
from ref. (108) with permission from Bentham Science Publishings Ltd).
In numerous cases, however, no linear separation is possible. In

order to take account of this problem, slack variables are introduced
for association with the misclassified compounds. Misclassification
of compounds influences the decision hyperplane and the misclas-
sified compounds also become support vectors (108). One of the
key components of SVM is the usage of a so-called kernel function,
which facilitates nonlinear classification and regression. A kernel

function can be considered as a special similarity measure with the
mathematical properties of symmetry and positive definitiveness
(124). SVM has been employed for development of QSAR involv-
ing COX-2 inhibition and aquatic toxicity (125) and carcinogenic
potency (126).
2.7. Ensemble Methods Ensemble method utilizes a combination of techniques for con-
structing a single QSAR model with the aim of improving predict-
ability. The method of ensemble aggregates the results from several
individual models so as to achieve substantial improvement over
various individual models (127). With the development of ensem-
ble approaches, multiple models of varying types are individually
developed resulting in different descriptor subsets for each model
type (128). An ensemble can be designed in diverse methodologies
such as bagging, random square, boosting methods (108,
127130) and three class classification models (131). An ensemble
method has been exemplified in Fig. 7. Recent research has revealed
that by capitalizing on the diversity of the individual models,
ensemble techniques can minimize uncertainty and produce more
Fig. 7. Individual decisions of three binary classifiers and the resulting classifier ensemble
with more accurate decision boundary (reproduced from ref. (108) with permission from
Bentham Science Publishings Ltd).
stable, accurate and reliable predictors in toxicology (132, 133).

Despite their growing popularity among neural network practi-
tioners, ensemble methods have not yet been widely adopted in
[(Q)SAR/(Q)SPR]. Neural networks are reported to be inherently
unstable, in that minor changes in the training set and/or training
parameters can lead to major changes in their generalization per-
formance (132).
2.8. Bayes Classifier The Bayes Classifier originates from the Bayes rule relating the
posterior probability of a class to its overall probability, the proba-
bility of the observations as well as the likelihood of a class with
respect to observed variables. In Bayes rule, the class/group mini-
mizing the posterior probability is selected as the prediction result
(108). Bayesian statistics compare the frequency of occurrences of
the features found in two or more groups to identify those features
which discriminate best between these groups (76). A distinct
advantage of the Nave Bayes Classifier is that it requires a relatively
smaller training data to estimate the parameters (means and var-
iances of the variables) needed for classification.
2.9. Moving Average MAA model is conceptually a classification model and is based upon
Analysis a method known as maximization of moving average (134). In each
of the models, the compounds are classified either as active or
inactive/toxic or nontoxic/permeable or impermeable/adductible
or nonadductible. According to this method, the minimum size of
range is based on moving average of 65% of the correctly predicted
compounds. However if the moving average percentage of correct
prediction lies between 50 15%, it is classified as transitional
range (21, 135). An active range initially bracketed by transitional
ranges and subsequently bracketed by inactive ranges simply reveals
gradual change in the activity and represents an ideal model. The
methodology used in the MAA aims at the development of suitable
models for providing lead molecules through exploitation of the
active ranges. MAA based models are unique and differ widely from
conventional (Q)SAR models. Both systems of modeling have their
advantages and limitations. In the MAA modeling, the system
adopted has distinct advantage of identification of narrow active
ranges, which may be erroneously skipped during regression analysis
in conventional QSAR (103). Since the ultimate goal of modeling is
to provide lead structures, therefore, active ranges of the MAA
based models can naturally play a vital role in providing lead struc-
tures. This method has been extensively used for development of
classification models for biological activities of diverse nature.
2.10. Moving Average As a part of regulatory assessment for investigational new drug
Model for Safe Drug (IND) application, submission of documented safety data of the
Molecules molecule along with biological efficacy is mandatory, therefore,
safety and activity predicted using computational methods will
Table 2
Examples of MAA based classification models for simultaneous prediction/
consideration of biological activity and safety
Average Average Average Average

EC50/IC50 EC50/IC50 SI in SI in
S. Model in active in inactive active inactive
N. Chemical Class index range range range range References
1 1-Alkoxy-5-alkyl-6- SATCI 0.49 10.32 168.8 48.3 (136)
(arylthio) uracils WTCI 0.51 8.67 147.9 44.9
2 1[(2-Hydroxyethoxy) MCI 0.092 48.10 8,895.80 7.54 (21)
methyl]-6-(phenylthio) EAI 0.062 78.35 12,545.41 34.38
thyamines
3 Dihydro (alkylthio) WI 4.6 159.1 64.8 35.2 (137)
(naphthylmethyl) MCI 5.1 166.5 71.7 166.5
oxopyrimidines ECI 3.4 145.9 59.0 9.1
4 Dimethylaminopyridin-2- AECTCI 0.002 1.356 24,697.0 2,595 (138)
ones MCTCI 0.002 10.191 25,785.0 6,008
WTCI 0.003 3.756 23,310.0 2,873
5 5-Alkyl-2-alkylamino- H0m 0.022 45.83 7,987.29 642.29 (139)
6-(2,6- H8e 0.020 30.17 6,248.33 154.00
difluorophenylalkyl)- CETCI 0.021 53.62 12,812.80 120.67
3,4-dihydropyrimidin- GTCI 0.020 50.80 5,422.67 1,369.41
4(3H)-one CIC4 0.020 77.20 8,215.44 441.22
6 2, 3-Diaryl-1,3- WI 0.06 13.87 1750.0 105.25 (140)
thiazolidin-4-one ECI 0.06 13.87 1,437.0 105.25
ECTI 0.06 12.16 1,822.4 359.46
7 Acylthiocarbamates W 0.101 26.55 10,771.0 23.49 (141)
MCI 0.072 29.52 13,809.0 26.49
AECI 0.130 26.94 8,357.29 25.79
SATCI superadjacency topochemical index, WTCI Weiners topochemical index, MCI molecular connectivity index, EAI
eccentric adjacency index, WI Wieners index, ECI eccentric connectivity index, AECTCI augmented eccentric connec-
tivity topochemical index, MCTCI molecular connectivity topochemical index, H0m H autocorrelation of lag 0/
weighted by atomic masses, H8e H autocorrelation of lag 8/weighted by atomic masses, CETCI connective eccentricity
topochemical index, GTCI global topological charge index, CIC4 complementary information content (neighborhood
symmetry of 4-order), ECTI eccentric connectivity topochemical index
Note: Whenever the active and/or inactive range was more than one, the range having highest biological activity and the
inactive range having lowest efficacy have been considered
naturally strengthen the claim for approval of an IND application.

MAA based models can also be utilized for simultaneous predic-
tion/consideration of biological activity and safety of compounds
subject to availability of requisite data (Table 2). A therapeutic
agent should not only possess high potency (low IC50 or EC50
values) but should also exhibit high safety represented by high
Commparison of Aveerage ECC50
30
25
20
Average EC50
W
15
MCI
10 AECI
0
LI LT Active UT UI
Nature of Range
Average Selectivity Index

14000
12000
Average Selectivity Index
10000
8000 W
6000 MCI
AECI
4000
2000
0
LI LT Active UT UI
Nature of Range
W - Wieners index; MCI Molecular connectivity Index; AECI Augmented
ecentric connectivity index; LI Lower inactive; LT Lower transitional;
UT Upper transitional; UI Upper Inactive.
Fig. 8. Average EC50 and SI values of acylthiocarbamate compounds in various ranges of models developed using Wieners
index, molecular connectivity index and augmented eccentric connectivity index (141). W Wieners index, MCI molecular
connectivity index, AECI augmented eccentric connectivity index, LI lower inactive, LT lower transitional, UT upper
transitional, UI upper inactive.
values of either selectivity index or therapeutic index. Figure 8

exemplify average EC50 and SI values of various ranges of a MAA
model for anti-HIV activity of acylthiocarbamate compounds
(141). Similarly, Fig. 9 illustrates average EC50 and SI values of
various ranges of a MAA model for anti-HIV activity of 1-alkoxy-5-
alkyl-6-(arylthio)uracils (136). High potency amalgamated with
safety with regard to anti-HIV activity of the active ranges simply
Comparison of Average EC50

12
10
Average EC50
6
AC
4
Wc
2
0
LI LT A UT UI UA
Nature of Range
Comparison of Average Selectivity Index
200
150
Average SI
100
AC
Wc
50
0
LI LT A UT UI UA
Naturre of Range
Ac- superadjacency topochemical index, Wc- Wieners topochemical index,

LI Lower inactive, LT Lower transitional, A - Active, UT Upper
transitional, UI Upper Inactive, UA Upper Active.
Fig. 9. Average EC50 and SI values of 1-alkoxy-5-alkyl-6-(arylthio)uracils in Rvarious ranges of models developed using
Wieners topochemical index and superadjacency topochemical index (136). AC superadjacency topochemical index,
Wc Wieners topochemical index, LI lower inactive, LT lower transitional, A active, UT upper transitional, UI upper inactive,
UA upper active.
reveals significance of these models. The active ranges of these models

can be easily exploited to provide lead structures.
Commercial softwares employed for the development of classi-
fication models have been exemplified in Table 3.
The prediction accuracy of these classification models should
always be assessed in terms of goodness-of-fit parameters. The
formulae and definition of these goodness-of-fit parameters have
been enlisted in Table 4 (19).
Table 3
Examples of software used for development of classification models
Program Company/author Website

DTREG Phillip H. Sherrod http://www.dtreg.com
Vanguard Studio Vanguard Software Corporation http://www.vanguardsw.com
CAESAR European Union http://www.caesar-project.eu
TOXTREE EU Joint Research Center http://ecb.jrc.ec.europa.eu/qsar/qsar-tools/
index.php?cTOXTREE
DART EU Joint Research Center http://ecb.jrc.ec.europa.eu/qsar/qsar-tools/
index.php?cDART
Wessa P Wessa http://www.wessa.net
MVSP Kovach Computing Service http://www.kovcomp.com
Minitab 16 Minitab Inc http://www.minitab.com
Statistica Statsoft http://www.statsoft.com
SYSTAT13 Systat http://www.systat.com
Unscrambler 10.0.0 CAMO Software http://www.camo.com
SPSS/WIN IBM http://www.spss.com
Cerius2 QSAR+ The Scripps Research Institute http://www.scripps.edu
Table 4
Definitions of goodness of fit parameters
Statistic Formula Definition

P
Concordance or c gg 0 Total fraction of objects correctly classified
g
100
accuracy n cgg0 number of objects correctly classified to each
(nonerror rate) class
n total number of objects
P
Error rate n g
c gg 0 Total fraction of objects misclassified
n cgg0 number of objects correctly classified to each
1 concordance class
NO-model error n 100
NOMER% nn M
Error provided in absence of model
rate, NOMER%) nM number of objects of the most represented class
Prior probability of P g G1 Probability that an object belongs to a class supposing
a class that every class has the same probability
(independently of the number of objects of the class)
G number of classes
(continued)
Table 4
(continued)
Statistic Formula Definition

ng
Prior proportional Pg n
Probability that an object belongs to a class taking into
probability of a account the number of objects of the class
class ng total number of objects belonging to class g
Sensitivity of a class CA
nA 100 Percentage of active compounds correctly classified as
active compounds
CA number of correctly classified active compounds
nA total number of active compounds
Specificity of a class C NA
nNA 100 Percentage of nonactive compounds correctly classified
as nonactive compounds
CNA number of correctly classified nonactive
compounds
nNA total number of nonactive compounds
P
Misclassification Risk of incorrect classification (takes into account the
P l c 0
g 0 gg gg
Pg
risk g ng 100 number of misclassifications, and their importance)
lgg0 diagonal element of the loss matrix
cgg0 number of objects correctly classified to each
class
ng total number of objects belonging to class g
Pg prior probability class
g 1, . . ., G (G number of classes)
Reproduced from ref. 19 with permission from OECD Environment Health and Safety Publications Series on Testing
and Assessment, No. 69, OECD Publishing, France
3. Conclusion
The past few decades have witnessed increasing trend toward devel-
opment of (Q)SAR models. These models constitute vital compo-
nent of drug discovery process. Numerous techniques of diverse
nature are being used for the development of both correlation and
classification models. These models normally predict either the
biological activity or toxicity of the compound. However, there is
a strong need for development of models for simultaneous predic-
tion of biological activity and toxicity so as to accelerate develop-
ment of safe drug molecules. Classification models play a vital role
in drug design. Active ranges of some of the classification models
such as those developed through MAA are of particular significance
because these simultaneously take into consideration both the
activity/potency as well as selectivity index. Active ranges of such
models have not only reportedly revealed high potency as indicated
by low value of EC50 or IC50 but they have also exhibited safety as
indicated by very high values of selectivity index. However, future

trend will naturally be towards development of models which can
simultaneously predict biological activity, toxicity, and pharmacoki-
netic parameters so as to accelerate development of bioavailable safe
drug molecules and to minimize failures during later stages of drug
development.
References
1. Sneyd JR (2001) Drug development in 21st 13. Harvey SC (1980) Drug absorption, action
century. Curr Anaesth Crit Care 12:329334 and disposition. In: Osol A (ed) Remingtons
2. Kubinyi H (2003) Drug research: myths, pharmaceutical sciences, 16th edn. Mack,
hype and reality. Nat Rev Drug Discov 2: Pennsylvania
665668 14. Mandell GL, Petri WA Jr (1996) Antimicro-
3. Sussman NL, Kelly JH (2003) Saving time bial agents: penicillins, cephalosporins, and
and money in drug discovery-A pre-emptive other b-lactam antibiotics. In: Hardman JG,
approach. Business Briefing. Future Drug Limbird LE (eds) Goodman and Gilmans the
Discov. Business Briefing Ltd., London, UK, pharmacological basis of therapeutics, 9th
4649 edn. McGraw-Hill, New York
4. DiMasi J, Hansen R, Grabowski H (2003) 15. Tsantrizos YS (2010) Research and develop-
The price of innovation: new estimates of mentthe discovery process http://www.
drug development cost. J Health Econ 22: wei-c-chen.com/winter2010/chem503/04%
151185 20-%20ADME-PK%20drug%20delivery.pdf.
5. Marjana N, Marjana V (2010) QSAR model Accessed 4 Dec 2010
for reproductive toxicity and endocrine dis- 16. Bajorath J (2001) Selected concepts and
ruption activity. Molecules 15:19871999 investigation in compound classification,
6. Chen JW, Li XH, Yu HY et al (2008) Progress molecular descriptor analysis and virtual
and perspectives of quantitative structure- screening. J Chem Inf Comput Sci 41:
activity relationships used for ecological risk 233245
assessment of toxic organic compounds. Sci 17. Bassan A, Worth AP (2008) The integrated
China Ser B Chem 51:593606 use of models for the properties and effects of
7. Ibezim EC, Duchowicz PR, Ibezim NE et al chemicals by means of a structured workflow.
(2009) Computer-aided linear modeling QSAR Comb Sci 27:620
employing QSAR for drug discovery. Sci Res 18. Worth AP, Cronin MTD (2003) The use of
Essays 4:15591564 discriminant analysis, logistic regression and
8. Johnson DE, Wolfgang GHI (2000) Predict- classification tree analysis in the development
ing human safety: screening and computa- of classification models for human health
tional approaches. Drug Discov Today 5: effects. J Mol Str Theochem 622:97111
445454 19. OECD (2007), Guidance document on the
9. Hou T, Wang J, Zhang W et al (2006) Recent validation of (Quantitative) Structure-Activity
advances in computational prediction of drug Relationships [(Q)SAR] Models, p. 62,
absorption and permeability in drug discov- OECD Environment Health and Safety Pub-
ery. Curr Med Chem 13:26532667 lications Series on Testing and Assessment,
No. 69
10. Thai KM, Ecker GF (2008) A binary QSAR
model for classification of hERG potassium 20. Grover M, Singh B, Bakshi M, Singh S (2000)
channel blockers. Bioorg Med Chem 16: Quantitative structure-property relationships
41074119 in pharmaceutical research. Part 1. Pharm Sci
Technol Today 3:2835
11. Eitrich T, Kless A, Druska C et al (2007)
Classification of highly unbalanced cyp450 21. Gupta S, Singh M, Madan AK (2001) Predict-
data of drugs using cost sensitive machine ing anti-HIV activity: computational
learning techniques. J Chem Inf Model 47: approach using a novel topological descriptor.
92103 J Comput Aided Mol Des 15:671678
12. Ekins S, Boulanger B, Swaan PW et al (2002) 22. Worth AP, Cronin MTD (2000) Embedded
Towards a new age of virtual ADME/TOX cluster modelling: a novel quantitative
and multidimensional drug discovery. J Com- structure-activity relationship for generating
put Aided Mol Design 16:381401 elliptic models of biological activity. In: Balls
M, van Zeller AM, Halder ME (eds) Progress learning methods. Eur J Med Chem 45:
in the reduction, refinement and replacement 11671172
of animal experimentation. Elsevier Science, 34. Boiani M, Cerecetto H, Gonzalez M et al
Amsterdam (2008) Modeling anti-Trypanosoma cruzi
23. Frimurer TM, Bywater R, Naerum L (2000) activity of N-oxide containing heterocycles.
Improving the odds in discriminating drug- J Chem Inf Model 48:213219
like from non drug-like compounds. J 35. Lin HH, Han LY, Yap CW et al (2007) Pre-
Chem Inf Comput Sci 40:13151324 diction of factor Xa inhibitors by machine
24. Weber KC, HonArio KM, Bruni AT et al learning methods. J Mol Graph Model
(2006) The use of classification methods for 26:505518
modeling the antioxidant activity of flavonoid 36. Gunturi SB, Theerthala SS, Patel NK et al
compounds. J Mol Model 12:915920 (2010) Prediction of skin sensitization poten-
25. Gillet VJ, Willett P, Bradshaw J (1998) Iden- tial using D-optimal design and GA-kNN
tification of biological activity profiles using classification methods. SAR QSAR Environ
substructural analysis and genetic algorithms. Res 21:305335
J Chem Inf Comput Sci 38:165179 37. Polishchuk PG, Muratov EN, Artemenko AG
26. Alvarez-Ginarte YM, Crespo-Otero R, et al (2009) Application of random forest
Marrero-Poncec Y et al (2006) In-silico classi- approach to QSAR prediction of aquatic tox-
fication of solubility using binary k-Nearest icity. J Chem Inf Model 49:24812488
Neighbour and physicochemical descriptors. 38. Zhu H, Rusyn I, Richard A et al (2008) Use
QSAR Comb Sci 25:881894 of cell viability assay data improves the predic-
27. Rodgers AD, Zhu H, Fourches D et al (2010) tion accuracy of conventional quantitative
Modeling liver-related adverse effects of drugs structure-activity relationship models of ani-
using k nearest neighbor quantitative mal carcinogenicity. Environ Health Perspect
structure-activity relationship method. Chem 116:506513
Res Toxicol 23:724732 39. Zhang S, Wei L, Bastow K et al (2007) Anti-
28. Khlebnikov AI, Schepetkin IA, Kirpotina LN tumor agents 252. Application of validated
et al (2008) Computational structure-activity QSAR models to database mining: discovery
relationship analysis of non-peptide inducers of of novel tylophorine derivatives as potential
macrophage tumor necrosis factor-alpha pro- anticancer agents. J Comput Aided Mol Des
duction. Bioorg Med Chem 16:93029312 21:97112
29. Nisius B, Goller AH, Bajorath J (2009) Com- 40. Asikainen A, Kolehmainen M, Ruuskanen J
bining cluster analysis, feature selection and et al (2006) Structure-based classification of
multiple support vector machine models for active and inactive estrogenic compounds by
the identification of human ether-a-go-go decision tree, LVQ and kNN methods. Che-
related gene channel blocking compounds. mosphere 62:658673
Chem Biol Drug Des 73:1725 41. Kadam RU, Roy N (2006) Cluster analysis
30. Kauffman GW, Jurs PC (2001) QSAR and k- and two-dimensional quantitative structure-
Nearest Neighbour Classification Analysis of activity relationship (2D-QSAR) of Pseudo-
selective cyclooxygenase-2 inhibition using monas aeruginosa deacetylase LpxC inhibi-
topologically-based numerical descriptors. J tors. Bioorg Med Chem Lett 16:51365143
Chem Inf Comput Sci 4:15531560 42. Hammann F, Gutmann H, Baumann U et al
31. Efferth T, Konkimalla VB, Wang YF et al (2009) Classification of cytochrome p(450)
(2008) Prediction of broad spectrum resis- activities using machine learning methods.
tance of tumors towards anticancer drugs. Mol Pharm 6:19201926
Clin Cancer Res 14:24052412 43. Petkov PI, Temelkov S, Villeneuve DL et al
32. Casanola-Martn GM, Marrero-Ponce Y, (2009) Mechanism-based categorization of
Khan MTH et al (2008) Atom- and bond- aromatase inhibitors: a potential discovery
based 2D TOMOCOMD-CARDD approach and screening tool. SAR QSAR Environ Res
and ligand-based virtual screening for the 20:657678
drug discovery of new tyrosinase inhibitors. J 44. Burton J, Danloy E, Vercauteren DP (2009)
Biomol Screen 13:10141024 Fragment-based prediction of cytochromes
33. Lv W, Xue Y (2010) Prediction of acetylcho- P450 2D6 and 1A2 inhibition by recursive
linesterase inhibitors and characterization of partitioning. SAR QSAR Environ Res 20:
correlative molecular descriptors by machine 185205
45. Hammann F, Gutmann H, Jecklin U et al respiratory chemical sensitizers based on

(2009) Development of decision tree models computational chemistry properties. SAR
for substrates, inhibitors, and inducers of p- QSAR Environ Res 20:429451
glycoprotein. Curr Drug Metab 10:339346 59. De O, Figueiredo LJ, Garrido FM, Kunisawa
46. Yang XG, Chen D, Wang M et al (2009) VY et al (2006) A chemometric study of phos-
Prediction of antibacterial compounds by phodiesterase 5 inhibitors. J Mol Graph
machine learning approaches. J Comput Model 24:227232
Chem 30:12021211 60. Lameira J, Medeiros IG, Reis M et al (2006)
47. Kuzmin VE, Polischuk PG, Artemenko AG Structure-activity relationship study of fla-
et al (2008) Quantitative structure-affinity vones compounds with anti-HIV-1 integrase
relationship of 5-HT1A receptor ligands by activity: a density functional theory study.
the classification tree method. SAR QSAR Bioorg Med Chem 14:71057112
Environ Res 19:213244 61. de Melo LC, Braga SF, Barone PM (2007)
48. Panaye A, Doucet JP, Devillers J et al (2008) Pattern recognition methods investigation of
Decision trees versus support vector machine ellipticines structure-activity relationships. J
for classification of androgen receptor ligands. Mol Graph Model 25:912920
SAR QSAR Environ Res 19:129151 62. Perez-Garrido A, Helguera AM, Rodriguez
49. Liao Q, Yao J, Yuan S (2007) Prediction of FG et al (2010) QSAR models to predict
mutagenic toxicity by combination of Recur- mutagenicity of acrylates, methacrylates and
sive Partitioning and Support Vector alpha, beta-unsaturated carbonyl compounds.
Machines. Mol Divers 11:5972 Dent Mater 26:397415
50. Kim HJ, Choo H, Cho YS et al (2006) Classi- 63. Casanola-Martin GM, Marrero-Ponce Y,
fication of dopamine, serotonin, and dual Khan MT et al (2007) Dragon method for
antagonists by decision trees. Bioorg Med finding novel tyrosinase inhibitors: biosilico
Chem 14:27632770 identification and experimental in vitro assays.
51. Hong H, Tong W, Xie Q (2005) An in silico Eur J Med Chem 42:13701381
ensemble method for lead discovery: decision 64. Garcia I, Fall Y, Gaimez G et al (2010) First
forest. SAR QSAR Environ Res 16:339347 computational chemistry multi-target model
52. Dureja H, Gupta S, Madan AK (2009) Topo- for anti-Alzheimer, anti-parasitic, anti-fungi,
logical models for prediction of physico- and anti-bacterial activity of GSK-3 inhibitors
chemical, pharmacokinetic and toxicological in vitro, in vivo, and in different cellular lines.
properties of antihistaminic drugs using deci- Mol Divers. doi:10.1007/s11030-010-9280-3
sion tree and moving average analysis. Int J 65. Le-Thi-Thu H, Casanola-Martin GM,
Comput Biol Drug Des 2:353370 Marrero-Ponce Y et al (2010) Novel
53. Goyal RK, Dureja H, Singh G, Madan AK coumarin-based tyrosinase inhibitors discov-
(2010) Models for antitubercular activity of ered by OECD principles-validated QSAR
50 -O-[(N-Acyl)sulfamoyl]adenosines. Sci approach from an enlarged, balanced data-
Pharm 78:791820 base. Mol Divers. doi:10.1007/s11030-010-
54. Roy K, Sanyal I (2005) QSTR with extended 9274-1
topochemical atom indices. 7. QSAR of sub- 66. Vilar S, Chakrabarti M, Costanzi S (2010)
stituted benzenes to Saccharomyces cerevi- Prediction of passive blood-brain partition-
siae. QSAR Comb Sci 25:359371 ing: straightforward and effective classifica-
55. Bhatia MS, Ingale KB, Choudhari PB et al tion models based on in silico derived
(2008) Application quantum and physico physicochemical descriptors. J Mol Graph
chemical molecular descriptors utilizing prin- Model 28:899903
cipal components to study mode of anticoag- 67. Prado-Prado FJ, Garcaa-Mera X, Gonzalez-
ulant activity of pyridyl chromen-2-one Daaz H (2010) Multi-target spectral moment
derivatives. Bioorg Med Chem 17:16541662 QSAR versus ANN for antiparasitic drugs
56. HonArio KM, da Silva AB (2005) A study on against different parasite species. Bioorg Med
the influence of molecular properties in the Chem 18:22252231
psychoactivity of cannabinoid compounds. J 68. Castillo-Garit JA, Vega MC, Rolon M et al
Mol Model 11:200209 (2010) Computational discovery of novel try-
57. Molfetta FA, Bruni AT, HonArio KM et al panosomicidal drug-like chemicals by using
(2005) A structure-activity relationship study bond-based non-stochastic and stochastic qua-
of quinone compounds with trypanocidal dratic maps and linear discriminant analysis.
activity. Eur J Med Chem 40:329338 Eur J Pharm Sci 39:3036
58. Warne MA, Nicholson JK, Lindon JC et al 69. Prado-Prado FJ, Ubeira FM, Borges F et al
(2009) A QSAR investigation of dermal and (2010) Unified QSAR & network-based
computational chemistry approach to antimi- 80. Sato T, Matsuo Y, Honma T (2008) In silico
crobials. II. Multiple distance and triadic cen- functional profiling of small molecules and its
sus analysis of antiparasitic drugs complex applications. J Med Chem 51:77057716
networks. J Comput Chem 31:164173 81. Du H, Wang J, Watzl J et al (2008) Classifica-
70. Prado-Prado FJ, Borges F, Perez-Montoto tion structure-activity relationship (CSAR)
LG et al (2009) Multi-target spectral studies for prediction of genotoxicity of thio-
moment: QSAR for antifungal drugs vs. dif- phene derivatives. Toxicol Lett 177:1019
ferent fungi species. Eur J Med Chem 82. Tang H, Wang XS, Huang XP (2009) Novel
44:40514056 inhibitors of human histone deacetylase
71. Sun M, Chen J, Wei H et al (2009) Quantita- (HDAC) identified by QSAR modeling of
tive structure-activity relationship and classifi- known inhibitors, virtual screening, and
cation analysis of diaryl ureas against vascular experimental validation. J Chem Inf Model
endothelial growth factor receptor-2 kinase 49:461476
using linear and -linear models. Chem Biol 83. Leong MK, Chen TH (2008) Prediction of
Drug Des 73:644654 cytochrome P450 2B6-substrate interactions
72. Gozalbes R, Barbosa F, Nicola E et al (2009) using pharmacophore ensemble/support vec-
Development and validation of a tor machine (PhE/SVM) approach. Med
pharmacophore-based QSAR model for the Chem 4:396406
prediction of CNS activity. ChemMedChem 84. Leong MK (2007) A novel approach using
4:204209 pharmacophore ensemble/support vector
73. Garcia-Domenech R, Lopez-Pena W, machine (PhE/SVM) for prediction of hERG
Sanchez-Perdomo Y et al (2008) Application liability. Chem Res Toxicol 20:217226
of molecular topology to the prediction of the 85. Auerbach SS, Shah RR, Mav D et al (2010)
antimalarial activity of a group of uracil-based Predicting the hepatocarcinogenic potential
acyclic and deoxyuridine compounds. Int J of alkenylbenzene flavoring agents using tox-
Pharm 363:7884 icogenomics and machine learning. Toxicol
74. Castillo-Garit JA, Marrero-Ponce Y, Torrens Appl Pharmacol 243:300314
F et al (2008) Estimation of ADME proper- 86. Leong MK, Chen YM, Chen HB et al (2009)
ties in drug discovery: predicting Caco-2 cell Development of a new predictive model for
permeability using atom-based stochastic and interactions with human cytochrome P450
non-stochastic linear indices. J Pharm Sci 2A6 using pharmacophore ensemble/support
97:19461976 vector machine (PhE/SVM) approach.
75. Alvarez-Ginarte YM, Marrero-Ponce Y, Ruiz- Pharm Res 26:9871000
Garcia JA (2008) Applying pattern recogni- 87. Votano JR, Parham M, Hall LM (2006)
tion methods plus quantum and physico- QSAR modeling of human serum protein
chemical molecular descriptors to analyze the binding with several modeling techniques uti-
anabolic activity of structurally diverse ster- lizing structure-information representation. J
oids. J Comput Chem 29:317333 Med Chem 49:71697181
76. Kamphorst J, Cucurull-Sanchez L, Jones B 88. Fernandez M, Caballero J (2006) Ensembles
(2007) A performance evaluation of multiple of Bayesian-regularized genetic neural net-
classification models of human PEPT1 inhibi- works for modeling of acetylcholinesterase
tors and non-inhibitors. QSAR Comb Sci inhibition by huprines. Chem Biol Drug Des
26:220226 68:201212
77. Cannon EO, Amini A, Bender A et al (2007) 89. Tarasov VA, Mustafaev ON, Abilev SK (2005)
Support vector inductive logic programming Use of ensemble structural descriptors for
outperforms the naive Bayes classifier and increasing the efficiency of QSAR study.
inductive logic programming for the classifi- Genetika 41:9971005
cation of bioactive chemical compounds. J 90. Guha R, Jurs PC (2004) Development of lin-
Comput Aided Mol Des 21:269280 ear, ensemble, and nonlinear models for the
78. Sun H (2006) An accurate and interpretable prediction and interpretation of the biological
bayesian classification model for prediction of activity of a set of PDGFR inhibitors. J Chem
HERG liability. ChemMedChem 1:315322 Inf Comput Sci 44:21792189
79. Wang J, Liu H, Qin S et al (2007) Study on 91. Debeljak Z, Skrbo A, Jasprica I et al (2007)
structure-activity relationship of new anti- QSAR study of antimicrobial activity of some
HIV nucleoside derivatives based on the sup- 3-nitrocoumarins and related compounds. J
port vector machine method. QSAR Comb Chem Inf Model 47:918926
Sci 26:161172
92. Bajaj S, Sambi SS, Madan AK (2004) Prediction 104. Dureja H, Madan AK (2007) Topochemical
of carbonic anhydrase activation of tri/tetra models for prediction of telomerase inhibi-
substituted-pyridinium-azole compounds- tory activity of flavonoids. Chem Biol Drug
Computational approach using novel topo- Des 70:4752
chemical descriptor. QSAR Comb Sci 105. Bajaj S, Sambi SS, Madan AK (2006) Models
23:506514 for prediction of anti-neoplastic activity of
93. Bajaj S, Sambi SS, Madan AK (2004) Topo- 1,2-Bis(sulfonyl)-1-methylhydrazines:
logical models for prediction of anti- Computational approach using Wieners
inflammatory activity of N-arylanthranilic indices. MATCH Commun Math Comput
acids. Bioorg Med Chem 12:36953701 Chem 56:193204
94. Bajaj S, Sambi SS, Madan AK (2005) Topo- 106. Blower PE, Cross KP (2006) Decision tree
chemical model for prediction of corticotro- methods in pharmaceutical research. Curr
pin releasing factor antagonizing activity of Top Med Chem 6:3139
N-phenylphenylglycines analogs. Acta Chim 107. Andres C, Hutter MC (2006) CNS perme-
Slov 52:292296 ability of drugs predicted by a decision tree.
95. Bajaj S, Sambi SS, Madan AK (2008) Topo- QSAR Comb Sci 25:305309
chemical models for predicting the activity of 108. Dudek AZ, Arodz T, Galvez J (2006)
a, g-diketo acids as inhibitors of the hepatitis Computational methods in developing quan-
C virus NS5B RNA-dependent RNA poly- titative structure-activity relationship: a
merase. Pharmaceut Chem J 40:650654 review. Comb Chem High Throughput
96. Sardana S, Madan AK (2002) Predicting anti- Screen 9:213228
convulsant activity of benzamides/benzyla- 109. Breiman L, Friedman JH, Olshen RA et al
mines: computational approach using (1984) Classification and regression trees.
topological descriptors. J Comput Aided Mol Wadsworth International Group, Belmont
Des 16:545550 110. Svetnik V, Liaw A, Tong C et al (2003) Ran-
97. Lather V, Madan AK (2005) Topological dom forest: a classification and regression tool
model for the prediction of MRP1 inhibitory for compound classification and QSAR mod-
activity of pyrrolopyrimidines and templates eling. J Chem Inf Comput Sci 43:19471958
derived from pyrrolopyrimidine. Bioorg Med 111. Tong W, Hong H, Fang H et al (2003) Deci-
Chem Lett 15:49674972 sion forest: combining the predictions of mul-
98. Gupta S, Singh M, Madan AK (1999) Super- tiple independent decision tree models. J
pendentic index: a novel topological descrip- Chem Inf Comput Sci 43:525531
tor for predicting biological activity. J Chem 112. Aboul-Kassim TAT, Simoneit BRT (2001)
Inf Comput Sci 39:272277 QSAR/QSPR and multicomponent joint
99. Kumar V, Madan AK (2006) Application of toxic effect modeling of organic pollutants at
graph theory: prediction of cytosolic phos- aqueous-solid phase interfaces, vol 5 E, The
pholipase A2 inhibitory activity of propan-2- handbook of environmental chemistry.
ones. J Math Chem 39:511521 Springer, Heidelberg
100. Kumar V, Madan AK (2005) Application of 113. Wold S, Dunnt WJ, Hellberg S (1985) Toxicity
graph theory: prediction of glycogen synthase modeling and prediction with pattern recogni-
kinase-3 inhibitory activity of thiadiazolidi- tion. Environ Health Perspect 61:257268
nones as potential drugs for the treatment of 114. http://www-users.cs.umn.edu/~kumar/
Alzheimers disease. Eur J Pharm Sci dmbook/ch8.pdf. Accessed 15 Aug 2010
24:213218
115. http://www.geosoft.com. Accessed 28 Nov
101. Kumar V, Madan AK (2007) Application of 2010
graph theory: models for prediction of car-
bonic anhydrase inhibitory activity of sulpho- 116. Jalali-Heravi M, Shahbazikhah P, Zekavat B
namides. J Math Chem 42:925940 et al (2007) Principal component analysis-
ranking as a variable selection method for the
102. Dureja H, Madan AK (2005) Topochemical simulation of 13C nuclear magnetic resonance
models for prediction of cyclin-dependent spectra of xanthones using artificial neural net-
kinase 2 inhibitory activity of indole-2-ones. works. QSAR Comb Sci 26:764772
J Mol Model 11:525531
117. Adenot M, Perriere N, Scherrmann JM et al
103. Dureja H, Madan AK (2006) Prediction of (2007) Applications of a blood-brain barrier
h5-HT2A receptor antagonistic activity of technology platform to predict CNS penetra-
arylindoles: computational approach using tion of various chemotherapeutic agents. 1.
topochemical descriptors. J Mol Graph Anti-infective drugs. Chemotherapy 53:7072
Model 25:373379
118. Mazzatorta P, Benfenati E, Lorenzini P et al logP derived by using GACGSVM

(2004) QSAR in ecotoxicity: an overview of approach. Mol Divers 13:261268
modern classification techniques. J Chem Inf 132. Agrafiotis DK, Cedeno W, Lobanov VS
Comput Sci 44:105112 (2002) On the use of neural network ensem-
119. Mahmoudi N, de Julin-Ortiz J, Ciceronl L bles in QSAR and QSPR. J Chem Inf Comput
et al (2006) Identification of new antimalarial Sci 42:903911
drugs by linear discriminant analysis and 133. Chen JJ, Tsai CA, Young JF et al (2005)
topological virtual screening. J Antimicrob Classification ensembles for unbalanced class
Chemother 57:489497 sizes in predictive toxicology. SAR QSAR
120. Cortes C, Vapnik V (1995) Support Vector Environ Res 16:517529
Machine. In: Machine Learning 20:273293 134. Gupta S, Singh M, Madan AK (2000) Con-
121. Vapnik V (1995) The nature of statistical nective eccentricity index: a novel topological
learning theory. Springer, New York descriptor for predicting biological activity.
122. Vapnik V (1998) Statistical learning theory. J Mol Graph Model 18:1825
Willey, New York 135. Dureja H, Gupta S, Madan AK (2008) Topo-
123. Karthikeyan R, Mohamed S, Sridhar V, logical models for prediction of pharmacoki-
Nagasuma C (2009) Support vector machine netic parameters of cephalosporins using
classifier for predicting drug binding to random forest, decision tree and moving aver-
p-glycoprotein. J Proteomics Bioinform 2: age analysis. Sci Pharm 76:377394
193201 136. Bajaj S, Madan AK (2007) Topochemical
124. Frohlich H, Wegner JK, Sieker F et al (2006) models for anti-HIV activity of 1-alkoxy-
Kernal functions for attributed molecular 5-alkyl-6-(arylthio) uracils. Chem Paper 61:
graphs-A new similarity-based approach to 127132
ADME prediction in classification and regres- 137. Lather V, Madan AK (2005) Topological
sion. QSAR Comb Sci 25:317326 models for the prediction of anti-HIV activity
125. Yao XJ, Panaye A, Doucet JP (2005) Com- of dihydro (alkylthio) (naphthylmethyl)
parative classification study of toxicity oxopyrimidines. Bioorg Med Chem 13:
mechanisms using support vector machines 15991604
and radial basis function neural networks. 138. Dureja H, Madan AK (2009) Predicting
Anal Chim Acta 535:259273 anti-HIV activity of dimethylaminopyridin-
126. Tanabe K, Lucic B, Amic D et al (2010) Pre- 2-ones: computational approach using topo-
diction of carcinogenicity for diverse chemi- chemical descriptors. Chem Biol Drug Des
cals based on substructure grouping and SVM 73:258270
modeling. Mol Divers 14:789802 139. Dutt R, Dureja H, Madan AK (2009) Models
127. Zhang Q, Hughes-Oliver JM, Ng RT (2009) A for prediction of anti- HIV-1 activity of 5-
model-based ensembling approach for develop- Alkyl-2-alkylamino-6-(2,6-difluoropheny-
ing QSARs. J Chem Inf Model 49:18571865 lalkyl)-3,4-dihydropyrimidin-4(3H)-ones
128. Dutta D, Guha R, Wild D et al (2007) using random forest, decision tree and
Ensemble feature selection: consistent moving average analysis. J Comput Methods
descriptor subsets for multiple QSAR models. Sci Eng 9:118
J Chem Inf Model 47:989997 140. Kumar V, Sardana S, Madan AK (2004) Pre-
129. Aroda T, Yuen DA, Dudek AZ (2006) Ensem- dicting anti-HIV activity of 2,3-diaryl-1,3-
ble of linear models for predicting drug prop- thiazolidin-4-ones: computational approach
erties. J Chem Inf Model 46:416423 using reformed eccentric connectivity index.
J Mol Model 10:399407
130. Bruce CL, Melville JL, Pickett SD et al (2007)
Contemporary QSAR classifiers compared. 141. Bajaj S, Sambi SS, Madan AK (2005) Topo-
J Chem Inf Model 47:219227 logical models for prediction of anti-HIV
activity of acylthiocarbamates. Bioorg Med
131. Zhang H, Xiang M-L, Ma C-Y (2009) Chem 13:32633268
Three-class classification models of logS and
Chapter 6
QSAR and Metabolic Assessment Tools

in the Assessment of Genotoxicity
Andrew P. Worth, Silvia Lapenna, and Rositsa Serafimova
Abstract
In this chapter, a range of computational tools for applying QSAR and grouping/read-across methods are
described, and their integrated use in the computational assessment of genotoxicity is illustrated through
the application of selected tools to two case-study compounds2-amino-9H-pyrido[2,3-b]indole (AaC)
and 2-aminoacetophenone (2-AAP). The first case study compound (AaC) is an environment pollutant and
a food contaminant that can be formed during the cooking of protein-rich food. The second case study
compound (2-AAP) is a naturally occurring compound in certain foods and also proposed for use as a
flavoring agent. The overall aim is to describe and illustrate a possible way of combining different informa-
tion sources and software tools for genotoxicity and metabolism prediction by means of a simple stepwise
approach. The chapter is aimed at researchers and assessors who have a basic knowledge of computational
toxicology and some familiarity with the practical use of computational tools. The emphasis is on how to
evaluate the data generated by multiple tools, rather than the practical use of any specific tool.
Key words: Mutagenicity, Genotoxicity, QSAR, Expert system, Structural alert, Read-across,
Chemical category, Toxicity, Metabolism, Prediction
1. Introduction
Regulatory programs aimed at assessing and managing the risks of

chemicals require information on a wide range of chemical proper-
ties and effects, including physicochemical and environmental fate
properties, as well as effects on human health and environmental
species. In the EU, information on the properties of chemicals is
required under multiple pieces of legislation, including legislation
on industrial chemicals (1, 2), biocides (3), plant protection pro-
ducts (4), contaminants in surface waters (5), cosmetics (6, 7), and
food flavorings (8, 9).
125
126 A.P. Worth et al.
The regulatory assessment of chemicals involves one or more of

the following procedures: (a) hazard assessment (which includes
hazard identification and doseresponse characterization), possibly
leading to classification and labeling; (b) exposure assessment; (c)
risk assessment based on hazard and exposure assessments; and (d)
the identification of Persistent, Bioaccumulative, and Toxic (PBT)
and Very Persistent and Very Bioaccumulative (vPvB) chemicals. To
address one or more of these different regulatory goals, the risk
assessments often need to be performed in the face of numerous
data gaps in hazard and exposure information. For reasons of
cost-effectiveness and animal welfare, these data gaps cannot be
completely filled by relying on traditional (animal) testing
approaches. To overcome this information deficit, it is now
widely recognized that a more intelligent approach to chemical
safety assessment is needed, based as far as scientifically possible
on the use of computational toxicology techniques. The ways in
which computational methods may be used depend on the possi-
bilities foreseen by the regulatory framework (some being more
conservative than others) and the specific context in which the
method is used (10).
A major achievement in the past decade has been the increasing
availability of toxicity and metabolism databases and prediction
tools (including freely available tools) that can be used to support
the risk assessment process in a consistent and transparent manner.
However, an unfortunate consequence of the increasing availability
of these tools could be increasing confusion on how best to combine
and use the totality of information resulting from the application of
these methods. The solution is to develop structured frameworks
for integrating the data, and for performing what is often referred to
as a weight-of-evidence (WoE) (or totality-of-evidence) assessment.
The overall aim of this chapter is to describe and illustrate a
possible way of combining different information sources and soft-
ware tools by means of a simple stepwise approach. The chapter is
aimed at researchers and assessors who have a basic knowledge of
computational toxicology and some familiarity with the practical
use of computational tools. The emphasis is on how to evaluate the
data generated by multiple tools, rather than the practical use of any
specific tool. The stepwise approach is illustrated by means of case
studies on the computational assessment of genotoxicity for two
chemicals. One chemical, 2-amino-9H-pyrido[2,3-b]indole (AaC),
is an environmental pollutant present in cigarette smoke and diesel
exhaust fumes, and can also be formed during the cooking of
protein-rich food. The other chemical, 2-aminoacetophenone
(2-AAP), is a naturally occurring component of several food items,
and is also a flavoring substance.
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity 127
2. Materials
A large and increasing number of software tools are available for

predicting the toxicological properties and fate of chemicals. These
tools, and the underlying models/methods incorporated, have
been reviewed extensively elsewhere (1113). This section is there-
fore limited to a brief description of the tools used in the case
studies described in Subheading 6.4. These tools were selected
partly because they were available in-house, and partly because
they are representative of the different types of predictive
methodstatistically based, expert knowledge based, and hybrid
(statistically and knowledge based). The tools selected include both
commercial and freely available applications, and are suitable for
running on a standard PC.
2.1. CAESAR A series of statistically based models, developed within EU-funded

CAESAR project (http://www.caesar-project.eu), have been imple-
mented into open-source software and made freely available for
online use via the Web. Predictions are made for five endpoints:
Ames mutagenicity, carcinogenicity, developmental toxicity, skin
sensitization, and the bioconcentration factor.
2.2. Derek for Derek for Windows (DfW) is developed and marketed by Lhasa
Windows Ltd, a nonprofit company and educational charity in the UK
(https://www.lhasalimited.org/). DfW contains over 630 alerts
covering a wide range of toxicological endpoints in humans, other
mammals, and bacteria. An alert in DfW consists of a toxicophore
(a substructure known or thought to be responsible for the toxicity)
and is associated with literature references, comments, and examples
(14). A key feature of DfW is the transparent reporting of the
reasoning underlying each prediction.
All the rules in DfW are based either on hypotheses relating to
mechanisms of action of a chemical class or on observed empirical
relationships. Information used in the development of rules
includes published data and suggestions from toxicological experts
in the industry, regulatory bodies, and academia. The toxicity pre-
dictions are the result of two processes. The program first checks
whether any alerts in the knowledge base match toxicophores in the
query structure. The reasoning engine then assesses the likelihood
of a structure being toxic. There are nine levels of confidence:
certain, probable, plausible, equivocal, doubted, improbable,
impossible, open, and contradicted. DfW can be integrated with
Lhasas Meteor software, which makes predictions of fate, thereby
providing predictions of toxicity for both parent compounds and
their metabolites.
2.3. HazardExpert This is a module of the Pallas software developed by CompuDrug

(http://compudrug.com/). It predicts the toxicity of organic
compounds based on toxic fragments, and it also calculates bio-
availability parameters (from log P and pKa). It is a rule-based
system with an open knowledge base, allowing the user to expand
or modify the data on which the toxicity estimation relies. It covers
the following endpoints relevant to dietary toxicity assessment:
carcinogenicity, mutagenicity, teratogenicity, membrane irritation,
immunotoxicity, and neurotoxicity. A further application of the
program is prediction of the toxicity of the parent compound and
its metabolites by linking with MetabolExpert, another module of
the Pallas software.
2.4. Lazar Lazar is an open-source software program that makes predictions of

toxicological endpoints (currently, mutagenicity, human liver tox-
icity, rodent and hamster carcinogenicity, Maximum Recom-
mended Daily Dose) by analyzing structural fragments in a
training set (15, 16). It is based on the use of statistical algorithms
for classification (k-nearest neighbors and kernel models) and
regression (multi-linear regression and kernel models). In contrast
to traditional k-NN techniques, Lazar treats chemical similarities
not in absolute values, but as toxicity-dependent values, thereby
capturing only those fragments that are relevant for the toxic
endpoint under investigation. Lazar performs automatic applicabil-
ity domain estimation and provides a confidence index for each
prediction, and is usable without expert knowledge. Lazar runs
under Linux and a Web-based prototype is also freely accessible
online: http://lazar.in-silico.de/.
2.5. Meteor Meteor is a knowledge-based expert system designed to predict the

likely metabolic fate of a query compound from its chemical struc-
ture (17). It is developed and marketed by Lhasa Ltd (https://
www.lhasalimited.org/). The system contains a biotransformation
dictionary of metabolic reactions and a reasoning engine which
discriminates between all possible outcomes and the most likely
ones. The reasoning engine is driven by rules which attempt to
encapsulate human expert knowledge. The uncertainty terms used
in Meteor in relation to predicted biotransformations and metabo-
lites include probable (when there is at least one strong argument
that the proposition is true and there are no arguments against it),
plausible (when the weight of evidence supports the proposi-
tion), equivocal (when there is an equal weight of evidence for
and against the proposition), doubted (when the weight of
evidence opposes the proposition), and improbable (when
there is at least one strong argument that the proposition is false
and there are no arguments that it is true).
2.6. OECD QSAR Toolbox The OECD QSAR Toolbox is a stand-alone software application
for gaps in (eco)toxicity data needed for assessing the hazards of
chemicals. Data gaps are filled by following a flexible workflow in
which chemical categories are built and missing data are estimated
by read-across or by applying local QSARs (trends within the
category). The Toolbox also includes a range of profilers to quickly
evaluate chemicals for common mechanisms or modes of action. In
order to support read-across and trend analysis, the Toolbox con-
tains numerous databases with results from experimental studies.
The first version of the Toolbox, released in March 2008, was a
proof-of-concept version. The second version (v. 2.0) was released
in October 2010. The release of version 3.0 is planned for October
2012. The Toolbox and guidance on its use are freely available:
http://www.qsartoolbox.org/.
2.7. OncoLogic This is a freely available expert system that assesses the potential of
chemicals to cause cancer. OncoLogic was developed by the US
EPA in collaboration with LogiChem, Inc. It predicts the potential
carcinogenicity of chemicals by applying the rules of structureac-
tivity relationship (SAR) analysis and incorporating what is known
about the mechanisms of action and human epidemiological stud-
ies. The software reveals its line of reasoning, like human experts, to
support the predictions made. It also includes a database of toxico-
logical information relevant to carcinogenicity assessment. The
Cancer Expert System comprises four subsystems that evaluate
fibers, metals, polymers, and organic chemicals of diverse chemical
structures. Chemicals are entered one by one and the user needs a
limited knowledge of chemistry in order to select the appropriate
subsystem. OncoLogic is freely downloadable from the US EPA
website: http://www.epa.gov/oppt/sf/pubs/oncologic.htm.
2.8. TOPKAT This QSAR-based system, developed by Accelrys Inc. (http://

accelrys.com/), makes predictions of a range of toxicological end-
points, including mutagenicity, developmental toxicity, rodent car-
cinogenicity, rat chronic Lowest Observed Adverse Effect Level
(LOAEL), rat Maximum Tolerated Dose, and rat oral LD50. The
QSARs are developed by regression analysis for continuous end-
points and by discriminant analysis for categorical endpoints. TOP-
KAT models are derived by using a range of two-dimensional
molecular, electronic, and spatial descriptors. TOPKAT estimates
the confidence in the prediction by applying the patented Optimal
Predictive Space (OPS) validation method. The OPS is TOPKATs
formulation of the model applicability domaina unique multivar-
iate descriptor space in which a given model is considered to be
applicable. Any prediction generated for a query structure outside
of the OPS space is considered unreliable.
2.9. ToxBoxes ToxBoxes, now called ACD/Tox Suite, is provided by ACD/Labs

and Pharma Algorithms, and provides predictions of various toxic-
ity endpoints including hERG inhibition, genotoxicity, CYP3A4
inhibition, ER binding affinity, irritation, rodent LD50, aquatic
toxicity, and organ-specific health effects (http://www.acdlabs.
com/products/admet/tox/). The predictions are associated with
confidence intervals and probabilities, thereby providing a numeri-
cal expression of prediction reliability. The software incorporates
the ability to identify and visualize specific structural toxicophores,
giving insight as to which parts of the molecule are responsible for
the toxic effect. It also identifies analogues from its training set,
which can also increase confidence in the prediction. The algo-
rithms and datasets are not disclosed.
2.10. Toxtree Toxtree is a flexible and user-friendly open-source application that

places chemicals into categories and predicts various kinds of toxic
effects by applying decision tree approaches. It is freely available
from the Joint Research Centre (JRC) website (http://ihcp.jrc.ec.
europa.eu/our_labs/computational_toxicology/qsar_tools/toxtree)
and from Sourceforge (https://sourceforge.net/projects/toxtree/).
Toxtree has been developed by the JRC in collaboration with
various consultants, in particular Ideaconsult Ltd (Sofia, Bulgaria).
A key feature of Toxtree is the transparent reporting of the
reasoning underlying each prediction. Toxtree v 1.60 (July 2009)
includes classification schemes for systemic toxicity (Cramer
scheme and extended Cramer scheme), as well as mutagenicity
and carcinogenicity (BenigniBossa rulebase and the ToxMic rule-
base on the in vivo micronucleus assay). The Cramer scheme is
probably the most widely used approach for structuring chemicals
in order to make an estimation of the Threshold of Toxicological
Concern (TTC).
The current version of Toxtree (v. 2.5.0, August 2011) also
applies a TTC decision tree (18), alerts for skin sensitization (19),
and includes SMARTCyp, a two-dimensional method for predict-
ing the sites of cytochrome P450-mediated metabolism in a
molecule (20).
3. Methods
3.1. Stepwise A stepwise approach for using toxicity and metabolism prediction
Approach to Hazard tools in the context of a hazard assessment has been proposed in an
Assessment earlier work (21) and incorporated into the technical guidance for
REACH (22). The stepwise approach presented here is an adapta-
tion of the approach proposed previously.
The different steps provide a logical approach to the compilation

of a datasheet which forms the basis for an assessment of the ade-
quacy of the non-testing data as a whole. In earlier steps, information
is gained that can be used to guide the search for information in later
steps. The approach is intended to be flexible so that one or more of
the steps could be skipped, if appropriate, or the different steps could
be applied in a different sequence, if more convenient. The overall
aim of the stepwise workflow is to apply computational methods in
an integrated manner, and finally judge the adequacy of the non-
testing data. As a result of this evaluation, a decision would be taken
whether further assessment, and in particular testing, is necessary.
Considerations in the evaluation of the adequacy of non-testing data
are discussed elsewhere (10, 23).
3.2. Step 1. Existing The workflow begins by identifying the information on chemical
Data and Information properties that is needed for in-house development purposes or to
meet a given set of regulatory requirements. In the case studies
presented, an assessment of genotoxic potential is performed as a
step towards evaluating carcinogenic potential.
Information needs to be collected about the chemical compo-
sition (identity of main chemical component, other components,
purity, impurities) of the substance of interest and a specific com-
pound is selected for the study (referred to as the parent com-
pound). This is necessary because predictions from (Q)SAR
methods and grouping approaches are generated from a single
well-defined structure (represented in a machine-readable format
such as SMILES code or mol file). The purity/impurity profile can
also be useful when explaining discrepancies between experimental
and non-testing data.
It is essential to verify the structure of parent compound, along
with key characteristics such as protonation, isomerism, and tau-
tomerism. If this is known by a CAS number or by its name, it is
necessary to derive its structure (e.g., in the form of a SMILES
code) to be used in the prediction generation process. This can be
achieved by using an application that serves as a Structure Con-
verter tool. If the structure is known, it is important to verify that
the structural information agrees with the CAS number or with the
name. In the case of large datasets, this can be a time-consuming
exercise, in which some degree of expert intervention is necessary.
Information on the parent compound should be collected from
a range of available databases. This profiling involves the collation
of available data on the physicochemical properties (cross-reference to
Dearden chapter (24)), ecological effects, fate, and health effects.
Data can be retrieved from the scientific literature, Internet-based
resources (free and commercial resources), or electronic databases.
Some useful starting points are illustrated in the case studies in
Subheading 4.
3.3. Step 2. QSAR This step refers to the use of any (Q)SAR-based tool for predicting
Predictions the physicochemical, biological, or environmental property of
interest. The use of such tools is distinguished from the use of
those in step 3 in that the predictive algorithms are predefined,
and thus should be reproducible. These tools are typically user-
friendly and as such suitable for use by nonspecialists. Generally,
it is sufficient to enter the chemical structure of interest, by drawing
the molecule with an editor, or more easily by using a SMILES code
or a mol file. The challenge in using such tools is typically in the
interpretation of the result. For example, in the case of genotoxicity
prediction, what exactly is a positive prediction? The underlying
dataset and evaluation criteria should be checked, but these are not
always readily available or completely transparent, especially in the
case of commercial tools.
3.4. Step 3. Grouping In contrast with the use of (Q)SAR tools, the grouping and read-
and Read-Across across approach is a more ad hoc approach involving a range of
subjective choices in terms of categorization tools, similarity
metrics, datasets for the retrieval of analogue, and criteria for
analogue selection. A broad chemical and toxicological expertise
is needed to apply this approach. Consequently, the approach is
unlikely to be reproducible, unless all of the expert choices are
clearly documented and retraced. A detailed explanation of the
category and read-across approach is given by Enoch (25). Various
tools can be used to assist grouping and read-across, including the
freely available Toxmatch (http://ihcp.jrc.ec.europa.eu/our_labs/
computational_toxicology/qsar_tools/toxmatch) (26), AMBIT
(http://ambit.sourceforge.net/), and the OECD QSAR Toolbox
(http://www.qsartoolbox.org/).
For the purposes of the case studies reported here, the OECD
Toolbox was used. The following paragraphs provide some con-
siderations relevant to the use of this tool when grouping chemicals
according to their potential genotoxicity (DNA-binding potential).
To build a category, the first step is to profile the chemical
under study using the appropriate general mechanistic and
endpoint-specific profilers available in the tool. A crucial element
in this approach is the definition of similarity, i.e., choice of similar-
ity metric which is to some extent subjective. In the OECD QSAR
Toolbox, the following profilers are available for the categorization
of potentially genotoxic chemicals:
l General mechanistic profilers:
DNA binding by OASIS (based on the Ames Mutagenicity
models in the TIMES software).
DNA binding by OECD (this includes 60 structural alerts
based on organic chemistry mechanisms for the identification
of DNA-binding chemicals).
l Endpoint-specific profilers:
Micronucleus alerts by BenigniBossa (based on the ToxMic
rulebase of the Toxtree software).
Mutagenicity/Carcinogenicity alerts by BenigniBossa
(based on the correspondent module of the Toxtree soft-
ware).
Once the compound has been profiled using the relevant mech-
anistic and endpoint-specific profilers, one has to consider various
possibilities to build a category around the profilers obtained. The
following general approach is recommended:
1. If all of the profilers agree on a single mechanism, one has high
confidence that the mechanism is correct. It is therefore rea-
sonable to build a category around this mechanism.
2. If conflicting results are obtained (i.e., multiple mechanisms),
one should be more cautious about the resulting category and
look into the suggested mechanisms: if one of the suggested
mechanisms (from the mechanistic profilers) is supported by an
alert from the endpoint-specific profilers, this provides confi-
dence about the choice of mechanism for category building.
3. Finally, if the mechanistic profilers give conflicting results, indi-
cating multiple mechanisms, and the endpoint-specific profilers
cannot help to resolve which of the mechanisms is more likely,
one is left with the option of building separate categories
relating to the different mechanisms. For example, if a sub-
stance is both an aromatic amine and an aldehyde (e.g., an
aminophenylacetaldehdye), it might interact with DNA both
via nitrenium formation (due to the amino group) and Schiff
base formation (due to the aldehyde moiety).
3.5. Step 4. Simulation One of the shortcomings of the QSAR and grouping approaches is
of Metabolism that metabolic and environmental fate are generally not taken into
account, at least explicitly (although this information may be
implicitly encoded in the underlying data, for example Ames test
results in the presence of metabolic activation [S9 mix]). Thus, it is
often desirable, perhaps even essential, to perform a computational
assessment of reactivity and fate both in the environment and in
biological organisms. In general, this is more difficult than toxicity
prediction, but a wide range of software tools have been developed
and applied, especially in the pharmaceutical sector (27).
Two of the most commonly used commercial tools are Meteor
and HazardExpert, described in Subheading 2. In the public
domain, the Chemical Reactivity and Fate Tool (CRAFT) and
Metabolic information Input System (METIS) have been devel-
oped under contract to the European Commissions JRC by
Molecular Networks GmbH (Erlangen, Germany). CRAFT was
designed to simulate metabolic and abiotic transformation path-

ways in a variety of biological species and environmental media.
METIS was designed to extend the functionalities of CRAFT but
also to act as a stand-alone reaction editor. It allows chemical
reactions and metabolic pathways to be drawn, edited, and anno-
tated. This information can be stored, viewed, and exchanged with
other software tools. CRAFT and METIS are freely downloadable
from the Molecular Networks website (http://www.molecular-
networks.com/products/). Also in the public domain is the metab-
olism simulator incorporated in the OECD QSAR Toolbox
(http://www.qsartoolbox.org/).
For the purposes of the case studies presented here, Meteor and
the OECD Toolbox simulator were used.
3.6. Step 5. Overall An evaluation of the adequacy of data obtained at each step is
Assessment carried out, and the need for further information is determined.
When all relevant steps have been applied, an overall assessment is
carried out, weighing all of the available information. Again, this is
a rather subjective process and there are no widely accepted guide-
lines on how to perform this WoE assessment in a consistent
manner.
4. Case Studies on
the Application of
Computational
Methods The amino-a-carboline, AaC, is a heterocyclic aromatic amine
(HAA) formed during ordinary cooking and also present in ciga-
rette smoke condensate and in diesel exhaust particles (28). AaC
4.1. Case Study 1: and other HAAs are mutagens and carcinogens formed as pyrolysis
2-Amino-9H-pyrido products during the cooking of protein-rich food such as meats and
(2,3-b)indole fish. The structural identifiers of AaC are summarized in Table 1.
4.1.1. Step 1. Existing The link between AaC and its mutagenic activity is well documen-
Data and Information ted in the literature. The bioactivation of AaC is mediated primarily
by cytochrome P450 1A2 in both rodents and humans, leading to
Mechanistic Knowledge
N-oxidation of the exocyclic amine group to form 2-hydroxya-
mino-9H-pyrido[2,3-b]indole (HONH-AaC). The HONH-AaC
metabolite can then undergo further activation by phase II enzymes
(O-acetyltransferases and sulfotransferases) to form the penultimate
ester species, which bind to DNA (29, 30). Generally, it is proposed
that the aryl nitrenium ions of HAAs form adducts by covalent
modification of bases in DNA. To date, the guanine adducts of
several HAs (MeAaC, AaC, Trp-P-2, Glu-P-1, MeIQ, DiMeIQx,
and PhIP) have been characterized (28). The sole DNA adduct of
AaC identified thus far is N-(deoxyguanosin-8-yl)-2-amino-9H-
pyrido[2,3-b]indole (dG-C8-AaC) (31, 32).
Table 1
Identifiers for case study 1: AaC
Chemical name 2-Amino-9H-pyrido(2,3-b)indole (AaC)

CAS No 26148-68-5
SMILES c12-c3c(cccc3)Nc1nc(N)cc2
Structural formula H
N N
H2N
Databases A search for existing genotoxicity data for the query compound (by
CAS number) was conducted using freely available, Internet-based
resources such as the online eChemPortal database (http://www.
echemportal.org/) and the downloadable OECD QSAR Toolbox.
The eChemPortal search resulted in the retrieval of data for the
query substance from the following participant databases: HSDB
(Hazardous Substance Data Bank), INCHEM (Chemical Safety
Information from Intergovernmental Organizations), ACToR
(US EPA Aggregated Computational Toxicology Resource), and
the US EPA SRS (US EPA Substance Registry Services). Among
these, only ACToR provided human genotoxicity information for
AaC, as the compound was positive in the sister chromatid
exchange (SCE) in vitro assay (cell type unspecified; data from
NLM TOXNET GENETOX). In addition, data in cultured Chi-
nese hamster lung cells were retrieved from INCHEM, indicating
that AaC induces mutations and chromosomal aberrations and
polyploidy.
From the Internet resources, no data were available on the
carcinogenicity of AaC to humans. However, the HSDB states
that the collected animal data are sufficient to conclude that the
compound is a possible carcinogen in humans. The carcinogenic
potency of AaC is recorded in the ISSCAN and CPDB databases
(http://potency.berkeley.edu/cpdb.html) with a mouse TD50
value equal to 49.8 mg/kg/day.
Information on human exposure is also provided in the above-
mentioned Internet resources. For example, in the HSDB it is stated
that the general population may be exposed to AaC via inhalation of
cigarette smoke or ingestion of food cooked at high temperatures.
More specific genotoxicity information could be gathered

using the OECD QSAR Toolbox (version 2.0, October 2010),
through the available mutagenicity/genotoxicity and carcinogenic-
ity databases (CPDB, ISSCAN, ISSMIC, genotoxicity OASIS,
micronucleus OASIS, Toxicity Japan MHLW) and the specified
endpoint genetic toxicity. The search allowed a number of differ-
ent experimental data to be extracted, i.e., nonhuman in vitro data,
which are summarized in Table 2 and show a positive outcome in all
cases. No in vivo data were available in the Toolbox.
Conclusion Step 1. In summary, a search for genotoxicity data for the
query compound in freely available Internet databases resulted in
the collation of in vitro genotoxicity data in both bacteria and
mammals, but no in vivo data could be retrieved. The collected
data are consistent and indicate that the genotoxicity of AaC occurs
through mutagenic and chromosome aberration activity.
4.1.2. Step 2. QSAR Predictions of genotoxicity for AaC using different (Q)SAR soft-
Predictions ware tools are summarized in Table 2. The majority of the software
tools applied resulted in positive predictions for mutagenicity
(Ames test), in addition to other genotoxicity endpoints, i.e.,
rodent, in vivo MN test (Toxtree) and mammalian, in vitro chro-
mosome damage (DfW), and carcinogenicity (DfW, OncoLogic,
and Toxtree). The predictions obtained were positive with all soft-
ware except one (Lazar) and thus, overall, they were in concordance
with the experimental data available.
Some of the software tools provide additional information on
the (Q)SAR analysis performed such as statistical performance of
the model, inclusion of the query compound in the model training
set, and toxicity predictions for chemicals with a similar structure to
the query compound. This information is useful when evaluating
the reliability of the prediction made for the query compound.
CAESAR. In the case of the CAESAR software, five similar com-
pounds to AaC (considering the two-dimensional structure) were
retrieved from the model database and correctly predicted as muta-
gens in all cases (data not shown). This result helps to increase
confidence that the prediction made for AaC is also reliable.
For expert systems such as DfW and Toxtree, the applicability
domain is defined by the so-called mitigating factors or restrictions
given for each structural alert (SA), while the validation comments
provided in the DfW software and in the Toxtree manual, respectively,
demonstrate the predictive performance of each of the SA fired
against specified internal and/or external test sets. The DfW and
Toxtree expert systems search the respective rulebase system for SAs
matching the query chemicals. When evaluating in vitro mutagenicity
(Ames test) and carcinogenicity for AaC, the prediction and
reasoning of these systems ascribe the genotoxic carcinogenicity of
this compound to the presence of the aromatic amine which is part of
a polycyclic aromatic ring system (see Table 3 for details of the alerts;
6
Table 2
Experimental genotoxicity data for AaC in the OECD QSAR Toolbox
Observed Test organism/ Metabolic

Endpoint outcome Test type/study title cells activation Database Author/source
Gene mutation Positive Bacterial reverse mutation Salmonella ISSCAN Benigni and Bossa (33)
(Ames test) typhimurium
Gene mutation Positive Bacterial reverse mutation Salmonella Genotoxicity OASIS
Gene mutation Positive Bacterial reverse mutation Salmonella Genotoxicity OASIS Kazius et al. (42)a
Chromosome Positive In vitro mammalian Without S9 Genotoxicity OASIS Kirkland et al. (43)b
aberration chromosome aberration
Chromosome Positive In vitro mammalian Rat With S9 Genotoxicity OASIS LSIC Japan-Danish EPA
aberration chromosome aberration Inventory
Chromosome Positive In vitro mammalian Chinese hamster Without S9 Genotoxicity OASIS LSIC Japan-Danish EPA
aberration chromosome aberration lung cells Inventory
() Unspecified data
a
In Kazius et al. (42), a number of toxicophores for mutagenicity prediction are identified on the basis of Ames test data obtained from the Chemical Carcinogenesis Research
Information System (available through TOXNET at http://toxnet.nlm.nih.gov)
b
The volume number cited in the OECD (Q)SAR Toolbox (version 2.0, 2010) is not correct; the correct citation is in (43). Also reported in the literature is the type of cells
(cultured Chinese hamster fibroblasts) used for the chromosomal aberration test (44)
QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity
137
Table 3
Predictions of genotoxicity for AaC using different (Q)SAR software tools
Software Module Prediction

CAESAR Mutagenicity (Ames test) Active
http://www.caesar-project.eu/
DfW (Lhasa Ltd.) Mutagenicity (Ames test) Plausible (SA354)
http://www.lhasalimited.org
DfW (Lhasa Ltd.) Carcinogenicity (mammalian) Plausible (SA589)
http://www.lhasalimited.org
DfW (Lhasa Ltd.) Chromosome damage Probable (SA519)
http://www.lhasalimited.org (mammalian, in vitro)
Lazar Mutagenicity (Ames test) Inactive (0.00963052)a
http://lazar.in-silico.de
OncoLogic Genotoxic carcinogenicity Moderate
http://www.epa.gov/oppt/
newchems/tools/oncologic.htm
TOPKAT (Accelrys) Mutagenicity (Ames test) Active (DS 20.5)
http://www.accelrys.com
Toxtree BenigniBossa carcinogenicity Active for genotoxic
https://sourceforge.net/projects/ and mutagenicity carcinogenicity (SA19 and
toxtree/ SA28). No SAs triggered for
nongenotoxic carcinogenicity
Toxtree BenigniBossa MN Active (SA19 and SA28)
https://sourceforge.net/projects/ (rodent, in vivo)
toxtree/
Toxtree (v. 2.1.0). Rulebase: BenigniBossa rulebase for carcinogenicity and mutagenicity (BB_CM), BenigniBossa
rulebase for the rodent in vivo micronucleus assay (BB_MN). Structural alert (SA) code: SA19 heterocyclic polycyclic
aromatic hydrocarbons; SA28 primary aromatic amine, hydroxyl amine, and its derived esters (with restrictions)
Derek for Windows (DfW) v. 12.0. Structural alert code: SA354 aromatic amine or amide; SA589 aromatic amine
or amide; SA519 aminocarbazole analogue. The number of SAs available in the DfK 12.0 Knowledgebase for each of
the endpoints considered is the following: Mutagenicity, bacteria (92 SAs); chromosome damage, mammalian, in vitro
(79 SAs); carcinogenicity, mammalian (66 SAs)
CAESAR QSAR model for mutagenicityversion 1.0
TOPKAT 6.2 QSAR model for Ames mutagenicityversion 3.1. DS discriminant score
Lazar/mutagenicitySalmonella typhimurium (CPDB)
a
A confidence level <0.025 indicates that the prediction is unreliable
in particular, the BenigniBossa rulebase (33) in Toxtree indicates the

requirement for a primary aromatic amine while the heteroaromatic
ring system of AaC triggers an additional SA), as this feature is linked
to formation of a reactive nitrenium ion that can bind directly to DNA
and proteins via electrophilic attack. However, AaC also contains an
electronegative pyridine N atom in the ortho position of the aromatic
amino group which may stabilize the nitrenium ion formation,
thereby promoting the mechanism of DNA binding (genotoxicity).
Thus, the genotoxic activity of AaC and related compounds [i.e., 2-

amino-pyrido(2,3-b)indoles and 2-amino-pyrido(2,3-b)imidazoles]
may be higher than other (primary) aromatic amines with(out) a
polycyclic aromatic ring system.
DfW. This provides an SA for 2- and 3-aminocarbazole analogues in
the chromosome aberration test in vitro, as several HAAs are posi-
tive in this test. 2-amino-pyrido(2,3-b)indoles such as AaC, Me
AaC, Trp-P-1, and Trp-P-2 and 2-amino-pyrido(2,3-b)imidazoles
such as Glu-P-1 all trigger this SA.
Lazar. This generates predictions from the experimental results of
compounds with similar structures (neighbors) in a specified data-
set and associates a confidence value to the prediction made based
on the representation of the structural features in the dataset. For
AaC, the mutagenicity prediction is negative (inactive), but the
associated confidence level is low (see Table 3), indicating that
this outcome is unreliable owing to unknown/infrequent features
of the query compound in the reference dataset.
OncoLogic. This predicted a moderate level of carcinogenicity con-
cern for the query compound and provided a similar reasoning as in
the DfW and Toxtree expert systems. In detail, the following justi-
fication was provided:
In general, the level of carcinogenicity concern of an aromatic amine is
determined by considering the number of rings; the presence or absence
of heteroatoms in the rings; the number and position of amino groups;
the nature, number, and position of other nitrogen-containing amine-
generating groups; and the type, number, and position of additional
substituents.
Aromatic amines are expected to be metabolized to N-hydroxy-
lated/N-acetylated derivatives which are subject to further bioactiva-
tion, producing electrophilic reactive intermediates that are capable of
interaction with cellular nucleophiles (such as DNA) to initiate carcino-
genesis.
A large number of mutagenic heteroaromatic amines occur at low levels
in cooked foods. They are the pyrolysis product (pyrolysates) of amino acids,
proteins, or other components in meats and fish heated to temperatures in
excess of 300 C. Chemically, they have been identified as congeners of
amino-carboline, amino-imidazo-quinoline, amino-imidazo-quinoxaline,
amino-pyridine, and other amino heterocyclics. Some of these heterocyclic
amines are also present in cigarette smoke condensates. SAR of various
mutagenic heteroaromatic amines have been extensively studied. All of the
ten mutagenic heteroaromatic amines that have been tested for carcinoge-
nicity gave positive results in rodents; one was also shown to be carcinogenic
in nonhuman primates.
A number of other heterocyclic amines (e.g., amino-phenazines,
amino-pteridines, amino-thiophenes, and amino-thiazoles) that are
associated with pesticides, drugs, car exhaust, and other environmental
sources, have been tested for mutagenicity and/or carcinogenicity.
Like homocyclic aromatic amines, heteroaromatic amines are
mainly metabolized by cytochrome P450 isozymes to yield hydroxya-
mino derivatives that are further activated to reactive electrophiles
through O-acetylation and O-sulfation. The reactive electrophiles then

interact with nuclear DNA to initiate carcinogenesis.
There is experimental evidence that the compound is carcinogenic.
The concern level derived for this compound is based on test data and
SAR consideration.
TOPKAT. When this tool was applied to AaC, this query chemical
was flagged as being the model training set. In addition, 424 similar
compounds to AaC were found in the model training set (data not
shown). Also for HONH-AaC, 424 similar compounds were found
in the model database (including AaC).
Conclusion Step 2. The QSAR-based prediction of genotoxic poten-
tial for AaC genotoxicity seems reliable and in agreement with
available experimental data.
4.1.3. Step 3. Grouping The OECD QSAR Toolbox was used to profile AaC in terms of the
and Read-Across available mechanistic and endpoint-specific profilers relevant to
genotoxicity, these being the mechanistic schemes DNA binding
by OASIS and DNA binding by OECD and the endpoint-
specific Micronucleus alerts by BenigniBossa (ToxMic) and
Mutagenicity/Carcinogenicity alerts by BenigniBossa.
AaC contains the aromatic amine substructure and was
therefore profiled under the aromatic amine category of the DNA
binding scheme. This includes primary and secondary (hetero)aro-
matic amines, unless they bear structural features which have been
identified to prevent covalent DNA binding (33). This chemical
class falls within the SN1 mechanistic domain leading to DNA
adducts via nitrenium ion formation (34), in agreement with the
BenigniBossa scheme (33). Likewise, AaC satisfies the require-
ments listed under the Amines category definition of the
OECD DNA binding scheme, which is linked to the same mecha-
nism of action. When profiled using the endpoint-specific schemes,
the query compound triggered the following two structural alerts,
which are common in the ToxMic and BenigniBossa schemes used:
1. Heterocyclic polycyclic aromatic hydrocarbons.
2. Primary aromatic amine, hydroxyl amine, and its derived esters.
Conclusion Step 3: It is possible, for example by using the QSAR
Toolbox, to build chemical categories for specific mechanisms of
DNA binding. For example, if the category is based on nitrenium
ion formation, AaC would be predicted as genotoxic by read-across.
A full illustration of this process is given in the next case study.
4.1.4. Step 4. Simulation OECD QSAR Toolbox metabolism simulator. No data on observed
of Metabolism metabolism could be retrieved from the QSAR Toolbox. On the
other hand, the liver metabolism simulator of the Toolbox produced
nine metabolites of AaC (data not shown). One of these is the docu-
mented product of P450-mediated N-oxidation, HONH-AaC. The
other metabolites bear one or more hydroxyl groups at various
Fig. 1. Use of SMARTCyp (v. 1.5.3) to predict the sites of metabolism in AaC.
Table 4
SMARTCyp ranking of AaC sites liable to cytochrome
450-mediated metabolism
Rank Atom Score Energy Accessibility

1 N.12 46.1 54.1 1
2 C.4 61.34 68.2 0.86
3 C.5 66.1 74.1 1
4 C.7 67.24 74.1 0.86
5 C.14 68.39 74.1 0.71
6 C.6 69.2 77.2 1
7 N.10 69.89 75.6 0.71
8 C.13 79.44 86.3 0.86
9 C.11 0.86
10 C.2 0.71
11 C.3 0.71
12 C.1 0.57
13 N.8 0.57
14 C.9 0.57
Rank 1 refers to the atomic site most liable to cytochrome P450-mediated metabolism
positions in the fused ring system of AaC. All the predicted metabolites
were profiled in the same manner as the parent compound, and gave
the same genotoxicity profile as AaC (data not shown).
SMARTCyp metabolism simulator. The sites of P450-mediated
metabolism were predicted using the SMARTCyp Web server
(freely accessible at http://www.farma.ku.dk/smartcyp/). As
shown in Fig. 1, the primary site of metabolism for the query
compound was the amino N atom (ranked 1 in Table 4), followed
by the C4 and C5 atoms in the indole ring (ranked 2 and 3,
respectively; Table 4).
Fig. 2. Simulation of metabolic transformation of AaC by Meteor.
The SMARTCyp Web server prediction is consistent with the

OECD Toolbox-generated metabolites. In fact, HONH-AaC corre-
sponds to the SMARTCyp ranked-1 site of metabolism, five out of
nine OECD Toolbox-generated metabolites bear an OH group on
the C5-position and two out of nine bear an OH group on the C4-
position, while only two out of nine OECD Toolbox-generated
metabolites involve sites other than those in the first three ranked sites.
Meteor. In addition to Phase I metabolism, Meteor v. 12.0.0 (Lhasa
Ltd, UK) was used to simulate phase II biotransformations. When
used to simulate the mammalian metabolic fate of AaC, Meteor
identified four metabolites of the query compound (Q) as shown
(Fig. 2 and Table 5). Of these, HONH-AaC (M1) and M3 are
produced by phase I and phase II biotransformations, respectively,
and are consistent with the observed metabolites in rodent and
human cells (see above).
Finally, the active (DNA-binding) metabolite of AaC, HONH-
AaC, was subjected to the same computational toxicology analysis
as illustrated for its parent compound (data not shown). All soft-
ware tools considered, with the exception of Lazar, predicted
HONH-AaC as potentially mutagenic in the same manner as for
AaC.
Conclusion Step 4. At least some of the metabolites of AaC are
predicted to be genotoxic. This probably will not affect the overall
conclusion of the risk assessment process since the parent com-
pound is considered genotoxic and any risk management measures
will be based on this.
Table 5
Simulation of metabolic transformation of AaC by meteor
Biotransformation
Parent Metabolite name Phase Enzyme Log P
Q M1 Hydroxylamines Phase I CYP450 2.194
from primary
aromatic amines
Q M2 Hydroxylation of Phase I CYP450 1.804
fused benzenes
M1 M3 O-Sulfation of N- Phase II SULT 0.37
hydroxy
compounds
M2 M4 Glucuronidation Phase II UGT 0.256
of aromatic
alcohols
M2 M5 Oxidation of 5- Phase II CYP450/ 1.07
hydroxyindole peroxidase,
derivatives GST, GT,
peptidase,
NAT
4.1.5. General Conclusions In summary, existing toxicological data indicate AaC as mutagenic
for 2-Amino-9H-pyrido in vitro, including the Ames test, chromosomal aberrations in
[2,3-b]indole cultured mammalian cells, and the SCE assay in human cells.
Rodent carcinogenicity data are also available. The in vitro mutage-
nicity of AaC was reliably predicted by most of the (Q)SAR and
expert systems applied, while the AaC active metabolite was iden-
tified by metabolism simulators. Thus, the totality of evidence
indicates the genotoxic potential of AaC. However, further infor-
mation would be needed to assess the carcinogenic potential of the
compound by non-genotoxic mechanisms.
4.2. Case Study 2: 2- 2-AAP is a substituted aromatic amine (see Table 6 for chemical
Aminoacetophenone identifiers). It is a naturally occurring substance in the following
food items: beer, corn tortillas, sweet corn, and milk (35). Quanti-
tative data on the natural occurrence in food have been reported for
two of these food items (36), up to 0.15 mg/kg in corn tortillas
and up to 0.4 mg/kg in sweet corn.
2-AAP is also a flavoring substance for which a risk assessment
needs to be conducted prior to its possible authorization and inclu-
sion in a positive list according to Commission Regulation (EC) No
1565/2000 (9, 37). The EU procedure for establishing the list of
authorized flavoring substances is laid down by Regulation (EC)
Table 6
Identifiers for case study 2: 2-aminoacetophenone
Structural formula
Chemical name (IUPAC) 1-(2-aminophenyl)ethanone

Chemical name (CA) 2-aminoacetophenone
CAS No 551-93-9
SMILES OC(C)c1ccccc1N
Molecular formula C8H9NO
Molecular mass 135.1632 g/mol
No 2232/96 of the European Parliament and the Council (8). The

need to evaluate genotoxic potential is part of this procedure (38).
4.2.1. Step 1. Existing The mechanism of action of 2-AAP is generally considered to

Data and Information involve N-hydroxylation, typically mediated by cytochrome P450
1A2, and subsequent O-esterification (38). The resulting esterified
Mechanistic Knowledge
product may then give rise to a reactive nitrenium ion that is
capable of binding covalently to cellular nucleophiles such as DNA.
Formation of the nitrenium ion is facilitated by delocalization
of the positive charge onto the aromatic ring and this resonance
interaction is promoted by electron-donating substituents. It has
also been proposed that nitrenium ions could be stabilized by
neighboring nitrogen, phosphorous, oxygen, and sulfur atoms as
the lone pairs can overlap strongly with the vacant 2p orbital on
nitrogen. Larger substituents in the ortho position are likely to
sterically inhibit N-hydroxylation and hence lead to a reduction in
or absence of activity.
Databases Table 7 shows results of retrieving data for 2-AAP from nine publicly
available databases. The chemical was found in only two databases:
the Danish QSAR database and ESIS (EU Existing Chemicals
ListEINECS). However, there were no data in ESIS and in the
Danish database; all data are predicted rather than experimental.
Table 7
Summary of the results from looking for data for 2-aminoacetophenone
into publicly available databases
Database (name and link) Information

Benchmark Data Set for In Silico Prediction of Ames No
Mutagenicity
http://ml.cs.tu-berlin.de/toxbenchmark/
Carcinogenic Potency Database (CPDB) No
http://potency.berkeley.edu/cpdb.html
Danish QSAR database Yes (predicted data only)
http://130.226.165.14/index.html In vitro mutagenicity: Negative,
exclusion
Mouse lymphoma: Positive (out of
domain), HGPRTEquivocal
In vivo mutagenicity: Negative,
exclusion COMETEquivocal
DSSTox (Distributed Structure-searchable Toxicity) No
database
www.epa.gov/ncct/dsstox
European Chemical Substances Information Chemical included, but no experimental
System (ESIS). Freely accessible from the JRC ex-ECB data
website:
http://ecb.jrc.ec.europa.eu/esis/
Existing Chemicals Examination (EXCHEM) database No
(Japan)
http://dra4.nihs.go.jp/mhlw_data/jsp/SearchPageENG.jsp
Istituto superiore di Sanita` database (ISSCAN) No
http://www.iss.it/ampp/dati/cont.php?
id233&lang1&tipo7
National Toxicology Program (NTP) database No
http://ntp.niehs.nih.gov
Toxicity Reference Database (ToxRefDB) No
http://www.epa.gov/ncct/toxrefdb/
TOXNET database of the National Library of Medicine No
(NLM), including the Carcinogenesis Research
Information System database (CCRIS) and the
Genetic Toxicology Databank (GENE-TOX)
http://toxnet.nlm.nih.gov/
Table 8 shows the experimental in vitro genotoxicity data

collected by EFSA (35) (http://www.efsa.europa.eu/en/scdocs/
doc/797.pdf). No in vivo mutagenicity/genotoxicity data were
found.
Table 8
Experimental data for in vitro genotoxicity of 2-aminoacetophenone
Test system Test object Concentration Results Reference

Reverse mutation Salmonella typhimurium 5005,000 mg/ml Negative (with Bowden et al.
TA98, TA100, and without S9) (45)
TA1535, TA1537,
TA1538
Reverse mutation Salmonella typhimurium <1,000 mg/ml Negative (with Thompson
G46, C3076, D3052, and without S9) et al. (46)
TA98, TA100,
TA1535, TA1537,
TA1538
DNA repair Escherichia coli WP2 <1,000 mg/ml Negative (with Thompson
and without S9) et al. (46)
Unscheduled Rat hepatocytes 0.07136 mg/ml Negative (with Thompson
DNA synthesis (0.51,000 and without S9) et al. (46)
nmol/ml)
Conclusion Step 1. There are no experimental data for the in vivo

genotoxic potential of the 2-AAP and there are limited in vitro
genotoxicity data. Based on these finding, it could be concluded
that 2-AAP is probably not genotoxic (at least in vitro).
4.2.2. Step 2. QSAR When analyzing the predictions of the eight software tools (Table 9),
Predictions mixed messages are obtained. Five of the software tools predicted 2-
AAP to be non-active, two of them as active, and one as low-
moderate active. Clearly, it is difficult to make a conclusion based
only on these predictions. Thus, the next step is to analyze any
additional information which the different software tools provide.
HazardExpert. This software tool does not provide additional
information.
CAESAR. This predicts the chemical as active, and also provides
experimental data and mutagenicity predictions for analogues
provided by the software (Table 10).
Although the chemical is predicted as active, five out of six
analogues are non-mutagenic based on the experimental data.
Only acetophenone, 40 -(hydroxyamino) (the fourth chemical in
the table), is mutagenic based on the experimental data but this
chemical is a hydroxylamine rather than an amine. All six analogues
are predicted as non-active. Based on this comparative analysis, it
could be concluded that the predicted mutagenicity for 2-AAP is
not reliable.
Table 9
Predictions and additional information provided by software tools for genotoxic
potential of 2-aminoacetophenone
Software Prediction Additional information

CAESAR Active Provides information on analogues: 5 out of 6 are
http://www.caesar-project. not active
eu/
DfW Nothing to report There are 4 alerts considering aromatic amines and
http://www.lhasalimited.org amides. A lot of additional information is
provided
HazardExpert Highly probable No additional information provided
http://www.compudrug.com to be active
Lazar Non-active Provides information on analogues: 6 out of 11
http://lazar.in-silico.de analogues are active. However, these can be
excluded on the following grounds: 2 of them are
not amines, 3 contain electron-donating groups
(opposite to the query compound), and the other
(2,6-diaminobenzoic) acid is a diamine
OncoLogic Low-moderate Detailed analysis provided
http://www.epa.gov/oppt/
newchems/tools/
oncologic.htm
TOPKAT Non-active Compound was found in database. Experimental
http://www.accelrys.com result: Negative
Toxboxes Non-active Probability of the positive Ames test 0.274
Provides information on analogues: 4 out of 5
provided similar chemicals that are active
Toxtree Non-active Provides information on SA triggered: Active based
https://sourceforge.net/ on the SA28 Primary aromatic amines, hydroxyl
projects/toxtree/ amines and its derivates
Also applied QSAR but not active based on QSAR
ToxBoxes. In case of the ToxBoxes prediction, the situation is almost

the same as in CAESAR but in the opposite direction (Table 11).
Although the prediction is non-active (probability of a positive
Ames test result is 0.274, which is near to the equivocal thresh-
old of 0.3), four out of five similar chemicals are positive. Thus the
reliability of the prediction can be regarded as low.
Lazar. This software has two models for mutagenicity based on
different training sets. The analysis presented here is based on the
model derived from the Kazius/Bursi database. Lazar predicts the
chemical as non-active and reports this prediction as unreliable (low
confidence of <0.025). Six out of the 11 analogues are active
Table 10
CAESAR analogues of 2-aminoacetophenone and their
experimental and predicted activity
Analogue Similarity, experimental and predicted class

SMILES: Nc1ccccc1CC
Similarity: 0.892
Experimental class: Non-mutagen
Predicted class: Non-mutagen
SMILES: OC(c1ccccc1)C
Similarity: 0.884
SMILES: OC(O)c1ccccc1(N)
Similarity: 0.877
SMILES: OC(c1ccc(cc1)NO)C
Similarity: 0.861
Experimental class: Mutagen
SMILES: OC(N)c1ccccc1C
Similarity: 0.861
SMILES: OC(OC)c1ccccc1(N)
Similarity: 0.857
Table 11
ToxBoxes analogues of 2-aminoacetophenone and their
experimental activity
Analogue Experimental class

SMILES: CC1C(N)CCCC1
Experimental class: Inconclusive
SMILES: CC1C(N)C2C(CC1)C(O)
C1C(CCCC1)C2O
SMILES: C(C)(O)c1c(N)ccc(Cc2cc(C
(C)O)c(N)cc2)c1
SMILES: CC1CCCC(N)C1C
SMILES: CC1CC(C)C(N)CC1
(Table 12). On closer inspection, it can be seen that two of these

active analogues (the 7th and 11th in Table 12) are not actually
amines, and should therefore be disregarded. Three of the remain-
ing active chemicals contain electron-donating groups (in contrast
to the query compound, which contains an electron-withdrawing
group ortho to the amino) and the other is a diamine (diaminoben-
zoic acid). Thus, after excluding these active chemicals from the
category, the prediction of non-active seems reasonable, especially
Table 12
Lazar analogues of 2-aminoacetophenone and their
experimental activity
Analogue Similarity and experimental class

SMILES: NC1C(CCCC1)C(O)O
Similarity: 0.87
SMILES: CCC1C(CCCC1)N
Similarity: 0.82
SMILES: NC1C(C(CCC1)N)C(O)
O
Similarity: 0.81
SMILES: NC(O)C1CCC(CC1)N
Similarity: 0.80
SMILES: CCC1C(C(CCC1)CC)N
Similarity: 0.79
SMILES: CC1C(CCCC1)N
Similarity: 0.79
SMILES: CC1C(CCCC1NCO)
NCO
Similarity: 0.77
(continued)
Table 12
(continued)
Analogue Similarity and experimental class

SMILES: COC(O)C1C(CCCC1)N
Similarity: 0.76
SMILES: CC2NC1C(CCCC1)C2
(C)C
Similarity: 0.74
SMILES: CC1C(CC(CC1)NCO)
NCO
Similarity: 0.74
SMILES: CC1CC(C(C(C1)C)N)C
Similarity: 0.73
when the analysis is done in the light of the mechanism of interac-

tion with DNA and the influence of the substituent on the activity.
TOPKAT predicts 2-AAP as non-active. More over the chemical
was found in the training set of the model where it is indicated to be
non-active on the basis of experimental data. Thus, this prediction
can be accepted as reliable.
DfW and OncoLogic give different predictions, but they provide
almost the same explanation of the mechanism of the interaction
of the aromatic amines and influence of the substituents. Onco-
Logic justifies its prediction as follows:
In general the level of carcinogenicity concern of an aromatic amine is
determined by considering the number of rings, the presence or absence
of heteroatoms in the rings, the number and position of amino groups,
the nature, number and position of other N-containing amine-gener-

ating groups, and the type, number and position of additional sub-
stituents. Aromatic amine compounds are expected to be metabolized
to N-hydroxylated/N-acetylated derivates which are subject to further
bioactivation producing electrophilic intermediates that are capable of
interaction with cellular nucleophiles (such as DNA) to initiate carcino-
genicity. An aromatic compound containing one benzene ring, one
amino group, and one or no methyl, methoxy, ethyl, ethoxy groups,
or halogens, has a carcinogenicity concern of low-moderate.
It is uncertain whether the substituent has any significant modify-
ing effect on the carcinogenicity.
In other words, the OncoLogic prediction does not account

for the influence of the carbonylic group, which may be significant.
DfW generates a nothing to report prediction, which means
that the expert system simply does not find a known alert in the
query compound. Although not strictly correct, for the purpose of
this analysis, the DfW prediction was interpreted as non-active,
which is reasonable in view of the fact that DfW includes four
structural alerts for aromatic amines and amides, and takes account
of the influence of the substituent in the ortho position.
Conclusion Step 2: Based on the predictions and analysis of the
entire set of information provided by the software tools, the pre-
diction of non-active seems reasonable and this is consistent with
the postulated mechanism of interaction with DNA.
4.2.3. Step 3. Grouping As described above, the OECD Toolbox software provides several
and Read-Across types of profilers which can help to define a chemical category:
predefined, general mechanistic, endpoint specific, and empirical.
The genotoxicity endpoint is associated with (a) two general mech-
anistic profilers (DNA binding by OECD and DNA binding by
OASIS) and (b) three endpoint-specific profilers (Micronucleus
alerts by Benigni/Bossa, Mutagenicity/Carcinogenicity alerts by
Benigni/Bossa, and Oncologic Primary classification).
The first step is to apply the selected profilers to the query
chemical. 2-AAP belongs to the aromatic amine (primary or
secondary) and nitrenium ion formation according to the DNA
binding by OECD profiler. It is not a binder according to DNA
binding by OASIS profiler. It belongs to the primary aromatic
amine, hydroxyl amine, and its derived esters by Micronucleus
alerts by Benigni/Bossa and Mutagenicity/Carcinogenicity alerts
by Benigni/Bossa and to Aromatic amine-type compounds based
on the Oncologic Primary classification. In other words, all profi-
lers (except the OASIS profiler) classified the query compound into
the same group: aromatic amines. This gives the user more confi-
dence. By applying one of these profilers, for example Mutagenic-
ity/Carcinogenicity alerts by Benigni/Bossa, to the Toolbox
databases connected with genotoxicity (CPDB, ISSCAN, OASIS
genotoxic database, ISSMIC, OASIS micronucleus database),

739 chemicals belonging to the same group were identified.
Chemicals that are not associated with genotoxicity data should
be excluded in the group formation process. During the formation
of a chemical group and the subsequent application of read-across,
the quality and consistency of the data are also important. There are
different sources of inconsistency: different endpoints, different
species, and use of metabolic activation or not.
In this exercise, 144 of the 739 chemicals were found to be
associated with Ames data (all Salmonella strains), without meta-
bolic activation (without S9). Since this number of chemicals is too
large for grouping, the next step was to refine the category (create a
subcategory) by specifying the chemical space of the compounds.
This is usually achieved by using profilers such as chemical ele-
ments, group of elements, organic functional groups, or structural
similarity. These secondary profilers are used to define a struc-
tural domain. Use of the organic functional group profiler and
exclusion of all chemicals containing different groups to those in
the query chemical resulted in just three analogues, all of which
were inactive. On this basis, 2-AAP would be predicted as inactive.
The same result is obtained if structural similarity is used as the
secondary profiler for subcategorization. For example if the Dice
distance with a threshold more than 80% similarity is used, the
number of chemicals is 14 and the predicted activity for query
chemical is non-active. This seems reasonable given that out
of the 14 chemicals, only two are active (o-aminophenol and
2-amino-4-methylphenol), depending on the Salmonella strain.
If the same procedure is repeated for all Salmonella strains
with metabolic activation (with S9), a category of 186 chemicals
with data can be formed. When subcategorized using structural
similarity, it is interesting to note that the prediction becomes
active, based on the data for 28 chemicals. This prediction can be
confirmed by studying the metabolism of the query compound (see
step 4 below).
Conclusion Step 3: Based on the results achieved by chemical cate-
gory formation and read-across, a prediction of non-active is
obtained in the absence of metabolic activation which seems rea-
sonable, but a prediction of active is obtained in the presence of
metabolic activation.
4.2.4. Step 4. Simulation Based on the electrophilic theory introduced by Miller and Miller
of Metabolism (39), genotoxic carcinogens have the unifying feature that they are
either electrophiles or can be activated by metabolism to electro-
philic reactive intermediates (pro-electrophiles). Thus, an investi-
gation of metabolism is an important step in the risk (hazard)
assessment procedure, and might be considered essential when
the prediction is inactive in the absence of metabolism. Conversely,
Fig. 3. Metabolites of 2-aminoacetophenone predicted by Meteor.
if the chemical is predicted to be positive in the absence of metabo-

lism, it might be conservatively assumed that the chemical will also
be positive in the presence of metabolism (even though, in princi-
ple, metabolism often serves to detoxify a parent compound). In
general, for some compounds activity is observed in both the
presence and absence of S9 mix, whereas for others activity may
only be exhibited in its presence.
Meteor: As shown in Fig. 3, Meteor reported five mammalian meta-
bolites whose likelihood is at least plausible.
Two of these metabolites (M3, M4) are of phase II reaction
products, which are usually assumed to be safe. The last one M5
is in fact the parent chemical itself. Therefore, metabolites M1 and
M2 are the most interesting for the next step of analysis.
The first metabolite 1-[2-(hydroxyamino)phenyl]ethanone
(M1) is an activated form of the amino group. Aromatic hydro-
xylamines are generally considered as proximate mutagenic meta-
bolites of aromatic amines (38). Enzyme-catalyzed O-esterification
of the hydroxylamine leads to the formation of an esterified product
which has the potential to generate a reactive nitrenium ion capable
of binding to cellular nucleophiles such as DNA (Fig. 2).
Positive results in the in vitro chromosome aberration test
have been reported for N-acetoxy-N-acetyl-2-aminofluorene and
N-acetoxy-N-acetyl-4-aminobiphenyl in the absence of S9 mix

(40, 41). The corresponding N-hydroxy compounds have also
been tested. N-Hydroxy-N-acetyl-2-aminofluorene was found
positive by Popescu et al. (41), but negative by Tates and Kriek
(40), who also observed no chromosome aberrations with N-
hydroxy-N-acetyl-4-aminobiphenyl. The authors concluded that
the lack of clastogenicity for the N-hydroxy compounds was due
to an inappropriate choice of fixing time in the test rather than an
inherent lack of activity.
The second metabolite 1-(2-aminophenyl)ethanol (M2) is an
aromatic amine with a hydroxyalkyl group in the ortho position.
The same mechanism of interaction with DNA as of the parent
chemical can be expected. The hydroxyalkyl group is electron
donating, which serves to stabilize the nitrenium ion that interacts
with DNA.
Toxicity predictions for these two metabolites (M1 and M2),
generated by different software tools, are given in Table 13.
Almost all the software tools (except CAESAR) predict the first
metabolite (M1) as active, from low-moderate (OncoLogic) to
highly probable to be mutagenic (HazardExpert and TOPKAT). All
software tools, except DfW and Toxtree, predict the second metab-
olite (M2) to be active (mutagenic). In addition, Toxtree predicts
that the compound is unlikely to be a carcinogen based on QSAR.
As with the parent compound, a comprehensive analysis of the
information accompanying the predictions could also be performed
for the metabolites [see the corresponding analysis (step 3) for the
parent chemical].
OECD Toolbox: Metabolism can also be evaluated by using the
OECD Toolbox, which provides the option to check whether
there are any documented metabolites for the query chemical. In
the case of 2-AAP there are no documented metabolites. As shown
in Fig. 4, the liver metabolic simulator which is implemented in the
Toolbox generated 12 metabolites (using default settings).
Table 14 gives the toxicity predictions generated by different
software tools for each of these 12 metabolites.
Based on the metabolites generated by the Toolbox, and their
predicted activities, it is difficult to draw a conclusion. However,
two observations are important: (a) the two metabolites generated
by Meteor were also generated by the Toolbox, which increases the
confidence in these predictions since Meteor and the Toolbox are
independent software tools; (b) many of the metabolites are pre-
dicted as mutagenic. On this basis of concern, it might be decided
to investigate the activities of some of these metabolites experimen-
tally, starting with the two common metabolites generated by
Meteor and the Toolbox.
Conclusion Step 4: Based on the simulated metabolites of 2-AAP
and their predicted activities, it is concluded that the parent
156
A.P. Worth et al.
Table 13
Genotoxicity predictions for two simulated metabolites of 2-aminoacetophenone
DfW CAESAR Lazar HazardExpert TOPKAT Toxtree OncoLogic

M1 Mutagenicity, Non-mutagen Active Highly probable Computed probability of Active based on the Low-moderate
bacterium reliable to be mutagenicity 0.999 SA28 primary
in vitro mutagenic aromatic amines,
plausible hydroxyl amines
and its derivates
M2 Nothing to report Mutagen Active Probable to be Computed probability of Unlikely to be Low-moderate
unreliable mutagenic mutagenicity 1 carcinogen based on
the QSAR
Fig. 4. Metabolites of 2-aminoacetophenone predicted by the OECD QSAR Toolbox.
compounds are likely to undergo metabolism and form potentially

genotoxic metabolites.
4.2.5. General Conclusions 2-AAP is non-genotoxic in vitro with and without metabolic
for 2-Aminoacetophenone activation (Ames test, DNA repair test, and Unscheduled DNA
synthesis). No data were available for in vivo genotoxicity.
2-AAP is probably non-genotoxic as based on the predictions
and additional information of the eight software tools. The same
conclusion is obtained by using the chemical category formation
and read-across. However, investigation of the possible metabolism
by using two independent software tools indicates that 2-AAP most
probably undergoes metabolism and some of its metabolites could
be genotoxic. On the basis of the totality of information, it might
be concluded that, on balance, 2-AAP is probably non-genotoxic,
or it might be decided that genotoxicity data for predominant
metabolites is needed to confirm the non-genotoxicity of the com-
pound. Since many metabolites can be formed, and simulation
tools have a tendency to overgenerate potential metabolites, a key
issue is which metabolites are toxically relevant. The concept of
toxicological relevance needs to take into account the amount of
each metabolite as well as its potency. The combination of parent
compounds and toxicologically relevant metabolites is sometimes
referred to as the residue definition. Guidance on how to establish a
residue definition suitable for the risk assessment of pesticides, is
currently being developed by EFSA. In the case of the EFSA
evaluation on the use of 2-AAP as a flavoring substance (35), it
Table 14
158
Genotoxicity predictions for 12 simulated metabolites of 2-aminoacetophenone
DfW CAESAR Lazar HazardExpert TOPKAT Toxtree OncoLogic

M1 Mutagenicity, bacterium Non-mutagen Active reliable Highly probable Computed probability of Active Low-moderate
in vitro plausible to be mutagenic mutagenicity 0.999
M2 Mutagenicity, bacterium Mutagen Inactive unreliable Probable to be Computed probability of Active Low-moderate
A.P. Worth et al.
in vitro plausible mutagenic mutagenicity 0

M3 Nothing to report Mutagen Active unreliable Probable to be Computed probability of Active Low-moderate
mutagenic mutagenicity 1
M4 Nothing to report Mutagen Inactive unreliable Probable to be Computed probability of Active Low-moderate
M5 Nothing to report Mutagen Inactive unreliable Probable to be Computed probability of Non-active Low-moderate
M6 Mutagenicity, bacterium Mutagen Inactive unreliable Probable to be Computed probability of Active Low
M7 Nothing to report Mutagen Inactive unreliable Probable to be Computed probability of Non-active Low
M8 Mutagenicity, bacterium Mutagen Active unreliable Probable to be Computed probability of Active Marginal
M9 Nothing to report Mutagen Active unreliable Probable to be Computed probability of Active Marginal
M10 Nothing to report Mutagen Active unreliable Probable to be Computed probability of Non-active Marginal
M11 Mutagenicity, bacterium Mutagen Active unreliable Probable to be Computed probability of Active Marginal
M12 Mutagenicity, bacterium Mutagen Active unreliable Probable to be Computed probability of Active Low
was concluded that genotoxic potential could not be excluded on

the basis of existing information, and thus additional genotoxicity
data were requested.
5. Concluding
Remarks
In this chapter, a range of computational tools for applying QSAR
and grouping/read-across are described, and their possible use in
the computational assessment of genotoxicity is illustrated by
means of the stepwise application of selected tools to two case-
study compoundsAaC and 2-AAP. The first case study com-
pound (AaC) is an environment pollutant and a food contaminant
that can be formed during the cooking of protein-rich food. The
second case study compound (2-AAP) is a naturally occurring
compound in certain foods and is also under evaluation in the EU
as a flavoring substance. In the case of 2-APP, EFSA concluded that
available experimental data were insufficient to exclude the geno-
toxic potential and therefore requested additional information
from the applicant to confirm that the flavorings are safe to use in
foods (35).
In the first case study (AaC), the genotoxic potential of the
parent compound is well established and is also evident from the
computational predictions. In this case, the assessment of metabo-
lites may not be warranted, since there is probably sufficient infor-
mation to conclude the risk assessment. In contrast, the second case
study (2-AAP) is more challenging, since the available toxicity data
and computational predictions give mixed messages, especially
when potential metabolites are taken into account.
It is emphasized that the software tools applied in these case
studies are only some of the many software tools available, and their
use is intended to be illustrative of a thought process, rather than
providing definitive results and conclusions.
There are at least two outstanding challenges in the computa-
tional assessment of toxicological endpoints, such as genotoxicity.
One is the further development of reliable models and their com-
bination into automated workflows to carry out stepwise assess-
ments such as those described here. The other is to develop a
structured and transparent framework for interpreting and weigh-
ing complex datasets in such a way that (apparent) discrepancies
between data can be resolved.
References
1. European Commission (2006a) Regulation concerning the Registration, Evaluation,

(EC) No 1907/2006 of the European Parlia- Authorisation and Restriction of Chemicals
ment and of the Council of 18 December 2006 (REACH), establishing a European Chemicals
Agency, amending Directive 1999/45/EC 7. European Commission (2009) Regulation

and repealing Council Regulation (EEC) No (EC) No 1223/2009 of the European Parlia-
793/93 and Commission Regulation (EC) No ment and of the Council of 30 November 2009
1488/94 as well as Council Directive 76/ on cosmetic products (recast). Official Journal
769/EEC and Commission Directives 91/ of the European Union, L 342/59 of
155/EEC, 93/67/EEC, 93/105/EC and 22.12.2009. Office for Official Publications of
2000/21/EC. Official Journal of the Euro- the European Communities (OPOCE), Lux-
pean Union, L 396/1 of 30.12.2006. Office embourg 2009
for Official Publications of the European Com- 8. European Commission (1996) Regulation No
munities (OPOCE), Luxembourg 2006 2232/96 of the European Parliament and of
2. European Commission (2006b) Directive the Council of 28 October 1996 laying down a
2006/121/EC of the European Parliament Community Procedure for flavouring sub-
and of the Council of 18 December 2006 stances used or intended for use in or on food-
amending Council Directive 67/548/EEC stuffs. Official Journal of the European
on the approximation of laws, regulations and Communities 23.11.1996, L 299, 14.
administrative provisions relating to the classi- 9. European Commission (2000b) Commission
fication, packaging and labelling of dangerous Regulation No 1565/2000 of 18 July 2000
substances in order to adapt it to Regulation laying down the measures necessary for the
(EC) No 1907/2006 concerning the Registra- adoption of an evaluation programme in appli-
tion, Evaluation, Authorisation and Restriction cation of Regulation (EC) no. 2232/96. Offi-
of Chemicals (REACH) and establishing a cial Journal of the European Communities
European Chemicals Agency. Official Journal 19.7.2000, L 180, 816
of the European Union, L 396/850 of 10. Worth AP (2010) The role of QSAR method-
30.12.2006. Office for Official Publications of ology in the regulatory assessment of chemi-
the European Communities (OPOCE), Lux- cals. In: Puzyn T, Leszczynski J, Cronin MTD
embourg 2006 (eds) Recent advances in QSAR studies: meth-
3. European Commission (1998) Directive 98/ ods and applications. Springer, Heidelberg, pp
8/EC of the European Parliament and of the 367382
Council of 16 February 1998 concerning the 11. Lapenna S, Fuart-Gatnik M, Worth A (2010)
placing of biocidal products on the market. Review of QSAR models and software tools for
Official Journal of the European Union, L predicting acute and chronic systemic toxicity.
132/1 of 24.04.1998. Office for Official Pub- JRC Technical Report EUR 24639 EN.
lications of the European Communities Publications Office of the European Union,
(OPOCE), Luxembourg 1998 Luxembourg. http://ihcp.jrc.ec.europa.eu/our_
4. European Commission (1991) Council Direc- labs/computational_toxicology/publications/
tive 91/414/EEC of 15 July 1991 concerning 12. Mostrag-Szlichtyng A, Worth A (2010) Review
the placing of plant protection products on the of QSAR models and software tools for pre-
market. Official Journal of the European dicting biokinetic properties. JRC Technical
Union, L 230/1 of 19.08.1991. Office for Report EUR 24377 EN. Publications Office
Official Publications of the European Commu- of the European Union, Luxembourg. http://
nities (OPOCE), Luxembourg 1991 ihcp.jrc.ec.europa.eu/our_labs/computatio-
5. European Commission (2000a) Directive nal_toxicology/publications/
2000/60/EC of the European Parliament 13. Serafimova R, Fuart Gatnik M, Worth A
and of the Council of 23 October 2000 estab- (2010) Review of QSAR models and software
lishing a framework for the Community action tools for predicting genotoxicity and carcinoge-
in the field of water policy. Official Journal of nicity. JRC Technical Report EUR 24427 EN.
the European Union, L 327/1 of 22.12.2000. Publications Office of the European Union,
Office for Official Publications of the European Luxembourg. http://ihcp.jrc.ec.europa.eu/
Communities (OPOCE), Luxembourg 2000 our_labs/computational_toxicology/publica
6. European Commission (1976) Council Direc- tions/
tive 76/768 of 27 July 1976 on the approxi- 14. Sanderson DM, Earnshaw CG (1991) Com-
mation of the laws of the Member States puter prediction of possible toxic action from
relating to cosmetic products. Official Journal chemical structure; the DEREK system. Hum
of the European Union, L 262/169 of Exp Toxicol 10:261273
27.09.1976. Office for Official Publications of
the European Communities (OPOCE), Lux- 15. Helma C (2006) Lazy structure-activity rela-
embourg 1976 tionships (lazar) for the prediction of rodent
carcinogenicity and Salmonella mutagenicity.
Mol Divers 10:147158
16. Maunz A, Helma C (2008) Prediction of 28. Frederiksen H (2005) Two food-borne hetero-
chemical toxicity with local support vector cyclic amines: metabolism and DNA adduct
regression and activity-specific kernels. SAR formation of amino-a-carbolines. Mol Nutr
QSAR Environ Res 19(56):413431 Food Res 49:263273
17. Button WG, Judson PN, Long A et al (2003) 29. King RS, Teitel CH, Kadlubar FF (2000) In vitro
Using absolute and relative reasoning in the bioactivation of N-hydroxy-2-amino-{alpha}-
prediction of the potential metabolism of xeno- carboline. Carcinogenesis 21:13471354
biotics. J Chem Inf Comput Sci 43:13711377 30. Schut HAJ, Snyderwine EG (1999) DNA
18. Kroes R, Renwick AG, Cheeseman M et al adducts of heterocyclic amine food mutagens:
(2004) Structure-based thresholds of toxico- implications for mutagenesis and carcinogene-
logical concern (TTC): guidance for applica- sis. Carcinogenesis 20:353368
tion to substances present at low levels in the 31. Pfau W, Schulze C, Shirai T et al (1997)
diet. Food Chem Toxicol 42:6583 Identification of the major hepatic DNA
19. Enoch SJ, Madden JC, Cronin MT (2008) adduct formed by the food mutagen 2-amino-
Identification of mechanisms of toxic action 9H-pyrido[2,3-b]indole (AaC). Chem Res
for skin sensitisation using a SMARTS pattern Toxicol 10:11921197
based approach. SAR QSAR Environ Res 19 32. Frederiksen H, Frandsen H, Pfau W (2004)
(56):555578 Syntheses of DNA adducts of two heterocyclic
20. Rydberg P, Gloriam DE, Olsen L (2010) The amines, 2-amino-3-methyl-9H-pyrido[2,3-b]
SMARTCyp cytochrome P450 metabolism pre- indole (MeA{alpha}C) and 2-amino-9H-pyr-
diction server. Bioinformatics 26:29882989 ido[2,3-b]indole (A{alpha}C) and identifica-
21. Bassan A, Worth AP (2008) The integrated use tion of DNA adducts in organs from rats
of models for the properties and effects of che- dosed with MeA{alpha}C. Carcinogenesis
micals by means of a structured workflow. 25:15251533
QSAR Comb Sci 27:620 33. Benigni R, Bossa C (2008) Structure alerts for
22. ECHA (2008) Guidance on information carcinogenicity, and the Salmonella assay sys-
requirements and chemical safety assessment. tem: a novel insight through the chemical rela-
European Chemicals Agency, Helsinki, Fin- tional databases technology. Mutat Res
land. http://guidance.echa.europa.eu/docs/ 659:248261
guidance_document/information_require 34. Enoch SJ, Cronin MTD (2010) A review of the
ments_en.htm electrophilic reaction chemistry involved in
23. Worth A, Lapenna S, Lo Piparo E, Mostrag- covalent DNA binding. Crit Rev Toxicol
Szlichtyng A, Serafimova R (2011) A Frame- 40:728748
work for assessing in silico toxicity predictions: 35. EFSA (2008) Scientific Opinion of the Panel
case studies with selected pesticides. JRC Tech- on Food Additives, Flavourings, Processing
nical Report EUR 24705 EN. Publications Aids and Materials in Contact with Food on a
Office of the European Union, Luxembourg. request from Commission on Flavouring
http://ihcp.jrc.ec.europa.eu/our_labs/com- group evaluation 48: Aminoacetophenone.
putational_toxicology/publications/ EFSA J 797:125, http://www.efsa.europa.
24. Dearden J (2011) Prediction of physicochemi- eu/en/scdocs/doc/797.pdf
cal properties. In: Resfeld B, Mayena AN (eds) 36. TNO (2000) Volatile compounds in food
Computational toxicology, Methods in molec- VCF Database. TNO Nutrition and Food
ular biology. Springer Science+Business Media, Research Institute, Boelens Aroma Chemical
New York Information Service BACIS, Zeist
25. Enoch SJ (2010) Chemical category formation 37. EFSA (2010) EFSA Panel on Food Contact
and read-across for the prediction of toxicity. Materials, Enzymes, Flavourings and Proces-
In: Puzyn T, Leszczynski J, Cronin MTD (eds) sing Aids. Draft Guidance on the data required
Recent advances in QSAR studiesmethods for the risk assessment of flavourings. EFSA J 8
and applications. Springer, Heidelberg, pp (6):1623, http://www.efsa.europa.eu/en/
209219 scdocs/scdoc/1623.htm
26. Patlewicz G, Jeliazkova N, Gallegos Saliner A 38. Colvin M, Hatch F, Felton J (1998) Chemical
et al (2008) Toxmatcha new software tool to and biological factors affecting mutagen
aid in the development and evaluation of chem- potency. Mutat Res 400:479492
ically similar groups. SAR QSAR Environ Res 39. Miller E, Miller J (1981) Searches for ultimate
19:397412 chemical carcinogens and their reactions with
27. Franklin RB (2009) In silico studies in ADME/ cellular macromolecules. Cancer 47:23272345
Tox: caveat emptor. Current computer-aided 40. Tates A, Kriek E (1981) Induction of chromo-
2009. Drug Design 5:128138 somal aberrations and sister-chromatid
exchanges in Chinese hamster cells in vitro by Sensitivity, specificity and relative predictivity.
some proximate and ultimate carcinogenic ary- Mutat Res 584:1256
lamide derivatives. Mutat Res 88:397410 44. Aeschbacher H-U, Turesky RJ (1991) Mam-
41. Popescu N, Turnbull D, DiPaolo J (1977) Sis- malian cell mutagenicity and metabolism of
ter chromatid exchange and chromosome aber- heterocyclic aromatic amines. Mutat Res
ration analysis with the use of several 259:235250
carcinogens and noncarcinogens: brief com- 45. Bowden J, Chung K, Andrews A (1976) Muta-
munication. J Natl Cancer Inst 59:289293 genic activity of tryptophan metabolites pro-
42. Kazius J, McGuire R, Bursi R (2004) Deriva- duced by rat intestinal microflora. J Natl
tion and validation of toxicophores for muta- Cancer Inst 57:921924
genicity prediction. J Med Chem 48:312320 46. Thompson C, Hill L, Epp J et al (1983) The
43. Kirkland D, Aardema M, Henderson L et al induction of bacterial mutation and hepatocyte
(2005) Evaluation of the ability of a battery of unscheduled DNA synthesis by monosubsti-
three in vitro genotoxicity tests to discriminate tuted anilines. Environ Mutagen 5:803811
rodent carcinogens and non-carcinogens: I.
Part II
Biological Network Modeling

Chapter 7
Gene Expression Networks

Reuben Thomas and Christopher J. Portier
Abstract
With the advent of microarrays and next-generation biotechnologies, the use of gene expression data has
become ubiquitous in biological research. One potential drawback of these data is that they are very rich in
features or genes though cost considerations allow for the use of only relatively small sample sizes. A useful
way of getting at biologically meaningful interpretations of the environmental or toxicological condition of
interest would be to make inferences at the level of a priori defined biochemical pathways or networks of
interacting genes or proteins that are known to perform certain biological functions. This chapter describes
approaches taken in the literature to make such inferences at the biochemical pathway level. In addition this
chapter describes approaches to create hypotheses on genes playing important roles in response to a
treatment, using organism level gene coexpression or proteinprotein interaction networks. Also,
approaches to reverse engineer gene networks or methods that seek to identify novel interactions between
genes are described. Given the relatively small sample numbers typically available, these reverse engineering
approaches are generally useful in inferring interactions only among a relatively small or an order 10 number
of genes. Finally, given the vast amounts of publicly available gene expression data from different sources,
this chapter summarizes the important sources of these data and characteristics of these sources or
databases. In line with the overall aims of this book of providing practical knowledge to a researcher
interested in analyzing gene expression data from a network perspective, the chapter provides convenient
publicly accessible tools for performing analyses described, and in addition describe three motivating
examples taken from the published literature that illustrate some of the relevant analyses.
Key words: Gene expression, Networks, Biochemical pathways, Gene Expression Omnibus,
ArrayExpress, Cytoscape, Pathway enrichment, Network identification
1. Introduction
The technology of genome-wide expression measurements like

DNA microarrays and the newer next-generation technologies
like RNA sequencing have revolutionized the way biological sys-
tems are studied. For the first time, a genome-wide gene expression
profile allowed researchers to work almost at the systems level of
genes/molecules in a tissue. This is as opposed to the prior more
165
166 R. Thomas and C.J. Portier
focused approach of the analysis of what happens to a select group

of molecules under given experimental conditions. Both
approaches come with inherent advantages and disadvantages.
While the older approach may miss interesting effects on the
biological system that would enrich the analysis, the newer
genome-wide profile based approach has in general suffered from
the noisy nature of the data and the fact that experiments generally
have few replicates mainly due to cost considerations.
It is well known that a single entity or even multiple entities in
isolation would not be able to explain how a biological system
sustains itself or protects itself against deleterious environmental
conditions. A biological system is maintained through complex
coordinated interactions of multiple entities at different scales
molecular, cellular, tissue, organism, and at the level of interactions
between populations of biological systems. So at the molecular level
there is an underlying network of interactions. Genes can be regu-
lated either at transcription, translation or posttranslation. Tran-
scription factors (examples include NF-kB, PPAR-, STAT, and the
AhR) bind to the promoter regions of certain genes to initiate or
repress the transcription of these genes according to whether they
stabilize or block the binding of RNA polymerase to the DNA.
Histone acetyltransferases acetylate the histone proteins resulting in
increased transcription whereas histone deacetylases have the oppo-
site effect. Other controls of gene function include repression of
expression via DNA methylation, generation of different isoforms
of the same messenger RNA via alternate splicing, posttranscrip-
tionally (RNA editing, via microRNAS), translation (via micro-
RNAs) or also at the post-translational stage (via phosphorylation,
ubiquitination). As can be observed from the above examples, the
direct regulatory factors are not the messenger RNAs but different
proteins and enzymes. Unfortunately, the technology is still at a
stage where system wide measurements of gene expression, i.e.,
messenger RNA (mRNA) are the most feasible. Using the central
dogma of biology (1) as justification, the literature has used gene
expressions as proxies for both the regulators and the regulated
molecular components of a biological system. However, it is impor-
tant to keep in mind that the level of correlation between gene
expression and corresponding presumed steady-state protein levels
is high but not perfect (2). The lack of perfect correlation could be
due to one of the several posttranscriptional regulatory mechanisms
mentioned above.
To summarize, molecular networks in this chapter are viewed as
network of interacting genes inferred or identified using gene
expression data. The nodes of the network are the various genes
(or proxies for their corresponding proteins) and edges represent
the interactions between pairs of genes in the network. Biologically,
the edges would correspond to any one of the regulatory
7 Gene Expression Networks 167
mechanisms mentioned above or statistically to an association

(causal or otherwise) between the pair of gene expressions defining
the edge.
This chapter describes three elements of analyses with gene
expression networks.
1. Accessing publicly available gene expression data.
2. Identifying a priori defined networks or pathways that are
affected as a result of a given toxicological condition using
gene expression data.
3. Identifying novel interactions between a set of genes using
gene expression data obtained under multiple perturbed con-
ditions.
DNA microarrays have been around for over a decade now.
Through the initiatives created by the creation of the NCBIs Gene
Expression Omnibus (3) and EMBLs ArrayExpress (4) among
others, the requirements by various scientific journals for the
authors of articles to provide public access to any microarray data
that their study was based on and the creation of the MiaME (5)
standard of reporting data relevant to a microarray experiment, the
scientific community has access to a treasure trove of data. They
could use these data in the creation of novel hypotheses or justifi-
cation as they design experiments to understand specific phenom-
ena. For example, a researcher interested in understanding the
toxicity mechanisms of benzene exposure can use all previously
performed experiments involving benzene or its metabolites
(hydroquinone, catechol, 1,4-benzenetriol) to determine the
degree of perturbed immune response in humans at all dose levels
considered.
One approach that could be used would be to first map the
expression data (available publicly or from his/her own scientific
efforts) to a priori-defined networks or pathways of genes/proteins
that are known to participate in well-defined immune responses like
B-cell receptor signaling, toll-like receptor signaling and cytokine
cytokine receptor interactions and then perform what is termed as
pathway enrichment or gene set enrichment analyses. Pathway
enrichment methods are statistically based analyses that attempt
to identify pathways or sets of genes whose expression patterns
are significantly different under perturbed or exposed conditions
when compared to those obtained under normal or control condi-
tions. These pathways are really not well-definedfor example
different sources of human pathways will not have exactly same
set of pathways or even the exactly same set of genes/proteins
and interactions within the same pathway. One possible alternative
will be to look at a system-wide set of interactions between genes/
proteins like that in the STRING database (6). The researcher
interested in immune response sets of proteins could either restrict
his/her view of the overall interaction network to only those

proteins that have known immune response roles or keep a
system-wide view of interacting proteins. The objective then
becomes the identification of clusters of proteins in either network
that have been significantly affected by benzene exposure.
The set of validated interactions identified between the various
proteins is very likely incomplete and researchers investigating
novel phenomena may have reason to believe the existence of
novel interactions. Techniques placed under the banner of reverse
engineering genetic networks, known as network inference, have
been developed to address this issue.
2. Materials and
Methods
2.1. Common Software All suggested software to perform the various analyses described in
and Hardware Required this chapter can be installed and run on standard 32 or 64-bit, 2 GHz
or better CPU Windows, Mac, or Linux operating systems-based
desktops with internet access and over 500 MB RAM. Basic level of
literacy in using one of these operating systems (running desktop
software applications, accessing and using a Web browser, manipulat-
ing and working with text or Microsoft Excel format tables, etc.) is
assumed. A background in simple statistics (random variables, mean,
variance and correlation between random variables, standard proba-
bility distributions, hypothesis tests and their associated p-values, two
sample t-tests, ANOVA, F-tests, experiments, replicate samples) is
also assumed. There are two open-source analysis programs which are
useful to master for regularly performing the kinds of analyses
described in this chapter. First, there are the various Bioconductor
packages (7) installed in the R statistical environment (8) that will
allow one to consolidate all analyses into one environment. The main
drawbacks of working in R are that it is not user-friendly and requires
basic programming knowledge; it is possible that errors could creep
into the results due to conflicting functions and parameters defined
across multiple packages can conflict. Crawley (9) is an introductory
book to statistical analysis using R. In addition, there are several useful
tutorials and presentations on using the basic functions in R that can
be readily accessed online. One way of detecting the phantom errors
mentioned above would be to run the chosen R functions from the
installed packages on data where you have an understanding of the
expected results.
The second open source software (Cytoscape) is useful for
visualizing large scale networks and even performing certain
advanced analyses (10). Biological networks that are large (~103
nodes and ~104 edges) like the organism-wide proteinprotein
interaction network will typically require >2 GB RAM when
visualizing using Cytoscape.
2.2. Accessing Publicly There are several publicly accessible gene expression databases, see
Available Gene Table 1. The Gene Expression Omnibus (GEO) (3) and ArrayEx-
Expression Data press (4) represent the largest repositories of gene expression data
across a diverse set of organisms, microarray and next-generation
technologies. The Web pages for both GEO and ArrayExpress have
Advanced Search tabs where you can search for possible data
matching criteria you specify such as organism, tissue, number of
samples, platform technologies, experimental condition (e.g.,
blood gene expression of humans exposed to benzene on Illumina
microarrays). In addition there are packages in Bioconductor (7)
that allow programmatic access to the same search capabilities and
provide for downloading of datasets from GEO, GEOmetadb (11)
and GEOquery (12), and ArrayExpress, ArrayExpress (13). GEO-
metadb (11) requires that the user have a basic knowledge of
MySQL (14), a database query computer language.
Both raw data (e.g., as CEL files from Affymetrix platforms)
and contributor-processed and normalized data are typically avail-
able. Annotation files for the different probes on the microarray
platform used to generate the data should also be available. It is
really up to a user to judge whether the probes on the given
platform provide sufficient coverage of all genes in the organism
or cover the genes that are of particular interest to him/her.
Typically coverage of expressed genes can range anywhere from
Table 1
Public sources of gene expression data
R
programmatic
Database Web site access Data (as of Jan 2010)
Gene Expression Omnibus http://www.ncbi.nlm.nih. GEOquery, 16,807 experiments
(GEO) gov/geo/ GEOmetadb and 430,926 samples
ArrayExpress (AE) http://www.ebi.ac.uk/ ArrayExpress 10,666 experiments
microarray-as/ae/ and 298,824 samples
Center for Information http://cibex.nig.ac.jp/index. NA 88 experiments and 133
Biology gene EXpression jsp samples
database (CiBEX)
Chemical Effects in http://tools.niehs.nih.gov/ NA 24 experiments
Biological Systems cebs3/ui
(CEBS)
Stanford Microarray http://smd.stanford.edu/ NA 19,775 samples
Database (SMD)
Cancer Array (caArray) https://array.nci.nih.gov/ NA 156 experiments
caarray/home.action
30 % to above 90 %. A downloaded data set will typically have data

for several samples, different subsets of which generally represent
replicates that correspond to a given biological condition (e.g.,
humans exposed to 1 ppm level of benzene or their matched
controls).
2.3. Identifying Given gene expression data like those described in the previous
Toxicologically section, the goal of gene set or pathway enrichment methods are
Relevant Gene to identify significantly affected pathways (e.g., apoptosis, aracha-
Expression Networks donic acid metabolism) or gene sets (e.g., various gene ontologies
(15), set of genes in a given region of a chromosome) when the
2.3.1. Gene Set or
biological/environmental conditions are altered.
Biochemical Pathway
There are three variations of gene set enrichment methods that
Enrichment have been used in the literatureOver-representation methods
(16, 17), Rank-based methods (18, 19), and Global methods
(19, 20). Over-representation methods require the user to pre-
identify all significantly affected genes in their experiment. Depend-
ing on the hypotheses being tested or scientific questions asked, the
user would perform a two-sample test, a one-way anova, a cox
proportional hazard test, or a trend test to obtain gene-specific
p-values. The p-values should then be adjusted for multiple testing
at an appropriately defined threshold (5% family-wise error rate or
5% false discovery rate) to obtain a list of genes significantly affected
by the experimental condition. The Over-representation meth-
ods require the background set of genesthese typically would be
all genes (irrespective of whether they were significantly affected or
not) on the microarray platform used. The methods work by
performing Fishers exact test or a chi-squared approximation of
it. When working with sets of genes from the Gene Ontology and
irrespective of which class of the three methods one wants imple-
mented, by far the most useful software seems to be the topGO
package (21) in R (8). In its analysis, topGO accounts for the
hierarchy of gene sets on GOthere is a GO gene set
corresponding to the set of all genes (~30,000 genes) known to
participate in some signal transduction process and a child GO
gene set corresponding to a set of five genes participating in a
lipoprotein mediated particle signaling. In addition there are
term enrichment methods that one can access on the Gene
Ontology Web site (15) and the DAVID Web site (22, 23).
The two other classes of methods are a generalization of the
Over-representation methods where instead of a 01 (not-
affected or affected by experimental condition, say determined
using one of the methods in the previous paragraph) weighting of
the genes on the microarray platform, genes are assigned a contin-
uous metric. It is up to the user to choose this continuous metric
it could be the T-statistic from a two-sample test, a SAM statistic
(24), an F-statistic from an ANOVA test or even the logarithm of
the fold change. As the name implies, Ranking methods like (18)
rank these gene-wise metrics across all the genes and look for
overrepresentation of genes of a given pathway, category, or ontol-

ogy at the top of this ranked list. Both the Over-representation
and Ranking methods have a competitive nature in that they
attempt to compare the set of genes of interest from a given
pathway with the remaining genes in the system. The results from
these methods provide indications of pathways whose genes are
most affected by the experimental condition. Unfortunately, these
methods assume that gene expressions are independent of each
other. This assumption is not completely justified. Methods that
fail to take into account the genegene correlations have been
shown to provide more false-positive results.
Global methods (20) attempt to check whether the genes in
a given pathway are affected, irrespective of what happens to the
other genes in the system. Unlike the over-representative analy-
sis, these methods account for genegene correlations.
To summarize, the input requirements for gene set or pathway
enrichment methods are first an array or matrix of preprocessed and
normalized gene expression data, the rows of which correspond to
the different genes and the columns to the various samples with
different experimental conditions in the experiment. Second, an
annotation file describing the genes associated with the rows of the
data matrix is required. Third, you need to identify what hypotheses
you are interested in testing (Are you looking for pathways or gene
sets that differentially perturbed from one condition to the next?
Are you looking for pathways with genes that are predictive of
survival time in patients with lung cancer? Are you looking for
pathways with genes that respond in a concentration-responsive
fashion to treatment with a given chemical?). The identified
hypothesis will determine an appropriate choice of a gene-wise
statistical test that should be run to either obtain a list of signifi-
cantly affected genes or gene-wise metric/statistic for all the genes
in the system. Fourth, you need to decide on the interesting gene
sets or pathways (Gene Ontologies, the different pathways on the
KEGG pathway database (2527) or the sets of genes in various
chromosomal regions of interest). The most common choice is
Gene Ontology sets and the topGO package (21) within R.
Another useful resource for a Ranking based method and also
for diverse sets of genes is the GSEA java application (28) that can
be installed on a personal computer. The output from each of these
software packages will be p-values for each of the selected gene sets
or pathways. Conclusions on significantly affected pathways should
be done after correction for multiple testing has been performed.
Given that this chapter is on gene expression networks, it seems
useful to note that the authors of this chapter have also developed
pathway enrichment methods (19) for each of the three classes of
methods mentioned above. A novelty of these methods is that they
treats pathways as not just sets of genes but as networks of inter-
acting genes which is what they really are. Changes in gene
expression across conditions in the experiment are given more

weight according to whether they correspond to nuclear receptors
or transcription factors. Nuclear receptors and transcription factors
are typically terminating nodes of the network representation of a
pathway. In addition pathways where the perturbed genes are close
relative to each other on the network are given more weight. The
code to run these analyses is available on request from these
authors.
2.3.2. Sub-Network Useful biological insight can be obtained by overlaying genome-

Identification wide expression changes onto the nodes of an organism level
biological network, e.g., the protein-protein interaction network
and the gene expression based correlation network. One useful
software tool for obtaining sub-networks at the organism-level is
the Markov Clustering Algorithm (29) that can be obtained as an
addition to the Cytoscape software (10).
2.4. Reverse In cases where novel biological phenomena are studied, there is
Engineering Gene sometimes reason to believe in the existence of genegene/protein
Expression Networks interactions that have not been identified before. Gene expression
data generated from specially designed gene knockout or perturba-
tion experiments can theoretically be used to identify novel inter-
actions. Various analytic methods have been designed over the past
decade that propose solutions to this problem of network identifi-
cation. The models have differed in whether they treat gene expres-
sion measurements as binary values (gene is either expressed or not)
(30) or as continuous values (3133). They have differed in the
objective function used to identify the networkprobability distri-
bution based (entropy (34), mutual information (35), posterior
probability of data in a Bayesian setting (36, 37) or model-based
least squares fits (38, 39)). Mechanistic representations of gene
regulation have taken linear forms or slightly more complicated
log-linear forms. Thomas et al. (38, 39) have mechanistically tried
to model gene regulation directly through proteins. Unfortunately,
these models require estimates of half-lives of the different mRNAs
and their associated proteins that are under study and these param-
eter estimates are in general not readily available. Another differ-
ence in the methods available for network identification is whether
they can work with time-series data (39, 40) or not. The methods
designed to work with time-series data should have some way of
accounting for serial correlation in the data.
Despite a plethora of methods being developed to address this
problem, there is one serious handicap that all of them face. Net-
work identification is a very hard problem in the sense that in order
to identify the network of interactions between a set of genes, you
will need to have data from an exponential (in the number of genes)
number of independent experiments (30). For example, to fully
identify a network of 20 genes you will need to get data from
around a million experiments! The way the methods have

attempted to overcome this hurdle is first by admitting that their
approach can be efficiently solved only on relatively small networks
with few nodes (on the order of 10) and then by enforcing assump-
tions that imply a sparseness of the set of all interactions possible.
For example, a standard assumption has been to limit the interac-
tions to say 2 or 3 that a given gene is involved in. This assumption
partially justified by the observed sparseness of existing biological
networks. Still it is arbitrary since there is no real biological reason
why a given gene is limited to 2 or 3 interactions. see refs. 30, 38,
39 for further comments on the number and nature of experiments
required for unique network identification.
There are stand-alone software packages that provide solutions
to this probleminformation theoretic-based ones (ARACNE
(35) and parmigene (41) as a package in R (8)) and Bayesian-
based methods capable of working with steady-state and time-
varying data, BANJO (36).
In order to obtain reliable output in a reasonable amount of
time, the researcher should identify the small subset of genes they
wish to model. Still, there may not be a sufficient number of
experiments to make the network identifiable. Hence the outputs
of all of the network inference methods should be subjected to a
rigorous literature or experimental verification.
3. Examples
3.1. Immune Response Benzene is a well known leukemogen. The molecular mechanisms
Pathways in the Blood of human response in blood cells to low benzene exposure were not
Perturbed at Low well understood. McHale et al. (42) performed a study of the gene
Doses of Benzene expression profile of peripheral mononuclear blood cells from
Exposure workers in China exposed to varying low levels of benzene in air
in their work place. The analysis of these 125 workers is summar-
ized in Table 2 taken from ref. 42. The microarray analyses were
performed on Human Ref8-V2 Beadchips. The data was prepro-
cessed and quantile normalized. Individual messenger RNA levels
were modeled using linear mixed effects models. The exposed
workers were divided into five groups depending on their exposure
levels to benzene in air: controls, <<1, 1, 510 and >10 ppm. One
of the analyses involved determining those pathways whose genes
are perturbed in samples of at least one of the exposed groups. A 16
gene signature was identified that could serve as a potential bio-
marker of benzene exposure. The general trend was that of rapid
increase in mRNA levels at the lower dose ranges followed by
indication of saturation of these levels at the higher dose range.
Gene-wise F-statistics were computed using ANOVA comparisons
Table 2
Characteristics of study subjects
Currently
Sex (n(%)) smoking (n(%))
Benzene exposure Subjects Air benzene
category (ppm) (n) (ppm)a Age (years) Male Female Yes No
Control 42 <0.04 b
29.5 8.2 17 (33) 25 (34) 9 (35) 33 (33)
Very low (<<1)c 29 0.3 0.9 30.3 9.2 8 (16) 21 (28) 6 (23) 23 (23)
Low (<1) d
30 0.8 0.8 27.9 7.2 19 (37) 11 (15) 5 (19) 25 (25)
High (510) 11 7.2 1.3 29.7 9.1 1 (2) 10 (14) 1 (4) 10 (10)
Very high (>10) 13 24.7 15.7 30.9 10.5 6 (12) 7 (9) 5 (19) 8 (8)
Slight variation of material reproduced from ref. 42 with permission from National Institute of Environ-
mental Health Sciences
Values for air benzene and age are mean SD
a
Air benzene level 3 months preceding phlebotomy
b
The limit of detection for benzene was 0.04 ppm
c
The average level of benzene was <1 ppm at most measurements in the 3 months preceding phlebotomy
and at all measurements in the prior month
d
The average benzene level was <1 ppm (in the 3 months preceding phlebotomy) but dosimetry levels were
not always <1 ppm in the previous 3 months
between the models with and without fixed effects of exposure.

These were used in the Over-representation based methodology
of the Structurally Enhanced Pathway Enrichment Analysis (19) on
all known human pathways on the KEGG pathway database (25
27). The results are given in Table 3. The significant p-values of the
immune response related pathways like Toll-like receptor signaling
and B-cell and T-cell receptor signaling, imply the important role of
these pathways in the response to benzene exposure. Interestingly,
genes known to be involved in the development of Acute Myeloid
Leukemia (an accepted consequence of benzene exposure) also
demonstrated significant change.
3.2. Protein Kinase, Three different strains of miceC3H/HeJ, C57BL/6J, and

AMP-Activated, Alpha B6C3F1/Jdemonstrate differential susceptibility to cardiomyop-
2, Prkaa2 is Likely to athy. Auerbach et al. (43) attempted to provide a gene expression
Play a Central Role in based explanation for this differential susceptibility. Gene expres-
the Pathophysiology sion profiles from the heart tissue from ten mice of each of these
of Murine Progressive three strains were measured using Affymetrix Mouse Genome
Cardiomyopathy in 430.2 microarrays. Gene-wise ANOVA analyses were conducted
C3H/HeJ Mice to identify the set of genes exhibiting significant differences in their
expressions in at least one of the strains. These genes were overlaid
on a genome-wide gene correlation network (44)a network of
genes whose expressions have been found to be significantly corre-
lated across a diverse set of conditions and tissues. An analysis was
Table 3
Top pathways associated with overall benzene exposure
Pathwaya p-Valueb
Toll-like receptor signaling pathway <0.001
Apoptosis <0.001
Acute myeloid leukemia <0.001
Oxidative phosphorylation <0.001
B cell receptor signaling pathway <0.001
T cell receptor signaling pathway 0.001
Slight variation of material reproduced from ref. 43 with permission from
National Institute of Environmental Health Sciences
a
These pathways were taken from the KEGG pathway database (2527)
b
The p-values are based on the SEPEA_NT3 pathway enrichment (19),
p-values < 0.005 correspond to a Bonferroni adjustment for multiple testing
performed to identify regions of this correlation network that is

over represented with genes differentially expressed in the heart
across the three strains. Prkaa2 was one of the genes identified to be
playing a central role (see Fig. 1).
3.3. Dose-Dependent Acetaminophen is a well-known hepatotoxicant. A good under-

Gene Networks of standing of dose-dependent toxicity mechanism of acetaminophen
Response to is still lacking. Literature suggests a role for apoptosis and oxidative
Acetaminophen: stress response as toxicity mechanisms and led to the selection of
Suggestions of 17 genes for analysis. In Toyoshiba et al. (45), the gene expression
Interactions Between data from the livers of groups of ten mice were obtained. Each
Apoptotic and Oxidative group was exposed one of three one-time gavage doses of acet-
Stress Genes at Higher aminophen50, 150, and 1,500 mg/kg. In rodents, intraperito-
Doses of Exposure to neal administration of higher than 500 mg/kg b.w. was known to
Acetaminophen result in centrilobular necrosis that is potentially fatal. A Bayesian-
based network inference algorithm, TAO-GEN (37) was used to
infer networks at the different doses. The joint consensus network
for the two lower doses and the one for the higher dose is shown in
Fig. 2. The quantifying analysis showed that, at lower doses, genes
related to the oxidative stress signaling pathway did not interact
with the apoptosis-related genes. In contrast, the high-dose net-
work demonstrated significant interactions between the oxidative
stress genes and the apoptosis genes and also demonstrated a
different network between genes in the oxidative stress pathway.
Fig. 1. Prkaa2 gene plays a central role in determining cardiomyopathy. Prkaa2 2nd
neighborhood network of differentially expressed genes. In the coexpression network
Prkaa2 is linked to 138 genes in the second neighborhood through the genes, Asb2 and
Tns1 (1st neighborhood genes). Of the 138 in the second neighborhood, 39 were
differentially expressed. Prkaa2 2nd neighborhood genes that exhibited significantly
higher expression in C3H/HeJ mice are shown in the ovals, where as those that exhibited
significantly lower expression in C3H/HeJ are shown in boxes (including Prkaa2). Genes in
the diamonds were not differentially expressed (43). Reproduced from ref. 43 with
permission from SAGE publications.
Disclaimer
The findings and conclusions in this report are those of the authors
and do not necessarily represent the views and positions of the
Centers for Disease Control and Prevention or the Agency for
Toxic Substances and Disease Registry.
Fig. 2. Inferred genetic networks of set of apoptotic and oxidative stress genes using rat-acetaminophen dose data. Two
networks resulting from application of the TAO-GEN algorithm (37) of all of the data for the 50 and 150 mg/kg
acetaminophen dose groups for the left network and the right consensus network consists of all of the data for the
1,500 mg/kg acetaminophen dose group (45). The orange and red colored cells refer to inferred interactions. The two
interactions identified as red cells were the only two common genegene interactions between the low-dose and the high-
dose identified networks. Reproduced from ref. 45 with permission from Elsevier publications.
References
1. Crick F (1970) Central dogma of molecular 9. Crawley MJ (2005) Statistics: an introduction
biology. Nature 227(5258):561563 using R. Wiley, Chichester
2. Greenbaum D et al (2003) Comparing protein 10. Shannon P et al (2003) Cytoscape: a software
abundance and mRNA expression levels on a environment for integrated models of biomo-
genomic scale. Genome Biol 4(9):117 lecular interaction networks. Genome Res 13
3. Barrett T, Edgar R (2006) Gene expression (11):2498
omnibus: microarray data storage, submission, 11. Zhu Y et al (2008) GEOmetadb: powerful
retrieval, and analysis. DNA Microarrays, Part alternative search engine for the Gene Expres-
B: Databases Stat 411:352369 sion Omnibus. Bioinformatics 24(23):2798
4. Parkinson H et al (2009) ArrayExpress 12. Davis S, Meltzer PS (2007) GEOquery: a
updatefrom an archive of functional geno- bridge between the Gene Expression Omnibus
mics experiments to the atlas of gene expres- (GEO) and BioConductor. Bioinformatics 23
sion. Nucleic Acids Res 37(Suppl 1):D868 (14):1846
5. Brazma A et al (2001) Minimum information 13. Kauffmann A et al (2009) Importing arrayex-
about a microarray experiment (MIAME) press datasets into r/bioconductor. Bioinfor-
toward standards for microarray data. Nat matics 25(16):2092
Genet 29(4):365371 14. Widenius M, Axmark D, DuBois P (2002)
6. Von Mering C et al (2006) STRING 7recent MySQL reference manual. OReilly & Associ-
developments in the integration and prediction ates, Inc., Sebastopol, CA
of protein interactions. Nucleic Acids Res 35 15. Ashburner M et al (2000) Gene ontology: tool
(Suppl 1):D358 for the unification of biology. Nat Genet 25
7. Gentleman RC et al (2004) Bioconductor: (1):25
open software development for computational 16. Al-Shahrour F, Daz-Uriarte R, Dopazo J
biology and bioinformatics. Genome Biol 5 (2004) FatiGO: a web tool for finding signifi-
(10):R80 cant associations of gene ontology terms with
8. Team RDC (2009) R: a language and environ- groups of genes. Bioinformatics 20(4):578
ment for statistical computing. R Foundation 17. Beibarth T, Speed TP (2004) GOstat: find
for Statistical Computing, Vienna statistically overrepresented gene ontologies
within a group of genes. Bioinformatics 20 33. Dhaeseleer P, Liang S, Somogyi R (2000)

(9):1464 Genetic network inference: from co-expression
18. Subramanian A et al (2005) Gene set enrich- clustering to reverse engineering. Bioinformat-
ment analysis: a knowledge-based approach for ics 16(8):707
interpreting genome-wide expression profiles. 34. Ideker TE, Thorsson V, Karp RM (2000) Dis-
Proc Natl Acad Sci USA 102(43):15545 covery of regulatory interactions through per-
19. Thomas R et al (2009) Choosing the right turbation: inference and experimental design.
path: enhancement of biologically relevant Pac Symp Biocomput 5:302313
sets of genes or proteins using pathway struc- 35. Margolin A et al (2006) ARACNE: an algo-
ture. Genome Biol 10(4):R44 rithm for the reconstruction of gene regulatory
20. Goeman JJ, B uhlmann P (2007) Analyzing gene networks in a mammalian cellular context.
expression data in terms of gene sets: methodo- BMC Bioinform 7(Suppl 1):S7
logical issues. Bioinformatics 23(8):980 36. Hartemink AJ et al (2002) Bayesian methods
21. Alexa A, Rahnenf uhrer J, Lengauer T (2006) for elucidating genetic regulatory networks.
Improved scoring of functional groups from IEEE Intell Syst 17:3743
gene expression data by decorrelating GO 37. Yamanaka T et al (2004) The TAO-Gen algo-
graph structure. Bioinformatics 22(13):1600 rithm for identifying gene interaction networks
22. Da Wei Huang BTS, Lempicki RA (2008) Sys- with application to SOS repair in E. coli. Envi-
tematic and integrative analysis of large gene ron Health Perspect 112(16):1614
lists using DAVID bioinformatics resources. 38. Thomas R et al (2004) A model-based optimi-
Nat Protoc 4(1):4457 zation framework for the inference of gene
23. Huang DW, Sherman BT, Lempicki RA (2009) regulatory networks from DNA array data.
Bioinformatics enrichment tools: paths toward Bioinformatics 20(17):32213235
the comprehensive functional analysis of large 39. Thomas R et al (2007) A model-based optimi-
gene lists. Nucleic Acids Res 37(1):1 zation framework for the inference of regu-
24. Tusher VG, Tibshirani R, Chu G (2001) Sig- latory interactions using time-course DNA
nificance analysis of microarrays applied to the microarray expression data. BMC Bioinform 8
ionizing radiation response. Proc Natl Acad Sci (1):228
USA 98(9):5116 40. Dasika M et al (2003) A mixed integer linear
25. Kanehisa M et al (2008) KEGG for linking programming (MILP) framework for inferring
genomes to life and the environment. Nucleic time delay in gene regulatory networks. World
Acids Res 36(Suppl 1):D480 Scientific Pub Co Inc.
26. Kanehisa M, Goto S (2000) KEGG: Kyoto 41. Sales G, Romualdi C (2011) Parmigenea
encyclopedia of genes and genomes. Nucleic parallel R package for mutual information esti-
Acids Res 28(1):27 mation and gene network reconstruction. Bio-
27. Kanehisa M et al (2006) From genomics to cheminformatics 27:18761877
ical genomics: new developments in KEGG. 42. McHale C et al (2010) Global gene expression
Nucleic Acids Res 34(Database Issue):D354 profiling of a population exposed to a range of
28. Subramanian A et al (2007) GSEA-P: a desktop benzene levels. Environ Health Perspect 10
application for Gene Set Enrichment Analysis. 43. Auerbach SS et al (2010) Comparative pheno-
Bioinformatics 23(23):3251 typic assessment of cardiac pathology, physiol-
29. Enright AJ, Van Dongen S, Ouzounis CA ogy, and gene expression in C3H/HeJ,
(2002) An efficient algorithm for large-scale C57BL/6J, and B6C3F1/J mice. Toxicol
detection of protein families. Nucleic Acids Pathol 38(6):923
Res 30(7):1575 44. Jupiter D, Chen H, VanBuren V (2009) STAR-
30. Akutsu T, Miyano S, Kuhara S (2000) Inferring NET 2: a web-based tool for accelerating dis-
qualitative relations in genetic networks and covery of gene regulatory networks using
metabolic pathways. Bioinformatics 16(8):727 microarray co-expression data. BMC Bioin-
form 10(1):332
31. Bernardo D, Gardner T, Collins JJ (2004)
Robust identification of large genetic networks 45. Toyoshiba H et al (2006) Gene interaction
network analysis suggests differences between
32. Chen T, He HL, Church GM (1999) Model- high and low doses of acetaminophen. Toxicol
ing gene expression with differential equations. Appl Pharmacol 215(3):306316
Pac Symp Biocomput 4:2940
Chapter 8
Construction of Cell Type-Specific Logic Models

of Signaling Networks Using CellNOpt
Melody K. Morris, Ioannis Melas, and Julio Saez-Rodriguez
Abstract
Mathematical models are useful tools for understanding protein signaling networks because they provide an
integrated view of pharmacological and toxicological processes at the molecular level. Here we describe an
approach previously introduced based on logic modeling to generate cell-specific, mechanistic and predic-
tive models of signal transduction. Models are derived from a network encoding prior knowledge that is
trained to signaling data, and can be either binary (based on Boolean logic) or quantitative (using a recently
developed formalism, constrained fuzzy logic). The approach is implemented in the freely available tool
CellNetOptimizer (CellNOpt). We explain the process CellNOpt uses to train a prior knowledge network to
data and illustrate its application with a toy example as well as a realistic case describing signaling networks in
the HepG2 liver cancer cell line.
Key words: Protein networks, Logic model, Boolean logic, Fuzzy logic, Cell-specificity, Signal
transduction
1. Introduction
For a cell to respond appropriately to environmental changes, a

molecular signal such as ligand binding a receptor on the cell
surface transmits information to proteins within the cell. This
signal transduction is mediated by cascades of molecular events
including posttranslational modifications such as phosphorylation
of specific proteins. These signals typically reach transcription fac-
tors and other elements that modify cellular phenotype through
changes in gene expression, protein translation, and cell morphol-
ogy, to name a few. Such cascades of molecular events form path-
ways, and the proteins within these pathways interact with each
other in large and entangled networks (1). These signaling net-
works differ depending on cell type as well as the internal state of
179
180 M.K. Morris et al.
the cell (e.g., metabolic state) and are essential for proper cellular
response to the environment.
Signaling networks are typically studied by monitoring molec-
ular events upon treatments consisting of stimulation or perturba-
tion of key elements in the network at the protein, mRNA or gene
level. Combining measurement of these molecular signals with
phenotypic response under several treatment conditions enables
the study of these treatments on the architecture and function of
the underlying signaling networks as well as the relationship
between the network behavior and phenotypic response (2).
Recent technical developments allow for measurement of the
response of many signals and phenotypes to many treatments.
The resulting datasets are large in scale and difficult to interpret
by intuition alone.
Various computational techniques and tools can be applied to
aid in data interpretation (3, 4). Clustering methods group data
according to their similarity in order to identify functionally rele-
vant patterns (5, 6), and correlation and regression-based methods
predict species that drive a signaling or phenotypic response (79).
While these methods are useful for gaining a compact picture of the
data and can be used to generate hypotheses, they are limited as
tools for understanding the functionality of signaling networks.
Network-based approaches are a more natural means of gaining
this type of understanding. When prior knowledge on how protein
species interact is scarce, one can infer (i.e., reverse-engineer) the
network from the data itself (1013). However, reverse-
engineering methods generally require a large amount of data, in
part because they ignore what is already known about the signaling
network. If there is enough prior knowledge about how species
within the signaling network of interest relate to each other, one
can build a mathematical model that is then trained to data. These
models provide insight on both experimentally measured or per-
turbed species as well as other species included in the model. In the
particular context of toxicology and pharmacology, this ability is
critical because compounds may affect species that were not directly
measured or perturbed experimentally.
The amount of detail known about the biological system of
interest and specific type of data collected should guide the choice
of the type of mathematical model to construct. A common
approach is to describe the signaling biochemical processes as a
set of differential equations, providing a natural and detailed
description of the underlying molecular events. However, this
detailed description requires a great deal of knowledge about bio-
chemical interactions between proteins and a great deal of data for
training the resultant model (14) such that these models are typi-
cally limited to a dozen or so proteins and one or two pathways
(1517).
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 181
Less detailed descriptions based on the conjectured activating

and inhibitory relationships between proteins can model larger
networks. Logic models are one such approach that have been
successfully applied to model signaling networks consisting of sev-
eral pathways (1821). This ability to capture multiple pathways is
crucial for the study of the effects of chemical compounds because
crosstalk mechanisms can propagate an effect from one pathway to
another and compounds can interact with several proteins in differ-
ent pathways (off-target effect) (2224). Furthermore, cytotoxicity
can be controlled by multiple pathways, a concept recently demon-
strated in primary hepatocytes (25).
Models of multiple signaling pathways are typically constructed
from literature and databases, lumping information from different
cell-types and experimental conditions. However, the toxicity of
compounds can be very different from cell to cell, and in the realm
of pharmacology these differences are precisely what are exploited
to affect diseased but not healthy cells. Therefore, it is critical to
study compounds within cell- and context-specific models. These
specific models can be obtained by training a general model derived
from the literature to a data set gathered from a specific cell type
under defined treatments.
We have recently developed an approach based on logic mod-
eling to create cell- and context-specific models of large signaling
networks and implemented it in the tool CellNetOptimizer (Cell-
NOpt; (26, 27)). The basis of these models is derived from prior
knowledge that is summarized in graphical form into a prior
knowledge network and trained to high-throughput data. The
resulting Boolean or constrained fuzzy logic models can be used
to find pathways differentially regulated in healthy and diseased
cells (therefore pointing at potential novel targets; (24)), study
effects of drugs on the propagation of signals (23) and uncover
unknown off-target effects of drugs (24). Here, we will first
describe its application using a simple toy example and proceed to
a more realistic example of training a network describing pro-
growth and inflammatory signaling pathways to data including
stimulation with extracellular ligands and perturbation with small-
molecule inhibitor drugs in the liver cell line HepG2.
2. Materials
Many tools exist for modeling signaling networks at the biochemi-

cal level (28, 29) and several tools have been developed specifically
for the simulation and analysis of logic-based models ((3040)
reviewed in ref. 21). Here we will describe CellNetOptimizer
(CellNOpt (26, 27)), which trains cell-specific models by linking
high-throughput experimental data to logic-based models.

Constructing a cell specific model is of great importance for toxi-
cology since drugs affects can vary greatly for different cell types.
MATLAB (version 2007a or above) is required to run CellNetOp-
timizer. Additionally, the MATLAB Optimization toolbox is
required for the final step of refining a constrained fuzzy logic
model, but the initial training does not require this additional
toolbox. CellNOpt itself is freely available at http://www.ebi.ac.
uk/saezrodriguez/software.html.
3. Methods
CellNOpt takes a prior knowledge network (PKN) summarizing

what is believed to be true about the network and systematically
compares it to an experimental dataset to derive a logic model that
explains the data as well as possible. The approach can model
different levels of detail depending on the available data (from
specific phosphorylation sites of proteins to macroscopic features),
but typically each node in the PKN corresponds to the activation of
a protein (or phosphorylation of a specific site of a protein). We will
describe the logic modeling formalism, data and PKNs CellNOpt
uses before describing the procedure it follows to train models.
3.1. Logic Modeling CellNOpt trains either Boolean logic (BL) or constrained fuzzy
in CellNOpt logic (cFL) models. In the case of BL models, species states are
modeled as either inactive (zero) or active (one). In cFL, species
states are quantitative in that they can be any value on the closed
interval between 0 and 1.
3.1.1. Boolean Logic In the general sense, Boolean (logic) modeling refers to a common
system for specifying logical relationships between nodes that can
occupy one of two states (0 or 1) (41). In the context of the
CellNOpt methodology, Boolean logic is most easily conceptua-
lized by understanding the determination of a downstream species
state from its input species states. In the simplest case, when one
input activates a downstream output, the output is active if the
input is active (Eq. 1 in Fig. 1a). Conversely, when one input
inhibits a downstream output, the output is inactive if the input is
active (Eq. 2, Fig. 1a).
If two inputs influence a downstream output, CellNOpt relates
the inputs to outputs with either an AND or OR gate. In the case of
an AND gate, both species must be activated for the downstream
species to be active, whereas with an OR gate, activation of either
species is sufficient to activate the downstream species. Thus, if the
logic gate is an AND gate, the output state is the minimum of the
a Boolean Constrained
Logic Fuzzy Logic
A 1. 6.
D=A D = f (A)
D
A 2. 7.
D=1-A D = 1 - f (A)
D
A AND B 3. 8.
D = min(A , B) D = min( f (A) , f (B) )

D
A B 4. 9.
OR
D = max(A , B) D = max( f (A) , f (B) )
D
5. 10.
A B C
D= D=
D max(min(A , B) , C) max(min(f (A) , f (B)) , f (C))
b Constrained Fuzzy Logic Transfer Function (f)

)
03
0 .33
1
(k=
0.9 .3
=0
55
0.
0.8 EC
=
n~1
input n
k
0.7
Output Value
3;
output = g * (1 + kn)
=
0.6
n
input n + kn
1;
.5
0.5 = g=0
g
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Input Value
Fig. 1. Description of logic formalisms. (a) To determine an output given a one-input

activating gate, Boolean logic (BL) as formulated by CellNOpt uses the binary zero or one
value of the input as the value of the output (Eq. 1). The output of a one-input inhibitory
gate is evaluated using Eq. 2. If the gate has more than one input, it uses the min. operator
to evaluate an AND gate and the max. operator to evaluate an OR gate. If a node has more
than two inputs, AND gates are evaluated before OR gates. In our description here, we use
minimum and maximum operators to evaluate AND and OR gates rather than the product
and sum operators more frequently associated with BL because the descriptions are
identical in CellNOpts implementation, and the min./max. description is more readily
extendible to cFL, which similarly uses min. and max. gates to evaluate the AND and OR
gates. However, instead of using the value of the input, it first transforms the value using a
transfer function (Eq. 6). The output of this transfer function is a possible output value that
is then operated on with the min. and/or max. functions depending on the structure of the
logic gate. (b) The cFL transfer function has three parameters: a gain, g, which determines
the maximal value of the input; a sensitivity parameter, k, which determines the EC50; and
a hill coefficient, n, which determines the sharpness of the output value transition.
Influence of each parameter is shown.
input species states whereas, if the logic gate is an OR gate, the

output state is the maximum of the input species states (Eqs. 3 and
4, Fig. 1a).
If a node has three or more inputs, it could be related to its
inputs by many combinations of logic gates. In CellNOpt, the
output state in these cases is determined by first evaluating AND
gates (the minimum of the input states) and then evaluating OR
gates (the maximum of the resultant gates). Thus, the logic state-
ment D (A AND B) OR C is evaluated by first evaluating the
AND gate (minimum of A and B) and then evaluating the OR gate
(maximum of the previously evaluated gate and value of C, Fig. 1a).
3.1.2. Constrained Constrained fuzzy logic (27) is an extension of the previously

Fuzzy Logic described Boolean logic formalism using a simplified form of the
traditional fuzzy logic gate (4244). Instead of describing activa-
tion with a simple one-to-one relationship, a quantitative transfer
function relating the input and output nodes is used. Thus, if one
input activates an output, the resultant output state is equal to the
evaluated transfer function (Eq. 6, Fig. 1a). If multiple inputs
activate a downstream species, the relationship could, as with
the Boolean case, be and AND gate or an OR gate. In the case
of the AND gate, the minimum possible output as evaluated by the
transfer functions is taken as the output state whereas for an OR
gate the maximum possible output is taken (Eqs. 8 and 9, Fig. 1a).
Gates with more than two inputs are evaluated in the order
described for the BL case (Eq. 10, Fig. 1a).
The transfer functions relating input and output species in cFL
can be of virtually any mathematical form. We have currently
implemented a normalized hill function multiplied by a gain
(Fig. 1b) in CellNOpt. This transfer function can take on a variety
of shapes including linear, sigmoidal, and step like. The shape is
determined by three distinct parameters: a sensitivity parameter, k,
that determines the level of input that results in an output level of
0.5 (the EC50), the Hill coefficient, n, that determines the sharp-
ness of the transition between a high and low output state, and the
gain, g, that determines the maximum output obtained by a fully
activated input.
3.1.3. Simulation Algorithm CellNOpt simulates networks by evaluating what each node should
and Its Limitations be given its inputs values at the previous simulation step (synchro-
nous updating). This is carried out until node states no longer
change, that is, the system reaches the logic steady state (20). If
networks contain feedback, this might cause node activation states
to oscillate and never stabilize. For example, if A activates B but B
inhibits A, when A is active, B will be active. However, the activa-
tion of B leads to the deactivation of A, which in turn deactivates B,
allowing A to be reactivated. If node states do not stabilize within a
prespecified number of simulation steps, their value is considered
indeterminable, which is penalized during model training. This will

generally lead to removal of negative feedback in trained networks.
However, this is a reasonable approximation when modeling early
events of signal transduction (30).
The determination of the logic steady state of networks has
implications for the models trained by CellNOpt. The networks (1)
model the increase (or decrease) in species states from their initial
state, (2) are static in that they only capture one time point relative
to time zero, and (3) cannot model negative feedback. Extensions
to handle time-resolved data are in development (see Note 11).
3.2. Experimental CellNOpt was designed to train models to experimental measure-

Design of Datasets ments describing the response of network species to perturbations.
Perturbations are divided into two categories: stimulation (activation)
and inhibition (deactivation). Stimulation is typically accomplished
by the addition of a ligand to the cellular environment, resulting in
the activation of known receptors on the cells membrane. Inhibi-
tion is a perturbation that leads to the deactivation of a specific node
in the network. Examples of such perturbations include the addition
of a small molecule inhibitor or knockdown at the mRNA or gene
level.
For the purpose of model training, combinations of experimen-
tal perturbations (stimulation in the presence or absence of inhibi-
tion) are required in order to access various states in the nodes of
the network of interest. For example, consider the case where two
inputs (A and B) might activate a downstream output (C). The
possible logic gates describing dependence of C on A and B are
shown in Fig. 2a. To reliably determine the appropriate logic gate,
knowledge of the behavior of C under conditions of activation of
both A and B, only A, and only B is required. In order to obtain
these conditions, it might be necessary to stimulate and inhibit the
network in several ways. For example, if the previous interaction
between A, B and C is embedded in the network shown in Fig. 2b,
activation of both A and B can be obtained by the activation of P
while activation of only B can be obtained by the activation of Q.
However, in order to activate only A, P must be activated during
the simultaneous inhibition of B. If all or most state combinations
cannot be accessed, the precise gate to use cannot be determined
and more than one possible solution exists. This situation, termed
model nonidentifiability, is common in modeling biochemical sys-
tems (4547).
3.3. Loading the Prior A PKN is a graph in which nodes are biological entities (proteins,
Knowledge Network and phosphorylation sites of proteins, mRNA, etc.). The edges in a
Data into CellNOpt PKN, termed interactions, are directed in that they relate an
input to an output node and signed in that they indicate if the
effect is activating or inhibitory.
a A and/or B might influence C

A B
A B A B A B A B A B
C C C C C
Both A AND B A OR B can Only A Only B Neither A nor B
are necessary to activate C activates C activates C activate C
activate C
Inputs Output Inputs Output Inputs Output Inputs Output Inputs Output
A B C A B C A B C A B C A B C
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1 0 0 1 1 0 1 0
1 0 0 1 0 1 1 0 1 1 0 0 1 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
b Experiment State
P Q Accessed
Activate Inhibit
None None A=0 ; B=0

A B
P None A=1 ; B=1
Q None A=0 ; B=1

C
P B A=1 ; B=0
c
Fig. 2. Possible logic gates underpinning prior knowledge network. (a) If it is believed that either A and/or B might influence
the activity of C, then the influence of A and/or B on C might take any of the logical formalisms indicated by the truth tables.
In order to differentiate between these possibilities, one must access all possible state combinations of inputs. (b) If A and
B are activated by P and Q as shown in (c), then in order to access all possible state combinations, the experiments shown
must be conducted, one of which includes inhibition of species B.
The information contained in a PKN can be obtained from many

sources: (1) informally from literature curation, (2) commercial and
academic databases (e.g., Pathway Commons: www.pathwaycom-
mons.org, reviewed in ref. 48), and (3) integrative tools (49, 50),
see Note 1. CellNOpt can load PKNs in two formats: a sif file
format that can be exported from Cytoscape ((51), see Table 1 and
Note 2) or a combination of two files in a format compatible with
CellNetAnalyzer (30) that can be exported from the modeling tool
ProMot (52). Since Cytoscape is able to load networks in the stan-
dard BioPAX (53), one can import networks from multiple sources
Table 1
Example of initial toy prior knowledge network in sif
format
Input Sign (1 if activating, 1 if inhibiting) Output

EGF 1 Ras
EGF 1 PI3K
TNFa 1 PI3K
TNFa 1 TRAF6
TRAF6 1 p38
TRAF6 1 JNK
TRAF6 1 NFkB
JNK 1 c-Jun
p38 1 Hsp27
PI3K 1 Akt
Ras 1 Raf
Raf 1 MEK
Akt 1 MEK
MEK 1 P90RSK
MEK 1 ERK
ERK 1 Hsp27
The model is depicted in Fig. 7a. If an AND gate is to be included in the PKN, a node
names and 1 is included and connected to the inputs and output of the AND gate with
multiple row entries
such as the general portal Pathway Commons. The PKN must be

saved in a directory located in the Models directory in CellNOpt.
The name of the directory containing the PKN file(s) is considered
the PKN name.
The data format of CellNOpt, a CNOProject, is a MATLAB
structure consisting of fields describing the names and values of the
stimulated, inhibited, and measured species. Table 2 describes each
field in this data format. The MATLAB toolbox DataRail (54) can
generate a CNOProject from a .csv file adhering to the MIDAS format
(Fig. 3a), see Note 3. Figure 3 explains the process of loading data into
DataRail and converting it into a CNOProject. Alternatively, the
CNOProject data structure can be generated manually by uploading
the species names and data values into the MATLAB workspace see
Note 4. Use of DataRail is recommended to avoid errors inherent in
any manual process. Furthermore, DataRail provides routines for
data normalization tailored to modeling with CellNOpt.
Table 2
CellNOpt data input format, a MATLAB structure called
CNOProject
Field name Description

namesStimuli n 1 cell of strings, each cell containing a name of a
stimuli treatment
namesInhibitors m 1 cell of strings, each cell containing a name of an
inhibitor treatment
namesSignals p 1 cells of strings, each cell containing a name of a
signal measured
valueStimuli e n matrix where a row specifies for each experiment
the scaled value of stimuli treatments, and each
column contains values for the treatment with the
corresponding index in the namesStimuli field
valueInhibitors e m matrix where a row specifies for each experiment
the scaled fraction of knockdown achieved by each
inhibition treatment, and each column contains
values for the treatment with the corresponding index
in the namesInhibitors field
valueSignals e p t 3D matrix composed of t 2D matrices in
which a row specifies for each experiment the scaled
signal measurement, and each column contains values
for the treatment with the corresponding index in the
namesSignals field
timeCues Time when treatments were addeda
timeSignals 1 t vector of the times that signals were measureda
dataCubeStruct Optional fields. If generated from DataRail, these fields
dataCubeNames allow reconstruction of DataRails five dimensional
data cube
n is the number of stimulus treatments; m is the number of inhibitor treatments; p is the
number of signals measured; and t is the number of times that signals were measured
a
CellNOpt currently assumes that a treatments were added at time zero and all measured
species were zero at time zero. Only one later time point is considered for measured
signal values
3.4. Training a Logic After the PKN is saved in the Models directory and the data has
Model with CellNOpt been imported into DataRail, one can either run CellNOpt through
DataRail using GUIs or save the data in the Data folder and run
CellNOpt with dedicated scripts (Fig. 4) see Note 5. The latter
allows for more flexibility but requires greater MATLAB scripting
capabilities. Regardless of which method is used, CellNOpt has many
parameters that affect PKN preprocessing and model training
a TR:Cell TR:EGF TR:TNFa TR:RAFi TR:PI3Ki DA:Akt DV:Akt

1 1 10 0.9
1 1 10 0.7
1 1 1 10 0.9
1 1 1 1 10 0.9
1 1 1 1 10 0
b DataRail GUI
Load Data from File
Transform Data and Add Array
c MIDAS Importer
Choose File Load Data
Poss. Treatments Poss. Readouts

PI3Ki Cell Type Akt
Cell
Hsp27
NFkB
Stimuli Erk
EGF
p90RSK
TNFa
Jnk
Inhibitors cJun
Rafi
d
Name Normalized
Info
Source RawData
Code Normalize
Params Change Parameters
Create Data Array
e
Name CNOData
Info
Source Normalized
Code CreateCNOData
Params
Create Data Array
Fig. 3. Loading data into CellNOpt. DataRail is a freely available resource for the storage, transformation, and analysis of
experimental data describing the signaling and phenotypic response of a biological system to environmental perturbations
(54). (a) Data in the Midas format can be loaded into DataRail. The MIDAS format lists cell type, stimuli condition, and
procedure. Table 3 describes the parameters for both BL and cFL

model training.
The main workflow of CellNOpt (Fig. 5) consists of two main
stages: (1) model preprocessing and (2) model training. The first
stage preprocesses the PKN into a logic superstructure that does
not contain species that are neither captured by the data nor neces-
sary to preserve logical consistency but does contain logic gates
relating species to one another. The second stage uses a genetic
algorithm to train the resulting logic superstructure. To fully train a
cFL model, a process of reduction and refinement occurs, as
described in ref. 27. The final part of the modeling process,
model analysis, takes place after several models have been optimized
so conclusions can be drawn from multiple results. In the following,
we explain these steps in detail; these concepts will also be discussed
in Subheading 4 in the context of specific examples.
3.4.1. Model Preprocessing In the first preprocessing step, termed compression, species that are
either measured or perturbed are labeled as designated and
remain in the network. Species that are neither measured nor
perturbed, undesignated species, are removed from the prior
knowledge network if it is possible to remove them and ensure
that the relationships between designated species will be logically
consistent. Practically, logical consistency is preserved by not
removing an undesignated species if it is the output for more than
one interaction and the input for more than one interaction (26).
In the second preprocessing step, termed expansion, interac-
tions are converted into logic gates. OR gates are implicitly repre-
sented when a species has more than one possible input species.
However, AND gates must be explicitly added to the model. The
expansion step can create AND gates from all possible combina-
tions of inputs (26). This adds complexity to the model and can
significantly decrease computational efficiency. For example, if a
species has six possible inputs, 57 AND gates can be constructed,
22 of which have more than three inputs. In many biological
systems, it is highly unlikely that activation of four independent
pathways is necessary for activation of a downstream species.
Fig. 3. (continued) inhibitor conditions as treatments, indicated by TR in the column header. Data acquisition times
(DA) and Data Values (DV) for each measured species are also listed. Each row represents a different treatment
condition, and the values in each column indicate the treatment or measurement value for that condition. This spreadsheet
should be saved as a .csv file, which can then be loaded into DataRail. (b) After starting DataRail, the Load Data from
Local File button in the main GUI will allow the investigator to select the saved MIDAS file. (c) The importer window allows
the investigator to indicate which treatments are cell types, stimuli, and inhibitors. (d) After the data is loaded into DataRail,
a variety of normalization procedures can be selected from the Transform Data and Add Array option in the main GUI to
scale the data between zero and one. (e) Once the data processing is complete, the CreateCNOData code will create a
CNOProject through the Transform Data and Add Array option in the main GUI.
a
Load data into DataRail, Save model in Models
normalize, and convert folder of CellNOpt
to CellNOpt format
Run CellNOpt through

Save data in Data folder of DataRail interface using GUIs
CellNOpt
1. Press Explore Data in DataRail
2. In MatLab workspace, enter the command
CNOProject=Compendium.data(IX).Value where IX is the
index of the CellNOpt formatted data.
3. Save CNOProject in Data folder of CellNOpt.
Choose
Model
Run CellNOpt using scripts
Run Optimization
Edit Parameters
b Example Script
startCNO; %load CellNOpt and (optionally) DataRail

paths if not loaded
%% Specify CNOProject & Model
ModelName = 'ToyPKNMMB'; % corresponds

to folder containing PKN in CNO/Models
FileWithData = 'ToyDataMMB'; % should be in
current directory or path
%%Set CellNOpt Parameters
Parameters=struct'Parameter Name', Parameter Value)
%% Load Model and data
[CNOProject SourceArray] =
CNOGetArrayFromFile(FileWithData,PGet);
Model = CNOLoadModel(ModelName);
Parameters.Model = Model;
%% Run Identification
DRArray=CNORunIdentification(CNOProject,Parameters);
Fig. 4. Running CellNOpt. (a) Once the model and data are in the proper format, CellNOpt can be run with DataRail or using
scripts. The CellNOpt button in DataRail starts the software and results in the appearance of a GUI for choosing the prior
knowledge network after selecting the required dataset. After selecting the appropriate PKN and pressing the Run
Optimization button, a GUI appears with parameters for the CellNOpt optimization process. (b) Example script for running
CellNOpt.
Table 3
CellNOpt parameters
Parameter name Possible values Description

Model type Boolean, Fuzzy Type of model to train
Expansion parameters
IdentGates All, Nots, or None All results in expansion of input/output relationships
into AND gates when possible (i.e., when and node
has two or more possible inputs) whereas Nots
results in expansion only when one input is inhibitory
(Note 10). If one does not wish to model the
complexity of an AND gate, all gates are assumed to be
OR gated. However, CellNOpt models an inhibitory
gate so that the output node is active when the input
node is inactive. This is only the case in biology when
the output node is constitutively active. To model the
more common case where the output is only active if a
positive input is present, an AND gate is required. The
Nots option generates only those needed AND
gates. None results in no expansion step
MaxInputsPerGate Positive integer Maximum number of inputs into any one AND gate
splitANDs True/false If an AND gate is in the prior knowledge network, this
parameter indicates if it should be split into OR gates
in the logic superstructure as well (true if so, false
otherwise)
Objective function parameters
nanfac Positive real number Penalty assigned to nodes that are not calculable in the
simulation. Nodes might not be calculable because
they oscillate due to a feedback loop. Typically given a
value of 1, which is the largest possible error between
the simulation and data
sizefac Positive real number Each input to a logic gate is penalized by this amount
(i.e., a two-input AND gate is penalized by this factor
twice). Must be zero for reliable cFL training.
Typically ~0.001 for size to have weak effect on score
compared to model error for BL training
Randomization controls parameters
RandomizeModel True, false As a control, one can compare to randomized PKNs
RandType SwapTails, SwapInputs, Type of PKN randomization (26)
SwapHead,
MinimalSpecifications,
MinimalWithReadouts
RandomizeData True, false As a control, one can compare to randomized data
Optimization parameter
Optimizer CNOga For BL: Genetic algorithm utilizing Sperner systems to
reduce the search space by only considering the logic
gates combinations that are not logically redundant
CNOgabinary For BL: Genetic algorithm that directly optimizes
presence/absence of a gate
(continued)
Table 3
(continued)

Enumerate For BL: Enumerates all possible combinations of gates
for problems with few possible gates. More than ~27
gates cannot be accommodated because of memory
constraints
CNOgadiscreteType For cFL: currently the only optimization method for
cFL described in ref. 27
Genetic algorithm parameters
StartPara Empty or (1 r) vector This is the starting BitString for the GA when training a
where r number of Boolean Model with CNOgabinary as the optimizer
gates in processed (not applicable to cFL model training). This
model parameter was used to investigate adding interaction
to the BL model in ref. 26. If this is not desired, this
parameter should be set to and empty vector ([])
MaxTime Positive real number The GA will stop after this amount of time (in seconds)
PopulationSize Positive integer Number of individuals tested during each generation of
the GA
MaxGens Positive integer The GA will stop after this many generations
StallGenLimit Positive integer The GA will stop if it has been this many generations
since the objective function improved
StallTimeLimit Positive real number The GA will stop if it has been this much time
(in seconds) since the objective function improved
ObjLimit Positive real number The GA will stop if the objective function reaches this
value
Pmutation Real number Each generation, the probability that one bit/number
in each individual is randomly changed
ShowStats Simple, full, or Amount of statistics displayed for each generation of the
none GA
Statistics True/false Whether or not to include GAstatistics in the Res
structure
FitnessAssignment Proportional or How the fitness is assigned to each individual, either
Ranking proportional to the score or ranked through a
linear or nonlinear transformation
Ranking Linear or nonlinear If fitness is assigned according to the rank, the fitness
can be computed using either linear or nonlinear
ranking. The latter allows for higher selection
pressures
SelectivePressure Real number greater If fitness is assigned according to the rank, this number
than one is used in the calculation of fitness to increase the
speed of loss of diversity and thus, convergence. If
linear ranking is used, [1 2], if nonlinear ranking,
[1, PopulationSize2]
SigmaScaling True/false Mean centers and scales fitness of individuals to decrease
the variance in fitness values
SigmaScalingCoeff Positive real number Variance of Sigma Scaling
(continued)
Table 3
(continued)

Selection Roulette, How the individuals that are permitted reproduction are
truncation, or chosen. Typically stochastic uniform sampling (SUS)
SUS
Elitism Positive integer Number of individuals retained for the successive
generation
RelTol Positive real number All solutions found by the GA within this fraction of the
best solution are returned in GAstatistics. Typically
~0.01 so that all solutions within 1 % of the best are
returned
Silent True/false If true, GA warnings are suppressed
Output options
SaveResults True/false Save results as .mat file. This should almost always be
true
CreateReport True/false If true, report of single run is compiled (BL only)
ReportFolder Directory location Directory in which report and results file are saved
Log String Tag to append to saved report and results file
Fuzzy specific parameters
Type1Funs (w 3) matrix where w is Set of transfer functions that GA chooses from for
the number of transfer relating nonstimuli inputs to outputs. Each row is a
functions to choose transfer function. The first column contains the gain
from (g), the second the Hill coefficient (n) and the third
the sensitivity parameter (k)
Type2Funs (w 3) matrix Set of transfer functions that GA chooses from for
relating stimuli inputs to outputs. Follows the same
scheme as Type1Funs. There must be the same
number of type 1 and type 2 transfer functions
ReductionThresh Vector of positive Reductions thresholds used during reduction step of
real numbers cFL model training. If the reduction threshold is too
high (greater than ~0.01), empty models may be
returned, resulting in a failure of the reduction and
refinement stages
These parameters are defined when calling CellNOpt to train a PKN to data. Most parameters have a default value. When
calling CellNOpt via the GUI, only some of these parameters can be modified
Thus, the number of inputs used to make AND gates can be limited
to a user-specified value.
3.4.2. Model Training The processed PKN is now a logic superstructure that contains all
allowed logic gates between species that were measured, perturbed,
or necessary to preserve logical consistency. The next step is to train
the superstructure to the data by finding the optimal combination
of gates as defined by an objective function (model score; Fig. 5).
Prior Knowledge
CNOProject
Network (PKN)
Remove unnecessary
undesignated species
Compressed PKN
Pre-Processing
Model
Insert AND gates. If cFL model
is to be trained, add transfer
function to each logic gate
Logic
Superstructure
Use discrete genetic algorithm to minimize score

If model is Boolean If model is cFL
score = fit + size score = fit
Training
Model
(simulated value - measured value)
2
fit = data
number data values
size =
number inputs to gates in model
total possible number inputs to gates in scaffold
Trained Model
If model is cFL, reduction thresholds and
parameter optimization further refine models
Fig. 5. Workflow of CellNOpt. Prior knowledge networks are first preprocessed to remove unmeasured, unperturbed
species (i.e., undesignated species) that are not necessary to ensure logic consistency and AND gates are then added,
resulting in a logic superstructure. This superstructure is then trained to the data using a genetic algorithm which
minimizes the score: sum of mean squared error (MSE) and a size penalty if the model is Boolean or simply the MSE if the
model is cFL. If a cFL model was trained, reduction thresholds and parameter optimization can be used to further refine the
model (27).
CellNOpt uses a genetic algorithm (GA) to solve this optimization

problem (see Note 6), but other optimization methods can be
used (23).
If a Boolean logic model is to be trained, the GA operates by
randomly testing combinations of the possible logic gates. These
models are specified by bit strings (arrays of ones and zeros) where
each gate is specified by a bit. A one indicates that the
corresponding gate is present whereas a zero specifies that it is
not. The stimulation and inhibition conditions provided are used
to simulate these smaller models and the error between the

simulated and experimentally measured species states is calculated.
Models are scored based on model fit and model size, and the GA
minimizes the score (i.e., sum of model error and size, Fig. 5) by
retaining a subset of best scoring models of each generation and
randomly recombining and mutating the bit strings of these models
for the next generation of models until the algorithm converges on
a model with a low score.
If a cFL model is to be trained, the logic superstructure is
extended to include transfer functions. This cFL superstructure is
then trained to the experimental data with a genetic algorithm. For
cFL, the GA randomly tests subsets of possible logic gates transfer
functions that are chosen from a user-specified list of possible
transfer functions with fixed parameters (see Subheading 3.1.2).
These models are specified by strings of discrete numbers, where
each number greater than zero corresponds to a different choice of
transfer function and the number zero indicates that the interaction
is not present. The resultant models are simulated with the experi-
mental stimulation and inhibition conditions and the simulation
values compared to the data. For the cFL case, the GA should
minimize a score that does not include a size penalty because the
additional complication of transfer functions causes the GA to
converge to a local minimum more often when a size penalty is
included (Fig. 5). In a postprocessing step, interactions that can be
removed without the model error increasing above a reduction
threshold are removed. By using several reduction thresholds and
plotting the dependence of model fit on reduction threshold used,
one can find a model that fits the data well but with fewer interac-
tions (27). Results of a CellNOpt optimization run are stored in a
MATLAB results structure (Table 4).
In the applications of this methodology to biological systems of
reasonable size, it is rare that the data will constrain the logic
models to a single optimum because the data do not contain
enough combinations of activities of measured species (the non-
identifiability issue mentioned with Fig. 2). Since there are often
multiple models that fit the data similarly well, the CellNOpt opti-
mization process should be conducted several times in order to
produce a population of models that fit the data well. For Boolean
logic, it is also possible to analyze multiple models found during a
single run that fit the data similarly well. These models are stored in
the GA statistics field of the results structure, described in Table 4.
3.5. Model Analysis Results of a CellNOpt optimization run can be analyzed in several
ways described below. CellNOpt contains functionality to aid in the
analysis of the models structures and fit to the data, described in
the manual that is shipped with CellNOpt. Other types of analysis
are more user-specific, but because CellNOpt is implemented in
MATLAB, they can easily be scripted.
Table 4
Results structure
Field name Description

CurrBest Best score returned by the GA
Params Vector returned by the GA (Bit string if Boolean,
discrete numbers if cFL)
CNOProjectStart Starting CNOProject. Model subfield contains the
starting PKN
CNOProject CNOProject from which any nonmodeled species have
been removed. Model subfield contains the
processed PKN. Parameters subfield contains a list
of CellNOpt parameters used
Boolean-specific fields
GAstatistics If Statistics parameter is true, contains information
about each generation of the optimization run. Also
returns the set of solutions with scores that are within
a fraction of the best solution (set by RelTol
parameter)
cFL-specific fields
Bit Bit string resulting from the discrete numbers returned
by the GA
CutBit Bit string with logical redundancy removed
UnRef Models returned by GA. The first substructure contains
the unrefined model and mean squared error (MSE)
of the model returned by the GA. The second
contains the unrefined model and MSE of the model
returned by the GA with logical redundancy removed
RefRef Reduced, refined models. The first substructure contains
the refined model and MSE of the model returned by
the GA. The second contains the refined model and
MSE of the model returned by the GA with logical
redundancy removed. The higher indices contain the
refined models and MSEs returned after reduction
with the reduction thresholds specified by the
CellNOpt Parameters
Upon a training run, CellNOpt generates a MATLAB structure containing the fields
below
3.5.1. Cross Validation and Cross validation can be accomplished by manually constructing a
Determination of Statistical subset of the data either chosen randomly or specified by the user as
Significance a test set. This test set can be left out of the training set and
CellNOpt used to train a PKN to the remaining training data.
The trained models can then predict the test set in order to estimate
the predictive capacity of the resultant models.
To assess the statistical significance of the models, CellNOpt

can either randomize the training data or PKN by setting the
RandomizeData or RandomizeModel parameter to true.
The frequency of randomized trained models with mean squared
error less than or equal to the real data trained to the real PKN
can then be used to calculate or estimate a P-value of the trained
networks.
3.5.2. Analysis of Model The simplest way to analyze the structures of the trained models is
Structural Features to visualize the average structure. CellNOpt contains functionality
for exporting models as either Cytoscape (www.cytoscape.org) or
graphviz (graphviz.org) structures. CellNOpt automatically gener-
ates graphviz layouts tailored to common analytical procedures
(see Note 8).
To compute the average structure of the trained models, one
simply averages the bits for each interactions. For example, when
averaging the results of ten structures, if a bit was present in every
trained model, it would have an average value of one. However, if a
bit were present in one of the ten models, it would have an average
value of 1/10 0.1. The resultant structure indicates which inter-
actions were present in a large fraction of trained models. Interac-
tions that are always present in the trained models are likely
necessary to fit the data. On the other hand, interactions that
are never or infrequently present are likely to be either inconsistent
with or not necessary to fit the data. These relationships can
easily be verified by examining experimental data of the removed
interactions inputs and outputs. The interactions that are some-
times present can be investigated and experiments devised to deter-
mine which structural feature is correct in the biological system of
interest.
3.5.3. Analysis of Model CellNOpt creates a plot of the data compared to the model simula-
Fit to Training Data tion. The background darkness is indicative of error (darker for
more error). If the models are consistently unable to fit a subset of
the data, this systematic error indicates that the prior knowledge
network was inadequate for capturing the data. Additional interac-
tions can be hypothesized from literature, databases, or experimen-
tal sources and added to the prior knowledge network to explain
the data (see Note 9).
3.5.4. Prediction of By simulating each individual trained model with combination of

Signaling States Under stimulation and inhibition conditions not present in the training
New Conditions data, one can predict what the states of species in the network will
be under the simulated conditions. Furthermore, because each
individual model can be simulated, an average and standard devia-
tion of each prediction can be readily calculated. If the standard
deviation of the prediction is high, the models are not constrained
for the prediction and the investigator should conclude that

the models cannot predict the species state under the simulated
condition.
4. Examples
4.1. Simple Toy We pose a toy example to illustrate several important considerations
Example when using CellNOpt and include the code for training the PKN
and analyzing the results (Fig. 6). The PKN and data files are
located in the Models/SampleModels and Data/SampleData
directories of CellNOpt, and the main directory contains the scripts
used to run all examples. For this example, the prior knowledge
network shown in Fig. 7a is compared to an in silico dataset that
describes the change in activation state of many species in the prior
knowledge network after stimulation with the TNFa and EGF
ligands (Fig. 7b). The data is bounded by zero and one, so no
rescaling is needed.
4.1.1. Model In our toy example, the PKN (Fig. 7a) indicates that EGF can
Preprocessing activate the Ras/Raf/Mek pathway while TNFa can activate the
PI3K or several inflammatory pathways (JNK, p38, and NFkB) via
TRAF6. In our in silico data set, many species were either measured
or perturbed (designated). However, the PKN also contained sev-
eral species that were neither measured nor perturbed (undesig-
nated). Many undesignated species are removed during the first
structure processing step in CellNOpt, the compression step
(Fig. 7c). The removal of the undesignated species does not com-
promise logical consistency because they have only one input or
only one output (Fig. 7a, dashed circles). However, one undesig-
nated species, MEK, was the input of more than one interaction
involving a designated species as well as the output of more than
one interaction involving a designated species. If it was removed,
logical consistency between designated species in the trained model
and starting PKN could not be guaranteed. Thus, it is kept in the
compressed model.
In the next step, model expansion, AND gates are added to
the model (Fig. 7d). Any node with more than one input could be
related to those inputs with AND gates, OR gates, or gates that are
independent of one or more of its possible inputs. For example, in
the compressed model, Hsp27 could be activated by either ERK or
TNFa. Thus, Hsp27 could be related to ERK and TNFa by an
AND gate, an OR gate, or through only one of the inputs (analo-
gous to the case presented in Fig. 2a). The expanded model con-
tains all of these possibilities.
General Scheme for Training and Application to Toy Example

Analyzing Models with CellNOpt (files are in downloaded version of CellNOpt)
1. Save CNOProject as named mat file in 1. Data is saved in CellNOpt directory:

CellNOpt directory: CNO/Data Data/SampleData
2. Save sif file or CNA files in directory: 2. Model is saved in CellNOpt directory:
CNO/Models/User - specified PKN Models/SampleModels/ToyPKNMMB
Name/
3. Run MasterScript multiple times from 3. CNOMasterScriptBool is in the

command line or on a computer cluster main CNO directory. While in that
directory, enter the commands:
for i = 1:10
CNOMasterScriptBool
end
4. All mat result files are in 4. All mat result files are in
CNO/Results directory. CNO/Results directory.
5. Create a cell of result file names 5. Enter the commands

Contents = dir(MMB);
[ListNames{1:nnz(ResultsID),1}] = ...
deal(Contents(find(ResultsID)).name);
6. Compile results of multiple runs using cell 6. Models are Boolean. Enter the
created above commands:
A. If models are Boolean, call function: CNOCompileMultiBoolSol...
CNOCompileMultiBoolSol (ListNames,RunningTutorialExample)
B. If models are cFL,
i. Call function:
CNOCompileMultiFuzzySol
ii. Use resulting plot to determine
selection threshold to use.
iii. Load the files tagged with
allFinalMSEs and allRes saved
when the previous function was
called
iv. Call function:
CNOAnalyzeFuzzyRes
Fig. 6. Training and analyzing a model with CellNOpt. The column on the left describes how to train and analyze a family of
models with CellNOpt, and examples of code for each step applied to the toy example are provided in the right column. In
practical applications, the data does not always constrain the models to one single optimum, and because a genetic
algorithm is used to train the models, running CellNOpt multiple times results in a family of models that fit the data well,
albeit with slightly different features. We provide here a description of how to compile and analyze the results of multiple
CellNOpt runs. see Note 7 for a description of functionality that generates a report of trained Boolean logic models resulting
from a single CellNOpt run.
4.1.2. BL Training The expanded model was then fit to the in silico data using the
and Analysis genetic algorithm described above. For our toy example, we per-
formed the fitting ten times with two size penalties: 0 and 105. We
then assessed the structures of the resulting models by calculating
the average structure.
The average structures of the ten models trained with a size
penalty of 0 (Fig. 8a) indicate that several AND gates could be used
to describe the interactions between species. However, inclusion of
a Prior Knowledge Network b Experimental Data
TNF EGF
EGF TNF EGF + TNF
0.9
Akt
TRAF6 PI3K Ras 0.8
Hsp27 0.7
NFB 0.6
JNK p38 Akt Raf 0.5
ERK
0.4
p90RSK 0.3
MEK JNK 0.2
c-Jun 0.1
0
None
Raf
PI3K
None
Raf
PI3K
ERK
None
Raf
PI3K
NFB c-Jun Hsp27 p90RSK
c Compressed Model d Expanded Model
EGF TNF
EGF TNF
JNK
Raf PI3K JNK Raf PI3K
Akt
Akt
MEK
MEK
ERK
ERK
p90RSK Hsp27 NFB c-Jun c-Jun
p90RSK Hsp27 NFB
Fig. 7. Toy example and prior knowledge network processing. A prior knowledge network (PKN) consists of a signed,
directed graph that summarizes how the investigator believes the studied species will interact. We train a PKN (a)
summarizing how we think several downstream proteins depend on TNFa and EGF to an in silico dataset (b) of stimulation
with one or both of these ligands in the presence or absence of Raf or PI3K inhibition. In the PKN, nodes are light grey
rectangles for stimuli, light grey ellipses for measured, and dark grey rectangles for inhibited species. Undesignated
species are white with a dashed outline indicating that it will be removed during compression. Species both measured and
inhibited are light grey with a black outline. Before training the model, CellNOpt first compresses the PKN (c) by removing
unmeasured, unperturbed species not necessary for logical consistency. In the case of our toy example, all unmeasured,
unperturbed species were compressed except for MEK, which is the output of more than one interaction as well as the
input of more than one interaction. CellNOpt then expands the model (d) to explicitly add AND gates when a species has
more than one possible input.
a small size penalty of 105 (Fig. 8b), yielded trained models with
identical fit in which the additional complexity of AND gates is not
needed to describe the data. Thus, for the remainder of the analysis,
we will focus on models trained with a size penalty of 105.
The models resulting from training the PKN to in silico data
indicated that, although ERK was hypothesized to activate Hsp27
a Trained with Size Penalty of Zero b Trained with Size Penalty of 10-5
EGF TNFa EGF TNFa
Raf PI3K JNK Raf PI3K JNK
Akt Akt
MEK MEK
ERK ERK
p90RSK Hsp27 NFkB c-Jun p90RSK Hsp27 NFkB c-Jun
c Error Between Model Simulation and Experimental Data

EGF TNF EGF + TNF
Akt
Hsp27
NFkB
Erk
p90RSK
Jnk
cJun
None
Raf
None
Raf
None
Raf
PI3K
PI3K
PI3K
Fig. 8. Training initial PKN to in silico data with Boolean logic. The initial PKN (Fig. 7a) was trained to the in silico data
(Fig. 7b) using CellNOpt for ten independent optimization runs with a size penalty of 0 and 105. The bitstrings of the ten
runs were averaged for each size penalty and graphed with CNOCompileMultiBoolSol such that the darkness
of the lines indicates the frequency of the interaction in the average structure. The models resulting from training with a
modest size penalty were simpler than those with no size penalty, although the fit to the data was identical. (c) Upon
examination of the fit of the models trained with a modest size penalty to the data where background darkness indicates
error (darker means more error), one immediately notices that the models failed to fit NFkB activation under EGF
stimulation. This is because the initial PKN (Fig. 7a) lacked a path from EGF to NFkB.
in the PKN, this gate was inconsistent with the data. Thus, it was
removed during the model training. By examining the data, we find
that the experimental basis of the removal of this edge was that Erk
but not Hsp27 is activated with EGF stimulation and the converse
is true with TNFa stimulation (Fig. 7b).
An analysis of error between the model and data indicates that

the models were unable to fit some of the data. By using the trained
model to predict what species states were and comparing to the in
silico data, one can calculate the error in the model fit. Figure 8c
indicates that NFkB was not modeled accurately for EGF stimula-
tion. Although NFkB was activated by EGF stimulation, no paths
existed between EGF and NFkB, so the model training process
could not fit this activation because there is no explanation for it in
the PKN.
To address discrepancies such as these, one can add a link to the
PKN that allows EGF to activate NFkB. From the trained models,
one can immediately see that EGF activates the Ras/Raf/Mek
pathway as well as the PI3K/Akt pathways. Thus, it is possible
that EGF acted through either of these pathways or some pathway
we did not include in our PKN. In this case, we hypothesize new
links from the literature (see Note 9) by noting that either Ras can
activate MAP3K1 (55) or PI3K can mediate MAP3K1 activation
through Rac (56, 57), which can activate IkK, ultimately leading to
the activation of NFkB. Thus, we will include Ras ! NFkB and
PI3K ! NFkB links to the PKN (Fig. 9a) and train it to the in
silico data.
After training the extended PKN to the in silico data, the
models indicate that EGF activated NFkB through PI3K and not
Ras (Fig. 9b). The data indicate that PI3K inhibition also inhibits
EGF-mediated NFkB activation. Thus, it is reasonable that PI3K
was mediating the activation.
An analysis of error (Fig. 9c) draws attention to the fact that the
trained models are consistently in error when modeling JNK path-
way activation. In the PKN, TNFa could have activated JNK.
However, the trained BL models indicate this is inconsistent with
the data because the JNK signaling pathway is not connected to any
other node in the network. The data indicate that JNK and c-Jun
are consistently partially activated (Fig. 7b) after TNFa stimulation.
However, this activation is to a low level (0.25 for JNK and 0.40 for
c-Jun), resulting in more error for a Boolean model to model these
species as active (1 0.25 0.75 for JNK and 1 0.4 0.6 for
c-Jun, for an absolute error of 1.15) instead of inactive
(0 0.25 0.25 for JNK and 0 0.4 0.4 for c-Jun, for
an absolute error of 0.65). Thus, the BL models did not model the
activation of these species.
The preceding case demonstrates a major limitation of BL
modeling. If the data contains intermediate values, BL might
miss the fact that a gate should be present to activate a species
because the partial activation is too low for BL modeling of the
species as active to be worth it from the standpoint of error.
To handle these cases, one can choose to explicitly model the
intermediate activation using cFL.
a
b Trained with Size Penalty of 10-5
EGF TNF
EGF TNF
Ras PI3K TRAF6
Raf PI3K JNK

Raf Akt p38 JNK
Akt
MEK
MEK
ERK
ERK
p90RSK Hsp27 NFB c-Jun p90RSK NFB Hsp27 c-Jun
c Error Between Model Simulation and Experimental Data

EGF TNF EGF + TNF
Akt
Hsp27
NFkB
Erk
p90RSK
Jnk
cJun
None
Raf
None
Raf
None
Raf
PI3K
PI3K
PI3K
Fig. 9. Training extended PKN to in silico data with Boolean logic. Due to the systematic error in modeling NFkB (Fig. 6), a
link from PI3K to NFkB and Ras to NFkB was added to the PKN (a). When this extended PKN was fit to the data using
CellNOpt, the resultant models contained only the PI3K to NFkB link (b) and were able to fit the activation of NFkB under
EGF (c).
4.1.3. cFL Training We thus use cFL to train the PKN to the initial dataset (Fig. 10). To
and Analysis train a cFL model, there are two requirements in addition to those
for BL training: (1) specification of the subset of transfer functions
to choose from with the GA and (2) specification of reduction
thresholds to use after model training (not needed for this toy
example, so we set the parameter to be an empty vector). In our
case, we specify seven possible transfer functions for each interac-
tion (Table 3). These transfer functions include one linear transfer
a
EGF TNF
Raf PI3K Jnk
Akt
Mek
Erk
p90RSK NFB Hsp27 cJun
b EGF TNFa EGF + TNFa

Akt
Hsp27
NFkB
Erk
p90RSK
Jnk
cJun
None
PI3K
Raf
None
PI3K
Raf
None
PI3K
Raf
Fig. 10. Training extended PKN to in silico data with constrained fuzzy logic. Constrained
fuzzy logic was used to train the extended PKN (Fig. 9a) to the in silico data using CellNOpt
for ten independent optimization runs. The resultant models had identical structures (a),
albeit with slightly different transfer function parameters and were able to fit all aspects of
the data, even the species that were activated to an intermediate level (b).
function and six normalized sigmoidal functions with varying

EC50. We also specify seven different transfer functions for interac-
tions connecting ligand inputs (TNFa and EGF in this case) to
downstream protein signals. These transfer functions are constants
between zero and one and are necessarily different because the
ligand inputs are binary (either present or not) and their influence
must be weighted in order to model downstream species with
intermediate values.
After the cFL models are trained to the data, we find that the
models were able to fit the data very well, and they included the
TNFa ! JNK ! cJun interactions necessary to fit the partial acti-
vation of JNK and cJun upon TNFa stimulation (Fig. 10).
4.2. Application to Data As our second example, we describe the use of CellNOpt to analyze
From the Hepatocyte the published data first used to validate the CellNOpt methodology
Cell Line HepG2 (9, 26). This dataset describes phosphorylation of 15 signaling
proteins in response to a battery of pro-growth and pro-
inflammatory cytokines. We will compare this data to a prior knowl-
edge network originally derived from the Ingenuity database and
altered as described in ref. 27 and Fig. 11a).
4.2.1. Model Preprocessing As previously mentioned, CellNOpt first processes the PKN struc-
ture before fitting it to the data. In this case, CellNOpt compresses
the PKN to result in a model containing the stimulated, perturbed,
and measured species as well as a few intermediate nodes necessary
to preserve logical consistency (Fig. 11b). CellNOpt then expands
the model to contain AND gates. In our case, we have specified that
the model should expand all inputoutput relationships into AND
gates if possible, but to limit the inputs into an AND gate to two by
setting the IdentGates parameter to All and the MaxInput-
sPerGate parameter to 2 (see Table 3). This limited expansion
allows us to capture some logical complexity without overly com-
plicating the optimization process. The resulting logic superstruc-
ture to be trained is shown in Fig. 11b.
4.2.2. Model Training After training this superstructure to the data with a modest penalty
and Analysis for model size (1 104) 100 times, we obtained a family of
solutions, some of which fit the data better than others
(Fig. 12a). These models were sorted according to how well they
fit the data, with a subset of best fitting models chosen for further
analysis ((26) describes one way to define such a subset). In this
case, there were two distinct populations of models, those with
scores greater than or <0.06. We thus chose the subset of 41
models with scores <0.06. We used the CellNOpt function CNO-
CompileMultiBoolSol to view the trained models directly
(Fig. 12b) and map the trained models to the original model to
see what pathways were kept by the optimization process
(Fig. 12c).
From the trained BL models, one can immediately deduce that
no AND gate was present in all models, indicating that they were
not necessary to fit the data. Furthermore, the models do not allow
TNFa or IL6 to activate downstream pathways through PI3K and
most interactions from a growth signal to an inflammatory signal
were removed during the training process (e.g., Akt ! Cot !
IkK; PI3K ! Rac ! Map3k1; and Ras ! Map3k1). Additionally,
by plotting the fit of these models to the original data (Fig. 12d),
one finds that most of the data was fit very well with the exception
of partially activated signals downstream of IL6 and TNFa as well as
partial activation of cJun after simulation with TNFa and TGFa. As
in the previous case, CellNOpt can be run with cFL in order to fit
the partially activated data values (27).
a Starting Model
il1a lps tnfa igf1 tgfa il6 ifng
il1r a20 tlr4 tnfr igfr egfr il6r ifngr
ck2 traf6 traf2 irs1t shc jak1 jak2
grb2 pip2
map3k7 ask1 sitpec sos stat1 stat3
mkk3 mkk6 nik ras stat11 stat33
pi3k pp2a
rac pip3
map3k1 pdk1
mkk4 mkk7 akt
p38 jnk12 pak cot
ikk raf1 mek12 mdm2
ikb p90rsk erk12
prak jnk12n p38n p90rskn erk12n
cjun elk1 nfkb msk12 mtor
hsp27 atf2 cfos nfkbn histh3 atf1 creb p53 stat1n p70s6 irs1s stat3n gsk3 casp9 stat13
b Processed Model
tgfa igf1 il6 il1a lps tnfa
ras pi3k traf6
akt mek12 map3k1 map3k7
ikk mkk4
jnk12 p38
mtor p90rsk msk12
gsk3 p70s6 irs1s p53 creb histh3 ikb stat3 cjun hsp27
Fig. 11. Preprocessing PKN to train to HepG2 data. (a) PKN consisting of 82 nodes and 115 interactions trained to HepG2
data. The data consisted of measurement of the phosphorylation of the 15 species depicted in light grey ellipses after
30 min of stimulation with one of the six the species in light grey rectangles, in the presence or absence of inhibition of one
of the seven species in dark grey rectangles. (b) During the preprocessing state, the PKN was compressed to include only
designated species and undesignated species necessary to ensure preservation of logical consistency. All possible two
input AND gates were then added in the expansion step, resulting in a logic superstructure with 30 nodes and 97
hyperedges.
a 60 b tgfa igf1 il6 il1a lps tnfa
Better
50 Fitting
Models ras pi3k traf6
Number of Models
40
akt mek12 map3k1 map3k7

30
20
ikk mkk4
10
jnk12 p38
0
0.05 0.055 0.06 0.065
Score
mtor p90rsk msk12
gsk3 p70s6 irs1s p53 creb histh3 ikb stat3 cjun hsp27
c il1a lps tnfa igf1 tgfa il6 ifng
il1r a20 tlr4 tnfr igfr egfr il6r ifngr
ck2 traf6 traf2 irs1t shc jak1 jak2
grb2 pip2
map3k7 ask1 sitpec sos stat1 stat3
mkk3 mkk6 nik ras stat11 stat33
pi3k pp2a
rac pip3
map3k1 pdk1
mkk4 mkk7 akt
p38 jnk12 pak cot
ikk raf1 mek12 mdm2
ikb p90rsk erk12
prak jnk12n p38n p90rsknerk12n
cjun elk1 nfkb msk12 mtor
hsp27 atf2 cfos nfkbn histh3 atf1 creb p53 stat1n p70s6 irs1s stat3n gsk3 casp9 stat13
d NO STIM TNFa TGFa LPS IL6 IL1a IGF1

AKT
GSK3
Ikb
JNK12
p38
p70S6
p90RSK
STAT3
cJUN
CREB
HistH3
HSP27
IRS1s
MEK12
p53
mTOR
mTOR
mTOR
mTOR
mTOR
mTOR
mTOR
MEK12
MEK12
MEK12
MEK12
MEK12
MEK12
MEK12
JNK12
JNK12
JNK12
JNK12
JNK12
JNK12
JNK12
GSK3
GSK3
GSK3
GSK3
GSK3
GSK3
GSK3
None
None
None
None
None
None
None
PI3K
PI3K
PI3K
PI3K
PI3K
PI3K
PI3K
p38
p38
p38
p38
p38
p38
p38
IKK
IKK
IKK
IKK
IKK
IKK
IKK
Fig. 12. Training PKN to HepG2 data with Boolean logic. (a) CellNOpt was used to train the PKN to the data 100 times, and
we readily identified that 41 of the optimization runs resulted in scores slightly better than the others. We focus the
remainder of our analysis on this subset, but note here that the other subset of models did not fit activation of IkB under
TNFa stimulation, resulting in smaller models that did not fit the data as well. (b) The average structure of the trained
models where darkness of the interactions indicates frequency in average structure. (c) The overlay of the average
structure onto the original model gives the user an idea of what pathways in the original PKN were kept and which removed
during training (AND gates are not explicitly shown). (d) The fit to data can be plotted to visualize systematic error where a
darker background indicates more error and distorted background indicates the data was unreliable.
5. Notes
1. Most resources describe signaling networks as graphs where

nodes are proteins and edges the interactions between them,
but if more detailed knowledge (e.g., specific protein-sites) or
other types of data are available (e.g., release of cytokines
downstream of signaling pathway (27)), they can be easily
accommodated simply by naming the nodes of the model
according to the level of detail in the data. Similarly, edges
among nodes can describe a direct molecular interaction (e.g.,
phosphorylation of one protein by another), or indirect effects
(protein A leads to the activation of protein B). The flexibility
to accommodate indirect or lumped effects allows one to
include prior knowledge that is incomplete in that not all the
intermediate molecular events are known.
2. CellNOpt uses the format described in ref. 30 to specify a
model. In this format, an AND gate is specified explicitly by a
hyperedge which is graphically depicted by connecting input
species to a small dot and connecting the dot to the output. An
OR gate is specified implicitly by the connection of more than
one input to an output species.
3. During data import into DataRail, although DataRail can
technically operate with each treatment type in any dimension,
in the current implementation one should use the canonical
format for operation with CellNOpt. In the canonical format,
the cell type is entered as Dim 1, the stimuli is Dim 2, and
the inhibitor is Dim 3 in the MIDAS importer (Fig. 3c).
Even if only one cell type is used, it must be included in Dim 1.
4. Data normalization: For both the Boolean and cFL cases,
because species states are modeled between zero and one, the
data should be scaled between zero and one prior to model
training. A normalization routine we developed to scale data
(26) is implemented in DataRail (54). This normalization
routine is based on the relative increase in signal value at the
time of measurement over the signal value before treatment.
The models reflect this type of data processing by assuming that
the value of all nodes before stimulation is zero, and stimulation
is modeled as an increase in value of upstream species that result
in activation or deactivation of downstream species.
5. Prior to running CellNOpt, one should check that the names in
the PKN and data are consistent. This is very important, as
most user error occurs because of inconsistent naming. Fig-
ure 13 demonstrates one way to check for consistency between
the PKN and CNOProject.
1. Run Master Script from the command line with the Optimizer parameter set to
none and SaveResults parameter to true.
2. Load mat result file into MatLab workspace
3. Enter the following commands (graphviz is required):

filename = user specified string;
CNOVisualizeNetwork(Res.CNOProject.Model, Res.CNOProject,
strcat(filename,_ModelToTrain)
CNOVisualizeNetwork(Res.CNOProjectStart.Model,
Res.CNOProjectStart,
strcat(filename,_InitialModelAndProject)
4. Open .pdf or .dot files created.
Fig. 13. Code to check CNOProject and PKN prior to training. To check that the names are
consistent in the PKN and CNOProject, after running the commands above, the file tagged
with InitialModelAndProject is the initial model compared to the initial CNOProject.
Species that are green are stimuli, orange are inhibited, and blue are measured. Check
that all species you meant to be included in the model are actually labeled as such. This is
a very good procedure for catching naming inconsistencies. Additionally, you will see
undesignated species that are compressed (dashed outline in black) or not compressed
(solid outline in black). The file tagged with ModelToTrain is the fully processed model.
You will see the results of model compression and all of the added AND gates. This model
can also be used to check for naming inconsistencies and see the additional complexity
added by the chosen expansion procedure.
6. Because GA is initialized randomly, models that are obtained

with different runs of the optimization procedure are indepen-
dent from each other. One way to determine the number of
independent optimization procedures to complete is to com-
pare the average structure of the runs (average of presence or
absence of each interaction in the trained models). If complet-
ing more GA optimization runs does not lead to a significant
change in the average structure, the available solutions ade-
quately represent the population.
7. For BL model training, CellNOpt can automatically generate a
report of one optimization run containing the most important
parameters used, the resulting model visualized with graphviz,
the fit of data, etc. These files can be generated by setting the
CreateReport parameter to true. The report along with
multiple results files are compiled together in a subfolder within
the Results folder. This functionality requires installation of
graphviz and latex and has not yet been extended for the cFL
case, nor for the evaluation of the results of multiple runs.
Figure 6 describes the process for analyzing the results of
multiple optimization runs with BL or cFL.
8. CellNOpt uses graphviz layout abilities to generate automatic
layouts (all network figures shown here have been constructed
with this functionality). This functionality is used to analyze
either one BL result with the CreateReport option or multi-
ple BL or cFL results as described in Fig. 6. In the case of the
Cytoscape output (sif file, Table 1, and EA file with

weights of each interaction generated automatically with Cre-
ateReport or manually by calling the CNOExportCytoscape
function), the user can then use the multiple layout and visuali-
zation routines of Cytoscape to customize the figure.
9. Determining which edges to add to a PKN to address system-
atic error can be done manually through literature searches as
with our toy example or more systematically in the Boolean case
by adding many edges to the PKN, retraining, and determining
which resulted in decreased systematic error (26).
10. The Sum/Product instead of Max./Min. operators can be used
to evaluate OR and AND gates. For Boolean models in Cell-
NOpt, the two descriptions are identical. For cFL, they are not.
To use Sum/Product with cFL, set the Simulator parame-
ter to @CNOfuzzySimEngv23_Sum and tfun parameter to
@CNOhillNonNorm_Prod.
11. Future development: We are working on integrating CellNOpt
with other approaches such as pharmacogenomics (22, 58, 59)
and drug-screenings studies based on gene expression (6062)
to provide a comprehensive view on drug effects. We are also
currently investigating the integration of logic-based ODEs
(40) into CellNOpt to investigate the dynamics of signaling.
Acknowledgments
We thank Peter K. Sorger and Douglas A. Lauffenburger for guid-

ance during the development of CellNOpt. The original develop-
ment of CellNOpt was funded by NIH grants P50-GM68762 and
U54-CA112967 and the DoD Institute for Collaborative Bio-
technologies (http://www.icb.ucsb.edu/). We also thank Beatriz
Penalver, Leonidas G. Alexopoulos, Regina Samaga, Jonathan
Epperlein, Steffen Klamt for their contributions to CellNOpt, and
to Camille Terfve, David Henriques, Aidan MacNamara, and Fran-
cesco Iorio for critically reading the manuscript.
References
1. Jorgensen C, Linding R (2010) Simplistic 3. Kestler HA, Wawra C, Kracher B, Kuhl M
pathways or complex networks? Curr Opin (2008) Network modeling of signal transduc-
Genet Dev 20:1522 tion: establishing the global view. Bioessays
2. Gaudet S, Janes KA, Albeck JG, Pace EA, Lauf- 30:11101125
fenburger DA, Sorger PK (2005) A compen- 4. Heinrichs A, Kritikou E, Pulverer B, Raftopou-
dium of signals and responses triggered by lou M (eds) (2006) Systems biology: a users
prodeath and prosurvival cytokines. Mol Cell guide. Nature Publishing Group, New York, NY
Proteomics 4:15691590
5. Boutros PC, Okey AB (2005) Unsupervised 18. Calzone L, Tournier L, Fourquet S, Thieffry
pattern recognition: an introduction to the D, Zhivotovsky B, Barillot E, Zinovyev A
whys and wherefores of clustering microarray (2010) Mathematical modelling of cell-fate
data. Brief Bioinform 6:331343 decision in response to death receptor engage-
6. DHaeseleer P (2005) How does gene expres- ment. PLoS Comput Biol 6:e1000702
sion clustering work? Nat Biotechnol 19. Pandey S, Wang RS, Wilson L, Li S, Zhao Z,
23:14991501 Gookin TE, Assmann SM, Albert R (2010)
7. Janes KA, Albeck JG, Gaudet S, Sorger P, Lauf- Boolean modeling of transcriptome data reveals
fenburger DA, Yaffe MB (2005) A systems novel modes of heterotrimeric G-protein
model of signaling identifies a molecular basis action. Mol Syst Biol 6:372
set for cytokine-induced apoptosis. Science 20. Samaga R, Von Kamp A, Klamt S (2010) Com-
310:16461653 puting combinatorial intervention strategies
8. Miller-Jensen K, Janes K, Brugge J, Lauffen- and failure modes in signaling networks. J
burger D (2007) Common effector processing Comput Biol 17:3953
mediates cell-specific responses to stimuli. 21. Morris MK, Saez-Rodriguez J, Sorger PK,
Nature 448:604608 Lauffenburger DA (2010) Logic-based models
9. Alexopoulos L, Saez-Rodriguez J, Cosgrove B, for the analysis of cell signaling networks. Bio-
Lauffenburger D, Sorger P (2010) Networks chemistry 49:32163224
inferred from biochemical data reveal profound 22. Berger SI, Iyengar R (2009) Network analyses
differences in Toll-like receptor and inflamma- in systems pharmacology. Bioinformatics
tory signaling between normal and trans- 25:24662472
formed hepatocytes. Mol Cell Proteomics 23. Mitsos A, Melas IN, Siminelakis P, Chairakai
9:18491865 AD, Saez-Rodriguez J, Alexopoulos LG
10. Markowetz F, Spang R (2007) Inferring cellu- (2009) Identifying drug effects via pathway
lar networksa review. BMC Bioinform 8 alteractions using an Interger Linear Program-
(Suppl 6):S5 ming optimization formulation on phospho-
11. Bansal M, Belcastro V, Ambesi-Impiombato A, proteomic data. PLoS Comput Biol 5:
di Bernardo D (2007) How to infer gene net- e1000591
works from expression profiles. Mol Syst Biol 24. Saez-Rodriguez J, Alexopoulos L, Zhang MS,
3:78 Morris MK, Lauffenburger DA, Sorger PK
12. Prill RJ, Marbach D, Saez-Rodriguez J, Sorger (2011) Comparative logical models of signal-
PK, Alexopoulos LG, Xue X, Clarke ND, Altan- ing networks in normal and transformed hepa-
Bonnet G, Stolovitzky G (2010) Towards a tocytes. Cancer Res 71:1
rigorous assessment of systems biology models: 25. Cosgrove B, Alexopoulos L, Hang T, Hendriks
the DREAM3 challenges. PLoS One 5:e9202 B, Sorger P, Griffith L, Lauffenburger D (2010)
13. Sachs K, Perez O, Peer D, Lauffenburger DA, Cytokine-associated drug toxicity in human
Nolan GP (2005) Causal protein-signaling net- hepatocytes is associated with signaling network
works derived from multiparameter single-cell dysregulation. Mol Biosyst 6:11951206
data. Science 308:523529 26. Saez-Rodriguez J, Alexopoulos LG, Epperlein
14. Jaqaman K, Danuser G (2006) Linking data to J, Samaga R, Lauffenburger DA, Klamt S, Sor-
models: data regression. Nat Rev Mol Cell Biol ger PK (2009) Discrete logic modelling as a
7:813819 means to link protein signalling networks with
15. Chen WW, Schoeberl B, Jasper PJ, Niepel M, functional analysis of mammalian signal trans-
Nielsen UB, Lauffenburger DA, Sorger PK duction. Mol Syst Biol 5:331
(2009) Inputoutput behavior of ErbB signal- 27. Morris MK, Saez-Rodriguez J, Clarke DC,
ing pathways as revealed by a mass action Sorger PK, Lauffenburger DA (2011) Training
model trained against dynamic data. Mol Syst signaling pathway maps to biochemical data
Biol 5:239 with constrained fuzzy logic: quantitative anal-
16. Becker V, Schilling M, Bachmann J, Baumann ysis of liver cell responses to inflammatory sti-
U, Raue A, Maiwald T, Timmer J, Klingmuller muli. PLoS Comp Biol 7(3):e1001099
U (2010) Covering a broad dynamic range: 28. Alves R, Antunes F, Salvador A (2006) Tools
information processing at the erythropoietin for kinetic modeling of biochemical networks.
receptor. Science 328:14041408 Nat Biotechnol 24:667672
17. Rangamani P, Iyengar R (2008) Modelling cel- 29. Maly IVE (2009) Systems biology, vol 500.
lular signalling systems. Essays Biochem Humana, New York, NY
45:8394
30. Klamt S, Saez-Rodriguez J, Lindquist J, 44. Aldridge B, Saez-Rodriguez J, Muhlich J, Sor-

Simeoni L, Gilles ED (2006) A methodology ger P, Lauffenburger DA (2009) Fuzzy logic
for the structural and functional analysis of analysis of kinase pathway crosstalk in TNF/
signaling and regulatory networks. BMC EGF/Insulin-induced signaling. PLoS Com-
Bioinform 7:56 put Biol 5:e1000340
31. Klamt S, Saez-Rodriguez J, Gilles ED (2007) 45. Chen W, Niepel M, Sorger P (2010) Classic and
Structural and functional analysis of cellular contemporary approaches to modeling bio-
networks with Cell NetAnalyzer. BMC Syst chemical reactions. Genes Dev 24:18611875
Biol 1:2 46. Kremling A, Saez-Rodriguez J (2007) Systems
32. Albert I, Thakar J, Li S, Zhang R, Albert R biologyan engineering perspective. J Bio-
(2008) Boolean network simulations for life technol 129:329351
scientists. Source Code Biol Med 3:16 47. Penny W, Stephan K, Daunizeau J, Rosa M,
33. Chaouiya C, Remy E, Mosse B, Thieffry D Friston K, Schofield T, Leff A (2010) Compar-
(2003) Qualitative analysis of regulatory ing families of dynamic causal models. PLoS
graphs: a computational tool based on a dis- Comput Biol 6:e1000709
crete formal framework. In: Benvenuit L, De 48. Bauer-Mehren A, Furlong L, Sanz F (2009)
Santis A, Farina L (eds) Lecture notes in con- Pathway databases and tools for their exploita-
trol and information sciences, positive systems. tion: benefits, current limitations and chal-
Springer, Berlin, pp 119126 lenges. Mol Syst Biol 5:290
34. Gonzalez AG, Naldi A, Sanchez L, Thieffry D, 49. Lachmann A, Maayan A (2010) Lists2Net-
Chaouiya C (2006) GINsim: a software suite works: integrated analysis of gene/protein
for the qualitative modelling, simulation, and lists. BMC Bioinform 11:87
analysis of regulatory networks. BioSystems 50. Laakso M, Hautaniemi S (2010) Integrative
84:91100 platform to translate gene sets to networks.
35. de Jong H, Geiselmann J, Hernandez C, Page Bioinformatics 26:1802
M (2003) Genetic Network Analyzer: qualita- 51. Shannon P, Markiel A, Ozier O, Baliga NS, Wang
tive simulation of genetic regulatory networks. JT, Ramage D, Amin N, Schwikowski B, Ideker
Bioinformatics 19:336344 T (2003) Cytoscape: a software environment for
36. Helikar T, Rogers JA (2009) ChemChains: a integrated models of biomolecular interaction
platform for simulation and analysis of bio- networks. Genome Res 13:24982504
chemical networks aimed to laboratory scien- 52. Saez-Rodriguez J, Mirschel S, Hemenway R,
tists. BMC Syst Biol 3:58 Klamt S, Gilles ED, Ginkel M (2006) Visual
37. Ulitsky I, Gat-Viks I, Shamir R (2008) setup of logical models of signaling and regu-
MetaReg: a platform for modeling, analysis, latory networks with ProMoT. BMC Bioin-
and visualization of biological systems using form 7:506
large-scale experimental data. Genome Biol 9: 53. Demir E, Cary MP, Paley S, Fukuda K, Lemer
R1 C, Vastrik I, Wu G, DEustachio P, Schaefer C,
38. Mendoza L, Xenarios I (2006) A method for Luciano J, Schacherer F, Martinez-Flores I, Hu
the generation of standardized qualitative Z, Jimenez-Jacinto V, Joshi-Tope G, Kanda-
dynamical systems of regulatory networks. samy K, Lopez-Fuentes AC, Mi H, Pichler E,
Theor Biol Med Model 3:13 Rodchenkov I, Splendiani A, Tkachev S,
39. DiCara A, Garg A, DeMicheli B, Xenarios I, Zucker J, Gopinath G, Rajasimha H, Ramak-
Mendoza L (2007) Dynamic simulation of reg- rishnan R, Shah I, Syed M, Anwar N, Babur O,
ulatory networks using SQUAD. BMC Bioin- Blinov M, Brauner E, Corwin D, Donaldson S,
form 8:462 Gibbons F, Goldberg R, Hornbeck P, Luna A,
40. Wittmann D, Krumsiek J, Saez-Rodriguez J, Murray-Rust P, Neumann E, Reubenacker O,
Lauffenburger DA, Klamt S, Theis F (2009) Samwald M, van Iersel M, Wimalaratne S, Allen
From qualitative to quantitative modeling. K, Braun B, Whirl-Carrillo M, Cheung KH,
BMC Syst Biol 3:98 Dahlquist K, Finney A, Gillespie M, Glass E,
41. Kauffman SA (1969) Metabolic stability and Gong L, Haw R, Honig M, Hubaut O, Kane
epigenesis in randomly constructed genetic D, Krupa S, Kutmon M, Leonard J, Marks D,
nets. J Theor Biol 22:437467 Merberg D, Petri V, Pico A, Ravenscroft D,
Ren L, Shah N, Sunshine M, Tang R, Whaley
42. Zadeh LA (1965) Fuzzy sets. Inform Control R, Letovksy S, Buetow KH, Rzhetsky A,
8:338353 Schachter V, Sobral BS, Dogrusoz U, McWee-
43. Tong R (1977) A control engineering review of ney S, Aladjem M, Birney E, Collado-Vides J,
fuzzy systems. Automatica 13:559569 Goto S, Hucka M, Le Novere N, Maltsev N,
Pandey A, Thomas P, Wingender E, Karp PD,
Sander C, Bader GD (2010) The BioPAX Haber DA, Settleman J (2007) Identification
community standard for pathway data sharing. of genotype-correlated sensitivity to selective
Nat Biotechnol 28:935942 kinase inhibitors by using high-throughput
54. Saez-Rodriguez J, Goldsipe A, Muhlich J, tumor cell line profiling. Proc Natl Acad Sci
Alexopoulos L, Millard B, Lauffenburger D, U S A 104:1993619941
Sorger P (2008) Flexible informatics for link- 59. Mori S, Chang JT, Andrechek ER, Potti A,
ing experimental data to mathematical models Nevins JR (2009) Utilization of genomic sig-
via DataRail. Bioinformatics 24:840847 natures to identify phenotype-specific drugs.
55. Bian D, Su S, Mahanivong C, Cheng RK, Han PLoS One 4:e6772
Q, Pan ZK, Sun P, Huang S (2004) Lysopho- 60. Iorio F, Bosotti R, Scacheri E, Belcastro V,
sphatidic acid stimulates ovarian cancer cell Mithbaokar P, Ferriero R, Murino L, Taglia-
migration via a Ras-MEK Kinase 1 pathway. ferri R, Brunetti-Pierri N, Isacchi A, di Ber-
Cancer Res 64:42094217 nardo D (2010) Discovery of drug mode of
56. Sander EE, van Delft S, ten Klooster JP, Reid T, action and drug repositioning from transcrip-
van der Kammen RA, Michiels F, Collard JG tional responses. Proc Natl Acad Sci U S A
(1998) Matrix-dependent Tiam1/Rac signal- 107:1462114626
ing in epithelial cells promotes either cell-cell 61. Lamb J, Crawford ED, Peck D, Modell JW,
adhesion or cell migration and is regulated by Blat IC, Wrobel MJ, Lerner J, Brunet JP, Sub-
phosphatidylinositol 3-kinase. J Cell Biol ramanian A, Ross KN, Reich M, Hieronymus
143:13851398 H, Wei G, Armstrong SA, Haggarty SJ, Clem-
57. Fanger GR, Johnson NL, Johnson GL (1997) ons PA, Wei R, Carr SA, Lander ES, Golub TR
MEK kinases are regulated by EGF and selec- (2006) The Connectivity Map: using gene-
tively interact with Rac/Cdc42. EMBO J 16: expression signatures to connect small mole-
49614972 cules, genes, and disease. Science 313:
58. McDermott U, Sharma SV, Dowell L, Grenin- 19291935
ger P, Montagut C, Lamb J, Archibald H, 62. Iskar M, Campillos M, Kuhn M, Jensen LJ, van
Raudales R, Tam A, Lee D, Rothenberg SM, Noort V, Bork P (2010) Drug-induced regula-
Supko JG, Sordella R, Ulkus LE, Iafrate AJ, tion of target expression. PLoS Comput Biol 6:
Maheswaran S, Njauw CN, Tsao H, Drew L, e1000925
Hanke JH, Ma XJ, Erlander MG, Gray NS,
Chapter 9
Regulatory Networks
Gilles Bernot, Jean-Paul Comet, and Christine Risso-de Faverney
Abstract
The usefulness of mathematical models for the biological regulatory networks relies on the predictive
capability of the models in order to suggest interesting hypotheses and suitable biological experiments.
All mathematical frameworks dedicated to biological regulatory networks must manage a large number of
abstract parameters, which are not directly measurable in the cell. The cornerstone to establish predictive
models is the identification of the possible parameter values. Formal frameworks involve qualitative models
with discrete values and computer-aided logic reasoning. They can provide the biologists with an automatic
identification of the parameters via a formalization of some biological knowledge into temporal logic
formulas. For pedagogical reasons, we focus on gene regulatory networks and develop a qualitative
model of the detoxification of benzo[a]pyrene in human cells to illustrate the approach.
Key words: Biological regulatory networks, Gene regulatory networks, Mathematical modeling,
Systems biology, Temporal logic, Model checking, Benzo[a]pyrene, Detoxification pathway, CYP,
Metabolizing enzymes
1. Introduction
Almost all the difficult questions that involve biological systems, with
several interacting entities, need mathematical models and
computer-aided reasoning in order to predict the global behavior
of the system, or to establish some characteristics of the dynamics of
the system. Domain-oriented formal frameworks are then required
in order to efficiently design the mathematical models, to perform
low-cost simulations, and to extract relevant predictions from the
models. The choice of the best suited formal framework is guided
by the biological question(s) under consideration. For instance if
the position of the entities in a two- or three-dimensional space is
important, as well as their trajectories, or the form of some com-
partments, diffusion speeds, etc., then frameworks such as cellular
automata (1), multi-agent systems (2), discrete geometry (3), and
so on may be suited. If, on the contrary, it is relevant to ignore the
215
216 G. Bernot et al.
3D arrangement of the biological objects under consideration in

favor of their quantities and the evolution of these quantities along
time, then frameworks dedicated to biological regulatory networks
are relevant.
Since about 10 years, the formal frameworks for regulatory
networks play a central role in integrative biology and in systems
biology, and they are one of the main scientific roots that initiates
the new era of synthetic biology. Based on the simple idea that
toxicology often needs to predict the behavior of complex
biological systems, and that it has consequently a large intersection
with systems biology in general, the formal frameworks for regu-
latory networks and their associated computer tools (which are now
commonly used in the systems biology laboratories) constitute an
interesting part of computational toxicology. We show in this chap-
ter how they can be used at the cellular and intracellular levels of
description but regulatory network models can also be relevant at
higher levels of description. A good model can predict for instance
the detoxification capabilities of certain pathways, and it can be
used to point out potentially dangerous configurations such as
DNA damage. Nevertheless, one should always have in mind that
a mathematical model never proves the safety of a configuration or a
molecule: in vivo experiments constitute the only biological proof
of a biological property. A formal model can only suggest the best
promising solutions; it never establishes certified solutions. The
predictive capabilities of models can also be used to suggest inter-
esting biological experiments. They can also point out less interest-
ing experiments that can be redundant for reasons far from being
obvious for a human reasoning.
The predictive capability of a mathematical model is essentially
based on a good choice of the parameters that drive the dynamics
(the semantics) of the model. Contrarily to most of the classical
models in sciences such as physics, chemistry, or computer science,
even a very simple biological system involves a very intricate net-
work of interactions, and a small change in the relative strengths of
these interactions can deeply modify the behavior of the system
under consideration. Consequently, a formal model for biology
contains a large number of parameters controlling these interactions
and the problem of the modeling activity is to find and firmly
establish all the possible values of those parameters. This question
is known as the parameter identification problem and is particularly
difficult in biology because the experiments that could establish the
parameter values ask for indirect reasonings: a direct measure of a
parameter value in vivo is rarely possible.
Formal logic and formal methods from computer science have
proved to be very efficient to assist the identification of parameters:
there are well-established computer algorithms that can perform in
silico formal reasonings, and this often leads to clever conclusions,
far from being obvious at first glance. Formal methods constitute
9 Regulatory Networks 217
also a powerful approach to abstract (to simplify) the mathematical

models in such a way that only the relevant features to answer the
considered biological question(s) are retained into the models. We
will see how the so-called discrete frameworks can realize qualita-
tive models dedicated to the sensible questions.
There are different kinds of biological regulatory networks and
they give rise to different mathematical frameworks (signaling net-
works (4), metabolic networks (5), gene networks (6), . . .). In this
chapter, we focus on gene regulatory networks; the overall model-
ing method (drawing of the interaction network as a graph, identi-
fication of the parameters, predictions and feedback to
experiments) does not differ notably for the other kinds of regu-
latory network, only the underlying mathematical frameworks dif-
fer. Moreover, for pedagogical reasons, we will use a simple
biological running example in order to give a good intuitive idea
of the formal modeling approach. Subheading 2 explains the
biological aspects of our example; Subheading 3 explains in details
how gene regulatory networks can be formally modeled; Subhead-
ing 4 discusses some results obtained for our running example; and
Subheading 5 shows how formal logic and in silico reasonings
provide a systematic way to identify parameters, and consequently
predict the possible behaviors of the considered gene network.
2. Example:
Detoxification
Induced by Benzo[a]
Pyrene Exposure Benzo[a]pyrene (BaP) is an environmental carcinogenic polycyclic
aromatic hydrocarbon (PAH) that is formed through incomplete
combustion of organic materials, and common sources are tobacco
smoke, automobile exhaust, and food (7, 8). BaP toxicity is largely
mediated through binding to the aryl hydrocarbon receptor (AhR),
a ligand-activated transcription factor, found in vertebrate species
from fish to humans (9).
The unliganded AhR is maintained in cytoplasm in association
with a chaperone complex (Hsp90/XAP/p23).
Depending upon BaP binding and concentration, the activated
AhR sheds the chaperon proteins and translocates into the nucleus,
where it forms a heterodimer with AhR nuclear translocator
(ARNT) already present in the nucleus (10). This complex recog-
nizes an enhancer DNA element, known as the aryl hydrocarbon
response element (AHRE)also called xenobiotic response ele-
ment (XRE), and dioxin response element (DRE)in the pro-
moter region of target genes collectively known as the AhR gene
battery, which results in their transcriptional activation (1113).
The AhR gene battery includes cytochrome P450 (e.g., Cyp 1
family) as well as non-P450 genes (e.g., a glutathione S-transferase
(Gsta1), a UDP glucuronosyltransferase (Ugt a6), . . .) that are
Fig. 1. Detoxification pathway 1 (dp1).
coordinately induced by AhR-ligands such as BaP and encode,

respectively, phase I and II xenobiotic metabolizing enzymes
involved in the detoxification of BaP (1416).
The coordinate regulation of phase I and phase II metabolizing
enzymes facilitates AhR-mediated detoxification (dp1 detoxifi-
cation pathway 1) and is necessary for cellular protection against
BaP (Fig. 1). The oxidative metabolism of BaP catalyzed by cyto-
chrome P450 enzymes (e.g., CYP1A1, phase I enzymes) leads to
the formation of reactive and electrophilic BaP metabolites (BM)
that can be inactivated by phase II enzyme-catalyzed conjugation
reactions (17). Phase II reactions can aid in formation of water-
soluble metabolites that are easily excreted from the organism,
thereby reducing exposure to BaP (18, 19).
Although metabolism of BaP by CYP1A1 is important for
detoxification, the process can lead to the formation of reactive
intermediatesboth reactive BaP metabolites and reactive oxyge-
nated species (ROS)that cause an oxidative stress signal (12).
The non-P450 AhR battery genes, which are transcriptionally
activated by AhR-ligand via the AHRE, are upregulated by oxida-
tive stress via antioxidant response element (ARE) (12, 16). The
ARE is a cis-acting sequence located in the promoter region of
Fig. 2. Detoxification pathway 2 (dp2).
target genes, which encodes enzymes essential in protection against

oxidative stress.
Linkage between AHRE- and ARE-controlled genes strength-
ens coupling between phase I and II enzymes, and attenuates
oxidative stress due to AhR-controlled CYP1A1 induction (dp2
detoxification pathway 2) (20) (Fig. 2).
3. Method: Thomas
Framework
3.1. Mathematical In order to design a predictive mathematical model for regulatory
Models of Regulatory networks, one has to collect two kinds of biological knowledge:
Networks
1. The sensible set of biological objects that are supposed to drive
the biological system (or the biological phenomenon) under
consideration, and the mutual influences between these
objects; this knowledge will constitute the structure of the
model.
2. A sufficiently precise evaluation of the strength of each influ-

ence between objects, under any relevant situation; this knowl-
edge, once mathematically translated into suitable parameters,
will establish the dynamics of the model.
The ideal situation from the mathematical point of view would
be when all details about the system under consideration have
been biologically elucidated, providing a unique possible structure
with known parameter values and leading to a unique model exhi-
biting a completely defined behavior. In practice the situation is far
from ideal, a majority of parameters are unknown, and even the
structural part may be subject to different possible versions.
Consequently, one has to consider a set of potential models (possi-
bly infinite), which can exhibit different possible behaviors. This
uncertainty does not imply that the modeling activity cannot be
predictive because, even under partial knowledge, all the potential
models can exhibit certain common behaviors under certain
conditions. The price to pay is that we have to manage a huge
number of unknown parameters and possible configurations:
here, computers and computer science become a cornerstone for
regulatory networks.
There are several mathematical frameworks to model regu-
latory networks and they can be classified according to the way
they handle dynamics (21):
l Probabilistic or stochastic frameworks consider that the state of
the regulatory network is defined by the number of molecules
of each sort in the biological system (the considered biomole-
cules can be for example RNA or proteins in order to define the
state of the gene that codes for them). The possible evolu-
tions of the system are then driven by the probability for each
considered object to produce new molecules, taking also into
account the probabilities of degradation; see the seminal work
of Gillespie (22). All these probabilities constitute the para-
meters of the model. Unfortunately the probabilistic models
are often too detailed to facilitate predictions, even with the
help of computers, because they require a huge number of
nondeterministic simulations in silico, so the precise evaluation
of the parameters is incredibly time consuming.
l Continuous frameworks approximate the number of molecules
by a concentration level for each considered object (23). Con-
centrations are positive real numbers, so it becomes possible to
consider the derivative of the concentrations with respect to
time, and the dynamics are then modeled by a system of differ-
ential equations with parameters. This kind of approximation,
that smooths the concentration levels, is of course only valid for
large numbers of molecules of all sorts. A drawback of this
approach is that all the trajectories become deterministic but
the advantage is that simulations are less costly and consequently

it is easier to identify the possible values of the parameters.
l Discrete or qualitative frameworks can be seen as an opposite
approximation where the concentration of molecules is discontin-
uous and is roughly counted for each considered object (e.g.,
low, medium, high), with of course suitable thresholds.
There are as many parameters as for the two previous kinds of
framework and consequently the richness of possible qualitatively
different behaviors is the same, but there are fewer possible values
for the parameters. So, computer science with the help of formal
logic becomes very efficient to identify the parameters and to
extract predictions. We will explain in details the discrete approach,
focusing on the approach defined for gene regulatory networks by
Rene Thomas in the 1970s (6) which is formalized in (24).
l Lastly hybrid frameworks try to take benefit of the qualitative
approach, whilst preserving some continuous or stochastic
aspects inside each discrete state of the network. Hybrid frame-
works constitute currently a very active research area in theo-
retical biology.
3.2. Structure: All kinds of framework represent the structure of a regulatory

Regulatory Graphs network as a directed graph:
l The considered objects (such as genes, relevant external condi-
tions that can vary, or some technical observation points) are
represented as nodes of the graph. These nodes are called
variables because a level, which can vary, will be attached to
them (e.g., a concentration level or an expression level).
l The possible actions from one object to another object (such as
activations or inhibitions) are represented as directed edges of
the graph, from a source node to a target node.
l Some actions can require several source nodes (e.g., when a
complex of molecules is needed to act on the target) and they
can also have several targets; in such cases we often add vir-
tual nodes in the graph that make explicit the cooperation
between source nodes. We call these nodes multiplexes.
Let us consider for example the graph of Fig. 3. It provides a
simplified view of the BaP regulatory network described in Sub-
heading 2.
This regulatory graph contains three variables, which are con-
ventionally surrounded by cycles.
l BaP represents the quantity of BaP present in the cell.
l CYP represents the product of the CYP1A genes, i.e., the
cytochrome P450 concentration level.
l BM represents the quantity of BaP metabolites in the cell, i.e.,
the capability to start an oxidative stress.
CYP
1 2
dp1 1 dp2
1 BM 2
CYP & not(BM) CYP & BaP
1
BaP
Fig. 3. The simplified benzo[a]pyrene regulatory graph. Detoxification pathway 1 is the

cycle BaP-CYP-dp1-BaP on the left and detoxification pathway 2 is made of two cycles on
the right: BaP-CYP-dp2-BM-BaP and BaP-dp2-BM-BaP. DNA damages are not considered
here.
The regulatory graph contains two multiplexes, which are con-

ventionally rectangles, with a first line giving the name of the
multiplex (dp1 and dp2) and a second line containing a formula.
l dp1 contains the formula CYP & not(BM) that says that CYP
must be present with a level sufficiently high (see below), and, on
the contrary, BM must not reach a certain level (see below) in
order to reduce the quantity of BaP in the cell. The multiplex
dp1 characterizes the dp1 because the low level of BM reflects
the absence of a significant oxidative stress.
l dp2 contains the formula CYP & BaP that says that CYP and
BaP must be both present, with a level sufficiently high (see
below), in order to produce BM (and start dp2).
The edges whose target node is a variable can be:
l Activations, conventionally represented with arrows of the form
source ! target such as the two edges from BaP to CYP or
the edge from dp2 to BM or
l Inhibitions, conventionally represented with arrows of the form
source a target such as:
The edge from dp1 to BaP, which represents the reduction
of BaP level in the cell performed via the coordinate induc-
tion of CYP1A and non-P450 enzymes mediated through
AhR-ligand (BaP)/AHRE pathway, in the absence of oxi-
dative stress (this inhibition reflects the dp1 of Fig. 1).
The edge from BM to BaP, which represents the reduction
of BaP level in the cell performed via the induction of
CYP1A mediated through AhR-ligand/AHRE pathway
and also the induction of non-P450 enzymes controlled by

both AhR-ligand/AHRE and oxidative stress/ARE path-
ways (this inhibition reflects the dp2 of Fig. 2).
Lastly, the edges whose source node is a variable are labeled by a
positive integer:
l There are two edges that start from CYP, with targets dp1 and
dp2. The first one is labeled by 1 and the second one is labeled
by 2. This means that the level of CYP required to participate in
dp1 is lower than the level required to participate in dp2: the
integers represent the order of triggering when we assume
that CYP is increasing, starting from its lowest possible level.
The integer label is called the threshold of the edge; it makes
more precise (and above all, edge-dependent) the notions of
sufficiently high level or certain level used before.
l The same applies to the edges starting from BaP. The quantity
of BaP can be sufficient to activate the dp1 where CYP is
activated but the oxidative stress remains low. So, there is an
edge from BaP to CYP with threshold 1. Also, there is a higher
level of BaP that increases again the production of CYP and also
starts an oxidative stress by producing more BM. This phenom-
enon is represented by the two edges from BaP to CYP and
from BaP to dp2, both with threshold 2.
l Let us remark that the only relevant threshold for BM is 1
because the edge from BM to dp1 is purely virtual as it serves
to mutually exclude dp1 and dp2 via the not(BM) sub-
formula of dp1.
For gene regulatory graphs, Rene Thomas has proposed a
systematic way to properly define the expression levels of a gene
with simple integers (6). He started from the known fact that,
considering solely the action of a source gene on a target gene,
the curve that represents the quantity of the target gene product (at
equilibrium), in function of the quantity of the source gene prod-
uct, is a sigmoid. When the source gene activates the target gene,
the sigmoid is increasing whereas the sigmoid is decreasing if the
source inhibits the target; see for example Fig. 4.
g2
g3
g1
g
0 1 2 3
Fig. 4. Example of sigmoid shapes where a source gene g activates its target genes g1
and g3 and inhibits g2. Four qualitatively different intervals appear, numbered by the
number of genes on which g is acting.
A gene g of the regulatory graph being given, it is sufficient to

consider all its target genes g1,. . ., gn and their corresponding sig-
moid curves; Fig. 4 shows three target genes. Once ordered
increasingly, the inflection points cut the set of possible expression
levels of g into n + 1 intervals, from 0 to n. So, the positive integer i
labeling each outgoing edge is the number of the first interval
where g acts on gi. Sometimes, gi and gi + 1 may share the same
threshold, in which case the inflection points cut the set of possible
expression levels of g into n intervals only, from 0 to n 1 (or less if
there are several shared thresholds).
To summarize, a regulatory graph contains variables, multi-
plexes, activation edges on a target variable, inhibition edges on a
target variable, and edges from a source variable labeled with an
integer threshold. It constitutes the structural part of the model. A
fully formal mathematical definition of regulatory graphs with
multiplexes can be found in (25).
3.3. Dynamics: State The dynamics of a regulatory network define how the biological
Graphs system evolves autonomously by describing the successive states
that the system shall exhibit, starting from any initial state. Follow-
ing Rene Thomas, within a discrete model, a state is defined by the
number of the interval (as described before) associated to each
variable. Intuitively, the state of a given variable g represents the
number of edges in the graph on which g is acting (as already
pointed out, it can be lower if there are some shared thresholds).
A current state being given, it describes for each variable the
targets on which the variable is currently acting. For example
(BaP 1, CYP 1, BM 0) is a state where, according to Fig. 3:
l BaP activates CYP via the left-hand side edge with threshold 1
but not via the right-hand side edge with threshold 2, and BaP
is not acting on dp2.
l CYP is acting on dp1 but it is not acting on dp2.
l BM is not acting on dp1 (thus not(BM) is true) and BM does
not repress BaP.
Consequently, a current state being given, one can make an
inventory of the edges that are acting on a variable. For example
within the current state (BaP 1, CYP 1, BM 0):
l BaP is repressed by dp1 because CYP & not(BM) is satisfied
(because CYP passes its threshold 1 and BM does not); BaP is
not repressed by BM because BM does not pass the threshold 1.
l CYP is activated by BaP at level 1 (left-hand side edge), but not
at level 2 (right-hand side edge).
l BM is not activated by dp2 because CYP & BaP is false (because
CYP does not pass its threshold 2, and neither does BaP).
All mathematical frameworks for gene regulatory networks
consider that this inventory decides what are all the possible futures
from the current state in the dynamics. More precisely, in the

discrete framework, we consider that the state of each variable g
of the regulatory graph tries to move toward a value Kg,{e1,e2,. . .}
where {e1, e2, . . .} is the inventory of all the edges that currently act
on g, according to the current state. The value of the parameter Kg,
{e1,e2,. . .} is called the focal point of g for the current state.
Consequently, to define the dynamics of a regulatory graph,
one has to identify the values of a family of parameters of the form
Kg,{e1,e2,. . .} where g is any variable of the regulatory graph and {e1,
e2, . . .} is any subset of the edges whose target is g. For example,
according to the regulatory graph of Fig. 3, the following para-
meters should a priori be considered:
l KBaP,{}, KBaP,{dp1}, KBaP,{BM}, and KBaP,{dp1,BM}, whose possible
values range from 0 to 2.
l KCYP,{}, KCYP,{BaP1}, KCYP,{BaP2}, and KCYP,{BaP1,Bap2}, whose
possible values range from 0 to 2.
l KBM,{} and KBM,{dp2}, whose possible values are 0 or 1.
In fact, the focal point KCYP,{BaP2} is useless because it is impos-
sible for BaP to act on CYP via the right-hand side edge without
acting also via the left-hand side edge (if BaP 2 then it is greater
than 1, and consequently KCYP,{BaP1,Bap2} applies). Similarly, KBaP,
{dp1,BM} is useless because if BM passes the threshold 1 to repress
BaP, then not(BM) is false, and consequently dp1 cannot repress
BaP. In other words BaP can be either repressed via the dp1
(left-hand side of Fig. 3) or repressed via the dp2 (right-hand side
of Fig. 3), and never both.
As usual, the cornerstone of the modeling activity is the parame-
ter identification process: it is discussed in the next section. Let us
assume for the moment that the parameter values are known and let
us show how to deduce the state graph, which defines the dynamics. A
state being defined by an integer value for each variable of the
regulatory graph, the state space can be seen as a hyperrectangle
whose dimension is the number of considered variables. So, the
state space for our example is a three-dimensional box, with three
values (from 0 to 2) in the BaP and CYP dimensions, and two values
(0 and 1) in the BM dimension (according to the numbering of
intervals mentioned previously). Consequently, it contains 18 states.
Let us first consider the 6 states where BaP 0 so that we will be able
to draw this subspace on a flat paper easily, and study the cell behavior
without BaP. It seems reasonable in this case to assume that CYP and
BM admit their intervals numbered 0 as focal points: KCYP,{} 0 and
KBM,{} 0. As BaP is required for any action on CYP or BM, these
two parameters are the only ones that are useful when BaP 0.
l Let us consider for instance the state (CYP,BM) (0,1). Its
focal state is the target state (KCYP,{},KBM,{}) (0,0) and the
dynamics will simply contain a transition from the state (0,1) to
the state (0,0).
BM BM BM
1 1 1
0 0 0
CYP CYP CYP
0 1 2 0 1 2 0 1 2
transition from state (2, 0) transitions from state (1, 1) state graph (for BaP = 0)
Fig. 5. Construction of the state graph where BaP is fixed to 0.
l Let us consider now the state (2,0). Its focal state is still (0,0) as
shown on the left part of Fig. 5 with the long, crossed out
arrow. Obviously a direct transition from (2,0) to (0,0) would
be biologically impossible because the CYP degradation must
cross the interval numbered 1 instead of jumping from 2 to 0.
Consequently, the dynamics convert the long arrow into a
transition of length 1, as shown with the short arrow in the
same figure.
l Lastly, let us consider the state (1,1). Its focal state is still (0,0) as
shown in the middle part of Fig. 5 with the diagonal, crossed out
arrow. A transition from (1,1) to (0,0) would mean that both
CYP and BM cross their respective sigmoidal thresholds exactly
at the same time. In fact, one of them is likely to cross the
threshold first, depending on which one is closer to its thresh-
old in the real current state in vivo. Consequently, the dynamics
replace the oblique arrow by two transitions, as shown with the
two perpendicular arrows, one of them modifying the CYP state
alone and the other one modifying the BM state alone.
These two principles (the length of a transition is 1 and a
transition modifies only one variable at a time) define how to
build the state graph. The right part of Fig. 5 shows all the transi-
tions that stay in the BaP 0 plane with KCYP,{} 0 and KBM,
{} 0. This part of the state graph shows that, in the absence of
BaP, the state (CYP 0, BM 0) is a stable state toward which all
states converge and that CYP and BM can decrease in any order.
4. An Example
of Possible
Parameter Values,
Among Others When the environment brings BaP into the cell, one has to consider
all possible values of BaP and consequently the state graph is three
dimensional, with three planes (one for each value of BaP) whose
transitions are similarly deduced from the parameters and there are
several transitions that jump between planes when BaP varies. Let
us consider a first case where the quantity of intracellular BaP can be
handled by the dp1. This case is modeled by KBaP,{} 1, and
except KCYP,{} 0 and KBM,{} 0, the other parameters are a
priori unknown. The next section explains how the computer can
help in finding the parameter values. For the moment, let us
arbitrarily consider the following reasonable values to complete
Fig. 3:
1. KBaP,{dp1} 0 (meaning that the dp1 can be sufficient to
reduce BaP from 1 to 0) and KBaP,{BM} 1 (following the
intuition that BM characterizes the dp2 by the presence of
oxidative stress, and that the role of the dp2 is to reduce BaP
from 2 to 1).
2. KCYP,{BaP1} 1 (the expression of CYP when BaP is main-
tained at level 1) and KCYP,{BaP1,Bap2} 2 (the expression of
CYP when BaP is maintained at level 2).
3. KBM,{dp2} 1 (BM trigger a significant oxidative stress when
both BaP and CYP are maintained at level 2).
For each of the 18 possible states, we exhaustively list the edges
that act on each variable, and this determines the parameter that
plays the role of focal point. The table on the left of Fig. 6 gives the
18 corresponding lines.
Then, by applying the two principles explained before, we get
the state graph drawn on the right of Fig. 6. Of course, the lower
level plane, where BaP 0, is the one obtained in Fig. 5, but we
can see that the state (BaP 0, CYP 0, BM 0) is not a stable
state anymore, because the new value KBaP,{} 1 creates the red
BM
BaP = 2
CYP
BM
BaP = 1
CYP
BM
BaP = 0
CYP
Fig. 6. A discrete model of the interleaving pathways dp1 and dp2 when environment imposes an in-between level of BaP.
The table shows for each possible state the set of resources of each variable and the possible evolution directions (left ).
The associated state graph (right ).
transition (0,0,0) ! (1,0,0). This transition represents the fact that

the cell environment pulls BaP to level 1. In the BaP 1 plane,
CYP increases to level 1 via the blue transition (1,0,0) ! (1,1,0).
Then BaP is reduced to level 0 via the green transition
(1,1,0) ! (0,1,0). In the BaP 0 plane, CYP is reduced to level
0 via the blue transition (0,1,0) ! (0,0,0) and finally we observe
that the cycle (0,0,0) ! (1,0,0) ! (1,1,0) ! (1,1,0) ! (0,0,0)
replaces the stable state observed in Fig. 5 where KbaP,{} was equal
to 0. This cycle reflects the behavior of the dp1.
Some remarks:
l Notice that the non-P450 genes are not explicitly taken into
account in this regulatory model for pedagogical reasons only
(in order to avoid a four-dimensional state graphs in this chap-
ter). A non-P450 variable should have been included into the
model for a better biological credibility and, of course, the
example is easy to study with the help of a computer, which
has no difficulty to handle a large number of dimensions.
Indeed, all the results explained here remain valid when we
hide, as we did, the non-P450 genes in the two inhibition
arrows of Fig. 3.
l Besides, notice that it is impossible to escape from the cycle
(0,0,0) ! (1,0,0) ! (1,1,0) ! (1,1,0) ! (0,0,0) which is
consequently a basin of attraction of the state graph.
l All other states of the state graph converge toward this basin of
attraction which consequently represents the only functional
behavior according to this parameter setting.
l Notice also that the BaP 2 plane is unreachable in normal
conditions because all vertical transitions between the two
planes BaP 1 and BaP 2 are going down.
This family of parameter values seems also suitable to model the
case where the cell environment brings BaP into the cell at a
sufficient level to trigger a significant oxidative stress and the dp2:
we consider the same parameter values except that KbaP,{} 2
instead of 1; we get the state graph of Fig. 7.
We see that the cycle reflecting the dp1 is no more a basin of
attraction. Indeed, there is a red transition (1,0,0) ! (2,0,0) that
escapes from the cycle of pathway 1 due to the capability of the
environment to pull up BaP to level 2. New cycles appear; among
them, the preferentially chosen ones go through the states (2,1,0),
(2,2,0), (2,2,1), and (1,2,1). These cycles denote that a larger
amount of CYP is expressed, that BM are significantly produced,
and that a significant oxidative stress is triggered. These cycles
belong to the possible behaviors of the dp2. The bold arrows of
Fig. 7 show one of those cycles. Remember that we do not take into
account the DNA damage in this model, which would impose an
escape of dp2.
BM
BaP = 2
CYP
BM
BaP = 1
CYP
BM
BaP = 0
CYP
Fig. 7. A discrete model of the interleaving pathways dp1 and dp2 when environment imposes a high level of BaP.
The table shows for each possible state the set of resources of each variable and the possible evolution directions (left ).
The associated state graph (right ).
5. Materials: Model
Checking and
SMBioNet
Since the parameters are generally not measurable in vivo, finding a
suitable class of parameters constitutes a major issue of the model-
ing activity. In fact, while available data on the connectivity between
elements of the network are more and more numerous, the kinetic
data of the associated interactions remain difficult to interpret in
order to identify the strength of the gene activations or inhibitions.
While it is rather easy to construct the interaction graph, the
determination of the dynamics of the model is quite difficult. This
parameter identification problem constitutes the cornerstone of the
modeling activities. Then, it would be interesting to automatically
exhibit from some biologically known behaviors or some hypothet-
ical behaviors parameters of the model which lead to dynamics
coherent with the set of available knowledge on the behavior of
the system. In the context of purely discrete modeling presented
before, this problem is simpler because of the finite number of
parameterizations to consider. Nevertheless this number is so enor-
mous that a computer-aided method is needed to help biologists
to go further in the comprehension of the biological system
under study. We show in this section how formal methods from
computer science are able to perform computer-aided identification
of parameters.
5.1. Temporal Logic Temporal logics are languages that allow us to formalize known
biological behaviors or hypothetical behaviors in such a way that
computers can automatically check if a model exhibits those beha-
viors or not. The building blocks of a temporal logic are atoms,
connectives, and temporal modalities. Let us here consider the
Computation Tree Logic (26), CTL for short, which is one of the
most common temporal logics:
1. Atoms in CTL are simple statements about the current state of
a variable of the network: equalities (e.g., (BaP 2)) or
inequalities (e.g., (CYP < 1) or (CYP > 1)).
2. Connectives are the standard connectives: :, as negation
(e.g., :(BaP 0) is the negation of the atom (BaP 0));
, as and stands for the conjunction (e.g., (BaP 0)
(CYP > 1)); , as or, stands for the disjunction (e.g.,
(BaP 0)(CYP > 1)); ), as implies, stands for impli-
cation (e.g., (BaP 0) ) (CYP > 1)), and so on.
3. Temporal modalities are combinations of two types of informa-
tion:
(a) Quantifiers: A formula can be checked with respect to all
possible choices of path in the asynchronous state graph (uni-
versal quantifier, denoted by the character A), or one can
check if it exists at least one path such that the formula is
satisfied (existential quantifier, denoted by the character E).
(b) Discrete time elapsing: A formula can be checked at the
next state (character X), in some future state which is not
necessarily the next one (character F), and in all future
states (character G). Moreover a formula can be checked
until another formula becomes satisfied in the future (char-
acter U).
In short, a CTL modality is the concatenation of two characters:
First character Second character

A for All path choices X neXt state
F for some Future state
E there Exists a choice G for all future states (Globally)
U Until
To illustrate how to use CTL to express a biological property,

let us consider the formula
((BaP 0) (CYP 0) (BM 0)) ) EF((BaP 2)
(CYP 2) (BM 1) AG(BaP 2))
This formula means that, starting from an initial state where
(BaP 0), (CYP 0), and (BM 0), it is possible (character
E) to reach, in the future (character F), a state where (BaP

2) and (CYP 2) and (BM 1). Moreover, from this latest
state, all trajectories (character A) will stay forever (character
G) in the set of states where (BaP 2).
5.2. CTL to Encode CTL formulas are useful to express temporal properties of
Biological Properties biological systems. Once such properties have been elaborated, a
model of the biological system will be acceptable only if its state
graph satisfies the CTL formulas; otherwise, it is not considered
anymore. Considering our running example, three temporal prop-
erties seem relevant.
The first temporal property focuses on the behavior of the
system when the toxic exposure level is null (KBaP 0). In such a
case, the system is able to reset the expression level of CYP toward
its basal level, that is, toward 0. Let us first denote by (x,y,z) the
formula ((BaP x) (CYP y) (BM z)). Since (KBaP 0)
is equivalent to the fact that from the state (0,0,0), the increasing of
BaP is not possible, this behavior is translated into CTL as
j0 [(0,0,0) ) : EX (1,0,0)] [(BaP 0) ) AF(AG(CYP
0))]
The second property focuses on the behavior of the system
when the toxic exposure level is set to 1 (KbaP 1). The dp1 is
supposed to be sufficient to detoxify the cell completely. Besides
BaP cannot increase up to level 2. In addition the dp1 (when the
BaP level is decreasing from level 1 to 0) does not involve the
oxidative stress/ARE pathway (in other words BM 0). Also
(KBaP 1) is equivalent to the fact that, on the one hand, from
the state (0,0,0), the increasing of BaP is possible, and on the other
hand, from the state (2,0,0), the decreasing of BaP is possible.
Thus, these properties are translated in CTL as follows:
j1([(0,0,0) ) EX(1,0,0)] [(2,0,0) )EX(1,0,0)])
((BaP > 0) ){EF(BaP 0) AF(AG(BaP < 2)) AG
[((BaP 1) EX(BaP 0)) ) (BM 0)]})
The third temporal property focuses on the behavior when the
toxic exposure level is set to 2 (KbaP 2). In such a case, a path
which detoxifies completely exists:
j2 (BaP 2) ) EF(BaP 0)
In practice, CTL formulas are sufficient to express the majority
of useful biological properties even if in some cases the translation
of a property is tricky.
5.3. Computer-Aided To apprehend a biological system, the researchers accumulate

Elaboration of Formal knowledge on this system. As seen in Subheading 2, this knowledge
Models includes structural or dynamic knowledge. CTL is used to
encode dynamic properties of the biological system, including the
response to a given stress, some possible stationary states, known

oscillations, etc. In general this second kind of knowledge can also
be a hypothesis about the behavior of the system.
The first question focuses on consistency: Is the dynamic knowl-
edge coherent with the structural knowledge?
After the formalization step, formal logic and formal models
allow us to test hypotheses, to check consistency, to elaborate more
precise models incrementally, and to suggest new and relevant
biological experiments. The classical way of testing consistency,
introduced in (24), consists in the following four steps:
l Draw all the sensible regulatory graphs according to the struc-
tural biological knowledge, with all the sensible, possible
threshold allocations.
l Express in a formal language, CTL for example, the known
behavioral properties as well as the considered biological
hypotheses.
l Then, automatically generate, for each possible regulatory
graph, all the possible values for all parameters. For all of
them, generate the huge number of corresponding state graphs.
l Check each of these models against the CTL formulas expres-
sing the dynamic knowledge. This step is called model checking.
If no model survives to the fourth step, then reconsider the
hypotheses and perhaps extend model schemes.
In the context of R. Thomas modeling, the software platform
SMBioNet (25) implements this way of testing consistency: it
allows one to select the models that are consistent with the regu-
latory graph and the dynamic properties expressed in CTL. For
each parameterization, SMBioNet constructs the corresponding
asynchronous state graph and check if the CTL temporal formula
is satisfied by this state graph. This verification step is performed by
the model checker NuSMV (27).
The total number of parameterizations to consider can be easily
computed. Let us first remark that parameters associated with BaP
(KbaP,{}, KbaP,{dp1}, KbaP,{dp1,BM}) can take their values in [0,1,2],
that parameters associated with CYP (KCYP,{}, KCYP,{bap1}, KCYP,
{bap1,bap2}) can take their values in [0,1,2], and that parameters
associated with BM (KBM,{}, KBM,{dp2}) can take their values in
[0,1], as described in Subheading 9.3.3. Thus there exist 33 33
22 2,916 different parameter values to consider. It does not
mean that there exist 2,916 different state graphs to consider, for
two reasons:
l It is not restrictive to consider only monotonous parameter-
izations where an activator cannot decrease a parameter (if a is
an activator of a variable v and if w is a set of resources of v, then
Kv,w Kv,w[{a}) and an inhibitor cannot increase a parameter
(if i is an inhibitor of v, then Kv,w[{i} Kv,w) because the

multiplexes explicit the exceptions at the structural level.
l Several parameter values can lead to a same state graph.
Consequently, the software platform SMBioNet enumerates
only the different state graphs that are associated with a monotonous
parameterization. For our BaP example, SMBioNet enumerates
only 420 state graphs, submits each graph to the model checker,
selects only the state graphs that satisfy the CTL formulas j0, j1, or
j2, and outputs the corresponding parameterizations.
There are 18 models (parameter valuations) which lead to a
state graph which is consistent with j0, 14 models consistent with
j1, and 90 models consistent with j2. According to Subhead-
ing 5.2, we are interested in the models that satisfy j0 when KBaP
0, j1 when KBaP 1, and j2 when KBaP 2. Interesting mod-
els are then those that share all the values of parameters except the
values of parameters KbaP,. . . because BaP is an environmental vari-
able of the regulatory network (external to the cell).
Finally three different parameterizations of KCYP,. . . and KBM,. . .
survive. To deepen our understanding of the system, we have to
construct the corresponding state graph for each possible value of
the parameters KbaP,. . . and check their biological meaning, as partly
done in Subheading 9.4 for one of these three parameterizations.
References
1. Kier LB, Bonchev D, Buck GA (2005) Model- 8. Phillips DH (1999) Polycyclic aromatic hydro-
ing biochemical networks: a cellular-automata carbons in the diet. Mutat Res 443
approach. Chem Biodivers 2:233243 (12):39147
2. Hoehme S, Drasdo D (2010) A cell-based sim- 9. Schmidt JV, Bradfield CA (1996) Ah receptor
ulation software for multicellular systems. Bio- signaling pathways. Annu Rev Cell Dev Biol
informatics 26(20):26412642 12:5589
3. Poudret M, Comet J-P, Le Gall P et al (2008) 10. Hankinson O (1995) The aryl hydrocarbon
Topology-based abstraction of complex receptor complex. Annu Rev Pharmacol Tox-
biological systems: application to the Golgi icol 35:307340
apparatus. Theory Biosci 127:7988 11. Gu YZ, Hogenesch JB, Bradfield CA (2000)
4. Eungdamrong NJ, Iyengar R (2004) Modeling The PAS superfamily: sensors of environmental
cell signaling networks. Biol Cell 96:355362 and developmental signals. Annu Rev Pharma-
5. Schuster S, Hilgetag C, Woods JH et al (2002) col Toxicol 40:519561
Elementary flux modes in biochemical reaction 12. Nebert DW, Roe AL, Dieter MZ et al (2000)
systems: algebraic properties, validated calcula- Role of the aromatic hydrocarbon receptor and
tion procedure and example from nucleotide [Ah] gene battery in the oxidative stress
metabolism. J Math Biol 45:153181 response, cell cycle control, and apoptosis. Bio-
6. Thomas R, dAri R (1990) Biological feedback. chem Pharmacol 59:6585
CRC, Boca Raton 13. Nebert DW, Dalton TP, Okey AB et al (2004)
7. Bostrom CE, Gerde P, Hanberg A et al (2002) Role of aryl hydrocarbon receptor-mediated
Cancer risk assessment, indicators, and guide- induction of the CYP1 enzymes in environ-
lines for polycyclic aromatic hydrocarbons in mental toxicity and cancer. J Biol Chem 279
the ambiant air. Environ Health Perspect 110 (23):2384723850
(suppl 3):451488
14. Nebert DW, Vasiliou V (2004) Analysis of the the Ah receptor and Nrf2. Biochem Pharmacol
glutathione S-transferase (GST) gene family. 73:18531862
Hum Genomics 1(6):460464 21. de Jong H (2002) Modeling and simulation of
15. Nioi P, Hayes JD (2004) Contribution of NAD genetic regulatory systems: a literature review.
(P)H:quinine oxidoreductase 1 to protection J Comput Biol 9(1):67103
against carcinogenesis, and regulation of its 22. Gillespie DT (1977) Exact stochastic simula-
gene by the Nrf2 basic-region leucine zipper tion of coupled chemical reactions. J Phys
and the arylhydrocarbon receptor basic helix- Chem 81:23402361
loophelix transcription factors. Mutat Res 555 23. Tyson JJ, Othmer HG (1978) The dynamics of
(12):149171 feedback control circuits in biochemical path-
16. Yoshinari K, Okino N, Sato T et al (2006) ways. Prog Theor Biol 5:162
Induction of detoxifying enzymes in rodent 24. Bernot G, Comet J-P, Richard A et al (2004)
white adipose tissue by aryl hydrocarbon recep- Application of formal methods to biological
tor agonists and antioxidants. Drug Metab Dis- regulatory networks: extending Thomas asyn-
pos 4:10811089 chronous logical approach with temporal logic.
17. Wills LP, Zhu S, Willett KL et al (2009) Effect J Theor Biol 229(3):339347
of CYP1A inhibition on the biotransformation 25. Khalis Z, Comet J-P, Richard A et al (2009)
of benzo[a]pyrene in two populations of Fun- The SMBioNet method for discovering models
dulus heteroclitus with different exposure his- of gene regulatory networks. Genes Genomes
tories. Aquat Toxicol 92(3):195201 Genomics 3(special issue 1):1522
18. Parkinson A (1996) Biotransformation of 26. Emerson EA (1990) Temporal and modal
xenobiotics. In: Klaassen CD (ed) Casarett logic. In: Van Leeuwen J (ed) Handbook of
and Doulls toxicology: the basic science of theoretical computer science. MIT Press, Cam-
poisons. McGraw-Hill, New York bridge
19. Yang SK (1988) Stereoselectivity of cyto- 27. Cimatti A, Clarke EM, Giunciglia EF et al
chrome P-450 isozyemes and epoxide hydro- (2002) NuSMV 2: An open source tool for
lase in the metabolism of polycylic aromatic symbolic model checking. In: Proceeding of
hydrocarbons. Biochem Pharmocol 37:6170 international conference on computer-aided
20. Kohle C, Bock KW (2007) Coordinate regula- verification (CAV 2002), pp 2731
tion of phase I and II xenobiotic metabolism by
Chapter 10
Computational Reconstruction of Metabolic

Networks from KEGG
Tingting Zhou
Abstract
Reconstruction of metabolic networks from metabolites, enzymes, and reactions is the foundation of the
network-based study on metabolism. In this chapter, we describe a practical method for reconstructing
metabolic networks from KEGG. This method makes use of organism-specific pathway data in the KEGG/
PATHWAY database to reconstruct metabolic networks on pathway level, and the pathway hierarchy data in
the KEGG/ORTHOLOGY database to guide the network reconstruction on higher levels. By calling upon
the KEGG Web services, this method ensures the data used in the reconstruction are correct and up-to-
date. The incorporation of a local relational database allows caching of pathway data improves performance
and speeds up network reconstruction. Some applications of reconstructed networks on network alignment
and network topology analysis are exampled and notes are stated in the end.
Key words: Metabolic network, Network reconstruction, KEGG, Web service, Relational database,
Network alignment, Network topology analysis
1. Introduction
Through the complex biological reaction system, cellular metabo-

lism manages syntheses, decomposition, and rearrangement of
energy and substances vital for the maintenance of life. Being the
abstract graphical form of real metabolism, different types of meta-
bolic networks are often computationally reconstructed to help
explain metabolism evolution (1, 2), predict essential enzymes (3),
explore network structure (4), track compound transforming routes
(5), identify potential drug targets (6) as well as link the experimen-
tal data to discover biomarkers and to analyze toxicity (7).
Providing the curated and well-organized genome and metab-
olism data, metabolic databases, such as KEGG (810), BioCyc
(11), and PUMA2 (12), make it easier to reconstruct metabolic
networks from sequenced species. Distinctive for its large data
235
236 T. Zhou
range, its organism-specific and pathway-specific data organization

mode and its explicit but semistatic display of pathway maps,
KEGG is considered one of the most popular databases in the
researches on metabolic networks (1315). A series of excellent
software are created to meet different research requirements, such
as KGML-ED (16) for dynamic visualizing the networks, FMM (5)
for navigating the metabolites, KEGGgraph (17) for reconstruct-
ing metabolic networks using R and bioconductor, and KEGG
spider (18) for analyzing gene lists in the context of KEGG meta-
bolic pathways.
Rather than going into the details of these software and repeat-
ing their user-guides, this chapter aims to share our experience on
reconstructing metabolic networks from KEGG, including how to
reconstruct different type of metabolic networks on different levels
and how to design the local database to accelerate the reconstruc-
tion process. The remainder is organized as following: Subhead-
ing 2 lists the software needed in the computational reconstruction;
Subheading 3 explains the method and process in detail; Subhead-
ing 4 illustrates some applications of the reconstructed metabolic
networks; and Subheading 5 places some notes.
2. Materials
l The Java runtime environment, for example, JRE or JDK.

l Internet service. This is optional. If you are not connected
online, you can download pathway data files once and parse
them locally.
l A local relational database, for example, MySQL. This is
optional in case you do not plan to cache the retrieved data
locally for future use.
l Software for network display and analysis, such as Pajek (19)
and Cytoscape (20), are also need in case you are not going to
realize all your ideas by programming.
3. Methods
In different research context, metabolic networks are often repre-

sented as different type of graphs, such as metabolite graphs (21),
enzyme graphs (22), reaction graphs (23), and pathway graphs
(17). In metabolite graphs, metabolites are represented as nodes,
transformation relations between metabolites are represented as
edges. These edges can be directed or undirected according to
the irreversibility or reversibility of the reactions, where the directed
10 Computational Reconstruction of Metabolic Networks from KEGG 237
Fig. 1. Illustration of metabolic network representation scheme. Two reversible reactions, R1 and R2, are taken as
examples in (a). In R1, substrate S is transformed to product P under the catalysis of enzyme E1; in R2, substrate P, also the
product in R1, is transformed to Q under the catalysis of enzyme E2. Thus, metabolites S, P, and Q are connected to one
another via enzymes or reactions, which form a simple metabolite graph, as (b) illustrates. Meanwhile, enzyme E1 and E2,
reaction R1 and R2, interact with each other via the common metabolite P and form the enzyme graph (c) and the
reaction graph (d) respectively. In these graphs, the undirected edge is split into two arcs with opposite directions.
edges are sometimes called arcs. Enzyme graphs and reaction

graphs are extracted in the same way. Their extraction scheme is
shown in Fig. 1. Pathway graphs are extracted in different way:
nodes stand for pathways and edges for pathway relations.
Two pathways related to each other when they share common
components.
KEGG provides data in several ways. KEGG/LIGAND (24)
contains the information of metabolites, enzymes and reactions as
individual flat files, which fits for reconstructing genome-scale
metabolite networks. KEGG/PATHWAY (25) maintains the
molecular interaction and reaction relation for each organism-
specific pathway, which makes it easier for the reconstruction of
enzyme network, reaction network and pathway network. KEGG/
ORTHOLOGY (KO) provides the three-level reference hierarchy
to organize pathway data, which are named the pathway level, the
subprocess level and the process level in this chapter. For examples,
pathways such as glycolysis/gluconeogenesis belong to the path-
way level; systems such as carbohydrate metabolism belong to the
subprocess level; and the whole metabolic system belongs to the
process level, which is also the highest level. This hierarchy enables
reconstructing metabolic networks on different level selectively.
Regarding the reconstruction of metabolite networks, Ma et al.
(13) proposed a method by parsing the data files in KEGG/
LIGAND. That method was described thoroughly in their work,
upon which we are not going to expand it further (see Note 1). In
this section, we look into details of how to extract enzyme relation
and pathway relation for the network reconstruction and how to
use KEGG API methods to step by step reconstruct enzyme and
pathway networks.
238 T. Zhou
Fig. 2. Part of enzyme data in hsa00010.xml. The highlighted and italic parts are used in the enzyme relation extraction.
3.1. Enzyme Network In KEGG/PATHWAY, the pathway data is encoded in KGML

Reconstruction format. KEGG assigns each pathway a code which consists of a 3-
or 4-letter organism prefix and a 5-digit map number (KO number)
3.1.1. On Pathway Level
(26). For example, hsa00010 denotes glycolysis/gluconeogen-
esis of Homo sapiens (hsa). Using the pathway hsa00010 as an
example, we illustrate how to extract enzyme relation for the net-
work reconstruction on pathway level.
Figure 2 displays part of hsa00010.xml, the pathway data of
hsa00010 (KEGG release 50.0). The italic part means: the
id 1 group of genes encode enzymes which catalyze the reac-
tion R00710; the id 2 group of genes encode enzymes
which catalyze the reaction R00235; the compound C00033,
with id 4, is involved in both R00710 and R00235 and links the
corresponding entries together. This type of link is recorded as
ECrel. Since both R00710 and R00235 are reversible, the
Fig. 3. Extraction of enzyme relations. The dash lines represent the relation data given in Fig. 2. The dual lines with left right
open-headed arrow indicate the extracted bidirectional enzyme relation.
direction of this link is described as undirected (some publica-

tions also describe it as bidirectional).
However, from this pathway data one cannot know the corre-
spondence between genes and enzymes. This information is the key
to translate links between genes into relations between enzymes.
There are two ways to retrieve the gene-enzyme map data. One is to
parse the file named enzyme in KEGG/LIGAND. The other is
to use get_enzymes_by_gene(), the method of KEGG API. In
either way, we can know that the id 1 group of genes encodes
enzyme ec:6.2.1.1, while the id 2 group of genes are
responsible for enzyme ec:1.2.1.3 and ec:1.2.1.5. With this
knowledge, the enzyme relations ec:6.2.1.1ec:1.2.1.3 and
ec:6.2.1.1ec:1.2.1.5 can be extracted as edges in the enzyme
networks (Fig. 3).
When associating genes to reactions and vice versa, it is impor-
tant to remember that not all genes have a one-to-one relationship
with their corresponding enzymes or metabolic reactions. Many
genes may encode subunits of a protein which catalyze one reac-
tion. Meanwhile, the same gene can be responsible for several
isomerase (27). Due to this many-to-many relation between
genes and enzymes, the mapping from genes to enzymes can possi-
bly introduce enzymes which do not belong to the current pathway
during the reconstruction process. This must be taken into account
during the programmatic reconstruction.
Using hsa00010 as an example, the following list shows how to
reconstruct enzyme networks with the help of KEGG API. The
reconstructed network is shown in Fig. 4.
240 T. Zhou
Fig. 4. The reconstructed enzyme network of hsa00010. This network consists of 27 nodes and 91 arcs, rendered in
Cytoscape. Notice that this network shows strong modularity. Two main modules are distinguished by different gray level.
1. Retrieve all the entries in hsa00010 using get_elements_by_-

pathway().
2. Filter out those entries whose type is not gene to form the
gene list of hsa00010.
3. Use get_enzymes_by_gene() to map genes in the gene list to
enzymes. Notice that each gene can have a list of enzymes, thus
the output should be an enzyme list.
4. Use get_enzymes_by_pathway() to filter out the enzymes
retrieved in step 3 but not in the current pathway.
5. Combine the gene list derived from step 2 and the enzyme list
derived from step 4 to form the mapping of gene-enzyme in the
current pathway. For the sake of convenient description, we just
note it as the gene-enzyme table.
6. Retrieve the enzyme list from gene-enzyme table. This is the list
of enzymes in the hsa00010.
7. Retrieve all the entry-relations in hsa00010 using get_elemen-
t_relations_by_ pathway().
8. Form the gene relation list by filter out those whose type is not
ECrel. Notice that gene relations are directed. The direction
is from entry 1 to entry 2 in the ECrel relation.
9. Map the gene relation list to the enzyme relation list according
to the gene-enzyme table. The directions between gene rela-
tions are kept as enzyme relations in the enzyme relation list.
10. With the enzyme list derived from step 7 as the node list and
the enzyme relation list from step 9 as the arc list, complete the
reconstruction of enzyme network for hsa00010.
3.1.2. On Higher Level Pathways can be considered as subnetworks that have compara-
tively independent functions. Therefore, the reconstruction of
enzyme networks on higher levels can be considered as the recur-
sive union operation of the involved enzyme networks on the
pathway level in light of the organism-specific KO hierarchy. Since
KEGG/ORTHOLOGY provides nothing but the reference hierar-
chy, the key problem is how to determine the KO hierarchy for the
specific organism.
Below list shows how to reconstruct enzyme network of hsa on
the process level, that is, hsa01100. Because of the limitation of
chapter length, the reconstructed network is not shown.
1. Use get_ko_by_ko_class() to retrieve the ko list of systems on
subprocess level of hsa01100.
2. For each system in the subprocess-level ko list, use get_ko_
by_ko_class() again to get the ko list of pathways on the path-
way level; by now the three-level KO reference hierarchy is
retrieved.
3. Use list_pathways() to retrieve all the pathways of the organism
hsa. By now we get the pathway list of the given organism.
4. Use the pathway list as a filter to prune the three-level KO
reference hierarchy. Now the hsa-specific KO hierarchy is set up.
5. Reconstruct enzyme networks for all the pathways list in the
hsa-specific KO hierarchy.
6. Union the reconstructed enzyme networks into a whole net-
work. This is the enzyme network of hsa01100.
3.2. Pathway Network KEGG/PATHWAY contains not only the molecular interaction
Reconstruction data within the pathway, but also the relation data among pathways.
Figure 5 shows the example on how to extract the pathway relation
data from the pathway data file.
242 T. Zhou
Fig. 5. Part of pathway data in hsa00010.xml. The highlighted and italic parts are used in the pathway relation extraction.
In Fig. 5, pathway hsa00030 whose id 5 interacts with

the current pathway hsa00010 through their common compo-
nentthe reversible reaction R02739. The genes whose
id 31 encodes the enzymes which catalyze R02739 to product
the compound C00668 whose id 5. Just like the enzyme
relation being marked as ECrel, the relation between pathways is
marked as maplink. Pathway relations are now considered to be
undirected, as is illustrated in Fig. 6.
With the hsa-specific KO hierarchy set up in Subheading 3.1.2,
we take hsa01101, the carbohydrate metabolism of H. sapiens, as an
example to illustrate the pathway network reconstruction.
1. Retrieve the pathway list under hsa01101 based on the hsa-
specific KO hierarchy.
Fig. 6. Extraction of pathway relations. The dash rectangle represents the pathway
hsa00010. The dash-dot rectangle represents the pathway hsa00030. Gene hsa:2821
is recorded as the entry id 31 in pathway data file hsa00010.xml and as the entry
id 4 in hsa00030.xml. This gene and its related enzymes and reactions are the
common part of these two pathways, which is placed in the overlap of these two
rectangles. It indicates how the relation between pathway hsa00010 and hsa00030 is
formed as maplink.
2. For each pathway in the pathway list:

(a) Use get_elements_by_pathway() to retrieve all the entries
of the current pathway.
(b) Set the proper filter to list the map entries from the
retrieved entries. These map entries points to the path-
ways which shares common components with the current
pathway. By now we get the pathway relation list of the
current pathway.
3. Integrate the pathway relation lists of all the pathways in the
pathway list to reconstruct the pathway network of hsa01101.
3.3. Local Database A local relational database helps manage the downloaded data. It
Design not only can help keeping the data accurate and speed up the
process of network reconstruction, but also allows you to reuse
the data for other purposes.
Generally speaking, the reference data such as the KO reference
hierarchy does not change frequently. Neither does the metabolic data
of organisms whose genome is sequenced very early. However, to
those whose genome is undersequenced or just got completely anno-
tated completely not very long ago, their metabolic data is updated
frequently. In this case, assigning different TTL (Time-to-live) to
different types of data will reduce network overhead and data retrieval
time caused by remotely visiting KEGG frequently. For example, it
would be helpful to set the renewal period of the KO reference
hierarchy data as 30 days while that of the organism-specific
244 T. Zhou
Fig. 7. Sequence diagram of pathway data retrieval. The CachingMetabolicDataDaoWrapper delegates the request to the
other DAOs in order to get the metabolic data and caching data along the way. The JdbcMatabolicDataDao gets the cached
metabolic data from the local database. The cached data need to be refreshed from time to time by having an expiry date.
The KeggMetabolicDataDao retrieves the latest metabolic data by calling the KEGG SOAP Web service.
gene-enzyme mapping data as 2 days. Please note these time figures

are not fixed but depends on the update frequency of KEGG.
Regarding retrieving pathway data from KEGG/PATHWAY as
an example, Fig. 7 shows a practical design of the data retrieval
process: check for the existence of queried pathway data in the local
database. If the data is not there, retrieve it from KEGG and write
into the local database. If the data exists, check whether it has
expired or not. If it is already expired, refresh it from KEGG. If
not, return the data immediately from the local database.
4. Applications of
the Reconstructed
Metabolic Networks
Reconstruction of different types of metabolic network on different
levels paves the way of network-based researches on metabolism.
Most of these researches focus on but do not limit to the exploration
of network functionality and network evolution mechanism by

means of network alignment or network structure analysis. In this
section, we use examples to show the application of reconstructed
networks in these two aspects.
4.1. Network Alignment Network alignment helps locating the differences between meta-
bolic networks. Difference between the abnormal pathway in cer-
tain disease states and the healthy one indicates the potential drug
target. Difference between the same pathways of different organ-
isms reveals the possible evolution trail. In this section, we give a
simple example on the alignment of reconstructed enzyme net-
works of glycolysis/gluconeogenesis of H. sapiens and A. aeolicus
(hsa00010 and aae00010), to show the evolutional difference from
bacteria to eukaryotes on the glycolysis/gluconeogenesis.
The reconstructed enzyme network of hsa00010 consists of
27 nodes and 62 arcs, while the reconstructed enzyme network of
aae00010 contains only 13 nodes and 12 arcs. Using Cytoscape,
the alignment result is displayed as Fig. 8. In the integrated enzyme
network, 15 nodes are hsa-specific, accounting for 53.6% in the
total amount. This figure is much <82.5%, the proportion of hsa-
specific arcs in the integrate network. Since to some extent the node
amount determines the network size while the arc amount mea-
sures the network complexity, the alignment result shows that both
network size and network complexity are getting increased along
with the form of glycolysis/gluconeogenesis from simple to
advanced. Besides, the increase of network complexity is more
noticeable. This possibly implies that life forms would rather
make use of the available resource of their own than introduce the
new components under the evolutional force. It is reasonable since
adaption always consumes less energy than generation.
However, please keep it in mind that this is no more than an
example of network alignment which does not have enough statis-
tical significance yet. More statistical studies are needed to confirm
this conclusion.
4.2. Network Topology Most biological networks are complex networks. Properties of
Analysis complex networks such as scale-free, small-world and modularity
reflect how robust the network is from different angles. In complex
network theory, degree distribution (DD), cluster coefficient (CC),
and average shortest length (ASL), are often used to study the
scale-free, modular and small-world properties respectively (28).
Jeong et al. reconstructed and analyzed metabolite networks of
43 model organisms to draw the conclusion that metabolite net-
works are scale-free, small-world and modular (29). However, their
study does not show how this conclusion go on with enzyme net-
works. To answer this question, we reconstruct the entire enzyme
network for H. sapiens, hsa01100, from KEGG (release 50.0), and
regard this network as an example to study the topological
246 T. Zhou
Fig. 8. Alignment of enzyme network hsa0010 and aae00010. In this figure, the round nodes represent hsa00010-specific
enzymes, the rectangle node represents the aae00010-specific enzymes, and the triangle nodes represent the enzymes
shared by these two pathway. These nodes are also distinguished by different gray levels. The solid lines represent the
hsa00010-specific enzyme relations, the sinewave lines represent the aae00010-specific enzyme relations, and the dash
lines represent the enzyme relations they overlapped.
properties of genome-scale enzyme networks. The reconstructed

enzyme network has 680 nodes and 2,333 arcs.
Using RandomNetworks (RN in brief), the Cytoscape plug-in,
we generate 100 random networks based on the BA model. The
DD of these networks follows the power-law, pk kg . The
average g of these 100 scale-free random networks is 0.363 with
the standard deviation as 0.001. Meanwhile, the DD of the enzyme
network hsa01100 is computed and found to follow the power-law

with the power g 1:53, higher than the average power of the 100
random networks. From this we can draw the conclusion that the
entire enzyme network of H. sapiens are scale-free., where some
high-connected hub nodes have the greatest influence on the com-
pleteness of the entire network.
Continue using RN, we generate another 100 random net-
works by random shuffling the edges of the hsa01100 enzyme
network. The shuffle times is set to 300,000. Having counted the
CC of all these networks, we found that 0.294, CC of the hsa01100
enzyme network, is much bigger than 0.006, the average CC of the
100 random networks with standard deviation as 0.001. From this
we can draw the conclusion that the hsa01100 enzyme network has
the modularity property.
Using RN again, we can derive the ASL of the hsa01100
enzyme network is about 7.029. This figure is much <680, the
network size. This implies that hsa0100, the entire enzyme network
of H. sapiens, is small-world.
5. Notes
1. In KEGG/LIGAND, the data of compounds, enzymes, reac-

tions are integrated into flat files respectively. The file named
enzyme lists the corresponding gene data and the attributive
organism data under each enzyme entry. However, no evidence
shows this database provides the pathway attributes for these
metabolic components. For example, you can hardly know
which compounds, reactions, and enzymes are involved in
which pathway. Thus, the reconstruction of metabolite net-
works from KEGG/LIGAND is limited to the process level,
that is, you can only reconstruct the so-called genome scale
metabolite networks by parsing data files in KEGG/LIGAND.
In the case that you need to reconstruct networks for the
specified part of metabolism, the pathway data in KEGG/
PATHWAY should be taken into account.
2. As is described in Subheading 3.1, retrieving the ECrel data
and forming the gene-enzyme table is the requirement to
obtain the relation between enzymes which is of great impor-
tance for the successful reconstruction of enzyme networks.
However, according to our knowledge, the gene group data
(for instance, the id 1 entry shown in Fig. 2) in the pathway
files is progressively taken place by the ko data from KEGG
release 54.0. Besides, so far there is no method in KEGG API
that can translate ko data to enzyme data directly. In this
case, these methods, get_kos_by_pathway(), get_genes_by_ko
() and get_enzymes_by_gene(), should be used successively to
248 T. Zhou
set up the ko-enzyme mapping relation to help to determine

relations between enzymes.
3. Using the methodology discussed in this chapter, we developed
two applications, MetaGen (30) and MetAtlas, for the meta-
bolic network. MetaGen is LGPL open-sourced and available
at http://sourceforge.net/projects/bnct. You can modify its
source code to adapt to your work or use it as a reference for
your experimentation. When you use the code, however, please
notice that since the KEGG API methods related to KO class
(i.e., get_ko_by_ko_class(), get_genes_by_ko_class(), list_ko_-
classes()) currently stops functioning, which disables MetaGen
and MetAtlas together. There is no solution yet but the follow-
ing workaround can be considered: download the hierarchical
information of KO from KEGG BRITE (ftp://ftp.genome.jp/
pub/kegg/brite/ko/), parse them (and write the KO hierar-
chy into the local database if you use the database option)
before you continue moving on the reconstruction.
References
1. Papp B, Pal C, Hurst LD (2004) Metabolic 9. Kanehisa M, Goto S, Hattori M et al (2006)
network analysis of the causes and evolution From genomics to chemical genomics: new
of enzyme dispensability in yeast. Nature developments in KEGG. Nucleic Acids Res
429:661664 34:D354D357
2. Zhou T, Chan K, Wang Z (2008) TopEVM: 10. Kanehisa M, Goto S (2000) KEGG: kyoto
using co-occurrence and topology patterns of encyclopedia of genes and genomes. Nucleic
enzymes in metabolic networks to construct phy- Acids Res 28:2730
logenetic trees. LNCS (LNBI) 5265:225236 11. Caspi R, Karp PD (2007) Using the MetaCyc
3. Becker SA, Palsson BO (2005) Genome-scale pathway database and the BioCyc database col-
reconstruction of the metabolic network in lection. Curr Protoc Bioinform 20:151
Staphylococcus aureus N315: an initial draft to 12. Maltsev N, Glass E, Sulakhe D et al (2006)
the two-dimensional annotation. BMC Micro- PUMA2-grid-based high-throughput analysis
biol 5:8 of genomes and metabolic pathways. Nucleic
4. Borenstein E, Feldman MW (2009) Topologi- Acids Res 34:D369D372
cal signatures of species interactions in meta- 13. Ma H, Zeng A (2003) Reconstruction of met-
bolic networks. J Comput Biol 16:191200 abolic networks from genome data and analysis
5. Chou CH, Chang WC, Chiu CM et al (2009) of their global structure for various organisms.
FMM: a web server for metabolic pathway Bioinformatics 19:270
reconstruction and comparative analysis. 14. Zhao J, Ding G-H, Tao L et al (2007) Modular
Nucleic Acids Res. doi:10.1093/nar/gkp264 co-evolution of metabolic networks. BMC
6. Sridhar P, Kahveci T, Ranka S (2007) An itera- Bioinform 8:311
tive algorithm for metabolic network-based 15. Ma H, Sorokin A, Mazein A et al (2007) The
drug target identification. Pac Symp Biocom- Edinburgh human metabolic network recon-
put 12:8899 struction and its functional analysis. Mol Syst
7. Kell DB (2006) Systems biology, metabolic Biol 3:135
modelling and metabolomics in drug discovery 16. Klukas C, Schreiber F (2007) Dynamic explo-
and development. Drug Discov Today ration and editing of KEGG pathway diagrams.
11:10851092 Bioinformatics 23:344350
8. Kanehisa M, Goto S, Furumichi M et al (2010) 17. Zhang JD, Wiemann S (2009) KEGGgraph:
KEGG for representation and analysis of a graph approach to KEGG PATHWAY in R and
molecular networks involving diseases and bioconductor. Bioinformatics 25: 14701471
drugs. Nucleic Acids Res 38:D355D360
18. Antonov AV, Dietmann S, Mewes HW (2008) 24. Goto S, Nishioka T, Kanehisa M (1998)
KEGG spider: interpretation of genomics data LIGAND: chemical database for enzyme reac-
in the context of the global gene metabolic tions. Bioinformatics 14:591599
network. Genome Biol 9:R179 25. Ogata H, Goto S, Fujibuchi W et al (1998)
19. Batagelj V, Mrvar A (2002) Pajekanalysis and Computation with the KEGG pathway data-
visualization of large networks. In: Mutzel P, base. Biosystems 47:119128
J
unger M, Leipert S (eds) Graph drawing. 26. Kanehisa M, Araki M, Goto S et al (2008)
Springer, Berlin, pp 811 KEGG for linking genomes to life and the envi-
20. Cline MS, Smoot M, Cerami E et al (2007) ronment. Nucleic Acids Res 36:D480D484
Integration of biological networks and gene 27. Palsson B (2006) Systems biology: properties
expression data using Cytoscape. Nat Protoc of reconstructed networks. Cambridge Univer-
2:23662382 sity Press, New York, NY
21. Faust K, Croes D, van Helden J (2009) Meta- 28. Albert R, Barabasi A (2002) Statistical mechan-
bolic pathfinding using RPAIR annotation. J ics of complex networks. Rev Mod Phys
Mol Biol 388:390414 74:4797
22. Yamanishi Y, Vert JP, Kanehisa M (2005) 29. Jeong H, Tombor B, Albert R et al (2000) The
Supervised enzyme network inference from large-scale organization of metabolic networks.
the integration of genomic data and chemical Nature 407:651654
information. Bioinformatics 21(Suppl 1): 30. Zhou T, Yung K, Chan C et al (2010) Meta-
i468i477 Gen: a promising tool for modeling metabolic
23. Verkhedkar KD, Raman K, Chandra NR et al networks from KEGG. Prog Biochem Biophys
(2007) Metabolome based reaction graphs of 37:6368
M. tuberculosis and M. leprae: a comparative
network analysis. PLoS One 2:e881
Part III
Biomarkers
Chapter 11
Biomarkers
Harmony Larson, Elena Chan, Sucha Sudarsanam,
and Dale E. Johnson
Abstract
Biomarkers are characteristics objectively measured and evaluated as indicators of: normal biologic pro-
cesses, pathogenic processes, or pharmacologic response(s) to a therapeutic intervention. In environmental
research and risk assessment, biomarkers are frequently referred to as indicators of human or environmental
hazards. Discovering and implementing new biomarkers for toxicity caused by exposure to a chemical either
from a therapeutic intervention or accidentally through the environment continues to be pursued through
the use of animal models to predict potential human effects, from human studies (clinical or epidemiologic)
or biobanked human samples, or the combination of all such approaches. The key to discovering or
inferring biomarkers through computational means involves the identification or prediction of the molecu-
lar target(s) of the chemical(s) and the association of these targets with perturbed biological pathways. Two
examples are given in this chapter: (1) inferring potential human biomarkers from animal toxicogenomics
data, and (2) the identification of protein targets through computational means and associating these in one
example with potential drug interactions and in another case with increasing the risk of developing certain
human diseases.
Key words: Biomarkers, Adverse drug reactions, Pharmacogenomics, Systems biology, Meta-analysis,
Gene ontology, Biological pathways
1. Introduction
The term biomarker relative to human therapeutics has become

synonymous with new, innovative approaches used to make medi-
cines safer, more effective, and more relevant to individual patients.
This includes efforts to treat diseases more effectively, avoid adverse
drug effects, and verify the impact of novel drugs on targets and
specific biological pathways. The ultimate goal is to increase pre-
dictability in drug development (1). Hence discovering and imple-
menting biomarkers has become an essential part of targeted
therapies and personalized medicine (2).
253
254 H. Larson et al.
The current definition of Biomarker, as it is used in drug

discovery and development, is as follows: a characteristic that is
objectively measured and evaluated as an indicator of: normal bio-
logic processes, pathogenic processes, or pharmacologic response(s)
to a therapeutic intervention. In drug research and early clinical
trials, a biomarker is generally qualified as opposed to being fully
validated. A biomarker as defined above is distinguished from a
surrogate endpoint, which is defined as a biomarker that is intended
to substitute for a clinical endpoint, and is expected to predict
clinical benefit (or lack of benefit or harm) based on epidemiologic,
therapeutic, pathophysiologic, or other scientific evidence. Vali-
dated surrogate endpoints used in both medical practice and drug
development include cholesterol levels for cardiovascular disease,
viral load (RNA) for HIV, blood pressure for hypertension, blood
glucose for diabetes, I.O.P for glaucoma, and bone density for
osteoporosis. In drug development these surrogate endpoints are
used as indicators of the effectiveness of a new drug candidate and in
many cases outcomes are compared against the standard of care
using the surrogate marker. The fact that the surrogate endpoint
must be rigorously validated in the clinical setting requires that the
marker has to be discovered and implemented in time to be vali-
dated in the registration clinical trial(s) using actual clinical end-
points as the comparator with the surrogate endpoint (1, 3).
The most important types of biomarkers, for use in drug devel-
opment, would be those that (1) target patients who are likely to be
responders to the therapeutic, such as Gleevec patients described
below, (2) allow monitoring of the desired clinical response, such as
monitoring the modulation of a target via downstream effectors,
(3) identify patients at risk for side effects, (4) guide dose selection
and/or dose tailoring for individual patients, and (5) differentiate
disease diagnoses (1). For example, Gleevec (imatinib mesylate),
which targets Abl kinase, was one of the first drugs approved that
targeted specific tyrosine kinase enzymes. Gleevec is used to treat
chronic myeloid leukemia (CML) for Philadelphia chromosome
positive (Ph+) patients. It is also used in the treatment of gastroin-
testinal stromal tumors and dermatofibrosarcoma protuberans. In
CML, the chromosomal abnormality resulting from translocation
and fusion of Abl and BCR served as a biomarker in both discovery
and clinical development stages (4). Other types of molecular and
image-based biomarkers have been reviewed by Frank and Har-
greaves (5) Biomarker discovery often involves genome-wide
profiling of relevant samples from patients with diagnosed condi-
tions where annotation from medical records confirms the medical
status of the patient. Profiling is accomplished with the use of high-
throughput technologies such as gene expression profiling,
spectrometry-based proteomics, and metabolomic technologies (6).
Biomarkers that allow researchers to target potential respon-
ders in clinical trials are used as in vitro diagnostics as part of the
11 Biomarkers 255
patient selection for a trial. Patients can be stratified prospectively

in a clinical trial design using an enrichment strategy for efficacy,
with only the patients testing positive as potential responders to
be tested further in a comparative manner against placebo. This
approach would be the fastest way to bring an innovative drug for
an orphan indication to market, where potential responders could
be identified a priori. However, it requires that a test be made
available as a companion diagnostic since the label would indicate
the requirement to use the test prior to treatment (1). Complica-
tions in contemplating this scheme occur when results of the test
are not available at the time of clinical trial randomization, when
there is a chance that nonresponders identified by the test may in
fact respond to treatment, or when the sensitivity of the test
cannot be assumed to be very high because extensive validation
has not yet taken place in early stage clinical trials. What qualifies
the test to be used in early clinical trials versus what level of
validation would be required to bring the test to market is beyond
the scope of this chapter; however, briefly, if the test was used in
the stratification of patients to define a primary clinical endpoint
for a registration trial, the test used in the trial must be the same
as the one marketed.
In an article that specifically discusses cancer research, Pepe
et al. (7) has defined a 5-part process for Biomarker development
starting in preclinical models and progressing into human trials/
usage. In Part A, Biomarkers are discovered and proposed in
exploratory preclinical studies with the goal of translating these to
human assays. In several instances these biomarkers are discovered
from biobanked tissues from patients. In Part B, proposed assays
from preclinical models are developed into clinical assays that can
detect established disease. This process is usually referred to as
translational science. At this broad stage, the level of detection is
centered on distinguishing subjects with or without disease; for
instance, subjects with a specific type cancer from subjects without
cancer or those with cancer but lacking a certain mutation impor-
tant for the progression of disease. In this context, a biomarker that
detects disease at an early stage is more valuable in medical practice
and clinical trials than one that detects only late-disease. In Part C,
retrospective longitudinal repository studies are conducted from
biobanked specimens from previous clinical trials or from biobanks
that collect tissues from subjects with and without disease. This
develops evidence regarding the capacity of a biomarker to detect
various stages of disease. In Part D, prospective screening is applied
to individual patients in order to establish the extent and character-
istics of disease detected by the test. Hopefully this will establish the
inherent false positive referral rate. Part E requires disease control
studies in order to establish the impact of screening on reducing the
burden of disease on the population.
1.1. Biomarkers Adverse drug reactions (ADRs) continue to be a major cause of

of Adverse Drug morbidity and mortality worldwide, and biomarkers related to
Reactions ADRs have and are being extensively studied. Examining the issue
on a global basis, the causes and personal/social impact of drug
safety, or the lack thereof, are clearly multifactorial, with the defini-
tion of risk expanding beyond those normally considered in drug
discovery and development. Factors include the drug itself, the
patient with or without unique susceptibilities, prescribing and
medication errors, compliance issues particularly in the elderly,
and complications with multiple drugs caused by complex regimens
of sick individuals (8). Viewing ADRs outside of a patient-oriented
context, such as in toxicology studies conducted in normal animals,
often misses factors with major implications for patient safety.
Examples of often overlooked factors in normal animals include
cardiovascular ADRs linked to antidiabetic treatments and VEGF
inhibitors in oncology patients.
Pepe et al. in the article cited above (7) also describe several
reasons why the biomarker program may not have an overall benefit
for the screened cancer population. These caveats seem highly
appropriate when one considers this scheme as it applies to toxicity
or adverse drug reactions. The reasons paraphrased to address tox-
icity are as follows: (1) Ineffective ways to treat or minimize the
effect in screen-detected toxicities, (2) poor compliance with the
screening program or difficulties with implementing the program
outside the clinical trial or hospital setting; such as in actual com-
munity medical practice, (3) prohibitive economic or morbidity-
associated costs of the screening program itself and of the diagnostic
workup of individuals who falsely screen positive for early develop-
ment of toxicity, and (4) the overdiagnosis of early developing
toxicity that, in the absence of a screening test and associated pro-
gram, would remain undetected and in some cases would regress
(8). Because of these complexities, one can see that the goal of
discovering and developing tests to detect and prevent toxicity in
human subjects would have a high level of validation associated with
it. In the translational science model, the initial specific application
of these tests specifically to preclinical toxicology studies requires a
second level of validation in the animal model itself prior to the use
in humans. More importantly, the validation between species
assumes that one can establish comparable analytical behavior of
an assay with homologous markers in specimens from each species.
This has been a major stumbling block for several immunological
assay preparations designed to correlate between humans and cer-
tain animal species. These difficulties have led to an increased inter-
est in using human samples from subjects who have developed
ADRs, transgenic animal models, or newer stem cell technologies.
Biomarkers used to either identify patients at risk for develop-
ing side effects and/or to guide dose selection or adjustment are
primarily metabolic markers, such as polymorphisms in metabolic
11 Biomarkers 257
enzymes, which help to identify patient populations potentially at

risk for side effects due to altered pharmacokinetics of the thera-
peutics. These are also used to identify patients with a higher risk in
developing drugdrug interactions. Examples where this informa-
tion has been included in the drug labeling include: atomoxetine
(Strattera) and CYP 2D6 polymorphisms; thiopurines (6MP and
azathioprine) and TPMP; irinotecan (Camtosar) and UGT1A var-
iants; and warfarin and CYP2C9 and VKORC1 considerations (1).
Tests to detect patients with certain polymorphisms in drug meta-
bolizing enzymes have had a major impact on both drug discovery
and development, as well as on the actual delivery of the products
to patients in the health care system. Tests are now being used in
the practice of medicine to alter dosing regimens or suggest alter-
native therapies. In drug discovery, molecules are now being
designed to avoid this risk; removing the possibility of a compound
becoming a substrate of problematic metabolic enzymes. In many
cases researchers will design molecules away from being a substrate
of a single metabolizing enzyme and specifically avoid CYP3A4 and
CYP2D6, two metabolizing enzymes frequently seen in drugdrug
interactions. The validation of these types of tests is less stringent
than the Surrogate Endpoint discussed earlier because these tests
are linked to a specific genetic variation, not the actual clinical
endpoint. These tests have reached the marketplace as research
grade diagnostics and therefore can be used both in clinical trials
and in medical practice.
1.2. Pharmacogenomic The use of pharmacogenomic biomarkers to predict severe ADRs

Biomarkers for Potential has been reviewed by Ingelman-Sundberg (9). Examples that are
Adverse Drug Reactions included in approved drug labels are presented below (8). These are
listed for genes or alleles, some relevant drugs affected, and the
potential toxicity associated with biomarker.
l TPMT variants; exampleAzathioprine: increased risk of mye-
lotoxicity.
l UGT1A1*28; for Irinotecan: increased risk for neutropenia;
Nilotinib: increased risk of hyperbilirubinemia.
l CYP2C19 variants; exampleVoriconazole: altered pharmaco-
kinetics and potential toxicity.
l CYP2C9 variants: examplesCelecoxib, Warfarin: altered
pharmacokinetics and potential toxicity.
l and VKORC1 for Warfarin; requires lower dose requirements
to avoid side effects.
l CYP2D6 variants and mutant; examplestricyclic antidepres-
sants, atomoxetine, fluoxetine; altered pharmacokinetics and
potential toxicity.
l HLA-B*5701 for Abacavir; Hypersensitivity reactions, lactic
acidosis and severe hepatomegaly.
l HLA-B*1502; exampleCaramazepine: serious dermatological

reactions.
l DPD deficiency; examplescapcitabine, fluorouracil: stomati-
tis, diarrhea, neutropenia, neurotoxicity.
l G6PD deficiency; examplerasburicase; severe hemolysis.
l NAT variants; examplesrifampin, isoniazid, and pyrazina-
mide. Altered pharmacokinetics and increased toxicity.
Variants of SLCO1B1 have been shown to be strongly associated
with an increased risk in statin-induced myopathy (10). FDA has
issued a warning that certain epilepsy drugs including dilantin, phe-
nytek, and cerebyx can lead to severe skin reactions in Asian patients
who test positive for HLA-B1502 (11). In addition, an association
has been suggested between APOC3 polymorphisms, and dyslipide-
mia and/or lipoatrophy in HIV-infected individuals receiving
HAART therapy using d4T and protease inhibitors (12, 13).
1.3. Computational Computational methods offer an expedited, comprehensive, and

Approaches for Safety expanded scale of approach, and therefore currently play a major
Biomarkers role in biomarker discovery. For instance, Paik et al. (14) published
an example of a computational model involving 21 genes that
predicts recurrence of breast cancer. Computational approaches
are also being used to identify potential links with ADRs and
Biomarkers. A recent paper by Wallach et al. (15), uses a
structure-based approach to map ADRs to perturbations of
biological pathways. In their methodology, the authors paired
drugs to observed ADRs, identified protein targets using protein
structure databases and in silico virtual docking, associated protein
targets with key biological pathways, and established relationships
using logistic regressions. The key in this and all other approaches is
to distinguish between on-target binding and modulation of a
desired target, which may lead to efficacy or excessive pharmacolog-
ical effects, and binding or modulation of an off-target (undesired)
which could lead to an ADR. Discovery of these off-target effects
correlated within the individual patient context, such as genetic
polymorphisms, personal medical histories, environmental expo-
sures to other chemicals, and inappropriate drug regimens can all
contribute to direct, or indirect, perturbations of biological path-
ways. Uncovering previously unknown pathway links can lead to a
biomarker hypothesis that can be tested using appropriate patient
samples. It is important to view ADRs from therapeutics and/or the
effects resulting from unexpected environmental exposures to che-
micals in the overall patient context since several coexisting factors
can contribute to the unfavorable outcome. In the environmental
exposure example this could be an understanding of susceptible
populations, such as children, pregnant women, or those with a
higher susceptibility to certain diseases where the exposure could
increase the risk of developing or exacerbating an existing condition.
11 Biomarkers 259
Since the essential goal of computational methods is to identify

potential protein targets and to relate these through biological
pathway analysis, several researchers have used virtual docking of
chemicals as a tool to discover potential targets (15), and a great deal
of work has been done on ligand-based similarity metrics, which
group together seemingly unrelated proteins to suggest potential
off-target effects (16).
In this chapter we will discuss the sources of samples to be used
for ADR biomarker research, review potential methods, and high-
light two examples where computational methods have been
applied to solve problems in this area. The first example shows the
methodology behind the application of animal toxicogenomics
data to infer human AR biomarkers. The second example demon-
strates methods used to propose molecular targets for phytochem-
icals in herbs used in Traditional Chinese Medicine recipes, and
explain how a similar approach can be used to link environmental
chemicals that may increase the risk of certain human diseases.
2. Materials
2.1. Animal Study- Specific initiatives for risk identification using either animal toxicol-
Derived Biomarkers and ogy study data or advanced screening techniques include the Pre-
Animal/Human ADME dictive Safety Testing Consortium (PSTC): a unique publicprivate
Genes partnership, led by the nonprofit Critical Path Institute (C-Path)
(17), which brings together pharmaceutical companies to share and
validate each others safety testing methods. The PSTC is under
advisement of the United States FDA and its European counter-
part, the European Medicines Agency (EMA). The corporate mem-
bers of the consortium share internally developed preclinical safety
biomarkers in five workgroups: carcinogenicity, kidney, liver, mus-
cle, and vascular injury. Participants are attempting to link biomar-
kers with ADRs in vivo starting from establishing biomarkers in
animal models and linking those to effects in clinical trials. The
PSTC identified seven urinary protein biomarkers useful in detect-
ing drug-induced kidney injury in preclinical rat toxicology studies
in addition to the long established markers, serum creatinine and
Blood Urea-Nitrogen (BUN). These new biomarkers include
KIM-1, Albumin, Total Protein, b2-microglobulin, Cystatin C,
Clusterin, and Trefoil Factor-3. As of June 2010, the biomarkers
have been validated and qualified by all ICH regulatory agencies
and are being used extensively in preclinical studies.
PharmaADME (18) (is an initiative of scientists from academia,
the pharmaceutical industry, and the genomic technology industry.
They have identified key genes and variants involved in the absorp-
tion, distribution, metabolism, and excretion (ADME) of medica-
tions. The purpose of this initiative is to stimulate the production
and use of genotyping assays in drug development. Currently (as of

Dec 2010) the genes are categorized (with number of genes) into
Phase I (130), Phase II (68), Transporters (77), and Modifiers
(24). These lists are also useful in establishing computational links
that may lead to an understanding how chemical exposures can lead
to an increased risk of developing or exacerbating a specific disease.
2.2. Human Samples ADRs are being studied prospectively using pharmacogenomics via
for ADR Research partnerships and networks of scientists in healthcare systems, aca-
demia, biopharmaceutical industry, nonprofits, and regulatory
agencies. Several networks have been established to study specific
ADRs, particularly as they apply to certain specific patient popula-
tions. These include the Canadian Genotype-specific Approaches to
Therapy in Childhood Program (GATC) focused on reducing
ADRs in children (19). The network includes ten major Canadian
pediatric health centers that monitor and report the occurrence of
ADRs in children. The GATC network collects DNA and plasma
samples for its genetic discovery studies. The initiative also includes
a surveillance program which collects reports from over 2,300
pediatricians. The GATC project has a mission to influence pediat-
ric medical practice on a global scale.
The United States Drug Induced Liver Injury Network
(DILIN) (20) was established with the aim of discovering underly-
ing causes of drug-induced liver disease. The endeavor is sponsored
by the National Institute of Diabetes and Digestive and Kidney
Diseases of the US National Institutes of Health. The overall goal
of the program is to discover why some individuals develop hepa-
totoxicity and others do not even when on the same drugs and
regimens. A registry of people experiencing liver injury from one of
four drugs since 1994 has been established. A retrospective study is
monitoring patients with hepatotoxicity from a specific class of
drugs, and a prospective study is ongoing and follows patients
who have recently experienced adverse liver reactions to any drug
or herbal medicine. The challenge in this endeavor is to define the
cause and effect link of hepatotoxicity to a specific drug particularly
within a patient context that can include alcoholism, hepatitis, and
other conditions that lead to liver toxicity.
The primary objective of the European collaboration to estab-
lish a casecontrol DNA collection for studying the genetic basis of
adverse drug reactions (EUDRAGENE) is to advance the under-
standing of the basis of adverse drug reactions, which hopefully will
lead to the development of tests for predicting individual suscepti-
bility to ADRs. Several studies have been established and sample
banks created. The network has 11 participating centers in Europe
and Canada and initially studied myopathy from cholesterol-
lowering drugs, agranulocytosis from several different drugs, ten-
donitis and tendon rupture from fluoroquinolone antibiotics, long
QT syndrome caused by several classes of drugs, liver injury caused
11 Biomarkers 261
by nonsteroidal anti-inflammatory drugs, and neuropsychiatric

reactions caused by mefloquine antimalarials.
The International Warfarin Consortium (IWPC) (through
PharmGkb (21)) is studying specific toxicities of warfarin therapy.
The Serious Adverse Event Consortium (SAEC) is a nonprofit orga-
nization comprised of pharmaceutical companies and academic insti-
tutions with scientific and strategic input from the U.S. FDA. The
mission of the SAEC is to help identify and validate DNA-variants
useful in predicting the risk of drug-related serious adverse events.
3. Methods
Various systems biology approaches have been used as broad meth-

odologies in biomarker discovery. The realization that studying and
modeling individual components fail to predict behavior of
biological systems as a whole has given rise to the notion of systems
biology, a method that models interactions between various compo-
nents of a complex system (6). System biology has greatly influenced
toxicology and in a review, Waters and Fostel (22) persuasively argue
the need for a systems approach to toxicology. Despite the few
success stories primarily involving metabolic enzymes and transpor-
ters, discovery of predictive biomarkers for ADRs still remains a
difficult problem. Although there are several focused patient samples
being collected as noted above, a lack of well-annotated relevant
clinical specimen for profiling, and the fact that static sampling of
biological processes fails to model dynamic events contribute to the
problem. A well annotated sample will identify the ADR and associ-
ate it with a specific drug or combination of drugs so the researcher
can understand both the drug regiment if known and the progres-
sion of disease. The article by Wallach et al., etc. evinces a continuous
and developing methodology. In our own work we have used meth-
ods distributed by Genego (ref. 23, and one of the examples below
highlights an application of Genego.
Multiple publications utilizing targeted genes or genome-wide
expression profiling using microarrays to probe markers of toxicity
are publicly available (24, 25), and datasets are now also publicly
available. A valuable source of information and set of tools can be
found in the Comparative Toxicogenomics Database (CTD) which
we highlight in the Best Practices section below. Toxicogenomics
publications typically deal with toxic effects of specific chemicals
either known to be toxins, or drugs that cause toxicity at high
doses. Typically, a rodent or in vitro model is used for the study,
which results in the measurement of differential gene expression.
Data are a list of genes that are differentially expressed as generated
by the application of well-accepted statistical analyses to the data-
sets. This approach has primarily contributed to the knowledge on
the toxic agents mechanisms of action, in the form of affected

genes, and has produced signatures of toxicity (6). Nonetheless,
extracting biological insight from toxicogenomics datasets remains
a challenge and new computational approaches are needed. We
show a method to overcome such a challenge in one of our exam-
ples below.
An approach used to overcome challenges seen in toxicoge-
nomics datasets is to take advantage of the fact that there have been
multiple studies addressing specific mechanisms, such as hepato-
toxicity, using different toxic agents. In this approach, called meta-
analysis, all datasets are collected and normalized, taking into
account variables such as experimental conditions, species used in
the experiment, microarray platform variability, and gene expres-
sion normalization (6). A meta-analysis for cancer gene expression
data, performed by Rhodes et al. (26), revealed common molecular
signatures of cancer and was validated on independent datasets. To
perform a meta-analysis it is important that all datasets are available
and that common experimental protocols have been followed.
Standards for describing toxicogenomics datasets are still evolving
and future publications may likely adhere to these standards. A
common difficulty with multiple years of publications includes a
distinct variability of microarray platforms and variations seen from
laboratory to laboratory using the same platform. Efforts are also
underway to integrate toxicogenomics datasets with other types
of data such as sequence homology across species, chemicals,
proteinprotein interaction, and the literature (25, 27). Meta-
analysis of toxicogenomics datasets will likely increase in value as
more datasets become available and standard protocols are used.
The Comparative Toxicogenomics Database (CTD) accomplishes
these goals through rigorous annotation procedures.
A well-defined ontology to map genes to ontology terms is
another approach to analyzing lists of significant genes, such as
those resulting from a toxicogenomics experiment. The use of
Gene Ontology (GO has proven useful. GO consists of three
ontologies: biological processes, molecular function, and cellular
localization where terms are part of a controlled vocabulary and in
ontology are arranged hierarchically. For example, glucose catabo-
lism is part of the biological process ontology but is specifically a
subterm of hexose catabolism. In addition, each term can have
synonyms and can be associated with several ontologies. The use
of GO terms, genes and products has been annotated, thus making
it easier to associate genes with function (28).
In the gene ontology-based approach, one determines a list of
differentially expressed genes using a suitable statistical method and
the list of significant genes is then mapped to GO terms. This
is compared with GO terms for all genes in the array, revealing
GO terms that are significantly over- or underrepresented. These
differential GO terms describe biological processes that are
11 Biomarkers 263
significantly active due to the specific experimental perturbation

under study. This approach was initially proposed for analysis of
genes differentially expressed in cancer and has been coded into
GoMiner (29). An example of the use of GO terms analysis in
toxicogenomics is presented by Currie et al. (30) who applied this
technique to elucidate mechanism of action of nongenotoxic car-
cinogen diethylhexylphthalate in rodents. Yu et al. (31) have pro-
posed a method called GO-quant which uses GO terms in a
quantitative manner in toxicogenomics analysis. Ontology-based
approaches represent a meaningful way to analyze a list of genes,
such as those obtained from toxicogenomics experiments, and
which appear in publicly available datasets or in supplemental
materials from publications. GO undergoes constant revisions and
cell, disease, and pathway ontologies are also emerging. In toxicol-
ogy, SysToxOntology has been proposed (32) for standardizing
and analyzing toxicogenomics information. The commercial soft-
ware suite from Genego mentioned in the Best Practices section
below has been developed to specifically handle toxicogenomics
data and elucidate affected biological processes. The freely available
CTD provides similar functionality linking chemicals, genes, and
diseases.
While ontology-based methods per se have been useful in infer-
ring processes represented by lists of significant genes in toxicoge-
nomics analysis, these methods by themselves do not take into
account known functional interactions between genes described in
biological pathways. Canonical pathways such as metabolic or signal
transduction pathways represent consensus views of biological pro-
cesses and offer another method to analyzing a list of significant
genes obtained through experimental sources. In this approach, a
list of genes is projected onto canonical pathways, highlighting
genes that are under- or overexpressed in the experiment. Such an
approach has been applied by Mao et al. (33) using KEGG pathways
to annotate clusters arising from analysis of gene expression data.
Both Genego and CTD also provide this functionality; however,
pathway analysis of gene expression data need not be restricted to
canonical pathways alone. More generally, interactions described in
the literature between gene products, proteins, and small molecules
can be extracted from the literature and visualized as networks.
Natural language processing (NLP) tools have been applied to
process biomedical literature and extract meaningful interactions.
There are also commercially available tools such as those offered by
GeneGo, Ingenuity, PathART, and Pathway Studio, which offer
both manually curated as well as NLP based interactions between
genes. Networks of interactions resulting from text mining
approaches can be characterized as literature networks and repre-
sent overlapping pathways, including canonical pathways. However,
these results cannot be equated with canonical pathways and have
possible errors resulting from automated text mining methods.
Assuming that the literature is not unduly biased towards certain

genes (see Hoffman and Valencia (34, 35) for genes that are over-
represented in literature), literature networks offer an automated
way of inferring pathways for a list of genes, sometimes offering
insight into overlapping pathways not detected by analyzing only
canonical pathways. Ekins (24) reviews methods that use the litera-
ture network approach to analyze toxicogenomics data and there are
applications elsewhere demonstrating its utility (36, 37).
In terms of biomarker discovery, signatures, or lists, of differ-
entially expressed genes are helpful, but a discovery of the more
complex interactions of genes and proteins in response to various
perturbations will hold the key to biomarker discovery and utiliza-
tion. This will include a more multifactorial approach in determin-
ing the most relevant biomarkers of chemically induced toxicity.
3.1. Best Practices The Comparative Toxicogenomics Database (CTD) (http://ctd.

mdibl.org) is a platform-based, freely available resource that allows
an investigator to search and probe interactions of environmental
chemicals with gene products and to gain an understanding of their
effects on human health (25). The software integrates and con-
structs chemicalgenedisease networks that highlight known and
potentially novel biomarker relationships. The CTD (as of July
2010) contains curated data from approximately 22,000 publica-
tions, 6,000 chemicals, 17,000 genes, and 4,000 diseases. Taken
together this generates approximately 1.4 million chemicalgene
disease interactions. Also included in CTD are pathway data from
Reactome (38), and KEGG (39), which generate potentially novel
chemicalgene pathways. CTD also links to gene data (21) and
DrugBank (40) and contains a suite of analytical and visualization
tools for complex navigation.
MetaCore by Genego is a knowledge-based platform
designed for functional, or pathway analysis of OMICs datasets,
gene lists and compounds. A user can perform analyses with a high
fidelity annotated knowledgebase of protein interactions, pathways,
and functional ontologies. The knowledge is structured in a
computer-readable format and includes software tools for managing
experimental data, analysis, and reporting. On the content side,
MetaCore encompasses a comprehensive database of protein inter-
actions of different types, pathways, network models and 10 func-
tional ontologies covering human, mouse, and rat proteins. The last
release of the database (Jan 2011) features 45,000 human, mouse,
and rat proteins, over a million of protein interactions and almost
700,000 biologically active compounds manually annotated from
140,000 of experimental PubMed articles. Functional analysis in
MetaCore is essentially divided into two approaches. One approach,
known as enrichment analysis (EA), deals with molecular objects
(proteins, genes, compounds, etc.) and shows how different ontol-
ogy terms (pathways, processes, disease biomarkers, etc.) are
11 Biomarkers 265
relatively represented by the proteomics datasets. This is a low

resolution method as it only provides a list of descriptive functional
representation with relative ranking methods that are not particu-
larly unique to the dataset. The second type of analysis evaluates
protein functionality represented as silos of interactions/relation-
ships for each protein in the dataset. The core assumption is that
relative connectivity of proteins in a dataset directly reflects func-
tional importance of these proteins for the phenotype. Relative
connectivity can be calculated for the entire organisms proteome
(human proteome is defined as about 20,300 proteins with experi-
mentally determined function) using interactome tools or for sub-
sets of proteins and their interactions represented as networks.
Furthermore, networks can be built using different algorithms
with different settings that can be defined by the analyst based on
the desired proteomic functional network output.
3.2. Inferring Human Toxicogenomics studies are typically conducted in rodents, and
Endpoints from methodology as outlined above can be used to infer pathway
Animal Data mechanisms and further functional insights. But translating infer-
ences from animal data to mechanism of action in humans remains a
difficult problem to solve (6). The availability of complete or nearly
complete genomes of human, rodents, and dog (the relevant safety
species contained in the literature, especially for drugs) enables the
assignment of gene orthology, which offers one way to transfer
functional information between species. Orthology assignment is
not trivial due to the split of paralogs before and after speciation.
Remm and Sonnhammer (41) have proposed an algorithm that
takes into account the so-called in-paralogs (paralogs that arose
after the species split) and out-paralogs (paralogs that arose before
the species split). Due to the high degree of sequence conservation,
orthology assignment may be straightforward in some cases; how-
ever, orthology assignment may not imply that the gene functions
identically in two species as there may be cases in which rodent
orthology of a human gene may not exist. For example, IL8, which
is one of the major mediators of inflammatory response in humans,
has no known ortholog in rodents.
Focusing on conserved pathways between species is another
approach to inferring human functional significance from animal
data. For example, experimental evidence indicates that innate
immunity is conserved across nematodes, arthropods, and verte-
brates (Kim and Ausubel (42)) which results in conserved path-
ways. It is reasonable to infer that conserved pathways imply
conserved functions across species. Kelley et al. (43) outline a
network alignment strategy to detect pathways conserved between
yeast and bacteria. Applying this in a toxicogenomics context, this
could be accomplished as follows. First, a list of genes is converted
into a network of interacting genes using pathway or literature
network approach described above for human and the other species
Table 1
Genes identified in the study along with putative human
and mouse orthologs
Rat Rat Gene Id Mouse Human

S100a6 85247 S100a6 S100A6
Lgals3 83781 Lgals3 LGALS3
Clu 24854 Clu CLU
Spp1 25353 Spp1 SPP1
Gpnmb 113955 Gpnmb GPNMB
Cd44 25406 Cd44 CD44
Cxcl10 245920 Cxcl10 CXCL10
Timp1 116510 Timp1 TIMP1
To map interactions between these genes and to elucidate pathway-level
understanding, one could use a literature/text-mining tool such as (http://
string-db.org/) which maps known and predicted protein interactions for a
given set of genes. Interactions for rat genes from Table are shown in Fig. 1
(such as in rodents) for which the experiment was carried out;

second, the networks are aligned and a score representing degree
of conservation is obtained. Depending on the degree of network
conservation, one could transfer inference from animal to human
with a certain degree of confidence.
4. Examples
In this example, three freely available programs were used to eluci-

date connectivity of genes derived from a rat toxicogenomics study
into mouse and human ortholog analyses. To elucidate a mechanism
of nephrotoxicity in rats, Kharasch et al. (44), administered a
haloalkene, fluoromethyl-2,2-difluoro-1-(trifluoromethyl)- vinyl
ether (FDVE), the major degradation product of the volatile anes-
thetic Sevoflurane. FDVE mediates nephrotoxicity in rats after
undergoing complex metabolism. Sevoflurane was approved for
use as an anesthetic in 1995, and does not seem to cause nephrotox-
icity in humans, although there are some lingering concerns (Duffy
and Matta (45), Bedford and Ives (46)). In their study, Kharasch
et al. used gene expression microarrays to identify 517 differentially
expressed genes after inducing nephrotoxicity in rats through Sevo-
flurane administration. They further validated these genes using
RT-PCR. A subset of the genes they have identified is shown in
Table 1, along with their NCBI entrez gene IDs (http://www.ncbi.
nlm.nih.gov/gene). Also shown are putative human and mouse
11 Biomarkers 267
Fig. 1. (a) Literature evidence for interaction between rat genes in Table 1. Two genes are linked only if at least one paper
provides evidence for interaction between the two genes. (b) Provides evidence for interaction between mouse and (c) For
human orthologs from the literature.
orthologs, identified using Homologene (http://www.ncbi.nlm.

nih.gov/homologene) The study identified several markers of kid-
ney injury including clusterin, a possible early prognostic marker of
kidney injury (47). Note that clusterin was identified as a biomarker
of kidney toxicity by the PSTC as discussed above.
Using either literature or experimental evidence, interactions

between genes is established and indicated with a line connecting
the two genes. In the above Fig. 1a, there are several genes which are
not connected, indicating an absence of evidence in the rat literature
for interactions between these genes. It is instructive to explore
human and mouse literature for the same set of genes. The Fig. 1b,
c shows interaction graphs for putative human and mouse orthologs.
From the above figures, it is clear that the human gene network
is more connected than the mouse, despite the fact that the mouse
shows more connectivity than the rat. Each instance of connectivity
indicates the presence of more information either in the form of
literature evidence or direct physical interaction between the genes,
leading to a more comprehensive understanding of the orthologs in
question. This is a simple example to illustrate the conversion or
inference of rat toxicogenomics data into human orthologs, which
then becomes the hypothesis for further testing either in clinical
trials or in biobanked human samples.
The complexity of ortholog connectivity has been highlighted
by Carlson et al. (48) who used primary rat and human hepatocytes
to measure responses to two aryl hydrocarbon receptor (AHR)
agonists; 2,3,7,8-tetrachlorodibenzo-p-dioxin and PCB 126. The
results suggest broad species differences in specific genes that
respond to AHR agonist activity and a major species divergence in
the AHR-regulated gene repertoire. This paper gives credence to
the principle that orthology naturally leads to functional similarities
and both literature and experimental evaluation are required to
validate the results.
4.1. More Complex Traditional Chinese Medicine (TCM) often uses teas and other
Problem recipe formulations of organic and mineral ingredients to treat vari-
ous maladies. Patients often combine the use of TCM preparations
with Western therapeutics frequently without a clear understanding
of the risks associated with potential drug-phytochemical interac-
tions. In this example several freely available databases (some requir-
ing registration) and a commercial software package (Genego) were
used to identify phytochemical ingredients of herbal preparations
and to associate/predict molecular targets of each phytochemical.
The function of each target was assessed and compared to those of
Western medications used in the same treatment regimen to predict
potential drug interactions. An example of such a recipe would be
the TCM formulation Kang Ai Pian, used to treat cervical, ovarian,
breast, nasopharyngeal, lung, liver, and gastrointestinal cancer. To
identify the molecular targets of Kang Ai Pian (KAP) (49) the
recipes primary ingredients were first identified as chen pi (Citrus
reticulata; Citri reticulatae Pericarpium), huang bo (Phellodendron
amurense; Phellodendri Cortex), huang lian (Coptis chinensis, Cop-
tis deltoidea and Coptis teeta; Coptidis Rhizoma), huang qin (Scu-
tellaria baicalensis; Scutellariae Radix), hu po (amber; Succinum),
11 Biomarkers 269
niu huang (Bos taurus domesticus; Bovis Calculus), and san qi (Panax
notoginseng; Notoginseng Radix). Next, each ingredients phyto-
chemicals were identified using a combination of the Comprehen-
sive Herbal Medicine Information System for Cancer (CHMIS-C)
from the University of Michigan, National University of Singapores
TCM-ID database, and Benskys Materia Medica. Approximately
201 compounds and 12 minerals were founds within Kang Ai
Pians ingredients altogether. Subsequently structural data files
(SDFs) of these 213 chemicals were generated using PubChem
and uploaded into GeneGo MetaDrug (a compound-based pathway
analysis Web-based software), where they were cross-referenced
against the softwares compound database for similar chemicals
using a Tanimoto coefficient similarity filter of 95100 %. KAP
chemicals with 95 % similarity or greater were identified via Gene-
Gos literature meta-searching and or predicted using GeneGos
QSAR modeling. Additionally, GeneCards and UniProtKB data-
bases were used to confirm the findings. Approximately 568 unique
molecular targets were found for KAP using this method. Using
these findings, detailed lists of potential synergy and antagonisms
within a recipe can be elucidated and verified by literature searching.
In addition, phytochemicaldrug interactions can be proposed, if
not previously known, based on known or predicted effects (induc-
tion or inhibition) of key metabolic enzymes and transporters. In an
example of synergy, KAP phytochemicals quercetin and norwogonin
activate CASP3, as do Western drugs cisplatin and 5-flurouracil. For
antagonism, quercetin inhibits CYP2C9 and CYP3A4, whereas
cyclophosphamide activates CYP2C9 and paclitaxel activates
CYP3A4. Chan et al. describe 15 Western oncology drugs with
potential interactions with 12 phytochemicals in the KAP recipe.
A similar in silico method was used to identify and prioritize
environmental compounds of concern that may increase the risk of
human breast cancer. The California Breast Cancer & Chemicals
Policy Project (3) first evaluated several disease-specific endpoints
that could be evaluated by data inquiry for an initial dataset of
216 compounds associated with mammary tumors in animals.
The first step in this example was to find or predict key molecular
targets and activating pathways for each chemical, which were
deduced or predicted from Genego software and pertinent pub-
lished information (50, 51). Metabolic activation of each com-
pound was assessed based on enzymes known to be expressed in
breast tissue (52). Known targets (genes) associated with breast
cancer were developed from data in Genego, and could also be
assessed using the CTD (25). Eleven compounds were predicted to
be modulators of human breast cancer-related targets/pathways
and about 25 compounds were predicted to activate key metabolic
enzymes known to play an important role in carcinogen metabo-
lism. Several compounds were shown to activate NR1I2/PXR
(Pregnane X receptor), PXR-related transport, and/or a network
of PXR upstream and downstream effectors.
PXR binds as a response element in the CYP3A4 and

ABCB1/MDR1 gene promoter and activates expression in res-
ponse to a wide variety of endobiotics and xenobiotics (by structural
similarity). Activation and downstream effects suggest it may be a
xenosensor of endocrine disruptors (53). This computational anal-
ysis suggests that PXR may be a biomarker for environmental expo-
sure of chemical compounds thought to act through a hormonal
mechanism and which could possibly affect breast cancer risk.
Additionally, several compounds in the dataset tested positive
in mutagenicity assays (from literature sources) and detailed
QSAR analyses were performed using Genego. Structural alerts
from ToxTree (http://ecb.jrc.ec.europa.eu/qsar/qsar-tools/)
proved to be the most sensitive predictor for genotoxicity and
mechanistically related carcinogenicity for the compounds in this
dataset: 167 of 204 chemicals in the dataset had ToxTree struc-
tural alerts for genotoxicity and carcinogenicity. Of these, the top
tier were nitro-aromatic (41), primary aromatic amine, hydroxyl
amine and esters (37), polycyclic aromatic hydrocarbons (21),
hydrazine (20), alkyl and N-nitroso groups (19), aliphatic halo-
gens (15), a,b unsaturated carbonyls (14), epoxides and aziridines
(10). It can be concluded from this computational analysis that in
silico interrogation of chemical interactions with breast cancer
related genes and biological pathways could help to find common
mechanistic pathways for chemicals and lead to defining biomar-
kers for future testing schemes. To complete this analysis to
classify chemicals of concern, the next steps would be to prioritize
chemicals for further testing based on endpoints already known,
or predicted for these compounds and to define their hazard out-
comes by chemical-documentation found in breast cancer data-
bases. From there, chemicals of concern would be identified and
prioritized, with those of the highest level of concern selected for
future priority screening. Next, exposure assessments of each
chemical and their targeted, susceptible populations would be
made. Following that, each compounds imposing breast cancer
risk level would be reassessed in consideration of their exposure
levels to populations and their hazard outcomes documented in
breast cancer databases. Lastly, a final check would be made for
compounds with risks that cannot be adequately controlled to
be classified into an action required status. These would include
compounds with significant data gaps filled by computational
analyses.
This last example of identifying chemicalgene relationships
and directly applying the information into problem solving meth-
odology (e.g., through TMC interactions or environmental con-
cerns for breast cancer risk) is one of the advantages of using a
computational approach for biomarker discovery and functional
connectivity.
11 Biomarkers 271
5. Notes
Identifying safety biomarkers through computational methods has

become an acceptable new methodology due to widespread genetic
research and testing to establish genetic risk factors for ADRs.
Large databases with annotated data sets are available publicly,
which contain well-characterized cases and controls. These are
now being interrogated in industry, academia, and regulatory agen-
cies. Tissue and DNA/RNA banks, in many instances coupled with
detailed clinical annotation, are also available commercially, or
through networks as noted above, to generate specific hypotheses
that can be tested prospectively during clinical trials or retrospec-
tively from banked samples from previous clinical studies. Several
biopharmaceutical companies are also starting to bank samples
from large animal toxicology studies and technologies are available
to establish the same information in primates and dogs as in
humans. Future toxicology studies are predicted to have backup
samples available to identify whether animals with specific toxicities
have predisposing genetic variants which place them at a higher
risk. Procedures have already been established to include these data
in regulatory submissions (see Voluntary Genomics Data Submis-
sions at: www.fda.gov/CDER/genomics/VGDS/htm). In GLP
studies, samples can be collected during a prerandomization period
and these can be studied by issuing specific protocol amendments.
By using cross-species connectivity, it has been suggested that these
approaches will be used to create specific animal models that predict
relevant ADRs and risk factors for humans, which would create a
major advance in nonclinical safety testing. A similar approach has
already been initiated in developing relevant mouse models for
certain diseases, where the mouse contains the same genetic var-
iants as do humans who are highly predisposed to disease develop-
ment. A major factor in these developments will be a consistent
computational approach to identify novel biomarkers and to covert
this information into prospective studies for validation.
References
1. Johnson DE, Smith DA, Park BK (2007) Biomar- inhibitor for chronic myelogenous leukemia. J
kers: the pie`ce de resistance of innovative medi- Clin Invest 105:37
cines. Curr Opin Drug Discov Devel 10:2224 5. Frank R, Hargreaves R (2003) Clinical biomar-
2. Daly AK (2007) Individualized drug therapy. kers in drug discovery and development. Nat
Curr Opin Drug Discov Devel 10:2936 Rev Drug Discov 2:566580
3. Walker DB (2006) Serum chemical biomarkers 6. Johnson DE, Rodgers AD, Sudarsanam S
of cardiac injury for nonclinical safety testing. (2006) Future of computational toxicology:
Toxicol Pathol 34:94104 broad application into human disease and ther-
4. Druker BJ, Lydon NB (2000) Lessons learned apeutics. John Wiley & Sons, Inc, Hoboken,
from the development of an abl tyrosine kinase NJ
7. Pepe MS, Etzioni R, Feng Z, Potter JD, 21. The pharmacogenomics knowledge base
Thompson ML, Thornquist M, Winget M, [PharmGKB], http://pharmgkb.org/
Yasui Y (2001) Phases of biomarker develop- 22. Waters MD, Fostel JM (2004) Toxicogenomics
ment for early detection of cancer. J Natl Can- and systems toxicology: aims and prospects.
cer Inst 93:10541061 Nat Rev Genet 5:936948
8. Johnson DE, Smith DA, Park BK (2009) Phar- 23. Thompson Reuters Genego http://www.gen-
macogenomics and adverse drug reactions; ego.com/
prospective screening for risk identification. 24. Ekins S (2006) Systems-ADME/Tox:
Curr Opin Drug Discov Devel 12:2730 resources and network approaches. J Pharma-
9. Ingelman-Sundberg M (2008) Pharmacoge- col Toxicol Methods 53:3866
nomic biomarkers for prediction of severe 25. Davis AP, King BL, Mockus S, Murphy CG,
adverse drug reactions. N Engl J Med Saraceni-Richards C, Rosenstein M, Wiegers T,
358:637639 Mattingly CJ (2011) The comparative toxico-
10. Link E, Parish S, Armitage J, Bowman L, genomics database: update 2011. Nucleic
Heath S, Matsuda F, Gut I, Lathrop M, Collins Acids Res 39:D1067D1072
R (2008) SLCO1B1 variants and statin- 26. Rhodes DR, Yu J, Shanker K, Deshpande N,
induced myopathya genomewide study. N Varambally R, Ghosh D, Barrette T, Pandey A,
Engl J Med 359:789799 Chinnaiyan AM (2004) Large-scale meta-
11. Office of the commissioner safety alerts for analysis of cancer microarray data identifies
human medical products - Phenytoin (mar- common transcriptional profiles of neoplastic
keted as dilantin, phenytek and generics) and transformation and progression. Proc Natl
fosphenytoin sodium (marketed as Cerebyx Acad Sci USA 101:93099314
and generics), http://www.fda.gov/Safety/ 27. Mattingly CJ, Rosenstein MC, Davis AP, Colby
MedWatch/SafetyInformation/SafetyAlerts GT, Forrest JN, Boyer JL (2006) The compar-
forHumanMedicalProducts/ucm094919.htm ative toxicogenomics database: a cross-species
12. Chiao SK, Romero DL, Johnson DE (2009) resource for building chemical-gene interac-
Current HIV therapeutics: mechanistic and tion networks. Toxicol Sci 92:587595
chemical determinants of toxicity. Curr Opin 28. The gene ontology, http://www.geneontology.
Drug Discov Devel 12:5360 org/
13. Bonnet E, Bernard J, Fauvel J, Massip P, Ruida- 29. GoMiner Home Page, http://discover.nci.nih.
vets J, Perret B (2008) Association of APOC3 gov/gominer/index.jsp
polymorphisms with both dyslipidemia and
lipoatrophy in HAART-receiving patients. AIDS 30. Currie RA, Bombail V, Oliver JD, Moore DJ,
Res Hum Retroviruses 24:169171 Lim FL, Gwilliam V, Kimber I, Chipman K,
Moggs JG, Orphanides G (2005) Gene ontology
14. Paik S, Shak S, Tang G, Kim C, Baker J, Cronin mapping as an unbiased method for identifying
M, Baehner FL, Walker MG, Watson D, Park T, molecular pathways and processes affected by
Hiller W, Fisher ER, Wickerham DL, Bryant J, toxicant exposure: application to acute effects
Wolmark N (2004) A multigene assay to predict caused by the rodent non-genotoxic carcinogen
recurrence of tamoxifen-treated, node-negative diethylhexylphthalate. Toxicol Sci 86:453469
breast cancer. N Eng J Med 351:28172826
31. Yu X, Griffith WC, Hanspers K, Dillman JF,
15. Wallach I, Jaitly N, Lilien R (2010) A Ong H, Vredevoogd MA, Faustman EM
structure-based approach for mapping adverse (2006) A system-based approach to interpret
drug reactions to the perturbation of underly- dose- and time-dependent microarray data:
ing biological pathways. PLoS One 5:e12063 quantitative integration of gene ontology anal-
16. Milletti F, Vulpetti A (2010) Predicting poly- ysis for risk assessment. Toxicol Sci 92:560
pharmacology by binding site similarity: from 577
kinases to the protein universe. J Chem Inf 32. Xirasagar S, Gustafson SF, Huang C, Pan Q,
Model 50:14181431 Fostel J, Boyer P, Merrick BA, Tomer KB,
17. Critical Path Institute, http://www.c-path. Chan DD, Yost KJ, Choi D, Xiao N, Stasiewicz
org/ S, Bushel P, Waters MD (2006) Chemical
18. www.pharmaadme.org - Home, http://phar effects in biological systems (CEBS) object
maadme.org/joomla/ model for toxicology data, SysTox-OM: design
19. Genome Canada, http://www.genomecanada. and application. Bioinformatics (Oxford, Eng-
ca/ land) 22:874882
20. Drug Induced Liver Injury Network DILIN 33. Mao X, Cai T, Olyarchuk JG, Wei L (2005)
site, https://dilin.dcri.duke.edu/ Automated genome annotation and pathway
identification using the KEGG Orthology
11 Biomarkers 273
(KO) as a controlled vocabulary. Bioinformat- the sevoflurane degradation product fluoro-

ics (Oxford, England) 21:37873793 methyl-2,2-difluoro-1-(trifluoromethyl)vinyl
34. Hoffmann R, Valencia A (2003) Life cycles of ether (compound A) in rats. Toxicol Sci
successful genes. Trends Genet 19:7981 90:419431
35. Hoffmann R, Valencia A (2003) Protein inter- 45. Duffy CM, Matta BF (2000) Sevoflurane and
action: same network, different hubs. Trends anesthesia for neurosurgery: a review. J Neuro-
Genet 19:681683 surg Anesthesiol 12:128140
36. Calvano SE, Xiao W, Richards DR, Felciano 46. Bedford RF, Ives HE (2000) The renal safety
RM, Baker HV, Cho RJ, Chen RO, Brownstein of sevoflurane. Anesth Analg 90:505508
BH, Cobb JP, Tschoeke SK, Miller-Graziano 47. Vaidya VS, Ferguson MA, Bonventre JV
C, Moldawer LL, Mindrinos MN, Davis RW, (2008) Biomarkers of acute kidney injury.
Tompkins RG, Lowry SF (2005) A network- Annu Rev Pharmacol Toxicol 48:463493
based analysis of systemic inflammation in 48. Carlson EA, McCulloch C, Koganti A, Good-
humans. Nature 437:10321037 win SB, Sutter TR, Silkworth JB (2009) Diver-
37. Bredel M, Bredel C, Juric D, Harsh GR, Vogel gent transcriptomic responses to aryl
H, Recht LD, Sikic BI (2005) Functional net- hydrocarbon receptor agonists between rat
work analysis reveals extended gliomagenesis and human primary hepatocytes. Toxicol Sci
pathway maps and three novel MYC- 112:257272
interacting genes in human gliomas. Cancer 49. Chan E, Tan M, Xin J, Sudarsanam S, Johnson
Res 65:86798689 DE (2010) Interactions between traditional
38. Reactome, http://www.reactome.org/ Chinese medicines and Western therapeutics.
39. KEGG: Kyoto encyclopedia of genes and gen- Curr Opin Drug Discov Devel 13:5065
omes, http://www.genome.jp/kegg/ 50. Weigelt B, Horlings HM, Kreike B, Hayes
40. DrugBank: home, http://www.drugbank.ca/ MM, Hauptmann M, Wessels LFA, de Jong
41. Remm M, Storm CE, Sonnhammer EL (2001) D, Van de Vijver MJ, Vant Veer LJ, Peterse
Automatic clustering of orthologs and in- JL (2008) Refinement of breast cancer classifi-
paralogs from pairwise species comparisons. J cation by molecular characterization of histo-
Mol Biol 314:10411052 logical special types. J Pathol 216:141150
42. Kim DH, Ausubel FM (2005) Evolutionary 51. Turnbull C, Rahman N (2008) Genetic predis-
perspectives on innate immunity from the position to breast cancer: past, present, and
study of Caenorhabditis elegans. Curr Opin future. Annu Rev Genomics Hum Genet
Immunol 17:410 9:321345
43. Kelley BP, Sharan R, Karp RM, Sittler T, Root 52. Williams JA, Phillips DH (2000) Mammary
DE, Stockwell BR, Ideker T (2003) Conserved expression of xenobiotic metabolizing enzymes
pathways within bacteria and yeast as revealed and their potential role in breast cancer. Cancer
by global protein network alignment. Proc Res 60:46674677
Natl Acad Sci USA 100:1139411399 53. Kretschmer XC, Baldwin WS (2005) CAR and
44. Kharasch ED, Schroeder JL, Bammler T, PXR: xenosensors of endocrine disrupters?
Beyer R, Srinouanprachanh S (2006) Gene Chem Biol Interact 155:111128
expression profiling of nephrotoxicity from
Chapter 12
Biomonitoring-based Environmental Public Health

Indicators
Andrey I. Egorov, Dafina Dalbokova, and Michal Krzyzanowski
Abstract
This chapter discusses the use of biomonitoring-based indicators of exposure to environmental pollutants in
environmental health information systems. Matrices for biomonitoring, organization and standardization of
surveillance programs, the use of intake and body burden data, and the interpretation of surveillance data are
discussed. The concept of environmental public health indicators is demonstrated using the Persistent
organic pollutants in human milk indicator implemented in the Environment and Health Information
System (ENHIS) of the WHO Regional Office for Europe. This indicator is based on the data from the
WHO-coordinated surveillance of persistent organic pollutants in human milk as well as data from selected
national studies. The WHO survey data demonstrate a steady decline in breast milk concentrations of dioxins
across Europe. The data from biomonitoring surveys in Sweden also show a steady decline of breast milk
concentrations of most persistent organic pollutants since 1970s with the exception of polybrominated
diphenyl ethers (PBDEs) which increased rapidly until the late 1990s and then started to decline after the
implementation of policy measures aiming at reducing exposures. The application of human biomonitoring
data in support of environmental public health policy actions requires carefully designed standardized and
sustainable surveillance, comprehensive interpretation of the data, and an effective communication strategy
based on credible information presented in the form of indicator factsheets.
Key words: Persistent organic pollutants, Dioxins, Polybrominated diphenyl ethers, Human milk,
Environment and Health Information System, Environmental public health indicators, World Health
Organization
1. Introduction
Exposure to environmental pollutants occurs through different

routes, such as inhalation, ingestion, or dermal absorption. The
amount of pollutant uptake is often termed as the absorbed dose.
The body burden at a given moment in time is the result of past
exposure, distribution and tissue binding, metabolism and excre-
tion. For many pollutants, comprehensive exposure assessment
based on environmental data requires quantitation of exposures
275
276 A.I. Egorov et al.
through multiple pathways or routes taking in account the distri-

butions of pollutant levels and individual behavioral patterns, such
as consumption of contaminated foods.
Biomonitoring involves measurements of biomarkers, such as
levels of environmental contaminants or their metabolites, or mar-
kers of health effects, in bodily fluids, such as blood, urine, saliva,
breast milk, sweat, or other specimens, such as feces, hair, teeth, or
nails. Biomonitoring data directly reflects the total body burden or
biological effect resulting from all routes of exposure, and interin-
dividual variability in exposure levels, metabolism and excretion
rates. Such data are often the most relevant metric for health impact
assessment for bioaccumulating or persistent chemicals. Biomoni-
toring, however, usually does not reveal exposure sources and
routes. Therefore, environmental monitoring will remain crucial
for the development of targeted policy actions.
Biomarkers can be grouped according to the link in the
causeeffect chain that they characterize: exposure or health effect.
Biomarkers of health effects reflect measurable changes at the
physiological, biochemical or morphological level due to exposure
to toxicants or biological agents in the environment. For example,
gene expression patterns associated with exposure to complex
chemical mixtures can be evaluated using microarrays. The forma-
tion of MicroNuclei (MN) in human peripheral lymphocytes or
the rate of chromosomal aberrations are widely used as indicators
of chromosomal damage, for example, due to exposure to Poly-
Aromatic Hydrocarbons (PAH) or other mutagens (1). High
throughput systems for automatic scoring on micronuclei (2, 3)
enable rapid analysis of a large number of samples making the MN
analysis a promising biomonitoring tool. A separate group of
biomarkers characterizes antibody responses to pathogens reflect-
ing infections or vaccinations, or antibody responses to food and
air-borne allergens. Biomarkers of exposure to chemical pollutants,
such as concentration of lead in blood or dioxins in human milk, are
most common in environmental health surveillance. Exposure bio-
markers are described in detail in Subheadings 2 and 3.
Biomonitoring can be used in epidemiological studies to
demonstrate an association between body burden of pollutants
and health effect or to test other research hypotheses. Novel bio-
monitoring methods are usually tested and validated in research
settings. Sustained national and international surveillance programs
typically use well-established biomonitoring techniques to produce
information on environmental factors of known public health sig-
nificance. The results of biomonitoring-based surveillance can be
reflected in environmental public health indicators, which represent
simple numerical summary measures of surveillance data, such as
the average level of population exposure to a specific pollutant, or
the proportion of heavily exposed individuals. Well-designed indi-
cators based on sound surveillance programs using standardized
12 Biomonitoring-based Environmental Public Health Indicators 277
methods can provide valuable information to policy makers and the

public. They can guide policy action to reduce or prevent adverse
health effects in the population, and inform stakeholders outside
the professional community or the general public.
Information systems based on environmental public health
indicators have been developed at national and international levels.
For example, the US Centers for Disease Control and Preventions
(CDC) National Environmental Public Health Tracking Program
(http://ephtracking.cdc.gov/showHome.action) integrates com-
ponents of hazard monitoring, and exposure and health effects
surveillance into a cohesive information network (4). The Environ-
ment and Health Information System (ENHIS) of the WHO
Regional Office for Europe integrates national level exposure,
health effect and policy action data. The system and the Persistent
organic pollutants in human milk indicator are described in Sub-
heading 3.
2. Materials
and Methods
2.1. Matrices Used The type of samples to be selected for biomonitoring depends on
for Biomonitoring many factors including bioaccumulation, metabolism and excretion
rates of the chemical, potential contamination, matrix interference,
required sample volume, and the ease of sampling. Biomonitoring
can provide a wealth of information on chemicals that are stored in
the body for a long period of time, such as Persistent Organic
Pollutants (POPs), lead, and cadmium. For chemicals that are
excreted rapidly, cross-sectional biomonitoring data reflect recent
exposure, while characterization of long-term exposure patterns
requires repetitive sampling.
2.1.1. Blood Blood is the preferred matrix for many chemicals because it is in
contact and in dynamic equilibrium with all tissues (57). The main
disadvantage of blood is the invasive sampling procedure. Blood is a
common matrix for biomonitoring of water soluble pollutants,
such as metals. Blood lead level (BLL) is the primary biomarker
of lead exposure (8). Blood (serum) is also widely used for biomo-
nitoring of fat-soluble chemicals, such as persistent organic pollu-
tants (POPs). Since the fat content of serum varies, concentrations
of fat soluble organic compounds are typically expressed per gram
of fat. Serum samples are also used for biomonitoring of antibody
responses to pathogens, such as Human Immunodeficiency Virus
(HIV), Herpes Simplex Virus HSV-2 (9), cytomegalovirus (10),
papillomaviruses (11), Helicobacter pylori (12), and Toxoplasma
gondii (13), and common food allergens, such as peanuts (14, 15).
2.1.2. Urine Urine is the second most common matrix for biomonitoring. Due to
the variable composition of urine, the results need to be adjusted for
the creatinine concentration. Another caveat is that the urine level of
a chemical or its metabolite does not directly reflect the body bur-
den. Thus, the application of toxicokinetics is necessary for inter-
preting the data. Urine is suitable for biomonitoring of water-
soluble chemicals, such as metals (8), and most useful for monitoring
rapidly metabolized organic chemicals, such as PAHs and phthalates
(57). Data on PAH metabolites in urine has been linked with the
biomarker of biological effect, serum Immunoglobulin E (IgE)
specific to indoor allergens, as well as allergy symptoms (16).
2.1.3. Hair and Nails Nails, especially toenails, which are less likely to be contaminated by
direct contact with environmental chemicals, are used for biomo-
nitoring of metals, such as mercury, lead, and cadmium (5, 8). Hair
is also a stable matrix that can be used in biomonitoring, most
commonly of metals. It is an especially reliable matrix for mercury.
Important issues for ensuring comparability of data are the length
of hair collected, position on the scalp, distance from the scalp, and
preparation of samples (5, 7).
2.1.4. Milk Breast milk is commonly used as a matrix for monitoring of lipophilic
POPs (6). Emerging nonpersistent chemicals, such as phthalates, can
also be detected in milk. As the content of milk is not constant, the
results are usually adjusted to the concentration of lipids. Due to
depuration of POPs from the mother during lactation, parity and
sampling time after delivery are important for the interpretation of
results. The age of the mother is also an important factor as concen-
trations of some POPs, such as dioxins, increase with age. The
quantitation of analytically complex POPs, such as dioxins, requires
the use of gas chromatographyhigh resolution mass spectrometry
(GS-HRMS) that can detect femtogram or picogram quantities of
chemicals. This technique requires very strict laboratory conditions
and is expensive. A lower cost alternative, the Chemically Activated
Luciferase eXpression (CALUX) assay, allows the quantitation of the
total activity dioxins and similar compounds acting as aryl hydrocar-
bon receptor (AhR) agonists. It shows a strong correlation with
chemical analysis of milk samples (17).
2.1.5. Other Matrices Many toxic elements are preferentially excreted from the body in
sweat (18). Difficulties with collecting sweat samples, however,
limit the application of this matrix in biomonitoring. Since 90% of
lead is accumulated in bones, the level of lead in bones or deciduous
teeth is a good marker of long-term exposure and toxic effects (8).
Lead in bones is measured using noninvasive methods, such as K-
line X-ray fluorescence (19). An important disadvantage of this
method is that survey participants need to come to a specialized
laboratory to be tested. Saliva offers many advantages as an easy-to-
collect noninvasive matrix but also poses technical challenges due

to its variable composition. Salivary antibody assays have been used
for low cost biomonitoring of environmental infections, such as
hepatitis A (20). Multiplex salivary antibody assays for several
waterborne infections has also been developed at the US Environ-
mental Protection Agency (21).
2.2. Examples The largest continuous program involving biomonitoring in the

of Biomonitoring USA, the National Health And Nutrition Examination Survey
Programs (NHANES), has been conducted by the US CDCs National Cen-
ter for Health Statistics (NCHS) since 1971. It became a continu-
ous survey in 1999. It is currently based on a stratified, multistage,
probability-cluster design to select a representative sample of the
civilian population (22). Each year, approximately 7,000 randomly
selected individuals are invited to participate. NHANES involves
extensive biomonitoring of metals in blood and urine samples,
POPs in blood and antibody responses to selected antigens in
sera. Dioxins, furans, polychlorinated biphenyls (PCBs), organo-
chlorine pesticides, and polybrominated diphenyl ethers (PBDEs)
are measured in a one-third subsample of participants excluding
children (22). The individual-level data are widely used in research
projects while statistical summaries provide information for policy
makers and the public.
The Italian program on human biomonitoring was commenced
in 2008 to assess blood levels of 18 metals in the general popula-
tion, to determine reference values characterizing the background
body burdens, to compare regions of the country and to monitor
trends (23). Two on-going Canadian studies, the population-based
Canadian Health Measures Survey (CHMS) and the pregnancy
cohort study, Maternal-Infant Research on Environmental Chemi-
cals (MIREC), include extensive biomonitoring efforts. Blood and
urine samples in CHMS and maternal blood, hair, and urine, cord
blood, meconium, and breast milk in MIREC are analyzed for
metals, pesticides, PCBs, and PBDEs as well as phthalate metabo-
lites, bisphenol A, and perfluorinated compounds (24). The Arctic
Monitoring and Assessment Program (AMPA) in Greenland
involves the evaluation of the combined health effects of dioxins
and PCBs using serological markers of their endocrine disrupting
potential, such as estrogen and androgen receptors reporter gene
transactivation, and AhR transactivation assays (25). The WHO-
coordinated international survey of Persistent Organic Pollutants
(POPs) in human milk is described in detail in Subheading 3.
2.3. Interpretation The proliferation of biomonitoring methods and surveillance

of Biomonitoring Data programs has a potential to reduce the use of environmental moni-
toring data in the assessment of exposure to some environmental
hazards. Future environmental monitoring efforts will likely
focus more on identifying sources or contaminants to explain
biomonitoring data (6). Biomonitoring will also allow direct and

more precise assessment of the distribution of risk in the population
incorporating individual variability in exposures and chemical excre-
tion rates. An ambitious concept, the exposome, involves charac-
terization of the totality of exposures to environmental chemicals
using prospective comprehensive biomonitoring (26).
Increasing application of human biomonitoring will drive similar
changes in toxicological studies where body burden measures will
replace uptake doses. Biomonitoring data is also increasingly used in
epidemiological studies linking the body burden with population
health effects. Examples are associations between cord blood mercury
level and adverse neurological effects in children (7), and between
maternal plasma concentrations of dioxins and modified neonatal
thyroid function (27). Accumulation of biomonitoring-based toxico-
logical and epidemiological data will enable wider application of
human biomonitoring data in health impact assessment.
Most current standards and guidelines for safe exposure, how-
ever, are still expressed in units of intake, such as Acceptable of
Tolerable Daily Intake (ADI or TDI), Reference Dose (RfD), Mini-
mal Risk Level (MRL), or Derived No-Effect Level (DNEL). Human
biomonitoring data can be interpreted using the Biomonitoring
Equivalent (BE) concept which represents the concentration of a
chemical or its metabolite in biological specimens consistent with
the established reference values based on the intake data (28, 29).
Linking exposure data with body burden relies on the knowl-
edge of toxicokinetics of chemicals. Compounds that are persistent
in the human body (half-life measured in years, such as lead and
dioxins) or intermediately persistent compounds (half-life measured
in days or months, such as methylmercury) are most suitable for this
approach because short-term variations in exposure have no effects
on the biomarker data (7). Various Physiologically Based Pharmaco-
kinetic (PBPK) models have been developed for multiple pollutants.
Therefore, standardized approaches need to be developed to ensure
comparability and consistency of the modelling results (30).
While the accuracy of PBPK models depends, among other
factors, on the accuracy of elimination half-life constants for specific
chemical species, the application of animal data to derive such
constants for humans is problematic as the elimination behavior
cannot be predicted reliably using allometric scaling. Half-life esti-
mates for important Persistent Organic Pollutants (POPs) are
based on human data from occupational exposure studies or acci-
dents involving exposure to a high level of pollutant. An approach
to use general population-based biomonitoring data to provide an
accurate half-life estimate for PCBs has also been developed (31). It
is based on analysis of temporal trends in body burdens from
sequential cross-sectional surveys taking into account growth
dilution of PCB concentrations in the body during childhood,
changes in body fat with age and life history of exposure, which is
approximated using time-specific intake estimates based on food

contamination surveillance data. In this study, the estimated half-
lives for different PCB congeners varied from 2.6 to 15.5 years.
For dioxins, the elimination rates have been shown to depend
on age, fat content of the body and the initial concentration of the
pollutant (32). The Concentration and Age- Dependent elimina-
tion Model (CADM) has been developed to reconstruct the time-
depended body burden profile of 2,3,7,8-tetrachlordibenzodioxin
(TCDD), the most toxic dioxin congener, in occupationally
exposed individuals using biomonitoring data collected many
years after the termination of occupational exposure (33, 34).
These body burden reconstruction data have been used to estimate
the carcinogenic potency of dioxin using the total area under the
lipid-adjusted serum concentration versus time curve as exposure
metric (35). The carcinogenic effect was modelled as follows:
RR e b AUC ;
where b is the Cox regression coefficient (pg/g of fat year)1,
AUC (pg/g fat year) is the area under the reconstructed con-
centration curve for lipid adjusted serum TCDD concentration and
RR is the ratio between age-specific total cancer mortality rate at a
given AUC increment and the background rate.
3. Examples
3.1. Environment The European Environment and Health Information System

and Health Information (ENHIS) (www.euro.who.int/enhis or www.enhis.org) involves a
System of the WHO set of indicators that address the three downstream elements of the
Regional Office for Driving forcePressureStateExposureEffectAction
Europe (DPSEEA) conceptual model (36, 37). Indicators of exposure are
designed to characterize priority EH issues which are amenable to
interventions and policy actions.
The information in ENHIS is presented in the form of stan-
dardized indicator factsheets providing concise yet comprehensive
information on the public health importance of the environmental
factor, policy actions, and regulatory context.
The information is presented in blocks (layers of informa-
tion) with different levels of detail to meet the needs of different
categories of users (Fig. 1). The top layer contains a key message
which summarizes the environmental public health situation and
policy framework, and rationale for implementing the indicator.
The data are displayed in easily understandable graphs or charts.
Data are interpreted in the context of the public health significance
of the problem and relevant policy actions. The assessment of the
situation focuses on potential public health effects and temporal
Policy-makers
Policy advisors Key

message
Contexts:
-Policy in place
-PH significance
Health professionals, Layersof information
researchers
Interpretation of data:
-Assessment of EH situation
Data:
- Presentations (graphs, charts)
- Data tables
- Meta-data (data-about-data).
Fig. 1. Communication and information dissemination in ENHIS: the indicator factsheet format.
trends potentially resulting from policy actions. The interpretation

and assessment layers target policy advisors and health profes-
sionals. The detailed description of data (metadata) and list of
references are provided for health professionals interested in
performing more in-depth analyses.
ENHIS has served as the main source of information for two
major WHO reports to support the European Region-wide policy
actions related to environmental health: Childrens Health and
the Environment in Europe: a Baseline Assessment (38) and
Health and Environment in Europe: Progress Assessment (39).
The latter report was presented, as background material, at the 5th
Ministerial Conference on Environment and Health in Parma, Italy
(March 2010). The conference adopted the Parma Declaration on
Environment and Health (40). The Commitments to Act in the
Parma Declaration specify measures to protect childrens health
from the impact of environmental factors, and call for the develop-
ment of internationally comparable environmental health indica-
tors and a consistent approach to human biomonitoring. The
Parma Declaration also explicitly supports further development of
ENHIS. It provides a framework for the development of new
environmental health indicators, which are necessary for monitor-
ing the implementation of specific commitments to reduce or
eliminate exposures to reproductive toxicants and endocrine dis-
ruptors.
3.2. ENHIS Indicator The Persistent organic pollutants in human milk indicator is
Persistent Organic based on data from the WHO survey of POPs in human milk
Pollutants in Human (Fig. 2) and selected data from national surveys in Sweden
Milk (Fig. 3). It addresses exposure to an important class of pollutants
50
1988 1993 2002 2007
40
pg TEQ / g fat
30
20
10
0
he ium
e in
y
om
lic
U lic
Sw ne
N n
en y
Fi a
C d
ia
ry
nd
an
ar
e
ri
an
a
at
ga
ak ub
ub
w
ai
ed
st
Sp
ze ingd
m
lg
rla
te rm
ro
nl
or
kr
Au
un
ov ep
ep
Be
H
R
R
K
G
D
et
ch
d
N
ni
Sl
U
Fig. 2. Dioxin levels in human milk in selected European countries, 19882007.
4.5 120
DDE (g/g fat)
4
DDE, DDT, PCBs, HCB, PCNs and PBDEs
100
DDT (g/g fat)
3.5
Dioxins (PCDDs/PCDFs/PCBs)
3 80
PCBs (g/g fat)
2.5
60
2 HCB (g/g fat)
1.5 40
PCNs (ng/g fat)
1
20
0.5 PBDEs (ng/g fat)
0 0
1970 1980 1990 2000 2010 Dioxins (pg TEQ/g fat)
Fig. 3. POPs levels in human milk, Sweden, 19722007, with trend lines produced using two data points per moving
average.
that are known to induce a variety of endocrine disrupting, devel-

opmental, and carcinogenic effects.
3.2.1. Sources of Data The WHO Global Environment Monitoring SystemFood

Contamination Monitoring and Assessment Programme (GEMS/
Food) involves biomonitoring of POPs in human milk using stan-
dardized recruitment, sampling, laboratory analysis, and data
presentation protocols. The international policy framework for this

program includes the United Nation Economic Commission for
Europes (UNECE) Long Range Transboundary Air Pollution
convention. This convention has effectively enforced the surveil-
lance of POPs, including human biomonitoring. The Stockholm
Convention on Persistent Organic Pollutants is another landmark
international agreement which was ratified in 2004 to decrease
human exposure to priority POPs.
Milk, blood and adipose tissue are all relevant matrices for the
assessment of body burdens of POPs, and the lipid-adjusted results
are actually comparable for many POPs. Human milk was selected
for the WHO survey because it provides information on the cumu-
lative exposure of the mother as well as the current exposure of the
infant. The main objective of this continuous biomonitoring pro-
gram is to examine temporal trends in participating countries (41).
The first round of the study took place in 19871988, while the last
round took place in 20082009 (42, 43).
National surveys are designed using the WHO guidelines (41).
In each country, at least 50 milk samples from primiparous healthy
women under 30 years of age have to be collected within 38 weeks
after the childs birth. The women exclusively have to breastfeed a
single child (twin births excluded) and should have resided in the
same area for at least 10 years. Women with unusual exposure
history, such as living near known POP hot spots, are excluded.
Each woman provides at least 50 mL of milk. Samples are divided in
two 25 mL portions. One portion is used locally for analysis of
analytically simple POPs, such as marker PCBs and organochlorine
pesticides (44). The other portions from all samples collected in
the country are pooled and sent to the WHO reference laboratory
(currently the State Institute for Chemical and Veterinary Analysis
of Food in Freiburg, Germany) for the analysis of analytically
complex POPs, such as polychlorinated dibenzodioxins
(PCDDs), polychlorinated dibenzofurans (PCDFs) and dioxin-
like PCBs. The Gas ChromatographyHigh Resolution Mass
Spectrometry (GC-HRMS) technique is used to quantify the con-
centrations of individual congeners.
Other sources of data for this indicator are national biomoni-
toring surveys in Sweden which used pooled milk samples and
generally followed the WHO guidelines but had specific objectives
addressing national information needs (4547).
3.2.2. Dioxin Levels Figure 2 presents country-level summary data on breast milk levels
in Human Milk of PCDDs, PCDFs and dioxin-like PCBs (hereafter called diox-
ins), an important class of POPs that have similar toxicological
properties. These compounds act as Aryl-hydrocarbon Receptor
(AhR) agonists and elicit profound biochemical responses in verte-
brates and altered regulation of a large number of genes (48, 49).
Endocrine disrupting effects of dioxins are likely due to a
combination of several mechanisms, such as altered steroidogene-

sis, reduced expression of receptors for sex steroids and luteinizing
hormone (LH), and induction of the cytochrome P450 1 family
of enzymes, resulting in the inactivation of steroid hormones (50).
Maternal exposure to dioxins has been linked to a long-lasting
modification of thyroid function in children (27) while exposure
during infancy has been linked to altered spermatogenesis and
hormonal status in adult men (51). Dioxins also act as nongeno-
toxic cancer promoters (35, 52).
PCDD/PCDFs have never been produced intentionally.
They are formed during waste incineration, home heating, and as
by-products of the production of organic chemicals containing
chlorine, such as organochlorine pesticides and PCBs. The latter
have been produced commercially for the use as dielectric fluid in
capacitors and transformers, heat conducting and hydraulic fluids,
and as additives to other commercial products. The production
peaked in the 1970s. In the USA, production was completely
phased out in 1977 under regulatory pressure due to concerns
about health effects.
PCDD/PCDFs and PCBs emitted into the air can remain there
for a long time, travel thousands of kilometers and accumulate in
the food chain in the areas where major emission sources do not
exist, such as the Arctic (53). While potential exposure pathways
include ingestion, inhalation and dermal absorption, the main
route of exposure (more than 90%) in the general population is
through food (49, 54). In adults, the main exposure sources are
dairy products, meat and fish, with relative importance of each
source depending on the local diet. For breastfed infants, the
main source of exposure is mothers milk (55).
WHO has developed the scheme of Toxic Equivalency Factors
(TEFs) to quantify the relative potency of PCDDs/PCDFs and
dioxin-like PCBs compared to the most toxic dioxin congener,
2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD). Using this scheme,
the overall toxicity of the mixture of dioxins is presented as the
Toxic Equivalency Quantity (TEQ) expressed in concentration
units (56, 57). The physical meaning of TEQ is the concentration
of TCDD which would have the same potency as the mixture
analyzed. Typically, dioxins other than TCDD account for approx-
imately 90% of toxicity of the mixture. TEFs are largely based on
the results of toxicological studies that used the ingested dose
metric and are intended to be applied to intake data (56). In
practice, the results of human biomonitoring of dioxins are also
expressed in TEQ units. These TEQ values should be interpreted
with caution because the pharmacokinetic and distribution proper-
ties of many TEQ-contributing compounds differ substantially
from those of TCDD, which could influence the biological effects
under chronic conditions (58, 59).
The long-term Tolerable Daily Intake (TDI) of dioxins estab-

lished by WHO (60) is 1 to 4 pg TEQ/kg body weight (bw), with
the upper range being the maximum tolerable intake. The stated
ultimate goal is to reduce intake to the level below 1 pg TEQ/kg
bw per day. Further assessment by the Joint Food and Agriculture
Organization (FAO)/WHO Committee set the provisional tolerable
monthly intake (PTMI) at 70 pg TEQ/kg bw (49). The tolerable
daily intake for chronic exposure set by the US Agency for Toxic
Substances and Disease Registry (ATSDR) is 1 pg TEQ/kg bw
(http://www.atsdr.cdc.gov/ToxProfiles/tp.asp?id366&tid63
Accessed 4 July 2012).
Although exposure to dioxins and dioxin-like PCBs continues
to decline in Europe, national studies show that intake in a substan-
tial proportion of adults still exceeds the WHO TDI of 1 pg TEQ/
kg bw (61). Body weight-adjusted daily intake levels in breastfed
infants are one to two orders of magnitude greater than adults
(55, 62), substantially exceeding the WHO TDI.
Body burden of dioxins increases with age. In adults, higher
body burden is associated with greater fish consumption and
inversely associated with lactation history (63, 64). Taking into
account the uncertainties associated with pharmacokinetic model-
ling, the BE values corresponding to the existing intake limits range
from 15 to 74 pg TEQ/g of serum fat (29). Since the lipid-based
concentrations of dioxins in serum and milk are at equilibrium,
these values can also be applied to milk data.
Figure 2 shows the summary data for human milk concentra-
tions of PCDDs/PCDFs and dioxin-like PCBs expressed in TEQ
units. The data characterize the body burden for these chemicals in
women as well as ingestion exposure in breastfed infants. Only data
from Member States of the WHO European Region that partici-
pated in at least two rounds of surveys are included in this graph.
Since the number of congeners analyzed differed between survey
rounds (44), this summary presents only the results for the con-
geners that were measured in all rounds.
The graph shows that the concentrations of dioxins in human
milk steadily declined across the European Region. The rate of
decline was greater in the countries where the initial levels were
higher, such as Belgium, The Netherlands, and Germany. In the
Czech Republic, the concentration increased slightly from 2002
to 2007. The concentrations of dioxins in pooled samples in the
2007 survey varied from 5 to 10 pg TEQ/g of fat, which is
below the lowest BE value of 15 pg TEQ/g. Assessment of the
proportion of individuals exceeding the BE limit in each country
using the pooled sample data is problematic, since the statistical
distribution of the body burden of dioxins is poorly characterized
(41). The typical average daily intake of breast milk fat in fully
breast fed infants during the first 6 months of life ranges from 3 to
6 g/kg bw (65). Hence, the daily intake of dioxins in most
European infants still greatly exceeds the target limit of 1 pg

TEQ/kg bw. The surveillance of dioxins in breast milk needs to
be continued and expanded to monitor the implementation of the
Stockholm Convention and the Parma Declaration commitment
to protect children from exposure to endocrine disruptors.
3.2.3. Levels of Figure 3 presents (from the same ENHIS factsheet) the results of
Polybrominated Diphenyl several national surveys that involved monitoring of PCDDs/
Ethers and Other POPs PCDFs, PCBs, polybrominated diphenyl ethers (PBDEs),
in Human Milk polychlorinated naphthalenes (PCNs), hexachlorobenzene (HCB),
dichlorodiphenyltrichloroethane (DDT) and its breakdown product
dichlorodiphenyldichloroethylene (DDE) in human milk in Sweden
(refs. 4547 and Malisch et al., unpublished data). These national data
were included in the indicator factsheet to provide a detailed illustra-
tion of temporal trends of all major classes of POPs in pooled milk
samples. The graph shows that concentrations of most POPs have
been declining steadily since the 1970s. The previously published
statistical analysis of these data demonstrated that the declines fol-
lowed the first order kinetic. Breast milk concentrations of DDT, total
PCBs and dioxins (expressed in TEQ units) were decreasing by 50%
every 4, 14 and 15 years, respectively (46). It should be noted that the
Fig. 3 data on dioxins are not directly comparable with the WHO data
for Sweden in Fig. 2 due to different recruitment and sampling
procedures (41, 46).
One notable exception to the steady decline of POPs in Sweden
was a rapid exponential increase in PBDEs in pooled milk samples
until the late 1990s, with concentrations doubling every 5 years.
This was followed by a rapid reduction after a voluntary ban on the
use of most toxic and bioaccumulating PBDE congeners.
PBDEs have the general formula C12H(90)Br(110)O in which
the sum of H and Br atoms always equals ten. Congeners are
numbered in accordance with the International Union of Pure
and Applied Chemistry (IUPAC) system. PDBEs have been used
as flame retardants in plastics and upholstery since the 1970s.
These chemicals are mixed with the material but not chemically
linked to it. Therefore, they can migrate from the material. Three
commercial PBDE mixtures have been produced: pentabromodi-
phenyl ether (penta-BDE), octabromodiphenyl ether (octa-BDE),
and decabromodiphenyl ether (deca-BDE consisting primarily of
BDE-209). Most of the penta-BDE and almost half of octa- and
deca-BDEs produced in the world have been used within the USA.
As a result, body burdens in the USA are one to two orders of
magnitude greater than in western European countries (66, 67).
A major exposure route in adults is contaminated food,
especially fish, meat and poultry (47). Breastfed infants are especially
heavily exposed through breast milk due to lactational off-loading of
PBDEs (66). Inhalation of house dust is another important expo-
sure pathway, especially for higher brominated varieties (66).
Body burden in young children can be an order of magnitude higher

than in adults, whereas there is no increase in PBDE body burden
with age in adults (22, 66, 67). Lower brominated varieties have
longer half-lives in the human body: range is from tens of days for
deca, nona, and octa-BDEs to several years for penta-BDE (66).
The data in Fig. 3 shows combined concentrations of five persistent
and bioaccumulating penta-BDE congeners (47).
Main health concerns are associated with developmental neu-
rotoxicity and endocrine disrupting effects. Different congeners
may exhibit antiandrogenic, estrogenic, and antiestrogenic effects.
Penta-BDEs are more toxic than octa-BDEs, while deca-BDE is the
least toxic. Penta-BDEs have been linked with delays in puberty,
decrease in the size of sex organs and lower sperm counts in animal
models (66, 68), and cryptorchidism in boys (68). Deca-BDE is
also developmentally neurotoxic and a possible human carcinogen
(67, 69). Adverse health effects of deca-BDE in humans can be due,
in part, to impurities (the presence of nona-BDE) and partial
debromination in the environment and biota. The penta-BDE
and octa-BDE mixtures were banned in the European Union in
2004 (70). Policy actions to phase out less toxic and bioaccumulat-
ing deca-BDEs have been contentious due to the lack of demon-
strably safer alternative flame retardants (71).
While recommended exposure limits have been specified for
PBDEs (69), linking biomonitoring data with the uptake data, and
determining biomonitoring equivalence (BE) values for PBDEs has
been problematic. This is due to the fact that debromination of
highly brominated congeners occurs in the environment and biota,
and that patterns of congeners evolve along the food chain due to
differential bioaccumulation, metabolism and degradation proper-
ties (66). Doseresponse functions for PBDEs are poorly character-
ized and information on health effects at low doses is insufficient.
Therefore, the interpretation of milk data is limited to the analysis
of temporal trends and the effects of policy actions. The accumula-
tion of these chemicals in breast milk remains a concern despite
recent improvements in the situation.
4. Notes
The objectives of human biomonitoring-based environmental

health indicators include assessing temporal trends and the effects
of policy actions, characterizing geographic patterns of exposure, or
comparing different population subgroups. Presenting and inter-
preting the data in a clear and understandable language helps to
broaden the reach of the information, facilitating the use of a
common language among different sectors involved in environ-
mental public health policy making.
For all types of surveillance programs, standardized protocols

and rigorous quality control and assurance are necessary to ensure
the interpretability and comparability of data in time and
space. The WHO program of POPs monitoring in human milk
involves standardization which ensures a high level of national data
comparability and enables monitoring of temporal trends. The latter
requires the use of consistent analytical techniques and, most impor-
tantly, consistent application of the recruitment procedures.
An important issue is developing a balanced communication
strategy aimed not only at environmental health professionals but
also at the general population. The WHO guidelines for POP
monitoring in human milk stress that despite the presence of con-
taminants in milk breastfeeding needs to be protected and encour-
aged (41) because the benefits of breastfeeding far outweigh
potential adverse health effects of exposure to POPs at the level
that is observed in the general population. Surveys of POPs in
human milk and the presentation of data in the form of EH indi-
cators will also ultimately promote breastfeeding through helping
to eliminate these chemicals from breast milk. As the modern
detection techniques are capable of measuring very small quantities
of chemicals, communication of biomonitoring results needs to
include careful interpretation of the data, and potential health
effects associated with the observed levels of exposure.
For example, biomonitoring of PBDEs was effective in
demonstrating steeply increasing trends in body burden and infant
exposure through breast milk, stimulating research on health effects
of these chemicals and prompting policy actions to address the
public health concern. Due to the ongoing dispute regarding
the use of PBDEs and their potential substitutes, and peak exposure
levels in the most vulnerable subpopulation, the infants, providing
more detailed characterization of milk concentrations of PBDEs
across the WHO European Region remains a priority issue.
The levels of PBDEs in humans exhibit much greater interper-
sonal variability than levels of other POPs (67), which limits
the interpretation of the pooled sample data. Measuring these com-
pounds in individual samples is preferable but subject to limitations
due to the substantial costs and sample volume required for such
tests. A major source of individual level data on PBDE body burden in
the USA is the NHANES study (22). The lipid adjusted blood levels
from NHANES and breast milk data can be compared directly
because concentrations of PBDEs are similar in all lipid compart-
ments (66). However, different recruitment criteria and data presen-
tation formats make quantitative comparisons problematic. Despite
these limitations, the NHANES data clearly show that the body
burden in the USA remains substantially higher than in Sweden
despite having phased phasing out production of penta-BDE and
octa-BDE in 2004 (http://www.epa.gov/oppt/pbde/).
Recent review of the burden of disease due to chemicals con-

ducted by WHO (72) noted a marked declining trend in blood lead
levels and estimated burden of disease due to exposures to lead
associated with phasing out of leaded gasoline. While the declines
in exposure to dioxins and many other POPs have also been
pronounced, quantifying the burden of disease due to these com-
pounds remains a challenge due to poorly characterized dosere-
sponse relationships, diverse and nonspecific health effects, as well
as substantial time lags between exposure and measurable
health outcomes.
References
1. Mun oz B, Albores A (2010) The role of molec- 10. Bate SL, Dollard SC, Cannon MJ (2010) Cyto-
ular biology in the biomonitoring of human megalovirus seroprevalence in the United
exposure to chemicals. Int J Mol Sci 11 States: the national health and nutrition exami-
(11):45114525 nation surveys, 19882004. Clin Infect Dis 50
2. Decordier I, Papine A, Vande Loock K et al (11):14391447
(2011) Automated image analysis of micronu- 11. Markowitz LE, Sternberg M, Dunne EF et al
clei by IMSTAR for biomonitoring. Mutagen- (2009) Seroprevalence of human papillomavi-
esis 26(1):163168 rus types 6, 11, 16, and 18 in the United States:
3. Rossnerova A, Spatova M, Schunck C et al National Health and Nutrition Examination
(2011) Automated scoring of lymphocyte Survey 20032004. J Infect Dis 200
micronuclei by the MetaSystems Metafer (7):10591067
image cytometry system and its application in 12. Brenner H, Berg G, Lappus N et al (1999)
studies of human mutagen sensitivity and bio- Alcohol consumption and Helicobacter pylori
dosimetry of genotoxin exposure. Mutagenesis infection: results from the German National
26(1):169175 Health and Nutrition Survey. Epidemiology
4. McGeehin MA, Qualters JR, Niskar AS (2004) 10(3):214218
National environmental public health tracking 13. Jones JL, Kruszon-Moran D, Sanders-Lewis K
program: bridging the information gap. Envi- et al (2007) Toxoplasma gondii infection in the
ron Health Perspect 112(14):14091413 United States, 1999 2004, decline from the
5. Esteban M, Castan o A (2009) Non-invasive prior decade. Am J Trop Med Hyg 77
matrices in human biomonitoring: a review. (3):405410
Environ Int 35(2):438449 14. Visness CM, London SJ, Daniels JL et al
6. Paustenbach D, Galbraith D (2006) Biomoni- (2009) Association of obesity with IgE levels
toring and biomarkers: exposure assessment and allergy symptoms in children and adoles-
will never be the same. Environ Health Per- cents: results from the National Health and
spect 114(8):11431149 Nutrition Examination Survey 20052006.
7. Clewell HJ, Tan YM, Campbell JL et al (2008) J Allergy Clin Immunol 123(5):11631169,
Andersen ME. Quantitative interpretation of 1169.e1-4
human biomonitoring data. Toxicol Appl Phar- 15. Branum AM, Lukacs SL (2009) Food allergy
macol 231(1):122133 among children in the United States. Pediatrics
8. Sanders T, Liu Y, Buchner V et al (2009) Neu- 124(6):15491555
rotoxic effects and biomarkers of lead expo- 16. Miller RL, Garfinkel R, Lendor C et al (2010)
sure: a review. Rev Environ Health 24 Polycyclic aromatic hydrocarbon metabolite
(1):1545 levels and pediatric allergy and asthma in an
9. Xu F, Sternberg MR, Markowitz LE (2010) inner-city cohort. Pediatr Allergy Immunol 21
Men who have sex with men in the United (2 Pt 1):260267
States: demographic and behavioral character- 17. Hui LL, Hedley AJ, Nelson EA et al (2007)
istics and prevalence of HIV and HSV-2- Agreement between breast milk dioxin levels
infection: results from National Health and by CALUX bioassay and chemical analysis in a
Nutrition Examination Survey 20012006. population survey in Hong Kong. Chemo-
Sex Transm Dis 37(6):399405 sphere 69(8):12871294
18. Genuis SJ, Birkholz D, Rodushkin I et al 30. Ruiz P, Fowler BA, Osterloh JD et al (2010)
(2011) Blood, urine, and sweat (BUS) study: Physiologically based pharmacokinetic (PBPK)
monitoring and elimination of bioaccumulated tool kit for environmental pollutantsmetals.
toxic elements. Arch Environ Contam Toxicol SAR QSAR Environ Res 21(78):603618
61(2):344357 31. Ritter R, Scheringer M, Macleod M et al
19. Hu H, Rabinowitz M, Smith D (1998) Bone (2011) Intrinsic human elimination half-lives
lead as a biological marker in epidemiologic of polychlorinated biphenyls derived from the
studies of chronic toxicity: conceptual para- temporal evolution of cross-sectional biomoni-
digms. Environ Health Perspect 106(1):18 toring data from the UK population. Environ
20. Morris-Cunnington MC, Edmunds WJ, Miller Health Perspect 119(2):225231
E et al (2004) A population-based seropreva- 32. Heinzl H, Mittlbock M, Edler L (2007) On
lence study of hepatitis A virus using oral fluid the translation of uncertainty from toxicoki-
in England and Wales. Am J Epidemiol 159 netic to toxicodynamic modelsthe TCDD
(8):786794 example. Chemosphere 67(9):S365S374
21. Griffin SM, Chen IM, Fout GS et al (2011) 33. Aylward LL, Brunet RC, Carrier G et al (2005)
Development of a multiplex microsphere immu- Concentration-dependent TCDD elimination
noassay for the quantitation of salivary antibody kinetics in humans: toxicokinetic modeling for
responses to selected waterborne pathogens. moderately to highly exposed adults from
J Immunol Methods 364(12):8393 Seveso, Italy, and Vienna, Austria, and impact
22. Centers for Disease Control and Prevention on dose estimates for the NIOSH cohort.
(2009) Fourth National Report on Human J Expo Anal Environ Epidemiol 15(1):5165
Exposure to Environmental Chemicals. 34. Aylward LL, Brunet RC, Starr TB et al (2005)
Atlanta, GA, USA Exposure reconstruction for the TCDD-
23. Bocca B, Mattei D, Pino A et al (2010) Italian exposed NIOSH cohort using a concentration-
network for human biomonitoring of metals: and age-dependent model of elimination. Risk
preliminary results from two regions. Ann Ist Anal 25(4):945956
Super Sanita 46(3):259265 35. Cheng H, Aylward L, Beall C et al (2006)
24. Haines DA, Arbuckle TE, Lye E et al (2011) TCDD exposure-response analysis and risk
Reporting results of human biomonitoring of assessment. Risk Anal 26(4):10591071
environmental chemicals to study participants: 36. Corvalan C, Kjellstrom T, Smith KR (1999)
a comparison of approaches followed in two Health, environment and sustainable develop-
Canadian studies. J Epidemiol Community ment: identifying links and indicators to pro-
Health 65(3):191198 mote action. Epidemiology 10(5):656660
25. Bonefeld-Jorgensen EC (2010) Biomonitoring 37. Pond K, Kim R, Carroquino MJ et al (2007)
in Greenland: human biomarkers of exposure Workgroup report: developing environmental
and effectsa short review. Rural Remote health indicators for European children: World
Health 10(2):1362 Health Organization Working Group. Environ
26. Rappaport SM (2011) Implications of the Health Perspect 115(9):13761382
exposome for exposure science. J Expo Sci 38. World Health Organization (2007) Childrens
Environ Epidemiol 21(1):59 health and the environment in Europe. A base-
27. Baccarelli A, Giacomini SM, Corbetta C et al line assessment report. World Health Organi-
(2008) Neonatal thyroid function in Seveso zation Regional Office for Europe,
25 years after maternal exposure to dioxin. Copenhagen, http://www.euro.who.int/
PLoS Med 5(7):e161 __data/assets/pdf_file/0009/96750/
28. Boogaard PJ, Hays SM, Aylward LL (2011) E90767.pdf. Accessed 04 Jul 2012
Human biomonitoring as a pragmatic tool to 39. World Health Organization (2010) Health and
support health risk management of chemi- environment in Europe: progress assessment.
calsExamples under the EU REACH World Health Organization Regional Office
programme. Regul Toxicol Pharmacol 59 for Europe, Copenhagen, http://www.euro.
(1):125132 who.int/__data/assets/pdf_file/0010/
29. Aylward LL, Lakind JS, Hays SM (2008) Deri- 96463/E93556.pdf. Accessed 04 Jul 2012
vation of biomonitoring equivalent (BE) values 40. World Health Organization (2010) Parma dec-
for 2,3,7,8-tetrachlorodibenzo-p-dioxin laration on environment and health. EUR/
(TCDD) and related compounds: a screening 55934/5.1. World Health Organization
tool for interpretation of biomonitoring data in Regional Office for Europe, Copenhagen,
a risk assessment context. J Toxicol Environ http://www.euro.who.int/en/home/confer
Health A 71(22):14991508 ences/fifth-ministerial-conference-on-
environment-and-health/documentation/ tion and affects human semen quality. Environ

parma-declaration-on-environment-and- Health Perspect 116(1):7077
health. Accessed 5 Dec 2010 52. Mates JM, Segura JA, Alonso FJ et al (2010)
41. World Health Organization (2007) Fourth Roles of dioxins and heavy metals in cancer and
WHO-coordinated survey of human milk for neurological diseases using ROS-mediated
persistent organic pollutants in cooperation mechanisms. Free Radic Biol Med 49
with UNEP. Guidelines for developing a (9):13281341
national protocol. World Health Organization, 53. Joint WHO/Convention task force on the
Geneva, http://www.who.int/foodsafety/ health aspects of air pollution (2003) Health
chem/POPprotocol.pdf. Accessed 04 Jul 2012 risks of persistent organic pollutants from long-
42. Malisch R, Moy G (2006) Fourth round of range transboundary air pollution. World
WHO-coordinated exposure studies on levels Health Organization, Geneva
of persistent organic pollutants in human milk. 54. European Commission. Health and Consumer
Organohalogen Compd 68:16271630 Protection Directorate-General (2000) Opin-
43. Malisch R, Kypke K, Kotz A et al (2010) ion of the Scientific Committee on animal
WHO/UNEP-coordinated exposure study nutrition on the dioxin contamination of feed-
(20082009) on levels of persistent organic stuffs and their contribution to the contamina-
pollutants (POPs) in human milk with regard tion of food of animal origin. http://ec.europa.
to the global monitoring plan. 30th Interna- eu/food/committees/scientific/out55_en.
tional symposium on halogenated organic pol- pdf. Accessed 10 March 2011
lutants, San Antonio Texas, USA 55. Patandin S, Dagnelie PC, Mulder PG et al
44. Colles A, Koppen G, Hanot V et al (2008) (1999) Dietary exposure to polychlorinated
Fourth WHO-coordinated survey of human biphenyls and dioxins from infancy until adult-
milk for persistent organic pollutants (POPs): hood: a comparison between breast-feeding,
Belgian results. Chemosphere 73(6):907914 toddler, and long-term exposure. Environ
45. Lunden A, Noren K (1998) Polychlorinated Health Perspect 107(1):4551
naphthalenes and other organochlorine con- 56. Van den Berg M, Birnbaum L, Bosveld AT et al
taminants in Swedish human milk, (1998) Toxic equivalency factors (TEFs) for
19721992. Arch Environ Contam Toxicol PCBs, PCDDs, PCDFs for humans and wild-
34(4): 414423 life. Environ Health Perspect 106
46. Noren K, Meironyte D (2000) Certain organ- (12):775792
ochlorine and organobromine contaminants 57. Van den Berg M, Birnbaum LS, Denison M
in Swedish human milk in perspective of et al (2006) The 2005 World Health Organi-
past 2030 years. Chemosphere 40(911): zation reevaluation of human and mammalian
11111123 toxic equivalency factors for dioxins and
47. Lind Y, Darnerud PO, Atuma S et al (2003) dioxin-like compounds. Toxicol Sci 93
Polybrominated diphenyl ethers in breast milk (2):223241
from Uppsala County, Sweden. Environ Res 93 58. Aylward LL, Lamb JC, Lewis SC (2005) Issues
(2):186194 in risk assessment for developmental effects of
48. Foster WG, Maharaj-Bricen o S, Cyr DG 2,3,7,8-tetrachlorodibenzo-p-dioxin and
(2010) Dioxin-induced changes in epididymal related compounds. Toxicol Sci 87(1):310
sperm count and spermatogenesis. Environ 59. Gray MN, Aylward LL, Keenan RE (2006)
Health Perspect 118(4):458464 Relative cancer potencies of selected dioxin-
49. World Health Organization (2002) Evaluation like compounds on a body-burden basis: com-
of certain food additives and contaminants: fifty- parison to current toxic equivalency factors
seventh report of the joint FAO/WHO expert (TEFs). J Toxicol Environ Health A 69
committee on food additives. World technical (10):907917
report series 909. World Health Organization, 60. World Health Organization (2000) Assessment
Geneva, http://whqlibdoc.who.int/trs/ of the health risk of dioxin: re-evaluation of the
WHO_TRS_909.pdf. Accessed 04 Jul 2012 tolerable daily intake (TDI): executive sum-
50. Svechnikov K, Izzo G, Landreh L et al (2010) mary. Food Add Contam 17:223240
Endocrine disruptors and Leydig cell function. 61. De Mul A, Bakker MI, Zeilmaker MJ et al
J Biomed Biotechnol pii:684504 (2004) Dietary exposure to dioxins and
51. Mocarelli P, Gerthoux PM, Patterson DG Jr dioxin-like PCBs in The Netherlands anno
et al (2008) Dioxin exposure, from infancy 2004. Regul Toxicol Pharmacol 51
through puberty, produces endocrine disrup- (3):278287
62. Gies A, Neumeier G, Rappolder M et al (2007) milk and cryptorchidism in newborn boys.
Risk assessment of dioxins and dioxin-like Environ Health Perspect 115(10):15191526
PCBs in foodComments by the German 69. US EPA (2008) Final integrated risk informa-
Federal Environmental Agency. Chemosphere tion system assessment for decabromodiphenyl
67(9):S344S349 ether (BDE-209). http://www.epa.gov/ncea/
63. Kiviranta H, Tuomisto JT, Tuomisto J et al iris/subst/0035.htm. Accessed 10 Jan 2011
(2005) Polychlorinated dibenzo-p-dioxins, 70. Directive 2003/11/EC of the European Parlia-
dibenzofurans, and biphenyls in the general ment and of the Council of 6 February 2003
population in Finland. Chemosphere 60 amending the for the 24th time Council Direc-
(7):854869 tive 76/769/EEC relating to restrictions on the
64. Kiviranta H, Vartiainen T, Tuomisto J (2002) marketing and use of certain dangerous sub-
Polychlorinated dibenzo-p-dioxins, dibenzo- stances and preparations (pentabromodiphenyl
furans, and biphenyls in fishermen in Finland. ether, octabromodiphenyl ether). 2003. Official
Environ Health Perspect 110(4):355361 Journal of the European Union. http://eur-lex.
65. Dewey KG, Heinig MJ, Nommsen LA et al europa.eu/LexUriServ/site/en/oj/2003/
(1991) Adequacy of energy intake among l_042/l_04220030215en00450046.pdf.
breast-fed infants in the DARLING study: rela- Accessed 04 Jul 2012
tionships to growth velocity, morbidity, and 71. Pakalin S, Cole T, Steinkellner J et al (2007)
activity levels. Davis area research on lactation, Review on production processes of decabro-
infant nutrition and growth. J Pediatr 119 modiphenyl ether (decaBDE) used in poly-
(4):538547 meric applications in electrical and electronic
66. Birnbaum LS, Cohen Hubal EA (2006) Poly- equipment, and assessment of the availability
brominated diphenyl ethers: a case study for of potential alternatives to decaBDE. European
using biomonitoring data to address risk assess- Commission. Directorate-General Joint
ment questions. Environ Health Perspect 114 Research Centre, Institute of Health and Con-
(11):17701775 sumer Protection, European Chemicals
67. Schecter A, Pavuk M, Papke O et al (2003) Bureau, Ispra, Italy. EUR 22693 EN
Polybrominated diphenyl ethers (PBDEs) in 72. Pruss-Ustun A, Vickers C, Haefliger P et al
U.S. mothers milk. Environ Health Perspect (2011) Knowns and unknowns on burden of
111(14):17231729 disease due to chemicals: a systematic review.
68. Main KM, Kiviranta H, Virtanen HE et al Environ Health 10(1):9
(2007) Flame retardants in placenta and breast
Part IV
Modeling for Regulatory Purposes

(Risk and Safety Assessment)
Chapter 13
Modeling for Regulatory Purposes (Risk and Safety

Assessment)
Hisham El-Masri
Abstract
Chemicals provide many key building blocks that are converted into enduse products or used in industrial
processes to make products that benefit society. Ensuring the safety of chemicals and their associated
products is a key regulatory mission. Current processes and procedures for evaluating and assessing the
impact of chemicals on human health, wildlife, and the environment were, in general, designed decades ago.
These procedures depend on generation of relevant scientific knowledge in the laboratory and interpreta-
tion of this knowledge to refine our understanding of the related potential health risks. In practice, this
often means that estimates of doseresponse and time-course behaviors for apical toxic effects are needed as
a function of relevant levels of exposure. In many situations, these experimentally determined functions are
constructed using relatively high doses in experimental animals. In absence of experimental data, the
application of computational modeling is necessary to extrapolate risk or safety guidance values for
human exposures at low but environmentally relevant levels.
Key words: Modeling, PBPK, BMD, NOAEL, RfD, RfC, Uncertainty
1. Dose/Response
Relationships
A doseresponse relationship is an association between dose and the
incidence of a defined biological effect in an exposed population
usually expressed as percentage. Dose is defined as the total quantity
of a substance administered to, taken up, or absorbed by an organ-
ism, organ, or tissue and can be measured with in vitro or in vivo
experiments. When measured in vivo, doses are usually expressed as
milligrams per kilogram of body weight in the case of oral exposure,
as parts per million or billion (ppm or ppb) in cases of inhalation
exposure, or milligrams per square meter in cases of dermal expo-
sure. In vitro dosing units will depend on the experiment that is
297
298 H. El-Masri
usually described as concentrations (e.g., ppm or mg/L) in the

medium where the experiment is conducted.
Effect can be defined as a graduated biological change in
continuum of changes that can be quantitatively measured. The
measured effect in the doseresponse relationship has to also be
independent from other exposures. The measured effect can quan-
titatively fall within one of the following:
l Dichotomous (Quantal)A dichotomous effect may be
reported as either the presence or absence of an effect or change.
l ContinuousA continuous effect may be reported as an actual
measurement or as a contrast (absolute change from control or
relative change from control).
l CategoricalFor categorical data, the responses in treatment
group are often characterized in terms of the severity of effect
(mild, moderate, or sever histological change).
Once a doseresponse relationship is established, it can be used
to estimate safe or risk guidance levels of chemicals by estimating a
reference dose (RfD) or reference concentration (RfC). The RfD
(or RfC), if derived, provides quantitative information for use in
risk assessments for health effects known or assumed to be pro-
duced through a nonlinear (presumed threshold) mode of action.
The RfD (expressed in units of mg/kg day) is defined as an estimate
(with uncertainty spanning perhaps an order of magnitude) of a
daily exposure to the human population (including sensitive sub-
groups) that is likely to be without an appreciable risk of deleterious
effects during a lifetime. The inhalation RfC (expressed in units of
mg/m3) is analogous to the oral RfD, but provides a continuous
inhalation exposure estimate. The inhalation RfC considers toxic
effects for both the respiratory system (portal-of-entry) and for
effects peripheral to the respiratory system (extrarespiratory or
systemic effects). Reference values are generally derived for chronic
exposures (up to a lifetime), but may also be derived for acute
(24 h), short-term (>24 h up to 30 days), and subchronic (>30
days up to 10 % of lifetime) exposure durations, all of which are
derived based on an assumption of continuous exposure through-
out the duration specified. Unless specified otherwise, the RfD and
RfC are derived for chronic exposure duration.
RfD or RfC are calculated based on the following equation:
PODHEC
RfD or RfC = ; (1)
UF MF
where:
POD[HEC] point of departure (NOAEL, LOAEL, OR BMC)
dosimeterically adjusted to a human equivalent concentration
(HEC)
13 Modeling for Regulatory Purposes (Risk and Safety Assessment) 299
UF uncertainty factors to account for the extrapolations asso-

ciated with POD (i.e., interspecies differences in sensitivity, intra-
species extrapolations, subchronic to chronic extrapolations,
NOAEL to LOAEL extrapolation, and incompleteness of data-
base), and
MF modifying factor to account for scientific uncertainties in the
study chosen as the basis for RfC derivation.
In traditional health risk assessment, RfD/RfC estimates are
usually determined for noncancer effects. Carcinogenicity assess-
ment is based on information of the carcinogenic hazard potential
of the substance in question and the derivation of quantitative
estimates of risk from oral and inhalation exposure. The information
includes a weight-of-evidence judgment of the likelihood that the
agent is a human carcinogen and the conditions under which the
carcinogenic effects may be expressed. Quantitative risk estimates
may be derived from the application of a low-dose extrapolation
procedure using information from doseresponse relationships. If
derived, the oral slope factor is a plausible upper bound on the
estimate of risk per mg/kg day of oral exposure. Similarly, an
inhalation unit risk is a plausible upper bound on the estimate of
risk per mg/m3 air breathed.
1.1. Point of Departure The starting point for RfD/RfC calculation is the identification of
Estimates the POD for the critical effect in a key study. Subsequent steps
involve the following: (1) adjustment for the difference in duration
between experimental procedure (e.g., 6 h) and expected human
exposure (24 h), (2) calculation of HEC based on dosimetric
adjustments or allometric scaling using body weights or surface
areas, and (3) application of uncertainty/modifying factors.
Ideally, the POD used in the RfD/RfC calculation process
should be the no observed adverse effect level (NOAEL), low
observed adverse effect level (LOAEL), or benchmark dose
(BMD). The NOAEL is identified as the highest nonstatistically
significant dose tested. The LOAEL is identified as the lowest dose
tested with statistically significant effect. A BMD is defined as the
statistical lower confidence limit of the dose producing a predeter-
mined level of change in adverse response compared with the
response in untreated animals (the benchmark response, BMR)
(1). BMD is determined by modeling a doseresponse curve in
the region of the relationship where biologically observable data
are available. The BMR is generally set near the lower limit of
responses that can be measured directly in animal experiments of
typical size.
1.2. Modeling The interrelationship among dose, time of exposure, and response
DoseResponse Data is fundamental to the quantitative analysis of the doseresponse
relationship. Standard doseresponse models are generally based
300 H. El-Masri
on Habers law or its generalization. Dose (or concentration) and

time of exposure are both important in determining the intensity of
response to a toxic agent. For a given response intensity, Habers
rule (c t K) has been proposed as a law of toxicology, but this
rule is just one special case of a more general relationship c tm
K, where c is the concentration, t is time of exposure, and m is a
variable used to fit data. For noncarcinogens, m generally has a
value between 0 and 1, whereas for carcinogens, m is usually
between 1 and 5. The absence of a universal value for m, or one
that is generally applicable to different classes of toxicants makes it
not yet possible to develop a Habers type rule with which to
extrapolate successfully between exposure scenarios (2). Another
disadvantage to using the Habers rule is its dependence on expo-
sure or administered dose. Using administered dose to derive risks
or safe levels bypasses many critical physiological process such
absorption (which may be related to bioavailability of the chemi-
cal), distribution (which could be impacted by protein binding in
blood), metabolism (which can may result in either bioactivation or
detoxification of the parent chemical) and excretion (which may
reduce amount of chemical in target tissue where response is
observed).
Target tissue dose estimation improves the characterization of
doseresponse relationships, which are critical components in asses-
sing environmental health risks. This improvement results from a
direct relationship between internal dosimetry and biological
response. When relevant and reliable estimates of internal dose of a
compound or a key metabolite are available, the results of toxicology
studies can often be better understood and evaluated in terms of the
internal doses. Additionally, understanding absorption, distribu-
tion, metabolism and excretion (ADME) of a chemical leads to a
more complete use of biological and toxicological data to support
route-to-route and animal-to-human extrapolation of dosere-
sponse information. Mathematical estimation of target tissue dose
is carried out using pharmacokinetic modeling. Pharmacokinetic
models are classified as compartmental and noncompartmental
(Renwick 1994) Compartmental models that include physiological
descriptions of biological tissues and processes describing ADME of
chemicals are usually described as physiologically based pharmaco-
kinetic (PBPK) models. In PBPK models, compartments corre-
spond more closely to actual anatomical structures, defined with
respect to their volumes, blood flows, chemical binding (partition-
ing) characteristics, and the ability to metabolize or excrete the
compounds of interest (Fig. 1). A PBPK model is made up of set
of mathematical equations describing in vivo ADME of the chemical
in question. Each mathematical equation is based on a form (i.e.,
first order, second order, exponential, Hills function), and quanti-
tative parameter estimation (Fig. 1). The form of the mathematical
equation is based on understanding of the biology of the chemical.
Fig. 1. Components of a PBPK model equation.
For example, a saturable Michaelis Menten (MM) equation

for metabolism is based on receptor binding mechanisms.
The MM equation can be modified to allow for biological processes
of inhibition or induction of metabolism. Another is example is a
Hills equation which can also based on receptor binding mecha-
nism but includes parameters representing the extent of positive or
negative biological coopertativity in the binding process. Both
equations have parameters that need to be identified quantitatively
so that the overall model can be used for simulating and predicting
data. Because the kinetic parameters of these models reflect tissue
blood flows, partitioning, and biochemical constants, these models
are more readily scaled from one animal species to another (3).
Adding mathematical description of toxic mechanism or model of
action of chemicals to PBPK models results in biologically based
doseresponse (BBDR) models. BBDR modeling integrates expo-
sure, dose, mode of action, and apical effect data to generate pre-
dictions of doseresponse and time-course behaviors for the apical
effects. These models can be valuable in determining safe or risk
guidance levels based on mode of action or target tissue, especially
when animal data is extrapolated to human exposures at low but
relevant environmental levels.
The quality of mechanistic based models (PBPK or BBDR)
predictions reflects the quality of the datasets that are used as inputs
to the model. In many situations, basic understanding and data
describing mechanistic information of chemical toxicity is lacking
or inadequate. For these situations computational modeling in the
form of quantitative structureactivity relationships (QSAR) is
usually performed to predict the toxicity of chemicals. QSAR meth-
ods rely on structural similarities between chemicals to predict
toxicity based on statistical analysis of experimental databases.
302 H. El-Masri
Therefore, QSAR is used to screen chemicals for toxic end points

when experimental information about them is not available.
2. Balancing and
Judging Uncertainty
With the understanding that there is no complete mechanistic
model that can accurately predict and describe all relevant ADME
and biological process of a chemical, regulatory agency in charge of
estimating risk or safe levels are usually faced with judging and
balancing the uncertainty inherent in every computational model.
Quantitative uncertainty analysis in the form of Bayesian analysis of
models is time consuming and resource and data exhaustive. In
many situations, scientists address models uncertainty using quali-
tative criteria based on their in depth evaluation of the structure and
parameterization of the model. Additional steps that adds confi-
dence in model behavior is its ability to simulate data that was not
used in its calibration (parameter determination by fitting to data).
Table 1 is an illustration of some examples that can add to or
Table 1
Information or items that may contribute to or reduce model
uncertainty
Contribute to model uncertainty Reduce model uncertainty

Biological basis of model Model is based on known
development is questionable biological mechanisms
l Flow/diffusion or active l Toxic metabolites
transport mechanisms, could all
fit data
l Extrahepatic metabolism l Route of exposure
(other tissues or by other enzymes)
l Species differences l Physiological determinants of
tissue dose
Parameter identifiability Parameters have biological relevance:
they can be tested or calculated
Calibration data are not sensitive to Calibration data are helpful in
parameters identifying parameters
Model behavior outside calibration Model is able to simulate data outside
data is questionable of calibration range
l Acute versus chronic exposures
subtract from the confidence of the chemical managers to apply a

computational model to estimate risk and safe guidance levels.
References
1. Crump K, Viren J, Silvers A, Clewell H 3rd, 3. Dedrick RL, Forrester DD (1973) Blood flow
Gearhart J, Shipp A (1995) Reanalysis of dose- limitations in interpreting Michaelis constants
response data from the Iraqi methylmercury for ethanol oxidation in vivo. Biochem Phar-
poisoning episode. Risk Anal 15:523532 macol 22:11331140
2. Bunce NJ, Remillard RB (2003) Habers rule:
the search for quantitative relationships in tox-
icology. Hum Ecol Risk Assess 9:15471559
Chapter 14
Developmental Toxicity Prediction

Raghuraman Venkatapathy and Nina Ching Y. Wang
Abstract
Developmental toxicity may be estimated using commercial and noncommercial software that is already
available in the market and/or literature, or models may be built from scratch using both commercial and
noncommercial software packages. In this chapter, commonly available software programs that can predict
the developmental toxicity of chemicals are described. In addition, a method for developing qualitative
structureactivity relationship (SAR) models to predict the developmental toxicity of chemicals qualitatively
(yes/no prediction) and quantitative structureactivity relationship (QSAR) models to predict quantitative
estimates (e.g., LOAEL) of developmental toxicants is also described in this chapter. Additional information
described in this chapter include methods to predict physicochemical properties of chemicals that can be
used as descriptor variables in the model building process, statistical methods that be used to build QSAR
models as well as methods to validate the models that are developed. Most of the methods described in this
chapter can be used to develop models for health endpoints other than developmental toxicity as well.
Key words: Developmental toxicity, Quantitative structureactivity relationships, QSAR, Structure

activity relationships, Toxicity prediction, Modeling, Model development, Statistical methods,
Model validation, Applicability domain
1. Introduction
The chemical industry and regulatory agencies around the world

spend millions of dollars in testing and assessing the health risks
associated with exposure to chemicals. The risk assessment process
is currently conducted using experimental data. Oftentimes, such
data are unavailable in the literature, thus making it difficult for
these regulatory agencies and research organizations to make deci-
sions regarding exposure guidelines for environmental contami-
nants. In addition, many countries and/or agencies are starting to
ban the use of animals in toxicity testing (1). For example, the
European Union adopted the 7th Amendment to the Cosmetic
Directive, which bans animal-testing (and sales) of cosmetic
finished products and ingredients (2). In such instances, the ability
305
306 R. Venkatapathy and N.C.Y. Wang
to predict potential health hazards from chemical exposure both

accurately and quickly would save not only time but also valuable
resources, which could be invested more wisely. This can be
achieved by the development of computational or predictive
toxicological approaches such as Quantitative Structure Activity
Relationship (QSAR) models.
Predictive toxicological approaches provide the means to esti-
mate the developmental toxicity of a wide variety of chemicals in the
absence of experimental toxicity data. Such approaches include the
following (3): (1) qualitative structureactivity relationship (SAR)
methods, (2) QSAR methods, (3) expert systems, (4) biologically
based models such as 2D or 3D receptor modeling, comparative
molecular field analysis (CoMFA), hologram QSAR (HQSAR),
binding, and ligand SAR, and (5) integrative models that incorpo-
rate or combine both chemical and biological information.
In general, these toxicological approaches describe correlations
between various physical and chemical properties of a chemical
(usually referred to as descriptors) and their observed or predicted
biological activities. Such relationships assume a common mecha-
nism behind the biological activity of a structurally/functionally
related set of chemicals (4). Hence, differences in the chemical
structures that induce the same biological effect, such as develop-
mental toxicity, and their associated descriptors can then be
mapped to changes in activity through mathematical equations.
The resulting equation can then be used to calculate the develop-
mental toxicity of new chemicals.
Very few QSAR models to predict the developmental toxicity
potential of chemicals have been developed by research groups in
the literature. Devillers et al. (5, 6) developed a QSAR model
using lipophilicity, molar refractivity, Hydrogen-bonding donor
and acceptor capability as descriptors to study the developmental
toxicity potential of 30 glycols, glycol ethers, and xylenes on
Hydra attenuata. Kavlock (7) developed mechanistic QSAR mod-
els to study developmental effects on Sprague-Dawley rats due to
a single exposure to 27 substituted phenols using octanolwater
partition coefficient, molar refractivity and Hammett sigma con-
stant as descriptors. Richard et al. (8) studied the mechanism
behind mammalian embryo toxicity due to exposure to 10 haloa-
cetic acids through QSARs. Vedani et al. (9) used genetic algo-
rithms and three-dimensional-QSAR (3D-QSAR) to predict the
developmental toxicity of 76 dibenzofurans, dibenzodioxins, and
biphenyls. Matthews et al. (1012) developed and validated a set
of QSAR models to predict reproductive and developmental tox-
icity of chemicals using MC4PC software. Recently, Cassano et al.
(13) have developed two QSAR models using different statistical
models based on a random forest algorithm and an adaptive
fuzzy partition algorithm to predict the developmental toxicity
of chemicals.
14 Developmental Toxicity Prediction 307
Commercial models that can predict the developmental toxicity

of chemicals are also available in the market today. These models are
either qualitative in nature or expert system based. In the former
case, the models are able to estimate whether a given chemical is
positive, negative, or indeterminate with respect to developmental
toxicity potential. In the latter case, experts in teratology review the
experimental data on a case by case basis and make judgments
regarding the developmental toxicity potential for any given chem-
ical. Commercial QSAR models that can predict the developmental
toxicity of chemicals include TOPKAT (Accelrys Inc., Burlington,
MA)a fully qualitative model, DEREK for Windows (Lhasa Ltd.,
University of Leeds, UK)an expert-system based model, Hazar-
dExpert Pro (Compudrug Ltd., Budapest, Hungary)an expert
system-based model, CAESAR (Caesar Consortium, Milan,
Italy)a fully qualitative model, and MultiCASE (MultiCASE
Inc., Cleveland, OH)a semiqualitative cum expert system based
model. No commercial QSAR model that predicts quantitative
estimates of developmental toxicity (in mg/kg body weight/day
units or equivalent) is available in the literature.
QSAR models that can provide qualitative or quantitative tox-
icity estimates for developmental toxicity will be used widely across
EPA, other federal agencies, and research organizations to incorpo-
rate quantitative estimates into their decision making process.
In addition, several drugs that had previously been approved by
regulatory agencies have been withdrawn in recent years because
of safety concerns. The use of in silico approaches to predicting
safety issues by the pharma industry could have provided meaning-
ful toxicity information early on in the drug discovery process to
both the pharma industry as well as regulatory agencies such as
United States Food and Drug Administration (US FDA), thus
leading to the production of drugs with minimal unintended side-
effects. For example, the US FDA has been engaged in the develop-
ment and evaluation of in silico methods in support of the goals of
the US FDAs Critical Path Initiative for evaluating the safety of a
diverse set of regulated products (http://www.fda.gov/ScienceRe-
search/SpecialTopics/CriticalPathInitiative/default.htm). The US
FDAs efforts have been facilitated by agency-approved data-
sharing agreements between government and commercial software
developers, resulting in the development of toxicology-based
QSAR models and software such as MultiCASE, and other knowl-
edge databases. These models also satisfy the guidelines of the
Organisation for Economic Cooperation and Development
(OECD) principles and the list of tests required by Registration,
Evaluation, Authorisation and Restriction of Chemical substances
(REACH) for successful toxicity assessment. Merlot (14) reviews
recent advances in the field of in silico toxicology, along with a
discussion of the reasons behind this increased attention and their
use by industry and regulatory agencies.
2. Materials
2.1. Types of (Q)SAR SARs for predicting the developmental toxicity of chemicals are
Models either qualitative in nature or expert-system based. In the former
case, there is a qualitative relationship between developmental tox-
icity and a chemical and/or its substructure (such as its one- and
two-atom fragments). A substructure associated with a biological
activity may also be referred to as a structural alert. SARs are
generally able to qualitatively estimate the toxicity of a given chem-
ical as positive, negative or indeterminate. In the case of expert
systems, experts in teratology review the experimental data on a
case-by-case basis and make judgments regarding the developmen-
tal toxicity potential for any given chemical. Expert systems may be
classified as knowledge based (rules are based on human expert
knowledge), induction rule based (based on artificial intelligence,
neural networks, machine learning, or data mining to automatically
derive the rules) or hybrid (rules are initially based on human expert
knowledge; machine can automatically learn new rules).
A QSAR is a mathematical model between a quantifiable
biological activity (such as developmental toxicant or developmen-
tal toxicity LOAEL) and one or more physicochemical properties of
the chemical (also referred to as molecular descriptors) using vari-
ous statistical methods such as regression analysis, principal com-
ponent analysis (PCA) or factor analysis. In general, such
relationships may be classified as either mechanistic or correlative.
Correlative (or statistical) models try to find associations between
molecular descriptors (physicochemical properties) of chemical
structures and developmental toxicological data by statistical
means, which are then used to predict the developmental toxicity
of a test chemical. Mechanistic models, on the other hand, are
developed using human expertise on known teratogenic mechanisms
of action or are limited to a congeneric series of chemicals with the
assumption that congeneric chemicals have similar mechanism(s)
or mode(s) of action.
Other approaches to predicting the developmental toxicities of
chemicals include read-across and chemical-analogues. Read-across
is a nonformalized approach in which developmental toxicity infor-
mation for one or more source chemicals is used to make a toxicity
prediction for another chemical based on a structural or functional
form of similarity (15). Read-across can either be qualitative or
quantitative, depending on the whether the data being used
to make the prediction are qualitative or quantitative in nature.
To estimate the developmental toxicity of a given chemical,
read-across can be performed in a one-to-one manner (one analogue
is used to make a single toxicity prediction) or in a many-to-one manner
(two or more analogues are used to make a single toxicity prediction).
In rare cases, when existing toxicity data are inadequate for

determining a developmental toxicity value for a chemical of inter-
est, a chemical analogue for the test chemical is identified using
either known mechanisms of teratogenicity or using a SAR/QSAR-
based approach. The choice of the appropriate analogue is based on
similarity between the test chemical and the analogue. The toxicity
value of the analogue may then be used as the toxicity value for the
test chemical.
2.2. Prebuilt Common commercial and noncommercial (Q)SAR software

Commercial/ programs that can predict the developmental toxicity of a noncon-
Noncommercial generic series of chemicals include Deductive Estimation of Risk
Software for Predicting from Existing Knowledge (DEREK, LHASA Ltd., Leeds, UK),
Developmental Toxicity HazardExpert (Compudrug Ltd., Budapest, Hungary), Discovery
Studio TOxicity Prediction by Komputer Assisted Technology (DS
TOPKAT, Accelrys Inc., Birmingham, MA, USA), Computer
Automated Structure Evaluation (CASE) and Multiple Computer
Automated Structure Evaluation (MultiCASE) (MultiCase Inc.,
Cleveland, OH, USA), and CAESAR (Caesar consortium,
Milan, Italy).
The software programs mentioned above predict the develop-
mental toxicity of chemicals based on their structure alone, and have
been used and validated by regulatory agencies including the Danish
EPA, US EPA, US FDA, Health Canada, European Chemicals
Bureau and the UK Health and Safety Executive (HSE), academia
and industry alike because of their ease of use and rapid application.
Commonly used software programs to predict the developmental
toxicity of chemicals are described in more detail below.
CAESAR. The CAESAR application is a Java-based downloadable
software and Web applet that allows user access to all toxicity
models developed within the CAESAR project (13). The projects
primary goal was to develop QSAR models that minimized false
negatives for endpoints that were relevant for REACH legislation
and make them easily accessible and usable by anyone. Currently,
CAESAR contains QSAR models for bioconcentration factor, skin
sensitization, mutagenicity, carcinogenicity and developmental tox-
icity. Two QSAR models for developmental toxicity have been
developed using different statistical/mathematical methods. The
first QSAR model makes a classification based on a random forest
algorithm, while the second is based on an adaptive fuzzy partition
algorithm.
To make a prediction for a chemical of interest for developmen-
tal toxicity, the appropriate model is first chosen in the left frame of
the software application, and the chemical is entered either by
typing the SMILES and pressing the load button, or by loading
multiple molecular structures from SDF/MOL files or a text file
containing multiple SMILES. After all chemicals of interest have
been entered/imported, the chemicals are submitted for making a

prediction by clicking the submission button on the bottom right
corner of the screen. The results screen gives the prediction along
with an assessment of the prediction, including whether the chemi-
cal is covered under the models applicability domain and a list of
similar molecules present in the model database. The final screen
gives the user the option of storing the results as a text file, csv file,
or an SDF file.
CASE/MultiCASE. The CASE and MultiCASE methodology, and
the resulting commercial software, including MCASE, MCWeb,
MC_NET, MC4PC, CaseTox, ToxLite, and METAPC were devel-
oped by Klopman et al. (16, 17). MCASE and CASETOX contain
more than 180 modules that cover various areas of toxicology,
including acute toxicity in mammals, adsorption desorption metab-
olism and excretion (ADME), adverse effects in humans, carcino-
genicity, cytotoxicity, developmental toxicity and teratogenicity,
ecotoxicity, biodegradation and bioaccumulation, enzyme inhibi-
tion, genetic toxicity, and skin, eye irritations and allergies. These
software programs/modules can automatically identify molecular
substructures and/or fragments that have a high probability of
being responsible for an observed biological activity such as devel-
opmental toxicity for a set of tested chemicals, provided the training
dataset that was used to develop the model contains both active and
inactive chemicals (i.e., chemicals that are developmental toxicants
and developmental nontoxicants, respectively, in the case of devel-
opmental toxicity).
The CASE and MultiCASE methodology uses probability
assessment to determine whether structural fragments are asso-
ciated with developmental toxicity. To achieve this, both develop-
mental toxicants and developmental nontoxicants are split into
structural fragments up to a certain path length (210 atoms).
Each of these fragments is associated with a confidence level and a
probability of developmental toxicity that is derived from the dis-
tribution of these biophores (a fragment whose functionality is
associated with the largest number of developmental toxicants
and the smallest number of developmental nontoxicants in a train-
ing set that contains an equal number of developmental toxicants
and developmental nontoxicants) and biophobes (the opposite of a
biophore). The fragments are then combined to give an equation of
the following form (18):
CASE units a fragment 1 b fragment 2
(1)
constant:
Based upon this relationship, chemicals are designated as
developmental toxicants, marginal toxicants or chemicals with
weak developmental toxicity, or developmental nontoxicants.
To make a prediction for a new, untested chemical, the chemi-
cal structure can be submitted to the software program either
through a structure-drawing program or as an MDL mol Molfile,

and an expert prediction of the potential activity of the new
molecule can be obtained. CASE- and MultiCASE-based products
provide the user with the capability of adding experimental data.
DEREK for Windows. DEREK for Windows is a knowledge- or
rule-based expert system that was originally created by Schering
Agrochemicals in the UK in 1986, and subsequently developed
and marketed by LHASA Ltd. (School of Chemistry, University
of Leeds, UK) and Harvard University (Boston, MA, USA) in
collaboration with industry, academia, and user groups. DEREK
is able to predict toxicological end points such as skin sensitization,
respiratory sensitization, irritancy, corrosivity, mutagenicity, carci-
nogenicity, teratogenicity, neurotoxicity, lachrymation, methemo-
globinemia, and anticholinesterase activity, and is based on an
analysis of the chemical structure alone. Toxicological predictions
for a given end point are generally based on structural alerts,
species, toxicity data, and physicochemical properties of chemicals
(19). New DEREK rules are established using a detailed review of
published sources of toxicological, mechanistic, and chemical data.
Reasoning rules in DEREK have the following formula (19):
If Grounds is Threshold then Proposition is Force; (2)
where [Grounds] is the evidence to be considered by the reasoning

rule, [Threshold] is the level above which the grounds must be for
the proposition to be assigned the force, [Proposition] is the out-
come of the reasoning rule, and [Force] is the likelihood of the
reasoning rule outcome.
DEREK for Windows has a small number of alerts for repro-
ductive and developmental toxicity, and was developed using repro-
ductive and developmental toxicity data from both US FDAs
ICSAS/CDER group (Informatics and Computational Safety
Analysis Staff/Centre for Drug Evaluation and Research) and
CFSAN group (Centre for Food Safety and Applied Nutrition).
Version 11 of the software contains five new alerts while four alerts
were updated. Two additional alerts were included in version 12 of
the software. To use DEREK for Windows interactively, molecular
structures are entered into DEREK either via the built-in chemical
editor program or by importing Molfiles or SDfiles. The program
compares structural features in the test chemical with the toxico-
phores described in its rule base, highlights toxicophores, and
provides a justification for each prediction in terms of the rules
fired, the mechanistic or historical basis for the rule, and any
supporting literature used in rule development. In addition,
end-point-specific information is also included in the output. The
system also reports that no toxicophores have been identified in the
structure in case none of the rules were applied.
DS TOPKAT. Discovery Studio Toxicity Prediction by Komputer

Assisted Technology (DS TOPKAT), originally developed by
Health Designs, Inc. (Rochester, NY) and currently marketed by
Accelrys Inc. (Burlington, MA), is a PC-based modular software for
the prediction of a wide variety of health end points. QSAR models
in DS TOPKAT include NTP rodent carcinogenicity, FDA rodent
carcinogenicity, weight of evidence rodent carcinogenicity, Ames
mutagenicity, developmental toxicity potential, skin sensitization,
skin irritancy, ocular irritation, aerobic biodegradability, LOAEL,
LD50, LC50, EC50, MTD, and octanolwater partition coefficient
(20). Each DS TOPKAT module consists of a specific database of
carefully screened chemicals, and several chemical subclass-specific
cross-validated QSAR models for predicting a specific toxicity end
point. In order to assess the end-point specific toxicity for any given
chemical structure, the software employs appropriate bulk, elec-
tronic and transport attributes that are presumed to be responsible
for the biological activity of the molecule (21, 22).
The Developmental Toxicity Potential module of the DS TOP-
KAT software package comprises three QSAR submodels and the
data from which these submodels are derived. Each submodel
applies to a specific class of chemicals. These discriminant models,
derived from uniform experimental studies selected after critical
review of approximately 3,000 open literature citations, compute
the probability of a submitted chemical structure being a develop-
mental toxicant in the rat; a probability below 0.3 indicates no
potential for developmental toxicity, and probability above 0.7
signifies developmental toxicity potential. The probability range
between 0.3 and 0.7 refers to the indeterminate zone.
To make a developmental toxicity prediction, the user has to
enter the chemical structure in SMILES format in DS TOPKAT
and select the developmental toxicity potential model as the pre-
diction module. The software first screens the chemical structure
against the model substructural library to determine adequate
coverage, and automatically chooses the appropriate chemical
class-specific QSAR submodel to generate the toxicity prediction.
A prediction is considered successful when the results of the
substructure analysis satisfy the validation criteria for both
the univariate and multivariate procedures that are built into the
model, and the descriptor values are within the model domain
(21, 23, 24).
HazardExpert. HazardExpert is a rule-based system developed by
Compudrug (Budapest, Hungary) using known toxicophores col-
lected from the literature or through in vivo experiments.
HazardExpert in combination with its sister product, MetabolEx-
pert (a software program for predicting metabolites), can be used
to predict the toxicity of both the parent compound and its
metabolites. The knowledge bases in HazardExpert and
MetabolExpert were developed based on the list of toxic fragments

reported by more than 20 experts in their respective fields.
HazardExpert predicts a range of toxicities including mutagenicity,
carcinogenicity, teratogenicity, oncogenicity, irritation, sensitiza-
tion, immunotoxicity, and neurotoxicity.
To use HazardExpert, a user selects a chemical from the Hazar-
dExpert database, defines the species, route of administration, dose
level and duration of exposure. If the test chemical is missing in the
database, the user must enter it into the database using the attached
chemical editing interface prior to running the model. Upon exe-
cution, substructures that exert a positive or negative effect with
respect to the chosen toxicity are identified based on known tox-
icophores stored in the toxic fragments knowledge base. Hazar-
dExpert provides the user with the means to add and/or modify
rules and their related references through the knowledge mainte-
nance module. Toxicity estimation results including the toxicity
classification are displayed to the user as histograms.
Leadscope Developmental Toxicity Suite. The models in the Develop-
mental Toxicity Suite were developed under a cooperative research
and development agreement (CRADA) with the US FDA and is
intend to support regulatory decision-making processes. The Lead-
scope QSAR models for developmental toxicity include dysmor-
phogenesis (structural and visceral birth defects), developmental
toxicity (fetal growth retardation and weight decrease), and fetal
survival (fetal death, postimplantation loss, and preimplantation
loss) for rodent composites, rats, mice, and rabbits. A QSAR
model is not available for predicting visceral dysmorphogenesis in
rabbit fetuses. The applicability domain is defined by nearest neigh-
bor analysis, which compares explicit comparison of structural
feature representations and coverage of test sets vs. the model.
The Leadscope developmental toxicity suite of models is a stand-
alone suite with read-only models. To make a prediction for a test
chemical, its structure is entered either as a SMILES or in mol or
SDF formats. Minimum requirements for running the model
include Windows XP or Vista with at least a Pentium 4, 1.0 GHz
processor (or equivalent) with Minimum 1.0 GB of RAM and 1 GB
of disk space.
Toxicity Estimation Software Tool (T.E.S.T.): T.E.S.T. is a Java-based
software suite developed by the US Environmental Protection
Agency (US EPA) that will enable users to easily estimate various
toxicity endpoints using QSAR methodologies such as hierarchical
method, FDA method, single-model method, group contribution
method, nearest neighbor method, consensus method, and ran-
dom forest method. The software is based on the Chemistry Devel-
opment Kit, an open-source Java library for computational
chemistry. Currently available QSAR models in this software tool
include 96-h fathead minnow LC50, 48-h daphnia magna LC50,
Tetrahymena pyriformis IGC50, Oral rat LD50, bioaccumulation

factor, developmental toxicity, and mutagenicity. In addition, the
software tool can also calculate descriptors for use in other QSAR
model-building programs.
To make a prediction for a new chemical, the user has to input
the chemical to be evaluated by drawing it in a two dimensional
(2D) chemical sketcher window, entering a SMILES string, enter-
ing a CAS number, or entering a MOL file. The user will also select
the desired endpoint to be predicted and the method to use to
make a prediction. Once the user presses the Calculate button,
the tool will generate three dimensional (3D) coordinates for the
molecule (if necessary) and then calculate hundreds of chemical and
physical descriptors. In the previous phase of the project, a hierar-
chical clustering based approach was used to divide the toxicity
dataset (for a given end point) into a series of clusters based on
these descriptor values. The predicted toxicity value and confidence
interval along with descriptor values and similar chemicals will then
be displayed in a Web browser. Finally if an experimental toxicity
value is available, it will also be displayed.
2.3. Developing (Q) Sometimes, the commercial and noncommercial software available
SAR Models to may not be suitable for a particular users application either because
Predict Developmental of the expense involved in acquiring the software or because the
Toxicity software is not suitable for chemicals that are of particular interest
to the user. In such a case, a user may be able to build a (Q)SAR or
other mathematical models to predict the developmental toxicity of
chemicals using a database of training chemicals that is relevant for
their purpose.
A large number of mathematical methods are available to
express the mathematical function in a generic (Q)SAR equation.
The most frequently used methods are multiple linear regression
(MLR), principal component and factor analysis, principal compo-
nent regression analysis, partial least squares, discriminant analysis
and other classification analysis and neural networks. All these
methods relate the biological activity (developmental toxicity) to
a number of physicochemical descriptors that characterize a
chemical structure. In general, the physicochemical properties/
descriptors may be categorized as experimental quantities, spectro-
scopic data, substituent constants (electronic, hydrophobic, steric),
parameters derived from molecular modeling and quantum chemi-
cal computations, graph theoretical indices, and variables that
describe the presence or absence of certain substructures or frag-
ments. The (Q)SAR equation(s) is of the general form:

1
log f V C; (3)
T
where T (dependent variable) is a measure of developmental

toxicity, f is a mathematical function, V is a set of independent
variables (physicochemical properties of chemicals) and C is a con-
stant. For qualitative SAR models where the prediction is whether a
given chemical is a developmental toxicant or nontoxicant, the
dependent variable can be expressed as yes or true for toxicant
(some software packages require the use of numerical variables in
which case the variable is represented as 1) and no or false
for nontoxicant (0). For quantitative SAR (QSAR) models, the
dependent variable is generally lowest observed adverse effect level
(LOAEL) for developmental effects expressed in moles/kg/d units.
The LOAEL is generally divided by the molecular weight of a chemi-
cal as dependent variables in molar units seem to yield better correla-
tions than weight-based units. The function, f, can be any
mathematical relation that relates the independent variables to the
dependent variable. The function may be obtained using logistical
regression, MLR, artificial neural networks or classification and
regression trees for qualitative models while for quantitative models,
MLR is usually the method of choice.
2.3.1. Software Suites Commercial and noncommercial applications that include descrip-
for Developing QSARs tor generator and statistical modules are available in the market to
build a SAR or QSAR model. Examples of such software include
ADAPT, Cerius 2, Discovery Studio, MDL QSAR, and Sybyl.
Commonly used software programs to predict the developmental
toxicity of chemicals are described in more detail below.
ADAPT. ADAPT (Automated Data Analysis using Pattern Recog-
nition Toolkit; Dr. Peter Jurs, University Park, PA; http://research.
chem.psu.edu/pcjgroup/adapt.html) is a software system designed
to allow the user to develop qualitative or quantitative SARs.
ADAPT provides facilities for graphical entry and storage of molec-
ular structures and their associated data, generation of 3D molecu-
lar models, molecular descriptor calculation and analysis of the
descriptors using multivariate statistical, pattern recognition, or
neural network methods to build predictive equations. In addition,
it can import structures as molfiles and 3D coordinates. ADAPT
has a large selection of molecular descriptor generation routines
(topological, geometrical, electronic, and physicochemical) and
the ability to generate hybrid descriptions that combine features.
Statistical approaches supported include MLR, clustering, discrim-
inant analysis and neural networks. ADAPT runs on Sun work-
stations under the UNIX operating system, and has been ported
to Linux using the Intel Fortran compiler. A number of scripts to
automate various tasks are also available. Approximately 120 MB of
storage are required for source code, libraries, executables, and
documentation.
ADMET Modeler. ADMET Modeler (Absorption, Distribution,

Metabolism, Elimination, and Toxicity; Simulations Plus, Inc., Lan-
caster, CA; http://www.simulations-plus.com/Products.aspx?
pID13&mID14) is an integrated module of ADMET Predic-
tor that automates the process of building (Q)SARs from sets of
experimental data. It works seamlessly with ADMET Predictor struc-
tural descriptors as its inputs, and appends the selected final model
back to ADMET Predictor as an additional predicted property.
ADMET Modeler includes Kohonen self-organizing maps, artificial
neural network ensembles for regression and binary classification,
vector machine ensembles for regression and binary classification,
kernel partial least squares (PLS) and ordinary PLS for regression
and MLR. ADMET Predictor is designed to run on Windows NT,
XP, 2000, or Windows 98 PCs or in Windows Vista in XP compati-
bility mode.
ADMEWorks ModelBuilder. ADMEWORKS ModelBuilder
(Fujitsu BioSciences Group, Sunnyvale, CA) is a tool for building
mathematical models that can later be used for predicting various
chemical and biological properties of compounds. The models are
based on values of physicochemical, topological, geometrical, and
electronic properties derived from the molecular structure.
Descriptor generator programs in ADMEWORKS ModelBuilder
come mostly from ADAPT (see description above) and MOPAC,
which is a general purpose semiempirical molecular orbital pack-
age developed by Professor James Stewart. The software includes
several statistical methods for generating predictive models
including perceptron (a type of artificial neural network model),
iterative least squares method, MLR, stepwise regression, leap-
and-bounds regression, genetic algorithm, fuzzy adaptive least
squares, k-nearest neighbors, AdaBoost, and support vector
machine. Suggested hardware and software requirements include
an Intel Pentium 4 2.0 GHz processor or higher with 512 MB of
RAM, and Windows 2000/XP.
BioEpisteme. BioEpisteme (Prous Institute for Biomedical
Research, Barcelona, Spain; http://www.prousresearch.com/
spage/technology/testpage/pageid-79/epage/BioEpisteme.aspx)
is a (Q)SAR program developed at Prous Institute based on well-
defined 2D and 3D molecular descriptors. The model-building
module of the program facilitates the development of mathematical
models that relate chemical structures to biological responses. These
models are interrelated, each having been designed in the context of
the others. BioEpisteme is organized into two main modules: a data
prediction module and a model building module. The predictive
models are integrated in the in silico Drug Discovery System (i-
DDS). Molecular descriptor-based BioEpisteme models under
development at Prous Institute include structure/molecular
mechanism of action relationships (SMAR), structure/therapeutic

activity relationships (STAR), structure/toxic effect relationships
(STER), structure/ADME relationships (SADMER), structure/
biological process relationships (SBPR) and structure/drug inter-
action relationships (SDIR).
CAChe/Scigress Explorer. CAChe (Computer Aided Chemistry/Sci-
gress Explorer; Fujitsu BioSciences Group, Sunnyvale, CA; http://
www.simulations-plus.com/Products.aspx?pID13&mID14) is
a unique desktop molecular modeling software package that can
apply a wide range of computational models to all types of molecular
systems from small organic molecules to inorganics, polymers,
materials systems, and whole proteins. Scigress Explorer speeds
time-to-discovery by providing powerful computing and analytical
tools designed for experimental scientists. The software suite provides
an intuitive interface for powerful 3D visualization of molecules and
model development using the Projectleader interface. CAChe is
designed to work the way experimental chemists work. Users identify
properties of interest using the standard menu interface and CAChe
computes the properties using its extensive built-in library of compu-
tational chemistry tools. Every tool can be calibrated to experimental
data to improve the accuracy of predicted results. While Scigress
Explorer works efficiently on desktop computers, large-scale compu-
tational tasks become accessible to the desktop user by means of
Linux GroupServer, offering accelerated execution of computation-
ally intense chemical and life sciences applications.
Cerius2/Materials Studio. Cerius2 (Accelrys, Inc., San Diego, CA;
http://accelrys.com/products/materials-studio/index.html) is a
user-friendly graphical molecular modeling program developed by
Accelrys Inc. (formerly MSI) that incorporates a wide variety of
popular molecular modeling codes in materials sciences, life
sciences and drug design. Cerius2 is now legacy software and is no
longer supported by Accelrys. Most of the Cerius2 licenses have
been converted to Materials Studio, the program developed by
Accelrys to replace Cerius2. Materials Studio is a comprehensive
suite of modeling and simulation solutions for studying chemicals
and materials, including crystal structure and crystallization pro-
cesses, polymer properties, catalysis, and SARs. Materials Studio
runs on Windows, but many of the internal programs run on IRIX,
Linux x86, Linux IA64, and Tru64.
Discovery Studio QSAR. Discovery Studio (Accelrys, Inc., San
Diego, CA; http://accelrys.com/products/discovery-studio/) is
a user-friendly graphical molecular modeling program developed
by Accelrys Inc. that incorporates a variety of useful molecular
modeling codes specifically designed for biological systems, includ-
ing metal containing systems. DS QSAR provides easy access to the
hundreds of molecular descriptors, proven in biological systems to
correlate with activity. The streamlined Discovery Studio (DS)

interface presents these descriptors and advanced modeling and
visualization methods in an easy-to-use environment. DS Library
Design applies these capabilities together with similarity and diver-
sity methods specifically tailored for chemical library design to
guide optimal library design. From one easy-to-use graphical envi-
ronment, the user is able to access QSAR, fingerprint, and quantum
mechanistic based descriptors, advanced data modeling tools such
as Bayesian models, MLR, PLS, genetic functional analysis and
neural networks, library design and optimization tools such as
Pareto optimization, diversity and similarity analysis. The current
release of Discovery Studio is version 2.1 and runs on Windows
and Linux.
MOE QuaSAR. Molecular Operating Environment QuaSAR
(Chemical Computing Group, Montreal, Canada; http://www.
chemcomp.com/journal/qsar.htm) is a complete QSAR/QSPR
analysis toolset. The first step in such analysis is to compute mole-
cule properties. QuaSARs graphical interface is a gateway into the
QuaSAR system, and allows the user to select which descriptors to
calculate. The interface scans MOE and automatically detects Qua-
SAR modules, including those introduced by the user. Users can
quickly and easily build descriptors of their own custom design and
add them into the system; the source code of the built-in descrip-
tors is distributed with QuaSAR, and can be used as templates.
After descriptors are calculated, they are sent to the suite of MOE
QuaSAR model-building functions for model fitting and evalua-
tion. (Q)SAR models may be built using linear, probabilistic and
decision-tree methodologies. The built-in binary QSAR method-
ology is ideal for building pass/fail models from high error content
data. Linear models include PCR and PLS methodologies and can
support biological activity or ADME assessments. In the model
refinement stage, further descriptor calculations may be required.
This cycle continues until a satisfactory model is found, or until the
current model is rejected and the process begun anew. MOE runs
on a wide variety of operating systems and computers including
multiprocessor clusters, high-performance scientific workstations,
and laptops. MOE can run either interactively or in batch mode.
In either mode, the built-in MOE/SMP capability permits execu-
tion to be distributed across multiple processors and/or computers
even in a heterogeneous computing environment. Supported
operating systems include Windows 2000/XP/Vista/Windows 7,
Linux, Mac OS X, Sun Solaris, Silicon Graphics, and Iris.
Molecular Analysis Pro, Molecular Modeling Pro, Molecular Model-
ing Pro Plus. The Molecular Modeling suite of programs developed
by Norgwyn Montgomery Software Inc. (North Wales, PA; http://
www.norgwyn.com) are designed as analysis tools for chemists
doing practical research in the relationship between properties of

molecules (QSAR, QSPR, etc.). Molecular Analysis Pro has three
main features that aid in the discovery of these relationships:
calculating molecular properties from structure, statistical and
graphing tools for analyzing data, and database storage, retrieval
and manipulation capacity including the ability to interact with
common databases such as Microsoft Access, SQL, and Oracle.
The software calculates over 100 molecular properties from struc-
ture, 40 3D geometry descriptors for similarity searching, and over
140 substructural features. Statistical methods included in the soft-
ware include MLR with brute force and stepwise methods, partial
least squares regression with cross-validation and PCA. The software
is available only for the Windows platform.
OpenMolGRID. OpenMolGRID (Open Computing GRID for
Molecular Science and Engineering; http://www.openmolgrid.
org) uses UNICORE to integrate the various applications needed
to build QSAR models from databases to molecular engineering
and prediction modules. Data mining techniques within Open-
MolGRID are used for the development of predictive models that
can be used for estimating various chemical properties and
biological activities. The OpenMolGRID system provides a flexi-
ble infrastructure for automating the workflow for the model
building process, which involves the preparation of a training
set, the generation of 3D structures for compounds in the training
set, quantum chemical calculations, the calculation of molecular
descriptors, and finally the QSPR/QSAR model building. Com-
plicated workflows can be represented in XML format, easily
shared between colleagues, customized and extended. The Open-
MolGRID system has Grid adapters for several existing software
packages that are required for carrying out tasks in the QSAR/
QSPR model development workflows. The details about sup-
ported tasks are available under available applications, which
include applications for 2D to 3D conversion, applications for
semiempirical chemical calculations, applications for descriptor
calculation and applications for model development. The MDA
module from the CODESSA PRO software has been adapted
for the QSAR/QSPR model development. Multiple statistical
methods are available for the development of predictive models,
including MLR and PLS. Several selection algorithms are available
for descriptor selection in the effective search of the best (most
informative) multiparameter correlations in the large space of the
natural descriptors. The prediction capability of the model is
judged by statistical parameters calculated for the model, various
cross-validation techniques, internal and external validation sets.
Visualization tools are available for plotting actual vs. predicted
activities/properties and residuals.
Optibrium StarDrop. The StarDrop software platform (Optibrium

Ltd., Cambridge, UK; http://www.optibrium.com/stardrop)
guides compound selection and design decisions to obtain an opti-
mal balance of properties at all stages of drug discovery. Sitting at
the heart of the drug discovery process, StarDrop provides an
intuitive environment to integrate data from experimental data-
bases or predictive models, to make project decisions with confi-
dence, design and select balanced compounds with a high chance of
success, focus resources on the most appropriate chemistries, and
provide project scientists with direct access to in-house models and
databases through StarDrops intuitive, user-friendly interface.
StarDrop provides a comprehensive range of features to support
design and prioritization of high quality compounds, including
probabilistic scoring, chemical space visualization and compound
selection tools, ADME QSAR models to predict a full range of
ADME properties, P450 metabolism models, and automatic model
building to develop and deploy models of your own data. The Auto
Modeler module in StarDrop can automatically calculate molecular
descriptors, generate the QSAR models using PLS radical basis
functions, Gaussian processes, and decision trees, and train, test,
and validate the developed model.
PRECLAV. PRECLAV (PRoperty Evaluation by CLAss Variables)
is a program for QSAR calculations from Tarko Laszlo (Center of
Organic Chemistry, Bucharest, Romania). The program can work
with a maximum of 500 chemicals in a dataset, and can calculate
over 1,100 global, local, and grid/field descriptors. The final
QSAR is built using r2 and r2-cv cross-validating functions. The
minimum number of descriptors in the developed model is 2, while
the maximum number of descriptors is 20. The program can be
adapted to work with databases created by the user.
Prochemist. Prochemist (Cadcom; http://pro.chemist.online.fr) is a
molecular modeling toolkit that can calculate the physicalchemical
properties of organic molecules, design new structures using a
graphical structure editor and predict their behavior. The QSAR
module in Prochemist computes the correlation between a known
value (activity, toxicity, etc.) and variables (molecular descriptors)
for an array of molecules. Molecular descriptors calculated by the
software include energy descriptors, lipophilicity, steric descriptors,
classification descriptors and topological indices. Qualitative mod-
els in Prochemist are built using PCA, correspondence factor anal-
ysis, ascending hierarchical classification, minimum spanning tree
and self-organizing neural networks. Quantitative SAR models are
built using MLR analysis, nonlinear (neuronal) analysis, back prop-
agation algorithm, Boltzmann algorithm, basis functions, Fourier
transforms, cascade correlations, and genetic algorithms. In addi-
tion to the descriptors calculated by the software, the user is also
able to provide external descriptors to the model building modules.
QSARpro. QSARpro (VLife Technologies, Pune, India; http://

www.vlifesciences.com/products/QSARPro/Product_QSAR pro.
php) is a complete software package for modeling molecular activity
or property with structural parameters, analyzing such models and
predicting activity of new molecules. QSARpro evaluates more than
1,000 molecular descriptors including those based on physico-
chemical, topological, and electrotopological properties, informa-
tion theory, quantum mechanics, electrostatic and hydrophobic
properties and MMFF atom types, among others. QSARpro
implements novel methodologies such as k-nearest neighbor
molecular field analysis (kNNMFA) for improved modeling cap-
abilities. The software offers the unique flexibility of coupling any
variable selection method to any model building method to build
the final QSAR equation. QSARpro also incorporates group based
QSAR (GQSARTM) methodology that simplifies interpretation of
results by considerations of molecular fragment contributions and
interaction between fragments. The software runs on Windows
NT/2000/XP/Vista operating systems.
Sarchitect. Sarchitect (Strand Life Sciences, San Francisco, CA;
http://www.strandls.com/Sarchitect) is a modeling tool for calcu-
lating descriptors, developing QSARs and modeling ADME/Tox
properties. Sarchitect allows model builders to build best possible
models of their data, and equally importantly, monitor and improve
the models over a period of usage. It allows users of the models to
optimize complex multidimensional ADME/Tox properties of
molecules while retaining their biological activity. The Sarchitect
platform empowers computational and medicinal chemists, mode-
lers, DMPK scientists and other users with a QSAR modeling and
deployment platform that provides powerful algorithms, interactive
views, single click model building, intuitive workflows for chemists,
and customized scripting. The Sarchitect Designer module in
Sarchitect helps automatically build QSAR models by assessing
the possibilities and limitations of data, extracting the right features
for models, selecting the best suited algorithms and parameters,
rigorous validation and quality testing, model packaging and
deployment, and most importantly, the postmodeling activities
that actually make the models valuable, monitoring the usage of
models, and continuously improving the models. The only tool a
modeler will need to build intuitive models are the chemical struc-
tures and experimental data.
SYBYL-X. SYBYL-X (Tripos International, St. Louis, MO;
http://www.tripos.com/index.php?familymodules,SimplePage,
comp_informatics) is a modular suite of molecular modeling pro-
grams for building, editing, and analyzing small molecule organics,
biopolymers, polymers and materials. Modules include Base
SYBYL, Biopolymer, QSAR, HQSAR, CoMFA, Advanced Comp
(for conformation searching and interfaces to QCPE programs).
The QSAR with CoMFA module in SYBYL builds statistical and

graphical models that relate the properties of molecules (including
biological activity) to their chemical structures. These models are
then used to predict the properties or activity of novel compounds.
Tripos patented CoMFA has been used as the method of choice in
hundreds of published QSAR studies. A wide variety of structural
descriptors can be calculated, including EVA and the molecular
fields of CoMSIA. QSAR with CoMFA organizes structures
and their associated data into Molecular Spreadsheets, calculates
molecular descriptors, and performs sophisticated statistical
analyses that reveal patterns in structureactivity data.
The basic tools needed to build powerful, predictive models of
biological activity (or any other property) from molecular structure
are provided in SYBYLs QSAR module. These include molecular
field generation tools, least-squares (PLS, PCA, and SIMCA) and
nonlinear (hierarchical clustering) analysis tools. The most power-
ful of these techniques can be extended, and their application
automated, using the Advanced CoMFA module. The Topomer
CoMFA module in SYBYL is a 3D QSAR tool that automates the
creation of models for predicting the biological activity or proper-
ties of compounds. Topomer CoMFA models can be created in
minutes, and can easily be used by both QSAR experts and QSAR
nonexperts, with the results being often comparable to traditional
CoMFA analysis. The Hologram QSAR (HQSAR) module in
SYBYL uses molecular holograms and PLS to generate fragment-
based SARs. Unlike other 3D-QSAR methods, HQSAR does not
require alignment of molecules, allowing automated analysis of very
large data sets. Validation studies have shown that HQSAR has
predictive capabilities comparable to those of much more compli-
cated 3D-QSAR techniques. SYBYL is available for Windows,
Linux, and Mac OS platforms.
2.3.2. Unbundled Software Stand alone applications may also be used to build a SAR or QSAR
for Developing QSARs models. In this case, separate software programs will have to be
used to calculate molecular descriptors and performing statistical
analysis to build the final model. Todeschini and Consonni (25)
provide a comprehensive list of molecular descriptors that can be
used to develop QSAR models. Apart from the applications
mentioned in the previous section, other common software appli-
cations that can be used to calculate molecular descriptors include
ACD/Labs PhysChem and ADME (http://www.acdlabs.com/
products/pc_admet/), ADRIANA. Code (http://www.molecular-
networks.com/products/adrianacode), Almond (http://www.
moldiscovery.com/soft_almond.php), Cranium (http://www.mol
know.com/Cranium/cranium.htm), CODESSA PRO (http://
www.codessa-pro.com/index.htm), EPI Suite (http://www.epa.
gov/oppt/exposure/pubs/episuite.htm), Dragon (http://www.
talete.mi.it/products/dragon_description.htm), HYBOT-PLUS
(http://www.timtec.net/software/hybot-plus.htm), molinspira
tion (http://www.molinspiration.com/cgi-bin/properties), and
Molconn-Z. (http://www.edusoft-lc.com/molconn/). Some quan
tum mechanical (QM) descriptors mentioned by Todeschini (25)
cannot be computed by the software programs mentioned previ
ously. QM descriptors are generally dependent on computational
chemistry/molecular modeling software such as ADF (http://
www.scm.com/Products/Overview/ADFinfo.html), AMPAC
(http:// www.semichem.com/ampac/), Gaussian (http://www.
gaussian.com/), GAMESS(http://www.msg.ameslab.gov/
GAMESS/GAMESS.html), Molpro (http://www.molpro.net/),
NWChem (http://www.nwchem-sw.org/index.php/Main_
Page), pDynamo(http://www.pdynamo.org/mainpages/), and
Q-Chem (http://www.nwchem-sw.org/index.php/Main_
Page). Once molecular descriptors for the chemical structures
have been calculated, statistical programs such as CoStat/CoPlot
(http://www.cohort.com/costat.html), DataFit (http://www.
curvefitting.com/datafit.htm), IgorPro (http://www.wavemetrics.
com/ index.html), JMP (http://www.jmp.com/software/jmp9/),
Minitab (http://www.minitab.com/en-US/products/minitab/
default.aspx), Mathematica (http://www.wolfram.com/products/
mathematica/index.html), Maple (http://www.maplesoft.com/pro
ducts/ maple/), MATLAB (http://www.mathworks.com/pro
ducts/matlab/), Miner3D (http://miner3d.com/), NLREG
(http://www.nlreg.com/index.htm), Origin (http://www.originlab.
com/index.aspx?goProducts/Origin), R (http://www.r-project.
org/), SAS (http://www.sas.com/software/sas9/), SigmaStat, Sig
maPlot and Systat (http://www.systat.com/SystatProducts.aspx), S-
Plus (http://spotfire.tibco.com/products/s-plus/statistical-analysis-
software.aspx), Statistica (http://www.statsoft.com/), and WEKA
(http://www.cs.waikato.ac.nz/~ml/weka/index.html) may be used
to develop the final (Q)SAR model. Most of the software mentioned
in this section work on a Windows platform though a few work on
Linux-, UNIX-, or Mac-based platforms as well.
3. Methods
3.1. QSAR Analysis The general procedure for developing a QSAR model for predicting
the developmental toxicity of a wide variety of chemicals is outlined
in the following steps:
Choosing the training set. The values of the dependent variables for
the chemicals in the model training set such as the lowest dose that
causes developmental effects (for quantitative models) and whether
exposure to a chemical causes developmental effects (for qualitative
model) will be chosen from the literature. Since a good QSAR
model is highly dependent on the quality of the data used in the

model construction process, the data should be chosen from a well-
documented and vetted source such as the databases from the ILSI,
ToxNet, or the US EPA.
Generate most-stable conformers. The activity of a given chemical is
dependent on how the chemical associates with the target tissue in a
biological system. In general, this can be determined by analyzing
the chemical for its most stable conformer or global minimum in
vacuum. Thus, as part of the model development process, the
structures of the chemicals considered in this study could be ana-
lyzed for their most stable conformers using software such as
Computer Aided Chemistry (CAChe), Hyperchem, or Gaussian.
This step needs to be conducted only if descriptors that depend on
3D coordinates of a chemical are calculated. If only descriptors that
depend on 2D coordinates of a chemical such as topological, geo-
metrical, WHIM, and constitutional descriptors are calculated,
then this step can be safely ignored.
Calculate descriptors. Quantum chemical descriptors that describe
the electronic states responsible for any given endpoint may be
calculated using descriptor-generator programs such as Molconn-
Z, Codesssa, E-Calc, EPIWIN, Dragon or any of the other
descriptor-generator programs mentioned in Subheading 14.2.
Explore the data. The descriptors calculated in the previous step
might be highly intercorrelated with each other. In such a case,
interrelationships between the descriptors calculated in the previ-
ous step should be analyzed to choose a descriptor that is theoreti-
cally most significant among highly intercorrelated descriptors.
Minimizing the number of highly cocorrelating descriptors will
also help reduce the uncertainties in the coefficients of descriptors
in the final QSAR equation. This step can be carried out using any
of the QSAR model building programs or statistical programs
mentioned in Subheading 14.2.
Develop the final QSAR equations. Once a list of potential descrip-
tors have been calculated in the previous steps, several statistical
methods can be used for generating the final QSAR equation.
These may include MLR, PLS, simple linear regression, stepwise
MLR, and principal components regression among others. This
step can be carried out using either the QSAR model-building
programs or the statistical packages mentioned in Subheading 14.2.
Validate the equations. The final QSAR equations should be vali-

dated internally and externally to identify outlier chemicals. Statis-
tical Validation Criteria that can be used to validate QSAR models
include leave-many-out (LMO), which tests model reliability,
leave-one-out (LOO), which tests model stability, and external
validation, which tests for balanced test sets. The validation process
involves developing the QSAR equation for a series of subsets of
chemicals from the initial chemical dataset and then comparing the
following: coefficients of the descriptors for the two resultant
QSAR equations (initial and subset), the Pearson correlation coef-
ficients and predictive ability of the subset equation for chemicals
that were not present in the subset. External validation involves
using the QSAR models developed in the previous step to predict
the toxicities of chemicals that are not part of the original dataset.
3.2. Independent Various independent variables can be used to predict the develop-
Variables mental toxicity of a wide variety of chemicals. For example, the
descriptors in Subheading 14.4 were calculated using the commer-
cial descriptor-generator programs Dragon, CAChe, AMPAC/
CODESSA and Molecular Connectivity (Molconn-Z). The meth-
ods used in deriving the independent variables for each descriptor-
generator program is described in the following paragraphs.
CAChe Descriptors. The chemicals considered in this study were first
optimized to their most stable conformation using the CONFLEX
module in CAChe WorkSystem Pro (version 6.1.12.33, Fujitsu,
Beaverton, OR). The independent variables (descriptors) from
CAChe were then calculated using the Project Leader interface in
CAChe.
Charge-based descriptors such as the partial charge, HOMO
and LUMO were calculated by first optimizing the chemicals in
vacuum using MOPAC with PM5 parameters followed by a popu-
lation analysis on the molecular orbitals. The following keywords
were used for the MOPAC job: GEO-OK PM5 NOMM ENPART
ESR LOCALIZE MULLIK PI POLAR BONDS DENSITY FOCK
GRADIENTS SPIN VECTORS PRECISE LARGE XYZ NODIIS
GRAPH ITRY2000 T10D GNORM0.0 SCFCRT1.D-15
DMAX0.1.
Solvent-based descriptors such as the solvent accessible surface
area, energy of solvation and dipole moment were calculated by first
optimizing the chemicals in water using MOPAC with PM5 para-
meters and the Conductor-like Screening Model (COSMO).
In addition to the keywords used for optimizations in vacuum,
the following two keywords were added: EPS78.4 RSOLV1.00.
Reactivity-based descriptors such as the nucleophilic, electro-
philic and radical superdelocalizability were calculated by optimiz-
ing the chemicals in vacuum using MOPAC with PM5 parameters
followed by running the Tabulator interface in CAChe to get the
isodensity values. In the Tabulator, a default reagent energy value of
8 eV, 2 eV, and 5 eV were assigned for calculations of electro-
philic, nucleophilic, and radical superdelocalizability, respectively.
Infrared descriptors were calculated by optimizing the chemi-
cals in vacuum using MOPAC with PM5 parameters followed by a
vibrational spectra calculation using the FORCE and THERMO
keywords. Ultraviolet descriptors were determined by ZINDO

using INDO/S parameters at the PM5 geometry. The following
keywords were used in the control section of ZINDO: UNI-
TSANGS ENTTYPCOORD IELEC0 NAT35 NEL110
DYNAL(1)0 14 21 0 0 65 63 IPRINT-1 RUNTYPCI
SCFTYPRHF SCFTOL0.00001 INTTYP1 INTFA(1)1.0
1.2675 0.585 1.0 1.0 1.0 MULT1 IAPX3 ITMAX200.
Descriptors such as the bond energy, bond order and bond
strain were calculated using Extended H uckel Theory at a molecu-
lar mechanics minimum energy geometry determined by optimiza-
tion using Augmented MM3. To perform the optimizations in
vacuum, the keyword MM3 was used instead of PM5.
LogP and molar refractivity were calculated using the atom
typing scheme of Ghose and Crippen (26), while the shape index
was calculated using the Kier and Hall indices (27, 28).
CODESSA Descriptors: Descriptors by CODESSA were calculated
by first optimizing the chemicals considered in this study using
AMPAC (a program developed by Semichem Inc., but based on
MOPAC). The following keywords were used for the optimization:
PM3 RHF DOUBLET GNORM0.0001 TIGHT BONDS ESP
CODESSA T100D TRUSTG FORCE THERMO ESR GRAD
MULLIK POLAR ENPART PI VECTORS ALLVEC DERINU
LOCALIZE MPG SPIN SCFCRT0 KPOLAR. The output from
the optimization was fed into CODESSA to calculate a series of
descriptors that are based on the two-dimensional and the three-
dimensional structure of the chemical. While three-dimensional
structure-based descriptors are dependent on the three-
dimensional orientation of a chemical and hence require an output
file from AMPAC, two-dimensional structure-based descriptors
may be calculated independently from the two-dimensional struc-
ture of the molecule alone.
Dragon Descriptors. Dragon is a software that calculates several sets
of two-dimensional descriptors such as topologic, geometric,
WHIM, 3D-MoRSE and constitutional from a chemicals molecu-
lar geometry (25). Since the descriptors are based on their two-
dimensional structure, the software cannot differentiate between
structures of stereoisomers. In addition, Dragon cannot identify
certain Group I and Group II metals such as sodium, potassium and
calcium. However, Dragon can calculate descriptors for transition
metals. The descriptors for the chemicals considered in this study
were calculated by first converting the molecular structures of the
chemicals into the MDL mol format, reading the structure into
Dragon and selecting the Generate Descriptors button in
Dragon.
Molconn-Z Descriptors. Molconn-Z descriptors were calculated
using the molecular descriptor generator program Molconn-Z,
which calculates general two-dimensional molecular descriptors

(e.g., atom counts, bound counts and molecular weight), Chi
descriptors that describe molecular connectivity of a molecule and
incorporates features such as size, branching and saturation and
kappa descriptors that describe the molecular shape of the molecule
(25). Molconn-Z also calculates topologic state descriptors that
arise from the interaction of atoms among each other, E-state
descriptors that arise from the electronic environment of each
atom due to its intrinsic electronic properties and general descrip-
tors such as polarity and graph diameter (25). Descriptors were
calculated using the following keywords in the control file: ALL-
RECORDS, HEADERS top, INPUTFORMAT oelibsmiles,
WARNINGS off, ERRORS skip and GO.
3.3. Datasets Commercial and noncommercial databases are available to obtain

qualitative developmental toxicity data (generally categorized as: A
(controlled studies show no risk), B (no evidence of risk in humans),
C (risk cannot be ruled out), D (positive evidence of risk), or X
(contraindicated in pregnancy)) or quantitative toxicity data in the
form of No Observed adverse Effect Levels (NOAEL) and/or
Lowest Observed Adverse Effect Level (LOAEL). Commonly avail-
able sources of data for developmental toxicity include the develop-
mental and reproductive toxicology/environmental teratology
information center (DART/ETIC), TOXNET, European Chemi-
cal Bureau, US FDA, International Life Sciences Institute, DSSTox,
ToxRefDB, ReproTox, and ChemSW.
The DART/ETIC database (http://www.nlm.nih.gov/pubs/
factsheets/dartfs.html) is a bibliographic database on the National
Library of Medicines (NLM) Toxicology Data Network (TOX-
NET). It covers teratology and other aspects of developmental
and reproductive toxicology. It contains over 200,000 references to
literature published since 1965. DART/ETIC is funded by the US
EPA the National Institute of Environmental Health Sciences, the
National Center for Toxicological Research of the FDA and the
NLM.
The human/animal developmental toxicity database (http://
www.chemsw.com/14240.htm) and the industrial solvents devel-
opmental toxicity database (http://www.chemsw.com/14239.
htm) are commercial databases that are maintained by ChemSW.
The former database contains toxicities and molecular properties
for 157 pharmaceutical, agricultural and industrial organic chemi-
cals while the latter contains toxicities and molecular properties of
94 individual solvents. Each database entry lists the name, CAS
registry number, and observed overall developmental toxicity (pos-
itive, negative, equivocal, unknown) within humans and the indi-
vidual animal species (mouse, rat, hamster, and rabbit). Data in the
two databases are provided in Molecular Modeling Pro (csv) or
SDF formats.
REPROTOX (http://www.reprotox.org/Default.aspx) is an
information system developed by the Reproductive Toxicology
Center for its members. REPROTOX contains summaries on the
effects of medications, chemicals, infections, and physical agents on
pregnancy, reproduction, and development. The REPROTOX
system was developed as an adjunct information source for clini-
cians, scientists, and government agencies.
The Distributed Structure-Searchable Toxicity (DSSTox;
http://www.epa.gov/ncct/dsstox/) and Toxicity Reference Data-
base (ToxRefDB; http://www.epa.gov/ncct/toxrefdb/) are main-
tained by US EPAs National Center for Computational
Toxicology (NCCT). Results from 383 rat and 368 rabbit prenatal
studies on 387 chemicals, mostly pesticides, are present in the
ToxRefDB database.
US FDA Informatics and Computational Safety Analysis Staff
(ICSAS) maintains a genetic toxicity, reproductive and develop-
mental toxicity and carcinogenicity database (http://www.fda.
gov/AboutFDA/CentersOffices/CDER/ucm092217.htm). The
ICSAS reproductive and developmental toxicity database contains
data records from FDA segment I (reproductive toxicity in male
and female animals), segment II (teratology, organ toxicity, and
nonspecific toxicity to the fetus), and segment III (behavioral
toxicity in newborn pups) studies in Glires (primarily rats, mice,
rabbits, and hamsters) and other animals. The data were acquired
from publicly available sources, such as Shepards Catalog of Tera-
togenic Agents, TERIS, REPROTOX, and RTECS, as well as
studies reported in drug labeling, and other reproductive toxicity
studies obtained from the EPA Toxdata-1g database.
The International Life Sciences Institute (http://www.ilsi.org/
Pages/ViewActivityDetail.aspx?ID114&ListNameActivities&
diaPage/Pages/RiskAssessmentDiagram.aspx) is developing a
prototype database of developmental toxicity data; once populated,
the database will facilitate systematic consideration of alternative
approaches for utilizing toxicity data in SAR modeling efforts.
3.4. Statistical Analysis Statistical analyses can be performed using any commonly available
dedicated statistical software such as JMP (SAS Institute Inc., Cary,
NC), Microsoft Excel (Microsoft Corporation, Redmond, WA,
US), R (R-Project, http://www.r-project.org), SAS (SAS Institute
Inc., Cary, NC), SigmaPlot (Systat Software Inc., Chicago, IL), and
SYSTAT (Systat Software Inc., Chicago, IL). Alternatively, statisti-
cal analysis can be performed in any of the QSAR model building
software mentioned in Subheading 14.2. Common methods used
to develop QSAR models include stepwise MLRs, genetic algo-
rithms, logistical regression, and classification and regression trees
among others. In general, SAR models for predicting developmen-
tal toxicity of chemicals qualitatively (yes/no) are developed using
logistical regression while stepwise MLR is used to develop models

that predict developmental toxicity quantitatively in the form of a
LOAEL. To prevent multicollinearity between the descriptor vari-
ables, the minimum tolerance can be set to a low value such as 0.02,
while the probability-to-enter and probability-to-leave values can
be set at 1.99 and 0.02, respectively. For each of the QSAR models,
the cross-validated-r2 (internal q2) can be calculated using the
leave-multiple-out (LMO) procedure, wherein the model was
built at least ten times using a random selection of 7080% of the
chemicals from the original model data set, and the developmental
toxicities of the remaining chemicals can be predicted using the
resulting model.
3.5. Domain The predictive ability of the QSARs also depends on their predic-
of Applicability tion domain, which is in turn dependent on the range of descriptor
values in the QSAR equation. One way of defining the prediction
domain is according to the principles of the domain of applicability
as defined by Gombar and Enslein (24). However, the procedure
for calculating their prediction domain is proprietary and hence
may not be available for implementation in QSAR models.
A second method of determining the QSAR prediction domain
is through a procedure based on the Hotellings T 2 statistic (29).
Briefly, the procedure involves performing a PCA on a training set
of chemicals and using the resultant loading matrix to calculate the
PCA scores for the external set. Using the scores and a given
confidence interval, a, a chemical i of the external set is considered
to belong in the prediction domain if its Hotellings score,Ti2 ,
satisfied the following criterion:
2
2 A N 1
Ti < F p a; (4)
N N A
where F(p a) is the tabulated value for a F-distribution using a
confidence interval a, A is the number of principal components
used to build the Hotellings test and N is the number of chemicals
in the training set.
X
A
t2
Ti2 ia
; (5)
a1
sa2
where sa2 is the variance explained by the principal component a,

and tia is the score of chemical i for principal component a. The
confidence in predictions is high for chemicals with small values of
T2, while the opposite is true for chemicals with large T2 values.
If T2 is greater than the expression on the right side of the equation,
the chemical is outside the confidence region, and either the pre-
diction should be rejected or the predicted value should be used
with great caution.
A distance-based measure using the concept of leverage as

proposed by Eriksson et al. (30) can also be used to determine
the domain of applicability for QSAR models. The leverage
approach allows a QSAR model to determine whether a new chem-
ical will lie within the structural model domain or outside the
domain of the QSAR model. In addition, as suggested by Grama-
tica (31), a prediction is considered to be unreliable or outside the
domain of applicability for a new chemical when (1) its leverage (h)
was greater than the critical value (h* 3p0 /n, where p0 is the
number of model variables plus one, and n is the number of the
chemicals in the training set), and (2) cross-validated standardized
residuals are greater than three standard deviation units (>3s). The
leverage of a chemical in the original variable space can be calcu-
lated using the following formula (32):
1
h i xiT X T X xi; (6)
where i 1, . . ., n; n is the number of chemicals in the training set,
xi is the descriptor vector of the chemical in question and X is the
model matrix derived from the training set descriptor values. The
defined applicability domain can be visualized via a Williams plot,
which is a plot of standardized cross-validated residuals versus
leverage values (h) to detect both the response outliers (standar-
dized residual >3s) and structural influential chemicals in a model
(h > h*).
3.6. Performance There are two methods which are commonly used to determine the
Evaluation of a QSAR predictive capability of a (Q)SAR model (33). The first method is
Model the use of cross-validation, which includes LOO, LMO and k-fold
cross-validation. In LOO, a compound is left out of the training set
and the remaining compounds are used to train the machine
learning method. The derived (Q)SAR model is then used to
predict the activity of the left-out compound. This process is
repeated until every compound in the training set has been left
out once. In k-fold cross-validation, the training set is randomly
divided into k mutually exclusive subsets of approximately equal
size. k-minus-one of the subsets is combined to form a modeling
training set for developing a QSAR model. The remaining subset is
then used as a modeling testing set to assess the predictive capability
of the QSAR model. This process was repeated until all k subsets
have been tested at least once.
Cross-validation methods may sometimes not be a good indica-
tor of the prediction capability of a (Q)SAR model (3437). More-
over, cross-validation methods have a tendency of underestimating
the prediction capability of a QSAR model, especially if important
molecular features are present in only a minority of the compounds in
the training set (38, 39). Thus a model having low cross-validation
results can still be quite predictive (38). Some studies have suggested
that an independent validation set may provide a more reliable esti-
mate of the prediction capability of a QSAR model (33, 34). Despite
these disadvantages, cross-validation methods are still useful for asses-
sing QSAR models during optimization of parameters of machine
learning methods and during descriptor selection.
3.6.1. Methods Common statistics used to determine the predictive capability of an

for Measuring Predictive SAR model include sensitivity (SE), specificity (SP), overall accu-
Ability of SAR Models racy (Q) and Matthews Correlation Coefficient (MCC; (40))
(http://voyagememoirs.com/pharmine/):
TN
SE 100%; (7)
TP FN
TP
SP 100%; (8)
TN TP
TP TN
Q 100%; (9)
TP FN TN FP
TP TN FN FP
MCC p ;
TP FN TP FP TN FN (TN FP)
(10)
where TP is number of the true positives (experimental positive

predicted to be positive), TN is the number of true negatives
(experimental negative predicted to be negative), FP is number of
the false positives (experimental negative predicted to be positive),
FN is the number of false negatives (experimental positive pre-
dicted to be negative), SE is accuracy of classification for positives
and SP is accuracy of classification for negatives and Q is the overall
accuracy of classification for the model. A perfect SAR model will
predict no false positives or false negatives, and will have an overall
accuracy of 100% and a MCC value of 1.
3.6.2. Methods Common statistics used to determine the predictive capability of an

for Measuring Predictive SAR model include Pearson correlation coefficient (r2), external
Capability of QSAR Models validation-r2 (q2), mean square error, mean absolute error, fold
error, and average fold error (see http://voyagememoirs.com/phar
mine/2008/03/09/methods-for-measuring-predictive-capability-
of-qsar-models-2/ for an explanation of the variables). The correla
tion coefficients measure the explained variance between the pre
dicted and experimental values for test and external validation
datasets while the fold errors measure the degree of overprediction
or underprediction of the QSAR models (41).
4. Examples
4.1. Development In this section, a simulated developmental toxicity dataset is being

of a Multiple-Linear- used to illustrate how one can develop a QSAR model using R.
Regression-Based QSAR Table 1 contains the data for a particular developmental endpoint
Model with Mammalian (i.e., dev_endpt; e.g., a LOAEL based on dose related increase in
Developmental Toxicity the incidence of rudimentary 14th ribs in rat fetuses was observed)
Endpoint Using R for 33 chemicals with 6 relevant descriptors v1v6. This dataset has
already been preprocessed and cleaned for model generation. Typi-
cally, one would need to convert weight-based unit (mg/kg-day) to
mol-based unit (mmol/kg-day) and transform the converted unit
to a log or natural log scale. Descriptors v1v6 can be experimental
or theoretical physicochemical (or quantum mechanical) properties
that may account for the observed toxicity (responsible variable or
the developmental endpoint in this case). The simulated dataset is
called DevelopmentalQSAR.csv (csv is comma-separated values
text file) and is presented in Table 1.
Before one can run or write R scripts, one will need to install R
first (see http://www.r-project.org/ for instructions on download
and installation). After installing R on a PC (the method is valid for
MacOS X or Linux as well), one can start writing scripts to generate
a regression (QSAR) model. Second, one will need to download
and install faraway or any appropriate packages (e.g., other sta-
tistical functionality) from the R Web site (e.g., if you prefer princi-
pal component-based regression model, you would need to
download pls package).
Now, one can use the following R script (MLR_AD_Dev.R) to
generate a MLR-based QSAR model. The code is the following:
## November 15, 2010
## Nina Ching Y. Wang
## Generate a QSAR model using multiple linear regres-
sion (MLR) in R
## A MLR-based QSAR model with a developmental endpoint
and its associated lowest-observed-adverse-effect-
level (LOAEL) as the response variable
## LOAEL (response endpt, second column, has already
been transformed into mmol-based from weight-based
and is in natural log scale
library(faraway)
## Make sure you download faraway package from the R
website first; some functionality may not work with
the base package only
## Some of the codes were co-developed with Dr. David
Farrar. Please do not distribute the R code without
prior permission from the authors/developers
# Cross validation--Leave one out
Table 1
Data that is to be placed in the Developmental QSAR.csv file
Chemical dev_endpt v1 v2 v3 v4 v5 v6
1 2.759712 10.441 0.02 0.068041 0.1156 0.0203 26.1079
2 4.587676 10.871 0.018 0 0 0 12.8101
3 9.701267 9.134 0.189 0.096225 0 0 3.7703
4 9.502622 7.977 0.279 1.626968 0.0678 2.24E 05 5.4028
5 3.542805 6.989 0.269 0.602671 0 0 1.4531
6 6.612875 9.471 0.058 0.117851 0 0 0.4964
7 5.693281 10.747 0 0.146385 0.0379 3.04E 05 17.078
8 10.37322 9.289 0.003 4.373589 0 0.00E + 00 0.00E + 00
9 0.793055 9.966 0.015 0 0 0.00E + 00 25.3052
10 6.332414 8.153 0.306 0.66066 0 0.00E + 00 0.8482
11 6.991609 10.713 0.008 0 0 0.00E + 00 0.00E + 00
12 7.704999 10.243 0.123 0 0 0.00E + 00 4.9914
13 8.988404 10.826 0.044 0 0.1805 0.0248 10.1478
14 5.439386 8.68 0.72 0.166667 0 0.00E + 00 1.1007
15 9.1427 10.941 0.762 0.117851 0.1748 1.03E 04 24.8754
16 9.033307 8.64 0.207 0.51953 0.106 7.42E 08 7.6008
17 5.181727 7.811 0.094 0.85042 0 0.00E + 00 0.9331
18 8.114655 8.702 0.529 0.92633 0 0.00E + 00 0.8272
19 11.88782 9.646 0.175 1.494564 0.1245 7.86E 03 13.1668
20 8.617923 9.405 0.133 0.831987 0 0.00E + 00 0.6074
21 1.387535 9.579 0.067 0.141421 0.07 0.0266 21.9634
22 2.923085 9.631 0.116 0.1 0.0698 0.0271 15.2603
23 3.86922 9.485 0.122 0.1 0.0686 1.32E 03 10.7721
24 0.38433 9.635 0.086 0.1 0.0699 0.0302 19.4625
25 0.373352 9.591 0.039 0.2 0.0701 0.0165 28.6147
26 10.60792 8.375 1.004 0.199071 0.0677 8.37E 03 10.1463
27 3.864432 9.575 0.065 0.1 0.0697 0.0168 22.1775
28 5.488927 12.092 0.744 0.166667 0.1804 0.0554 25.7257
29 3.762447 8.677 0.096 0.147883 0 0.00E + 00 3.6157
30 9.795335 8.62 0.241 0.572211 0.1597 0.01 6.1058
31 10.6815 6.918 1.012 0.595928 0.0712 5.12E 04 1.5184
32 8.932089 8.198 0.579 0.45111 0.0722 2.40E 04 4.7366
33 4.46365 10.691 0.002 0 0 0.00E + 00 6.9086
Qsquare <- function(m) {

# Q-square for a model generated with lm()
PRESS <- sum((m$residuals/(1-hatvalues(m)))^2)
SSTotC <- sum(anova(m)[["Sum Sq"]])
return(1 - PRESS/SSTotC)
}
# Assume one has a csv file of developmental toxicity
dataset named DevelopmentalQSAR.csv installed under
C:\R
d0<-read.csv("C:\crR\crDevelopmentalQSAR.csv",
row.names1,headerT)
#rownumber of chemicals/observations
row <- nrow(d0)
#Pnumber of descriptors/variables
P<- length(names(d0))-1
#dev_endpt is the response variable or Y, m1MLR model
m1 <- lm(dev_endpt ~ ., datad0)
print(summary(m1))
print("LOO Q2:")
print(Qsquare(m1))
The following procedures are customized for the PC platform

(slight deviations for other platforms). Run the script from the R
window by typing source(C:\crR\crMLR_AD_Dev.R) at the
command prompt, if you have installed R program under C drive
directly (the exact path to R should be typed). After running the
code, you should get the following result:
> source("C:\crR\crMLR_AD_Dev.R")
Call:
lm(formula dev_endpt ~ ., data d0)
Residuals:
Min 1Q Median 3Q Max
-2.6674 -0.5541 -0.1300 0.3952 3.9573
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.73607 2.88942 -0.947 0.352394
v1 -0.93612 0.30945 -3.025 0.005537 **
v2 3.62202 1.10227 3.286 0.002909 **
v3 1.21729 0.37178 3.274 0.002995 **
v4 -36.01499 6.54213 -5.505 8.91e-06 ***
v5 -112.63005 30.17727 -3.732 0.000936
***
v6 -0.23134 0.04432 -5.219 1.89e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 1.57 on 26 degrees of freedom
Multiple R-squared: 0.8188, Adjusted R-squared: 0.777

F-statistic: 19.59 on 6 and 26 DF, p-value: 1.646e-08
[1] "LOO Q2:"
[1] 0.6465491
Both r2 and q2 should have been calculated by the code:

r 0.8188 and leave-one-out (LOO) q2 0.646549. One can
2
also code for leave-multiple-out (LMO) or k-fold validation by

splitting the dataset into five parts and leave one part out for
validation. However, for this demonstration, the coding for LMO
is not provided here. Now, you have gotten a decent QSAR model
with reasonably good r2 and q2 and that all six variables (descrip-
tors) are statistically significant (p < 0.01). However, to define a
domain of applicability or applicability domain (AD) in order to
detect overextrapolation and to test model predictability, one will
need to define an AD of a QSAR model mathematically or statisti-
cally. To derive an AD using the leverage and plotting the Williams
plot, one can append the following code to the previous R code or
simply type in the following commands:
## Domain of Applicability (AD): Williams Plot (Resi-
duals vs. Leverage)
#h* warning leverage3*P/N, where Pnumber of
descriptors, Nnumber of chemicals
h1 <- 3*P/row
print("h* warning: ")
print(h1)
#Plot Williams Plot or domain of applicability or AD
plot(m1, which5, add.smoothFALSE, id.nrow,
labels.id,cex.id0.75, main"", caption"",sub.
caption"")
abline(h3.0, lty2);abline(h-3.0, lty2);abline
(vh1, lty2)
## standardized residuals should be -3<x<3
#New Chemical that have yet to be tested for developmen-
tal adverse effect
A<-c(-9.873,0.785,2.557013,-0.1499,0.0107,0.3035)
B<-c(-11.125,0.017,0.423858,-0.0892,0.0010803,24.3729)
C<-c(0,26.603,1.603,0,0,16.9298)
#Calculate leverage for new chemicals
model<-as.matrix(m1$model[,-1])
xtxi<-solve(t(model) %*% model)
print ("leverages for new chemicals A,B,C are")
levA<-t(A)%*%xtxi%*%A
print(levA)
levB<-t(B)%*%xtxi%*%B
print(levB)
levC<-t(C)%*%xtxi%*%C
print(levC)
Fig. 1. Williams Plot for the QSAR model generated using the data in Table 1.
If you have appended the additional codes to the previous R

script, you can save and rerun the new R script again, one should
get Fig. 1 followed by the following result:
[1] "h* warning: "
[1] 0.5454545
[1] "leverages for new chemicals A,B,C are"
[,1]
[1,] 0.4285405
[,1]
[1,] 0.2266694
[,1]
[1,] 288.0573
From the result, one can see that majority of training set is
indeed inside the AD as defined by the rectangular box as bounded
by h* and residuals (3 < residuals < 3); only two chemicals (#28
and #8) are outside the AD. At this point, one may consider
removing these two chemicals one-by-one and redeveloping a
new QSAR model or simply keep the current model (it already
has reasonable predictability). Lastly, one of the great utilities of
AD is that it allows one to test the predictability and uncertainty of
a QSAR model for new or untested chemical that is not part of the
training set (i.e., external validation). Here, three new hypothetical
chemicals, A, B and C, their respective descriptors have already
been generated (Lines bolded and italicized in the code). As one
can see, A and B have their leverages less than then the h*
(0.5454545), and C has a very high leverage (h 288.0573).
This result suggests that one should not use the current QSAR
model to predict Chemical C due to high extrapolation, and there-
fore, very high uncertainty. For A and B, predictions for these two
chemicals by the model may be reasonably reliable for screening
and prioritization. However, further testing (e.g., in vitro and/or
in vivo assays) should be conducted if these two chemicals are of
high concern, especially in a regulatory setting with policy implica-
tion. One should always provide the predicted toxicity values with
associated leverages and underlying model limitations.
5. Notes
Advantages/Disadvantages of the Different Types of Models. As

with statistical models, there are both advantages and disadvantages
with mechanistic and expert-system-based models. The correlative
models use unbiased assessment of the data to generate relation-
ships and make toxicity predictions; however, these models provide
no insight on the mechanistic nature of chemicals and are capable of
generating relationships that have little biological plausibility. In
addition, the predictive capability of the model is dependent upon
the quality of the data used in model development. For these
reasons, correlative models have to be carefully validated prior to
use. The advantage of a mechanistic model is that it is based on an
understanding of mechanism of action to determine the activity of a
given chemical, which is generally done by using human input.
However, such systems are restricted to human knowledge and
are not capable of providing insight into new relationships or
discovering new insights and/or novel associations between struc-
ture and toxicity. As a result, such models are biased towards
current ideas on mechanism of action. The most serious drawback
of these systems is that the rules, sometimes in thousands, relate
only to positivity, and assume that all chemicals that are not
governed by any of the rules are negative. This generally leads
to a large number of false positive predictions.
Overfitting. A (Q)SAR model may be overfitted if it contains more
descriptors than is necessary. One method that may be used to check
whether a (Q)SAR model is overfitted is to compare the predictive
ability of internal and external validation datasets (42). A model
containing the optimal number of descriptors would have similar
predictive abilities for both internal and external validation datasets.
References
1. Anderson ME, Al-Zoughool M, Croteau M 12. Matthews EJ, Kruhlak NL, Benz RD et al
et al (2010) The future of toxicity testing. (2007) A comprehensive model for reproduc-
J Toxicol Environ Health B Crit Rev 13(2 tive and developmental toxicity hazard identifi-
4):163196 cation: II. Construction of QSAR models to
2. European Union Directive (2003) DIREC- predict activities of untested chemicals. Regul
TIVE 2003/89/EC OF THE EUROPEAN Toxicol Pharmacol 47(2):136155
PARLIAMENT AND OF THE COUNCIL 13. Cassano A, Manganaro A, Martin T et al
of 10 November 2003 amending Directive (2010) CAESAR models for developmental
2000/13/EC as regards indication of the toxicity. Chem Cent J 4(Suppl 1):S4
ingredients present in foodstuffs. http://eur- 14. Merlot C (2008) In silico methods for early
lex.europa.eu/LexUriServ/LexUriServ.do? toxicity assessment. Curr Opin Drug Discov
uriOJ:L:2003:308:0015:0018:EN:PDF. Devel 11(1):8085
Accessed 23 Oct 2010 15. Worth AP, Bassan A, de Brujin J et al (2007) The
3. Doull J, Borzelleca JF, Becker R et al (2007) role of the European Chemicals Bureau in pro-
Framework for use of toxicity screening tools moting the regulatory use of (Q)SAR methods.
in context-based decision-making. Food Chem SAR QSAR Environ Res 18:111125
Toxicol 45(5):759796 16. Klopman G (1984) Artificial intelligence
4. Basak SC, Bertelsen S, Grunwald GD (1995) approach to structure-activity studies. Com-
Use of graph theoretic parameters in risk assess- puter automated structure evaluation of
ment of chemicals. Toxicol Lett 79:239250 biological activity of organic molecules. J Am
5. Devillers J, Chezeau A, Thybaud E et al (2002) Chem Soc 106:73157321
QSAR modeling of the adult and developmen- 17. Klopman G (1992) MULTICASE: a hierarchi-
tal toxicity of glycols, glycol ethers and xylenes cal computer automated structure evaluation
to Hydra attenuata. SAR QSAR Environ Res program. Quant Struct Act Relat 11:176184
13(5):555566 18. Klopman G, Rosenkranz HS (1994) Approaches
6. Devillers J, Chezeau A, Thybaud E (2002) to SAR in carcinogenesis and mutagenesis. Pre-
PLS-QSAR of the adult and developmental diction of carcinogenicity/mutagenicity using
toxicity of chemicals to Hydra attenuata. SAR Multi-CASE. Mutat Res 305:3346
QSAR Environ Res 13(78):705712 19. Maslankiewicz L, Hulzebos EM, Vermeire TG
7. Kavlock RJ (1990) Structure-activity relation- et al (2005) Can chemical structure predict
ships in the developmental toxicity of substi- reproductive toxicity? RIVM, Bilthoven. www.
tuted phenols: in vivo effects. Teratology 41 rivm.nl/bibliotheek/rapporten/601200005.
(1):4359 html. Accessed 24 Oct 2010
8. Richard AM, Hunter ES III (1996) Quantita- 20. Accelrys (2001) TOPKAT 6.1: User Guide.
tive structure-activity relationships for the Oxford Molecular Ltd, Burlington, MA
developmental toxicity of haloacetic acids in 21. Gombar VK (1998) Quantitative structure-
mammalian whole embryo culture. Teratology activity relationships in toxicology: from funda-
53(6):352360 mentals to application. In: Reiss C, Parvez S,
9. Vedani A, McMasters DR, Dobler M (1991) Labbe G et al (eds) Advances in molecular
Genetic algorithms in 3D-QSAR: predicting toxicology. VSP, Utrecht
the toxicity of dibenzodioxins, dibenzofurans 22. Purcell WP, Bass GE, Clayton JM (1973) Strat-
and biphenyls. ALTEX 16(1):914 egy of drug design: a guide to biological activ-
10. Matthews EJ, Kruhlak NL, Cimino MC et al ity. Wiley, New York
(2006) An analysis of genetic toxicity, repro- 23. Moudgal CJ, Venkatapathy R, Choudhury H
ductive and developmental toxicity, and carci- et al (2003) Application of QSTRs in the selec-
nogenicity data: II. Identification of tion of a surrogate toxicity value for a chemical of
genotoxicants, reprotoxicants, and carcinogens concern. Environ Sci Technol 37:52285235
using in silico methods. Regul Toxicol Pharma-
col 44(2):97110 24. Gombar VK, Enslein K (1996) Assessment of
n-octanol/water partition coefficient: when is
11. Matthews EJ, Kruhlak NL, Benz DR (2007) A the assessment reliable? J Chem Inf Comput
comprehensive model for reproductive and Sci 36:11271134
developmental toxicity hazard identification:
I. Development of a weight of evidence 25. Todeschini R, Consonni V (2000) Handbook
QSAR database. Regul Toxicol Pharmacol 47 of molecular descriptors, vol 11, Methods and
(2):115135 principles in medicinal chemistry. Wiley-VCH
Verlag GmbH, Weinheim
26. Ghose AK, Crippen GM (1987) Atomic physi- 34. Golbraikh A, Tropsha A (2002) Beware of q2!
cochemical parameters for three-dimensional- J Mol Graph Model 20(4):269276
structure-directed quantitative structure- 35. Kozak A, Kozak R (2003) Does cross valida-
activity relationships. 2. Modeling dispersive tion provide additional information in the eval-
and hydrophobic interactions. J Chem Inf uation of regression models? Can J Forest Res
Comput Sci 27(1):2135 33(6):976987
27. Kier LB (1986) Shape indexes of orders one 36. Reunanen J (2003) Overfitting in making com-
and three from molecular graphs. Quant Struct parisons between variable selection methods. J
Act Relat 5:17 Mach Learn Res 3:13711382
28. Kier LB, Hall LH (1999) The Kappa indices for 37. Olsson I-M, Gottfries J, Wold S (2004)
modeling molecular shape and flexibility. In: D-optimal onion designs in statistical molecu-
Devillers J, Balaban AT (eds) Topological lar design. Chemometr Intell Lab Syst 73
indices and related descriptors in QSAR and (1):3746
QSPR. Gordon and Breach, Reading 38. Mosier PD, Jurs PC (2002) QSAR/QSPR
29. Eriksson L, Johansson E, Kettaneh-Wold N studies using probabilistic neural networks
et al (1999) Introduction to multi- and mega- and generalized regression neural networks. J
variate data analysis using projection methods Chem Inf Comput Sci 42(6):14601470
(PCA & PLS). Umetrics, Inc., Kinnelon 39. Hawkins DM, Basak SC, Mills D (2004) Asses-
30. Eriksson L, Jaworska JS, Worth AP et al (2003) sing model fit by cross-validation. J Chem Inf
Methods for reliability and uncertainty assess- Comput Sci 43(2):579586
ment and for applicability evaluations of classi- 40. Matthews BW (1975) Comparison of the pre-
fication- and regression-based QSARs. Environ dicted and observed secondary structure of T4
Health Perspect 111:13611375 phage lysozyme. Biochim Biophys Acta 405
31. Gramatica P (2007) Principles of QSAR mod- (2):442451
els validation: internal and external. QSAR 41. Obach RS, Baxter JG, Liston TE et al (1997)
Comb Sci 26:694701 The prediction of human pharmacokinetic
32. Atkinson AC (1985) Plots, transformations parameters from preclinical and in vitro metab-
and regression. Clarendon, Oxford olism data. J Pharmacol Exp Ther 283
33. Wold S, Eriksson L (1995) Statistical validation (1):4658
of QSAR results. In: van de Waterbeemd H 42. Hawkins DM (2004) The problem of overfit-
(ed) Chemometric methods in molecular ting. J Chem Inf Comput Sci 44(1):112
design. VCH, New York
Further Reading
http://www.scripps.edu/rc/softwaredocs/msi/ estrogens: the categorical-SAR (Cat-SAR)
cerius45/qsar/T_qsar1.html approach. In: Devillers J (ed) Endocrine disrup-
http://voyagememoirs.com/pharmine/archives tion modeling. CRC, New York
Arena VC, Sussman NB, Mazumdar S et al (2004) Hewitt M, Ellison CM, Enoch SJ et al (2010) Inte-
Selection in developmental toxicity: comparative grating (Q)SAR models, expert systems and read-
analysis of logistic regression and decision tree across approaches for the prediction of develop-
models. SAR QSAR Environ Res 15(1):118 mental toxicity. Reprod Toxicol 30(1):147160
Cronin MT, Jaworska JS, Walker JD et al (2003) Knudsen TB, Martin MT, Kavlock RJ et al (2009)
Use of QSARs in international decision-making Profiling the activity of environmental chemicals
frameworks to predict health effects of chemical in prenatal developmental toxicity studies using
substances. Environ Health Perspect 111 the U.S. EPAs ToxRefDB. Reprod Toxicol 28
(10):13911401 (2):209219
Cunningham AR, Carrasquer CA, Mattison DR Martin MT, Mendez E, Corum DG et al (2009)
(2009) A categorical structure-activity relation- Profiling the reproductive toxicity of chemicals
ship analysis of the developmental toxicity of from multigeneration studies in the toxicity ref-
antithyroid drugs. Int J Pediatr Endocrinol. erence database (ToxRefDB). Toxicol Sci 110
doi:10.1155/2009/936154 (1):181190
Cunningham AR, Consoer DM, Iype SA et al OECD (1995) Test No. 421: Reproduction/
(2009) A structure-activity relationship (SAR) Developmental Toxicity Screening Test,
analysis for the identification of environmental OECD guidelines for the testing of chemicals,
section 4: health effects, OECD Publishing. Tropsha A, Gramatica P, Gombar VK (2003) The
doi: 10.1787/9789264070967-en importance of being earnest: validation is the
Saiakhov RD, Klopman G (2008) MultiCASE absolute essential for successful application and
expert systems and the REACH initiative. Tox- interpretation of QSPR models. QSAR Comb
icol Mech Methods 18(23):159175 Sci 22(1):6977
Sakamoto K, Yamauchi A, Sasaki M et al (2007) A US EPA (1991) Guidelines for developmental tox-
structural similarity evaluation by SimScore in a icity risk assessment. Office of Research and
teratogenicity information sharing system. Development, Washington, DC, EPA/600/
J Comput Chem Jpn 6(2):117122 FR-91/001
Todeschini R, Consonni V (2000) Handbook of Yang CH, Valerio LG, Arvidson KB (2009)
molecular descriptors, vol 11, Methods and Computational toxicology approaches at the
principles in medicinal chemistry. Wiley-VCH US Food and Drug Administration. Altern
Verlag GmbH, Weinheim Lab Anim 37(5):523531
Chapter 15
Predictive Computational Toxicology to Support

Drug Safety Assessment
Luis G. Valerio Jr.
Abstract
Use of predictive technologies is an important aspect of many efforts in todays research, development, and
regulatory landscapes. Computational methods as predictive tools for supporting drug safety assessments is
of widespread interest as the field of in silico assessments rapidly changes with emerging technologies and
the large amount of existing data available for modeling. There are challenges associated with application of
in silico analyses for drug toxicity predictions and need for strategies and harmonization to enable an
acceptable in silico evaluation for prediction of specific toxicity assay outcomes. This chapter will provide an
overview focused on computational tools using structureactivity relationships and will highlight initiatives
for use of computational assessments and realistic applications for predictive modeling in evaluating
potential toxicities of drug-related molecules.
Key words: Computational toxicology, Drug safety, Safety assessment, Drug toxicity, QSAR, SAR,
In silico toxicology, Predictive toxicology, Chemoinformatics, Molecular screening
1. Introduction
There are many recognized potential drug safety issues that can
arise during the course of drug development through post-
marketing. Among the most recognized are safety-based issues
centered on cardiovascular and hepatic adverse drug reactions (1).
In addition, some data indicate that the average success rate for new
molecular entities (NMEs) in aggregate of all therapeutic areas
starting from first-in-human studies to registration during
19912000 was approximately 11% (2). In 2003, the US Food
and Drug Administration (FDA) approved 21 NMEs (2), but that
has decreased recently with only 15 NMEs approved in 2010 (3).
Although lack of efficacy is a major contributor, toxicity can also be
a major cause of failure in drug development with estimates of
approximately 30% of root causes of attrition (4). In an effort to
341
342 L.G. Valerio Jr.
improve the success rate while ensuring the rapid development of

safe and efficacious drugs, several initiatives have been undertaken
globally and in the United States with the FDA leading the way in
part by the Critical Path Initiative launched in 2004. The overarch-
ing strategy of the FDA Critical Path Initiative is innovation, and it
supports the development of new approaches targeted at moder-
nizing scientific processes through which a potential human drug is
discovered and transformed into a safe and efficacious medical
product (5, 6). Computational tools are one such innovative
approach and are frequently applied in early drug discovery to
predict drug toxicity and prioritize compounds with low safety
liabilities for further advancement in drug development processes.
However, regulatory agencies are active in investigating computa-
tional tools as decision support for internal risk assessments of
drug-related molecules under specific scenarios and toxicological
endpoints identified as high priority (712). In addition, the tools
offer a potential line of evidence for regulators to consider when
reviewing data on toxicological endpoints that cant be tested in
humans. These toxicity endpoints include carcinogenicity, geno-
toxic potential, and teratogenicity. These types of tools and
approaches have shown promise as well as limitations in regulatory
assessments and their application for drug-related molecules will be
covered in this chapter.
The most common way computational toxicology tools at the
FDA/CDER have been addressing ongoing and emerging drug
safety issues is through the generation of predictive information
using fundamental concepts in medicinal chemistry guided by
advanced machine learning, high quality toxicology study data,
and computer-based high throughput generation of calculated
drug molecular space (9, 13). Within this context, the computa-
tional tools that have been implemented at FDA/CDER have
focused primarily on predictive structureactivity relationship
(SAR) analysis and quantitative structureactivity relationship
(QSAR) analysis for prediction and modeling of drug toxicity.
Data mining approaches to support preclinical safety assessment
of drug-related molecules have also been an integral part of the
applied use of computational tools at the FDA (14). Currently,
genotoxicity prediction of drug impurities is the toxicity endpoint
of most interest for realistic application for predictive computa-
tional modeling of preclinical safety data at FDA/CDER. In sup-
port of this necessity, current FDA draft guidance on the genotoxic
and carcinogenic potential of impurities in drug products states that
If an impurity that is present at levels below the ICH qualification
thresholds is identified the impurity should be evaluated for geno-
toxicity and carcinogenicity based on SAR assessments (i.e.,
whether there is a structural alert). This evaluation can be con-
ducted either via a review of the available literature or through a
computational toxicology assessment; (15). The International
15 Predictive Computational Toxicology to Support Drug Safety Assessment 343
Conference on Harmonization (ICH) reports that it plans to dis-

cuss the acceptability and nature of SAR analysis for addressing
genotoxic impurities in pharmaceuticals at its June 2011 meeting
(16). It is anticipated that ICH will develop a M7 guidance that
would include for the first time computational toxicology assess-
ment as part of the international guidelines for human pharmaceu-
ticals on evaluating genotoxic impurities. The proposal is that
if a compound is predicted to be negative via expert evaluation
and computational assessment than it can be assumed to be
non-mutagenic without the conventional gentox assay (e.g.,
Ames test). Other rationale that support a basis for initiatives to
use computational tools for predicting genotoxicity include the
large amount of data (public and private) that are available for
modeling genotoxicity. Moreover, the fact that many well charac-
terized genotoxins induce damaging effects with DNA (e.g., alkyl-
ation) through chemical reactivity mechanisms involving
substructural features present on the compound, suggests it is
feasible to elucidate a structure-based explanation for genetic tox-
icity. It is likely then that searching for SAR patterns is useful in
model construction and the prediction of drug-related molecules.
Thus, one potential application is to perform a computational
toxicology assessment of a drug impurity that has been structurally
identified and found to be present in the drug product batch
proposed for use in a clinical trial prior to start of the trial when
existing safety data on the genotoxic potential of the molecule is
inadequate or ambiguous.
The overall approach of analyzing molecular structural features
for chemicals and drugs originates from the human expert struc-
tural alert (aka toxiciphores, or traditional alerts) classification
schemas and decision trees that are well known over the years
(1722). These human expert structural alerts such as the Ashby-
Tennant alerts have played a fundamental role and source of knowl-
edge that have ultimately led to development of numerous
computer-based expert systems and predictive computational
tools designed to analyze chemical molecular space descriptors
and substructural features associated with the genotoxic potential
of a chemical. SAR and QSAR are versatile computational tools
with examples of its use in certain regulatory safety assessment
settings (7, 15, 23), and as an early drug discovery technology.
Broadly, SAR and QSAR are recognized techniques in a tool box
for de novo design of drugs (24), combinatorial chemistry (25),
lead identification and optimization (26), high throughput virtual
screening, and genotoxicity evaluation of theoretical drug metabo-
lites and theoretical and known drug impurities (15, 27, 28).
(Q)SAR Computational Tools
Leadscope
Model SciQSAR Symmetry Toxtree Derek Nexus
Applier
Combined Decision tree

Logistic classification
Partial Human Expert
(Q)SAR Discriminant Regression & reasoning
Logistic Knowledge
Algorithm Analysis Similarity (own based on
Regression Base
development), Human Expert
other options Knowledge
Fingerprint 2D Structural
Molecular Connectivity
Molecular Descriptors, Structural Alert
Structure Indices (2D
Features / 3D in the Rules (Alerts) (Molecular
Interpretation Descriptors)
Scaffolds future Fragment)
2D 2D
Molecular 2D(n~800
(n~27,000 (n~200, Kier 2D 2D
Descriptors descriptors)
features) and Hall)
Similarity to Descriptor- Descriptor- Similarity to
Applicability training set based based
None knowledge
Domain + Features in Membership Membership in base
Domain in Class Class
Windows in
Operating
Windows Windows the server, any Windows Windows
System
in the clients
Application Web
Desktop Desktop Desktop Desktop
Architecture Application
Fig. 1. List of common computational toxicology software platforms. Information shows a wide range of approaches.
All software can be enabled to predict the genetic toxicity of chemicals using various modules (e.g., bacterial mutagenic-
ity). No endorsement intended or implied.
2. Materials
and Methods
Figure 1 lists a sample of computational toxicology software

programs and illustrates the differences between these approaches.
Most of these computational programs have been used for research
purposes at the FDA/CDER analysis through agency-approved
Cooperative Research and Development Agreements or Research
Collaboration Agreements with the software developer. The
approach of all these software programs is either SAR or QSAR.
It is important to note that the FDA does not endorse any of these
products even if the agency has reported research and use of a
system for internal use or model building. The validation statistics
Table 1
Summary of validation statistics reported in the literature for some computational
toxicology software that have QSAR models or SAR methods designed to predict
mutagenicity
(Q)SAR predictivity values

for mutagenicity
Computational Sensitivity Specificity Accuracy

toola (%) (%) (%) References Remarks
Leadscope 81 73 77 (31,63) Internal cross-validation, and
model external validation based on
applier 2,368 xenobiotics
SciQSAR 81 76 78 (30) Internal and external (1,444
chemicals) validation. 26 % of
the training set are drugs
Symmetry 78 82 80 (64) External validation based on
2,265 xenobiotics
Toxtree 85 70 77 (23) External validation for
genotoxicity based on 1,290
DSSTox chemicals data set
Derek 83 81 82 (23) External validation for
genotoxicity based on 1,290
DSSTox chemicals data set
All are predictivities of QSAR models for the Ames Salmonella assay except Toxtree and Derek which are SAR approaches
for genotoxicity
a
No endorsement implied or inferred
for predicting Salmonella mutagenicity by the programs listed in

Fig. 1 are summarized in Table 1. A prominent characteristic of the
predictive performance of some of these models is the high speci-
ficity and low sensitivity. This explains why independent validation
tests with the models have reported low sensitivity and is probably
not related to the software algorithms (27, 29). Still other nonsta-
tistical based software such as Toxtree have shown high sensitivity
for classifying the mutagenic potential of compounds. Overall, the
many validation studies that do exist are centered on testing accu-
racy of environmental agents or other nondrug molecules. There
are few studies that report validation of QSAR mutagenicity models
using human pharmaceuticals as the test set. However, one study is
that of Snyder (27) who tested two computational toxicology
software (MC4PC and Derek) using a data set of 545 marketed
pharmaceuticals with the Physicians Desk Reference (PDR) as a
data source. Results showed low sensitivity of 45% for MC4PC, and
62% for Derek, but high specificity 97% for MC4PC and 88% for
Derek in the prediction of the Ames assay (27). The other factor to
consider is the training sets and knowledge base of the computa-
tional software programs that are used to make the models have
very low drug content (<25%) (30). Thus, the following question
can arise: if nondrug molecules present in the training set of a
model can predict the mutagenicity of drugs, and moreover, can
these public training sets deal with proprietary drug molecular
space? To help answer this question, a recent examination of a
public Ames QSAR training set using over 1,000 drug impurities
(public and private) as a test set found that only 14% of the drug
impurities were not in the applicability domain space of the model
(31, 63). In other words, useful Ames predictions could be made
for 86% of the drug impurities that were screened with the model.
Thus, it seems at least in this example of a QSAR training set of
public structures which are predominately not drugs, there is an
acceptable level of molecular coverage and overlapping chemical
space with a test set of drug impurities. The next logical step will be
to test drug impurities with known Ames assay outcomes and
validate the available computational models for accuracy, sensitivity
and specificity. There are various arguments that can be made
regarding a preference for a model to perform at high sensitivity
or specificity, which have been described recently (7, 9, 26, 28). In a
perfect world we would have 100% sensitivity and 100% specificity
in a single model. Since that is not possible, one has to usually settle
for one of them (e.g., higher sensitivity over specificity). However,
it is not impossible, to construct a model with somewhat balanced
sensitivity and specificity (32). As an alternative solution, construc-
tion of different models with optimized levels of predictive perfor-
mance for sensitivity or specificity is becoming more common in
order to customize models to the different use cases.
The software programs illustrated in Fig. 1 do not have special
software requirements since all these systems can operate in a
standard Windows environment. Other computational toxicology
software use Linux as the operating system (33), or require Oracle
databases (34), so depending upon the users Information Tech-
nology (IT) infrastructure there are considerations to be made.
The computational software that are most commonly used for
predicting drug safety endpoints using SAR and QSAR approaches
involve confidential in-house systems (35, 36), freely available sys-
tems (8, 37, 38), and several commercially available ones (12, 13,
39, 40, 67). The confidential in-house systems have an advantage of
being able to leverage knowledge from private study data into a
model and design algorithms that can be customized to better align
with business goals, work flow, and scientific processes. Moreover,
these systems are thought to be innovative as there are fewer
restrictions on design of the tools and clearly they can take advan-
tage in terms of quantity and quality of training set data compared
to public data sets that commercial systems often tend offer.
Fig. 2. Hierarchy classification tree of a QSAR training set used to develop a public model for the Ames Salmonella
mutagenicity assay using Leadscope technology (Enterprise version 3.0). Tree shows the positive structural features
contained in the QSAR training set that are important components of the model. Negative features except for negative
parents of positive features were filtered out. Frequency is the number of compounds in the training set with that feature.
The precision of each feature indicates how well an alert candidate would work against the training set. Features with
significant z-scores and large total (+) loadings and zero (or near zero) total ( ) loadings explain activity by themselves and
are candidates for use as alerts.
Commercial systems often have turnkey approaches. Moreover,

their algorithms used for the prediction are proprietary. This may
not offer an interested user an understanding of the basis of the
prediction or in-depth knowledge of how the model was con-
structed. Although there are exceptions with some computational
tools that are oriented towards transparency, and enable model
construction with chemoinformatic analysis of structural features
so as to better understand the basis of the model and predictive
information (13). This is illustrated in Fig. 2 where a view of
hierarchical tree representing positive structural features important
in a public gentox training set can be quickly identified and subse-
quently loaded into the modeling tool to help improve predictive
performance. The method classifies compounds for structural fea-
tures on the basis of chemical scaffolds that differentiate genotoxic
from nongenotoxic compounds. The hierarchical relationships are
thus used to create a constrained tree positioning potentially
promiscuous structural features (i.e., alerts) in proximity to each

other in a parentchild relationship. The technique thus can be
used to discover nontraditional structural alerts (63). Some com-
mercial computational toxicology software offer prediction of
organ-specific toxicity endpoints using annotated compound-tox-
icity associations from histological data used to build the models
(67). These models do offer more insight into the types of pathol-
ogy potentially affected by a compound (e.g. necrosis, lipid accu-
mulation, organ tissue weight gain, etc).
Another observation with the computational toxicology software
is that few commercial systems are OECD compliant for regulatory
use because they dont adhere to the Setubal principals for QSARs
(41, 42). According to the Setubal principals, QSARs for regulatory
application should (1) be associated with a defined endpoint of
regulatory importance, (2) take the form of an unambiguous algo-
rithm, (3) have a defined domain of applicability, (4) be associated
with appropriate measures of goodness of fit, robustness, and pre-
dictivity, and (5) have a mechanistic basis. Notably some computa-
tional toxicology software that are widely used such as Derek for
Windows (40), and Toxtree (available from the European Commis-
sion Joint Research Centre) are both OECD compliant (38). Still
there are stark differences between different programs and so clarify-
ing how the output of the many different software programs is used in
different integrated assessments could be a path forward towards
better computational assessments and harmonization.
3. Discussion
With interest in applications for the pharmaceutical sciences,

batteries of computer-based models using SARs, QSARs, decision
trees, and toxicity databases have been developed, and accuracy
tested in predicting discrete endpoints for preclinical safety; includ-
ing genetic toxicity (12, 40, 4345), rodent carcinogenicity (11,
38, 4648), and teratogenicity (14, 4952).
Predicting the genotoxic potential of drug-related molecules
appears to be the most realistic application of (Q)SAR-based
computational techniques for regulators of drugs in the USA (65).
Regardless if the application is to support drug safety assessment or
not, high quality databases need be complied, well characterized,
and rigorously tested. There are several examples of genetic toxicity
databases used in computational toxicology modeling. In one
recent study, collaboration between industry, computational soft-
ware developers and regulatory researchers led to the development
of a genetic toxicity database of 3,200 plus chemicals (43). A private
genetic toxicity warning system developed and used by Astra Zeneca
reported a database of 12,500 data points (53). Thus, there are a
plethora of private and public data resources available for developing

genetic toxicity models. A recent review summarized available pub-
lic toxicity databases (9). Establishing sustainable data management
and storage is therefore a critical process in computational toxicol-
ogy. The FDA is taking steps towards this end with development of a
project called the Chemical Evaluation and Risk Estimation System
(CERES) designed by the Center for Food Safety and Applied
Nutrition (54). CERES will provide a single unified data repository
and computational tools to identify potential safety concerns of
chemical substances the Center may encounter. According to its
developers, CERES will provide mode of action-driven QSAR pre-
diction models as well as readacross tools in a strategy designed to
deal with a wide variety of information including regulatory, human
exposure data, physical/chemical properties of food ingredients,
and biological and chemical toxicity data (54).
For pharmaceuticals, outside of the aforementioned preclinical
use cases, newer approaches addressing the prediction of human
adverse effects of drugs using structure as an input variable are also
underway through data mining, integration and computational
analysis of post-market surveillance information (5560). How-
ever, prospective validation and testing of these modeling techni-
ques is needed in order to learn if relevant safety signals can be
detected with these approaches. Most validation that is being
reported for many statistical-based computational models suffers
in the sense of its reliance on internal cross-validation techniques
without plans for external validation. These types of model building
exercises can lead to over interpretation of generated predictions,
and thus, users should be cautious with respect to applying the
predictive information in safety assessments. Improved validation
will facilitate knowledge regarding how accurate these types of
models really are, and will support the activity of modelers to
optimize and update models according to needs of the risk assessor.
It will also aid in understanding what safety and risk assessment
scenarios the predictive models might be useful for regulators and
industry drug development scientists.
The majority of SAR and QSAR predictive computational soft-
ware generate qualitative predictions in a dichotomous way (e.g.,
yes/no, positive/negative, active/inactive outputs). A major chal-
lenge is that many safety endpoints for pharmaceuticals are
extremely complex and the qualitative binary outputs from the
computational tools may frankly not be informative enough, such
as in evaluating human cancer risk from rodent carcinogenicity
studies, and the cellular system level of cardiac ion channels which
play an important role in understanding drug cardiac safety. Thus
training a model and using predictive output that a drug is posi-
tive for a toxicological endpoint may not make sense given the
complexity of the biological system, the countless decisions that go
into establishing a safety concern, and how a drug may potentially
Fig. 3. The importance of research and innovation of computational tools to support safety assessments during the drug
development process.
perturb numerous biological pathways leading to a significant

adverse effect. Thus, in order to prevent over interpretation of
computational predictions, end user evaluation of the system, and
a whole process understanding of how the predictive model was
constructed coupled with a tempered expectations based on the
suitability of the predictive data is critically important.
An understanding of how the pharmaceutical industry uses
computational predictive methods to support the numerous safety
evaluations of drugs would aid regulatory risk assessors in better
defining the pluses and minuses of what technology is available, its
limitations, and what is being provided by the expert systems in
terms of predictive information that may have potential utility for
safety assessment in certain regulatory settings.
Recognizing the challenges predictive computational tools in
the pharmaceutical sciences are widely being reported to lend
support to safety assessments in multipurpose ways and at various
stages of drug development but innovation is key (Fig. 3) (33).
One major characteristic is that these tools appear to be highly
versatile or multipurpose. For example, many computational
QSAR models have been built based on training sets of chemical
structures of different origins including environmental agents,
drugs, food ingredients, and numerous types of industrial chemi-
cals (11). As a consequence, constructed models from these sources
do not hinge on drug information or adhere to the same study

protocol (e.g. non-GLP) standards. The study data and chemical
structure information is harnessed to create an intended global
type data landscape. This philosophy of global modeling does
offer greater chemical diversity and higher opportunity for being
able to predict a new compound or detect substructural features
useful in the machine learning. However, one disadvantage is that
the mechanism of action for toxicity of a wide spectrum of chemi-
cals entered into toxicology training sets is likely unknown, unspec-
ified, or there is uninformed pharmacology. This is evident because
often the chemicals in the global training sets are not reviewed in
terms of mechanism for inducing the toxicity endpoint that is
modeled (32, 47), and this is a drawback. It is difficult to collect
large data sets of drugs and chemicals with a proven mechanism of
action in the first place. Nonetheless, how the global modeling
approach influences predictivity for toxicity remains to be seen.
Other approaches have shown the linking of chemicals in the
model with the same mechanism of toxicity and this characteristi-
cally yields a QSAR training set with a congeneric series of struc-
tures and usually fine-tuned predictive performance but at a cost of
coverage of chemical space (32, 47, 61). The latest advances are in
integrating experimental test data with computational predictions
to arrive at semi-empirical assessments which could be considered
more evidence-based (65-66).
4. Summary
An array of predictive technologies exists that offer an opportunity

to detect drug safety liability endpoints. These approaches could
eventually be used as part of an integrated testing strategy as out-
lined by the US National Academies in their report, Toxicity Test-
ing for the twenty-first century (62), and aid innovative designs to
addressing complex drug safety issues as explained in the FDAs
landmark report on Critical Path Initiative (5). Through rigorous
research and development efforts with the numerous computa-
tional platforms and statistical modeling methods that are available,
these tools have been applied to help resolve a wide spectrum of
problems in toxicology and metabolism. Several new approaches
have emerged and others are still evolving as was covered in this
chapter. Recognizing their limitations, all seem to be revealing how
in silico modeling of drug properties and molecular structures can
help address data gaps and lend scientific decision support to safety
assessments for human pharmaceuticals depending upon use case
scenarios and appropriateness of the modeling method.
References
1. Redfern WS, Carlsson L, Davis AS, Lynch WG, 12. Contrera JF, Matthews EJ, Kruhlak NL, Benz
MacKenzie I, Palethorpe S, Siegl PKS, Strang RD (2008) In Silico Screening of Chemicals for
I, Sullivan AT, Wallis R, Camm AJ, Hammond Genetic Toxicity Using MDL-QSAR, Non-
TG (2003) Relationships between preclinical parametric Discriminant Analysis, E-State,
cardiac electrophysiology, clinical QT interval Connectivity, and Molecular Property Descrip-
prolongation and torsade de pointes for a tors. Toxicol Mech Methods 18:207216
broad range of drugs: evidence for a provisional 13. Valerio LG, Yang C, Arvidson KB, Kruhlak NL
safety margin in drug development. Cardiovasc (2010) A structural feature-based computa-
Res 58:3245 tional approach for toxicology predictions.
2. Kola I, Landis J (2004) Can the pharmaceutical Expert Opin Drug Metab Toxicol 6:505518
industry reduce attrition rates? Nat Rev Drug 14. Arvidson KB (2008) FDA toxicity databases
Discov 3:711715 and real-time data entry. Toxicol Appl Pharma-
3. Mullard A (2011) 2010 FDA drug approvals. col 233:1719
Nat Rev Drug Discov 10:8285 15. FDA (2008) Draft guidance for industry: gen-
4. Kola I (2008) The state of innovation in drug otoxic and carcinogenic impurities in drug sub-
development. Clin Pharmacol Ther stances and products: recommended
83:227230 approaches. U.S FDA/CDER, Silver Spring
5. FDA (2004) Challenge and opportunity on the 16. ICH (2010) Final concept paper. M7: geno-
critical path to new medical products. US toxic impurities. In: International conference
Department of Health and Human Services, F on harmonisation of technical requirements
(ed), FDA, Rockville. http://wcms.fda.gov/ for registration of pharmaceuticals for human
FDAgov/ScienceResearch/SpecialTopics/ use. International Conference on Harmonisa-
CriticalPathInitiative/CriticalPathOpportuni- tion, Geneva
tiesReports/ucm077262.htm 17. Ashby J, Lefevre PA, Styles JA, Charlesworth J,
6. FDA (2011) FDA Critical Path Initiative. In: Paton D (1982) Comparisons between carci-
Critical path website, US Food and Drug nogenic potency and mutagenic potency to
Administration, Silver Spring. http://www. Salmonella in a series of derivatives of 4-
fda.gov/ScienceResearch/SpecialTopics/Criti dimethylaminoazobenzene (DAB). Mutat Res
calPathInitiative/default.htm 93:6781
7. Arvidson KB, Chanderbhan R, Muldoon-Jacobs 18. Kazius J, McGuire R, Bursi R (2005) Deriva-
K, Mayer J, Ogungbesan A (2010) Regulatory tion and validation of toxicophores for muta-
use of computational toxicology tools and data- genicity prediction. J Med Chem 48:312320
bases at the United States Food and Drug 19. Bailey AB, Chanderbhan R, Collazo-Braier N,
Administrations Office of Food Additive Safety. Cheeseman MA, Twaroski ML (2005) The use
Expert Opin Drug Metab Toxicol 6:793796 of structure-activity relationship analysis in the
8. Yang C, Valerio LG Jr, Arvidson KB (2009) food contact notification program. Regul Tox-
Computational toxicology approaches at the icol Pharmacol 42:225235
US Food and Drug Administration. Altern 20. Munro IC, Ford RA, Kennepohl E, Sprenger
Lab Anim 37:523531 JG (1996) Correlation of structural class with
9. Valerio LG Jr (2009) In silico toxicology for no-observed-effect levels: a proposal for estab-
the pharmaceutical sciences. Toxicol Appl lishing a threshold of concern. Food Chem
Pharmacol 241:356370 Toxicol 34:829867
10. Arvidson KB, Valerio LG, Diaz M, Chanderb- 21. Benigni R, Bossa C (2008) Structure alerts for
han RF (2008) In silico toxicological screening carcinogenicity, and the Salmonella assay sys-
of natural products. Toxicol Mech Methods tem: a novel insight through the chemical rela-
18:229242 tional databases technology. Mutat Res
11. Matthews EJ, Kruhlak NL, Benz RD, Contrera 659:248261
JF, Marchant CA, Yang C (2008) Combined Use 22. Cramer GM, Ford RA, Hall RL (1978) Esti-
of MC4PC, MDL-QSAR, BioEpisteme, Lead- mation of toxic hazarda decision tree
scope PDM, and Derek for Windows Software approach. Food Cosmet Toxicol 16:255276
to Achieve High-Performance, High- 23. Worth A, Lapenna S, Lo Piparo E, Mostrag-
Confidence, Mode of Action-Based Predictions Szlichtyng A, Serafimova R (2010) The appli-
of Chemical Carcinogenesis in Rodents. Toxicol cability of software tools for genotoxicity and
Mech Methods 18:189206 carcinogenicity prediction: case studies relevant
to the assessment of pesticides. In: JRC
scientific and technical reports. EC Joint digm for toxicity prediction. Toxicol Mech
Research Centre Institute for Health and Con- Methods 18:103118
sumer Protection, Ispra, pp. 1819 38. Mostrag-Szlichtyng A, Zaldivar Comenges JM,
24. Gasteiger J (2007) De novo design and syn- Worth AP (2010) Computational toxicology at
thetic accessibility. J Comput Aided Mol Des the European commissions joint research cen-
21:307309 tre. Expert Opin Drug Metab Toxicol
25. Gasteiger J (2003) Physicochemical effects in 6:785792
the representation of molecular structures for 39. Saiakhov RD, Klopman G (2008) MultiCASE
drug designing. Mini Rev Med Chem Expert Systems and the REACH Initiative.
3:789796 Toxicol Mech Methods 18:159175
26. Naven RT, Louise-May S, Greene N (2010) 40. Marchant CA, Briggs KA, Long A (2008) In
The computational prediction of genotoxicity. silico tools for sharing data and knowledge on
Expert Opin Drug Metab Toxicol 6:797807 toxicity and metabolism: derek for windows,
27. Snyder RD (2009) An update on the genotoxi- meteor, and vitic. Toxicol Mech Methods
city and carcinogenicity of marketed pharma- 18:177187
ceuticals with reference to in silico predictivity. 41. Jaworska JS, Comber M, Auer C, Van Leeuwen
Environ Mol Mutagen 50:435450 CJ (2003) Summary of a workshop on regu-
28. Durham SK, Pearl GM (2001) Computational latory acceptance of (Q)SARs for human health
methods to predict drug safety liabilities. Curr and environmental endpoints. Environ Health
Opin Drug Discov Devel 4:110115 Perspect 111:13581360
29. Snyder RD, Ewing DE, Hendry LB (2004) 42. Fjodorova N, Novich M, Vrachko M, Smirnov
Evaluation of DNA intercalation potential of V, Kharchevnikova N, Zholdakova Z, Novikov
pharmaceuticals and other chemicals by cell- S, Skvortsova N, Filimonov D, Poroikov V,
based and three-dimensional computational Benfenati E (2008) Directions in QSAR mod-
approaches. Environ Mol Mutagen eling for regulatory uses in OECD member
44:163173 countries, EU and in Russia. J Environ Sci
30. Contrera JF, Matthews EJ, Kruhlak NL, Benz Health C Environ Carcinog Ecotoxicol Rev
RD (2005) In silico screening of chemicals for 26:201236
bacterial mutagenicity using electrotopological 43. Yang C, Hasselgren CH, Boyer S, Arvidson K,
E-state indices and MDL QSAR software. Aveston S, Dierkes P, Benigni R, Benz RD,
Regul Toxicol Pharmacol 43:313323 Contrera J, Kruhlak NL, Matthews EJ, Han
31. Myatt G, Cross KP, Valerio LG (2011) Sup- X, Jaworska J, Kemper RA, Rathman JF,
porting safety assessment of drug impurities Richard AM (2008) Understanding genetic
through examination of an Ames assay QSAR toxicity through data mining: the process of
model. In: SOT (ed) The toxicologist. Society building knowledge by integrating multiple
of Toxicology, Washington, DC, p 1812 genetic toxicity databases. Toxicol Mech Meth-
ods 18:277295
32. Benigni R, Bossa C (2008) Predictivity of
QSAR. J Chem Inf Model 48:971980 44. Benigni R, Zito R (2003) Designing safer
drugs: (Q)SAR-based identification of muta-
33. Boyer S (2009) The use of computer models in gens and carcinogens. Curr Top Med Chem
pharmaceutical safety evaluation. Altern Lab 3:12891300
Anim 37:467475
45. Matthews EJ, Kruhlak NL, Cimino MC, Benz
34. Ekins S, Andreyev S, Ryabov A, Kirillov E, RD, Contrera JF (2006) An analysis of genetic
Rakhmatulin EA, Sorokina S, Bugrim A, toxicity, reproductive and developmental toxic-
Nikolskaya T (2006) A combined approach to ity, and carcinogenicity data: I. Identification of
drug metabolism and toxicity assessment. carcinogens using surrogate endpoints. Regul
Drug Metab Dispos 34:495503 Toxicol Pharmacol 44:8396
35. Bercu JP, Morton SM, Deahl JT, Gombar VK, 46. Contrera JF, Kruhlak NL, Matthews EJ, Benz
Callis CM, van Lier RB (2010) In silico RD (2007) Comparison of MC4PC and MDL-
approaches to predicting cancer potency for QSAR rodent carcinogenicity predictions and
risk assessment of genotoxic impurities in the enhancement of predictive performance by
drug substances. Regul Toxicol Pharmacol 57 combining QSAR models. Regul Toxicol Phar-
(23):300306 macol 49:172182
36. Boyer S (2010) The use of computer models in 47. Benigni R, Bossa C (2008) Predictivity and
pharmaceutical safety evaluation. Altern Lab reliability of QSAR models: the case of muta-
Anim 37:467475 gens and carcinogens. Toxicol Mech Methods
37. Richard AM, Yang C, Judson RS (2008) Tox- 18:137147
icity data informatics: supporting a New para-
48. Benfenati E, Benigni R, Demarini DM, Helma 57. Matthews EJ, Frid AA (2010) Prediction of
C, Kirkland D, Martin TM, Mazzatorta P, drug-related cardiac adverse effects in
Ouedraogo-Arras G, Richard AM, Schilter B, humansA: creation of a database of effects
Schoonen WG, Snyder RD, Yang C (2009) and identification of factors affecting their occur-
Predictive models for carcinogenicity and rence. Regul Toxicol Pharmacol 56:247275
mutagenicity: frameworks, state-of-the-art, 58. Keiser MJ, Setola V, Irwin JJ, Laggner C,
and perspectives. J Environ Sci Health C Envi- Abbas AI, Hufeisen SJ, Jensen NH, Kuijer
ron Carcinog Ecotoxicol Rev 27:5790 MB, Matos RC, Tran TB, Whaley R, Glennon
49. Di Carlo FJ (1990) Structure-activity relation- RA, Hert J, Thomas KLH, Edwards DD, Shoi-
ships (SAR) and structure-metabolism relation- chet BK, Roth BL (2009) Predicting new
ships (SMR) affecting the teratogenicity of molecular targets for known drugs. Nature
carboxylic acids. Drug Metab Rev 22:411449 462:175181
50. Pearl GM, Livingston-Carr S, Durham SK 59. Berger SI, Maayan A, Iyengar R (2010) Sys-
(2001) Integration of computational analysis tems pharmacology of arrhythmias. Sci Signal
as a sentinel tool in toxicological assessments. 3:ra30
Curr Top Med Chem 1:247255 60. Rodgers AD, Zhu H, Fourches D, Rusyn I,
51. Matthews EJ, Kruhlak NL, Daniel Benz R, Tropsha A (2010) Modeling liver-related
Ivanov J, Klopman G, Contrera JF (2007) A adverse effects of drugs using knearest neigh-
comprehensive model for reproductive and bor quantitative structure-activity relationship
developmental toxicity hazard identification: method. Chem Res Toxicol 23:724732
II. Construction of QSAR models to predict 61. Franke R, Gruska A, Bossa C, Benigni R (2010)
activities of untested chemicals. Regul Toxicol QSARs of aromatic amines: identification of
Pharmacol 47:136155 potent carcinogens. Mutat Res 691:2740
52. Pery AR, Desmots S, Mombelli E (2010) 62. NRC (2007) Toxicity testing in the 21st cen-
Substance-tailored testing strategies in toxicol- tury: a vision and a strategy. National Research
ogy: an in silico methodology based on QSAR Council of the National Academies, Washing-
modeling of toxicological thresholds and ton, DC
Monte Carlo simulations of toxicological test- 63. Valerio LG Jr, Cross KP (2012) Characteriza-
ing. Regul Toxicol Pharmacol 56:8292 tion and validation of an in silico toxicology
53. Glowienke S, Hasselgren C (2011) Use of struc- model to predict the mutagenic potential of
ture activity relationship (SAR) evaluation as a drug impurities. Toxicol Appl Pharmacol
critical tool in the evaluation of the genotoxic 260:209221.
potential of impurities. In: Teasdale A (ed) Gen- 64. Valerio LG Jr, Dixit A, Moghaddam S, Mora
otoxic impurities: strategies for identification O, Prous J, Valencia A. (2012) QSAR Model-
and control. Wiley, Hoboken, pp 97120 ing for the mutagenic potential of drug impu-
54. Arvidson K, McCarthy A, Yang C, Hristozov D rities with Symmetry1 Suppl. Toxicol. Sci.
(2011) Design and development of an institu- 2906, 437
tional knowledgebase at FDAs Center for 65. Nigsch F, Lounkine E, McCarren P, Cornett B,
Food Safety and Applied Nutrition. In: SOT Glick M, Azzaoui K, Urban L, Marc P, Muller
(ed) The toxicologist. Society of Toxicology, A, Hahne F, Heard DJ, Jenkins JL (2011)
Washington, DC, p 155 Computational methods for early predictive
55. Matthews EJ, Ursem CJ, Kruhlak NL, Daniel safety assessment from biological and chemical
Benz R, Sabate DA, Yang C, Klopman G, data. Expert Opin Drug Metab Toxicol
Contrera JF (2009) Identification of 7:14971511
structure-activity relationships for adverse 66. Valerio LG Jr (2012) Application of advanced
effects of pharmaceuticals in humans: B. Use in silico methods for predictive modeling and
of (Q)SAR systems for early detection of drug- information integration. Expert Opin Drug
induced hepatobiliary and urinary tract toxici- Metab Toxicol 8:395398
ties. Regul Toxicol Pharmacol 54(1):2342
67. Myshkin E, Brennan R, Khasanova T, Sitnik T,
56. Ursem CJ, Kruhlak NL, Contrera JF, Serebriyskaya T, Litvinova E, Guryanov A,
MacLaughlin PM, Benz RD, Matthews EJ Nikolsky Y, Nikolskaya T, Bureeva S (2012) Pre-
(2009) Identification of structure-activity rela- diction oforgan toxicity endpointsbyQSARmod-
tionships for adverse effects of pharmaceuticals eling based on precise chemical-histopathology
in humans. Part A: use of FDA post-market annotations DOI: 10.1111/j.1747-0285.
reports to create a database of hepatobiliary 2012.01411.x
and urinary tract toxicities. Regul Toxicol Phar-
macol 54:122
Part V
Integrated Modeling/Systems Toxicology Approaches

Chapter 16
Developing a Practical Toxicogenomics Data Analysis

System Utilizing Open-Source Software
Takehiro Hirai and Naoki Kiyosawa
Abstract
Comprehensive gene expression analysis has been applied to investigate the molecular mechanism of
toxicity, which is generally known as toxicogenomics (TGx). When analyzing large-scale gene expression
data obtained by microarray analysis, typical multivariate data analysis methods performed with commercial
software such as hierarchical clustering or principal component analysis usually do not provide conclusive
outputs by themselves. To best utilize the TGx data for toxicity evaluation in the drug development process,
fit-for-purpose customization of the analytical algorithm with user-friendly interface and intuitive outputs
are required to practically address the toxicologists demands. However, commercial software is usually not
very flexible in the customization of their functions or outputs. Owing to the recent advancement and
accumulation of open-source software contributed by bioinformaticians all over the world, it becomes
easier for us to develop practical and fit-for-purpose analytical software by ourselves with fairly low cost and
efforts. The aim of this article is to present an example of developing an automated TGx data processing
system (ATP system), which implements gene set-level analysis toxicogenomic profiling by D-score method
and generates straightforward output that makes it easy to interpret the biological and toxicological
significance of the TGx data. Our example will provide basic clues for readers to develop and customize
their own TGx data analysis system which complements the function of existing commercial software.
Key words: Toxicogenomics, Bioinformatics, KNIME, R, Bioconductor, Open-source software
1. Introduction
Toxicity is one of the major causes of drug development attritions

(1), which can bring about considerable impacts on R&D cost and
time. Historical gold standards for the evaluation of drug-
induced toxicity in the preclinical drug development phase include
observation of general condition of animals, histopathological
examination, or measurement of well-established biomarkers,
such as the plasma alanine aminotransferase level as an index of
liver injury. In addition to the observation of these conventional
357
358 T. Hirai and N. Kiyosawa
toxicological endpoints, comprehensive gene expression analysis

using the microarray technique for the investigation of the molec-
ular mechanism of drug-induced toxicities, generally called tox-
icogenomics (TGx), has been widely utilized in pharmaceutical
companies in the last decade. TGx data adds values to such con-
ventional toxicology data by providing detailed information of
comprehensive molecular dynamics in the gene expression level.
One of the major difficulties when utilizing the TGx data is how to
efficiently handle the huge amount of gene expression data set and
extract toxicologically relevant information. A number of com-
mercial software provide tools to perform typical multivariate
analysis methods such as hierarchical clustering or principal com-
ponent analysis. However, commercial software is usually expen-
sive, and sometimes requires a relatively high performance
computer to run the programs for user-friendly interface and
sophisticated multivariate algorithms. In addition, multivariate
data analysis by itself does not usually generate intuitive and
conclusive outputs, and in many cases users are forced to spend
considerable time on the differentially expressed gene list to care-
fully interpret the biological/toxicological significance of the TGx
data. Sometimes commercial pathway analysis software such as
Ingenuity Pathway Analysis (IPA, http://www.ingenuity.com/),
MetaCore (http://www.genego.com/metacore.php), or ToxWiz
(http://www.camcellnet.com/) is helpful for toxicologists to
interpret the biological significance of the differentially expressed
gene list. However, the generated pathway is usually too compli-
cated to interpret the toxicological significance of the results too,
because the amount of information included in the generated
network (i.e., edges and nodes) is too huge for ordinal toxicolo-
gists who are not necessarily experts in genomics. To address these
difficulties in TGx data analysis, we previously introduced a gene
set-level analytical method called Differentially regulated gene
score, or D-score (2), which represents the general expression
change level of certain gene sets. D-score method can decrease
the amount of data set (i.e., information for >30,000 genes to the
number of gene sets, which users can arbitrarily define) as pre-
sented in Fig. 1.
Since the algorithm of D-score calculation is quite simple, we do
not need to use a high performance computer, and it can be imple-
mented by utilizing open-source software including statistical soft-
ware R (3). Since the D-score method can substantially reduce the
data size, it is much easier for users to interpret the biological/
toxicological significance. In addition, it is expected that gene set-
level analysis or biological pathway-level analysis would provide
more robust outcomes compared with individual gene-level analysis
when observing the molecular dynamics elicited by chemical expo-
sure. Information of gene sets, whose expression levels are closely
associated with certain toxicities, has been accumulated (4), which is
16 Developing a Practical Toxicogenomics Data Analysis System Utilizing. . . 359
Affy probe ID
1367659_s_at
1367667_at
Gene Title
dodecenoyl-Coenzyme A delta isomerase (3,2 trans-enoyl-Coenzyme A isomerase)
farnesyl diphosphate synthase (farnesyl pyrophosphate synthetase, dimethylallyltranstransferase, geranyltranstransferase)
Gene Symbol Entrez Gene Group1_ratio Group2_ratio Group3_ratio
Dci
Fdps
29740
83791
1.1
1.1
1.7
0.9
1.0
1.6
Gene set #1 D-score
1367668_a_at
1367689_a_at
stearoyl-CoA desaturase (delta-9-desaturase)
CD36 molecule (thrombospondin receptor)
Scd
Cd36
83792
29184
0.9
1.0
0.7
1.5
3.8
1.5 calculation
1367725_at
1367733_at
1367836_at
pim-3 oncogene
carbonic anhydrase II
carnitine palmitoyltransferase 1a, liver
Pim3
Car2
Cpt1a
64534
54231
25757
1.1
1.3
1.0
1.6
0.9
1.6
1.6
1.5
1.5
Gene set #2
Gene set #3
1367847_at nuclear protein 1 Nupr1 113900 0.9 1.8 1.1
1367932_at 3-hydroxy-3-methylglutaryl-Coenzyme A synthase 1 (soluble) Hmgcs1 29637 1.1 1.1 1.7
1367950_at solute carrier family 22 (organic cation/carnitine transporter), member 5 Slc22a5 29726 1.0 2.1 1.3
1367979_s_at cytochrome P450, subfamily 51 Cyp51 25427 1.1 1.0 2.4
1368007_at
1368009_at
1368020_at
deleted in malignant brain tumors 1
glucosamine (UDP-N-acetyl)-2-epimerase/N-acetylmannosamine kinase
mevalonate (diphospho) decarboxylase
Dmbt1
Gne
Mvd
170568
114711
81726
1.3
1.2
1.2
1.7
1.5
1.0
2.5
1.0
3.6
Gene set #4
1368047_at solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 3 Slc13a3 64846 1.1 1.7 0.9
1368052_at
1368074_at
1368086_a_at
tetraspanin 8
UDP-galactose-4-epimerase
lanosterol synthase (2,3-oxidosqualene-lanosterol cyclase)
Tspan8
Gale
Lss
171048
114860
81681
1.0
1.2
1.3
1.2
1.2
0.9
1.6
1.6
3.7
Gene set #5
1368160_at insulin-like growth factor binding protein 1 Igfbp1 25685 1.0 5.3 0.8
1368173_at
1368189_at
nucleolar protein 5
7-dehydrocholesterol reductase
Nol5
Dhcr7
60373
64191
1.4
1.0
1.1
0.7
1.5
1.6

1368232_at mevalonate kinase Mvk 81727 1.2 1.2 2.1
1368251_at Janus kinase 3 Jak3 25326 1.3 2.1 1.2
1368255_at
1368260_at
1368265_at
neurotrimin
aurora kinase B
cytochrome P450 monooxygenase CYP2T1
Ntm
Aurkb
Cyp2t1
50864
114592
171380
0.8
1.1
1.0
1.0
0.6
0.8
1.7
1.5
1.6
Gene set #N
GeneChip data : >30,000 probe sets N gene sets N scores
Fig. 1. Concept of D-score analysis. Affymetrix Rat 230 2.0 GeneChip array contains >30,000 probe sets. Users are
required to define gene sets whose expression levels are associated with certain biological or toxicological pathways
(in this figure a total of N gene sets are defined). Gene expression data contained in the predefined gene sets were used for
the calculation of D-score, which represents the general tendency of expression changes of certain gene sets (2). During
this process, the dimension of the data was reduced from >30,000 to N, and therefore it is easier for toxicologists to
understand the overall profile of TGx data.
sometimes known as the gene expression signature or TGx bio-

marker. Thus, it is becoming easier to obtain such gene set informa-
tion from published literatures.
Although the output of D-score analysis is straightforward and
its biological/toxicological significance is easy to interpret, the
calculated scores can be considerably affected by fluctuation levels
of gene expression changes, which are sometimes not toxicologi-
cally significant. In such cases, the calculated score may be mislead-
ing. To avoid making inappropriate conclusions, users must check
the expression changes of individual genes included in the gene set
and verify that the meaningful expression change of the toxicity
relevant genes was observed. For users to execute this process easily,
it is desirable to develop software. Owing to continuous devotion
of bioinformaticians, a myriad of open-source software are now
available on the Web, and we can save our valuable time and
money to construct a data analysis environment to address our
demands.
The aim of this article is to present an example of developing a
simple, practical, and cost-effective bioinformatic system to facili-
tate TGx data interpretation, using combinations of open-source
software. We are not getting into the details of conventional statis-
tical analysis to generate differentially regulated gene lists, since a
number of analytical algorithms have been already reported and can
be easily implemented on both public and commercial software
(reviewed in ref. 5). Instead, we focus on the utilization of
D-score analysis method, which adds values to conventional multi-
variate data analysis by presenting simple and intuitive outputs that
will be helpful for ordinal toxicologists who are not always expert in
genomics to interpret the toxicological significance of the TGx data
efficiently.
2. Open-Source
Software
The analytical environment described in the present article was
developed and implemented on Windows XP operating system.
We developed a computational workflow using KNIME software,
by which execution of procedures of TGx data analysis was
automated including data normalization, statistical analysis, QC
analysis, and D-score calculation, all of which were coded with
statistical software R. Although this approach may compromise
the computational performance and takes time to complete the
analysis, the KNIME workflow is easy for users who are not expert
in computational coding skills to understand the logical flow of
TGx data processing. Therefore, users can modify the nodes in the
workflow easily when they want to try an alternative algorithm for
the analysis. The following open-source software and resources
were utilized in our system:
l R (http://cran.r-project.org/) is one of the most preferred
statistical software in the molecular biology society (3). Part
of the reason we chose R is because of the existence of a good
community of users; a number of books and online instruction
tools are available, all of which are very helpful to solve the
problems users encounter; and the software is updated very
frequently. We used R-2.11.1 to develop this software.
l The Comprehensive R Archive Network (CRAN) (http://cran.r-
project.org/) archives the vast number of high quality user
contributed packages, including graphics, that are freely avail-
able. The following is the list of R packages we used in the
present article. Users can download the following packages
from the CRAN archives, and then install it from within R.
The installation is a call to the install.packages function
for the CRAN packages.
ggplot2 (http://cran.r-project.org/web/packages/ggplot2/
index.html) is a plotting system for R. It provides a powerful
model of graphics that makes it easy to produce complex
multilayered graphics (e.g., Heat map).
plotrix (http://cran.r-project.org/web/packages/plotrix/
index.html) has various plotting functions. We can use the
radial.plot command needed to generate the Radar
chart.
l Bioconductor (http://www.bioconductor.org/) project devel-
ops and archives R-based bioinformatic software to analyze
transcriptome, proteome, and metabolome data obtained by
microarray, LC/GC-mass spectrometry, and other techniques
(6), which can be implemented on Linux, Windows, and Mac
OS. The following is the list of R packages we used in the
present article. Users can download the following packages

from the Bioconductor archives, and then install it from within
R. The installation is a call to the biocLite function for the
Bioconductor packages.
rat2302.db (http://www.bioconductor.org/help/bioc-
views/2.7/data/annotation/html/rat2302.db.html) is
an Affymetrix Rat Genome 230 2.0 Array annotation
data (chip rat2302) assembled using data from public
repositories.
RGraphviz (http://www.bioconductor.org/packages/release
/bioc/html/Rgraphviz.html) is a tool for plotting R graph
objects. Interface R with the AT and T Graphviz library is used
for plotting R graph objects from the graph package. Users on
all platforms must install Graphviz.
l KNIME (http://www.knime.org/) enables users to visually
create a data analysis pipeline, selectively execute some of the
analysis steps, and investigate the results through interactive
views on data and models (7). Scripts written in R can be
incorporated in the analysis flow by installing an appropriate
add-in module program. We used KNIME 2.3.1 to develop the
software. In order to use R within KNIME, we downloaded the
R plug-in for KNIME and overwrote the path of the R binaries
in the R plug-in within KNIME.
l Graphviz (http://www.graphviz.org/) is an open-source graph
visualization software with several main graph layout programs
(8). The program automatically places nodes and edges accord-
ing to the user-selected layouts (e.g., hierarchical, radial, circu-
lar, etc.), with very simple and intuitive text language that
describe the relationships among nodes. The functions of
GraphViz can be implemented on an R environment using
RGraphviz software available on the Bioconductor Web site
(9). We used RGraphviz to draw the gene set-level network.
l Perl is a scripting language widely used for system administra-
tion and programming (e.g., very accomplished text manipula-
tion) on the World Wide Web. ActivePerl (http://www.
activestate.com/activeperl) is a quality-assured binary distribu-
tion of Perl for popular UNIX platforms and Windows, and
freely available under the ActivePerl Community License
Agreement. In the present article, we utilized Perl in order to
produce an HTML page to visualize the D-score colored by a
user-defined threshold.
l JavaScript is a client-side programming language that runs
within the Web browser to make Web pages dynamic and
interactive. By using JavaScript and DHTML, the sorting oper-
ation is achieved in the D-score HTML page by using Table
sorter (http://neil.fraser.name/software/tablesort/).
3. Methods
3.1. Gene Expression Affymetrix GeneChip microarray was used for obtaining TGx data
Data Sets sets. Numerical gene expression data was generated from the
scanned GeneChip data by using MAS5 algorithm (10), which
was implemented by GCOS software (Affymetrix). MAS5 algo-
rithm generates Signal (numerical value, which stands for the
gene expression level) and Detection Call (Presence, Marginal,
and Absence Call, which stands for the reliability of the Signal
value) obtained from individual probe sets, and both the Signal
and Detection Call are required for D-score calculation. Although
we use GeneChip data as an example in the present article, TGx
data obtained from any other microarray platforms may also be
utilized provided that they generate appropriate indices for the
gene expression level (i.e., Signal in GeneChip) and reliability of
data (i.e., Detection Call in GeneChip).
3.2. Preparation To perform the D-score analysis, users must predefine the gene sets
of Gene Sets whose expression levels are closely associated with certain biological
pathways or toxicity endpoints. Such gene sets can be prepared
either from published literatures or by experiment-based informa-
tion (4). In addition, gene set information can also be obtained
from public databases such as Gene Ontology (11), KEGG (12),
BioCarta (http://www.biocarta.com/genes/allPathways.asp), or
GenMAPP (13) biological pathway information. Notably, the
authors found that the Molecular Signatures Database (MsigDB)
available on the Gene Set Enrichment Analysis (GSEA) Web site
(http://www.broadinstitute.org/gsea/index.jsp) is an excellent
resource for users to obtain gene sets such as those for biological
pathways or targets of transcription factors, which are either
derived from public pathway databases or are computationally
determined by GSEA algorithm (14). Although the gene set infor-
mation obtained from public databases is useful as it is, directions of
gene expression changes (i.e., upregulation or downregulation)
induced by chemical treatment is usually not uniform in the gene
set, which will compromise the performance of D-score calculation.
To obtain the best result from D-score analysis, users are required
to carefully classify the genes into upregulated and downregulated
ones for each biological/toxicological pathway.
3.3. Development 1. Setting up an analytical workflow using KNIME: Overall analyti-

and Implementation cal flow of the gene set-level TGx profiling system is summarized
of Automated TGx Data in Fig. 2, and we developed the automated TGx data processing
Processing System system (ATP system) to achieve this analytical flow. Individual
computational modules were written with R, and were integrated
into a KNIME workflow (Fig. 3). Although the ATP system
includes basic functions for microarray data analysis such as
Import TGx data
Data pre-processing
(normalization, signal log ratio,
Presence Call ratio, etc.) Prepare gene sets whose
expression levels are
associated with certain
Calculation of D-scores for all biological / toxicological
the gene sets pathways
Define gene set-level

Summarize the result by HTML network structure
Heat map of individual genes Radar chart Gene set-level network

Identification of affected biological /
Identification of responsible genes
toxicological pathways
Interpretation of the TGx data by toxicologists
Fig. 2. Overview of automated toxicogenomics data processing system (ATP system). Users can customize the content of
gene sets based on their own research interest. ATP system assists users to organize the gene sets, implement D-score
calculations after importing the TGx data, and summarize the results in an HTML format for efficient interpretation of the
results biological and toxicological significance.
normalization, quality control (QC), and statistical analysis to

generate the list of differentially regulated genes, we will only
focus on the part of D-score analysis in the present paper.
2. Preparation of data set object: The following three files are
required to implement D-score analysis by the ATP system:
(1) MAS5-analyzed GeneChip data, (2) group definition file,
and (3) comparison definition file. The Group definition file
assigns the individual samples to the corresponding groups.
The comparison definition file determines the appropriate
pair of the groups to be compared to calculate the gene expres-
sion ratio between chemical-treated and corresponding control
groups, and is used for D-score calculation (Fig. 4). In the ATP
system, we applied the global mean normalization method for
normalizing the TGx data.
3. D-score calculation: The algorithm of D-score calculation is
described in the previous publication (2). Briefly, the signal log
ratio (SLR, base 2) was calculated by dividing the mean signal
value of the chemical-treated group by that of the corresponding
controls. The Presence Call ratio (PR) was determined by dividing
the number of the Presence Call given for both chemical- and
vehicle-treated by the total number of chemical-treated and con-
trols. For example, if three of the chemical-treated animals gave
Presence and three of the vehicle-treated animals gave Absence
Calls, respectively, the PR is calculated as (3 + 0)/(3 + 3) 0.5.
Fig. 3. KNIME workflow. Each node represents programs written with statistical software R, and is incorporated into a
workflow designed with KNIME software. The ATP system includes functions for data normalization, QC, and statistical
analysis to generate a differentially expressed gene list in addition to D-score calculation. However, we will focus on
examples for D-score analysis in the present article.
PR is a useful factor to eliminate noise effects of the low-quality

signal data (2).
Assuming that a Gene set X consists of N probe sets (x1,
x2, x3,. . ., xN1, xN), the calculated SLR is given as (SLR1,
SLR2, SLR3,. . ., SLRN1, SLRN) and PR is given as (PR1,
PR2, PR3,. . ., PRN1, PRN). Using these parameters, D-score
is calculated as:
X
N
Index 1 SLR i PR i =N
i1
b c
(a): GeneChip matrix data

(b): Group definition file : assign each sample to
corresponding groups
(c): Comparison definition file : determines

which groups are compared in statistical
analysis and D-score calculation
Fig. 4. Preparation of data set object. The following three files are required for implementing the ATP system: (a) MAS5-
analyzed GeneChip data, (b) group definition file to define which column of the GeneChip data is assigned to specific
groups, and (c) comparison definition file to determine the combination of groups to be compared in the D-score
calculation.
X
N
Index 2 SLR i PR i 2 =N
i1
D - score Index 1 Index 2 100 (scaling factor)

where, Index 1 stands for the overall direction of the expres-
sion change per probe set, and Index 2 stands for the overall
magnitude of the expression change per probe set of the Gene
set X. The sign of Index 1 is positive when the overall gene
expression is upregulated, while the sign of Index 1 is negative
when the overall gene expression is downregulated. The value
of Index 1 is expected to approach zero when the direction of
expression changes is divergent. The sign of Index 2 is always
positive and the value of Index 2 becomes greater when the
expression changes of the genes in Gene set X show a higher
value. Accordingly, D-score will be higher when the genes
included in Gene set X show uniform upregulation with higher
Gene probe
sets
Gene expression change

Gene probe
sets
Presence / Absence Call

Gene probe
sets
Statistics (P-value)
1 2 6 12 24
Time after BB treatment (h)
Fig. 5. Summarization of the calculated D-score. The calculated D-score for each gene set is presented in an HTML format.
The D-scores are linked to heat maps of corresponding expression changes, Presence/Absence Call, and statistics
(P-values) of the individual genes. In addition, the HTML has links to the radar chart (Fig. 5) and gene set-level network
presentation (Fig. 6).
expression changing levels. The calculation algorithm is quite

simple and straightforward, and can be easily implemented by
any program languages including R and Perl.
4. Summarizing the calculated D-score: The final D-score calcu-
lated for each gene set is presented in an HTML format
(Fig. 5). Every D-score in the table is hyperlinked to a heat
map of the individual genes included in the gene set, where
levels of gene expression change, Presence/Absence Call, and
statistical significance of the gene expression change between
the groups compared are presented. This information helps
users to efficiently understand which genes were differentially
expressed between the groups, and which gene expression
change affected the value of the calculated D-score. JavaScript
software Table sorter allows dynamic and interactive usability:
the gene sets can be sorted by their D-scores, which allows
users to easily identify which biological/toxicological pathways
a
>dscore <-c(1:4)
>dscore.names <-names(dscore) <-c("Set1", "Set2",
"Set3", "Set4")
>dscore <-c(-25, 70, 5, 10)
>library(plotrix)
>radial.plot(dscore, labels=dscore.names, rp.type ="p",
line.col = "blue",
radial.lim = c(-100,100))
b
2h after BB treatment
D-score = 66.9
(DNA damage)
Fig. 6. Radar chart of calculated D-score. (a) Sample R code. (b) The calculated D-scores
are presented in a radar chart. The solid line indicates D-scores for 60 gene sets.
were affected by the chemical treatment. In addition, the

HTML page has hyperlinks to a radar chart (Fig. 6) and gene
set-level network (Fig. 7).
5. Radar chart: Radar chart presentation is convenient for users
to capture the overall profile of the calculated D-scores (Fig. 6).
Users may define the threshold based on their experience
when evaluating the biological/toxicological significance of
the D-scores.
6. Gene set-level network presentation: Graphviz enables users to
design a gene set-level network (Fig. 7a). Once a gene set-level
network that addresses the users research interest has been
defined, functional associations of individual D-scores can be
interpreted more intuitively and appropriately by users who are
not experts in genomics. As shown in Fig. 7b, a gene set-level
network colored with D-score values is a convenient way to
efficiently capture the molecular dynamics elicited by chemical
treatment. An example of the gene set-level network applica-
tion can be found in our previous publication (15).
a b
>library("Rgraphviz")
>rEG <- new("graphNEL", nodes = c("AhR", "Cyp1a1",
"Oxidative_stress"), edgemode = "directed")
>rEG <- addEdge("AhR", "Cyp1a1", rEG, 1)
>rEG <- addEdge("Cyp1a1", "Oxidative_stress", rEG, 1)
>nAttrs$fillcolor <-c (Cyp1a1 = "Red") #color can be
changed by D-Score
>plot(rEG, nodeAttrs = nAttrs)
AhR
Cyp1a
Oxidative
stress
Fig. 7. Gene set-level network. RGraphviz was utilized to draw the gene set-level network. (a) Sample code of the Graphviz.
Such network structure must be predefined by users. (b) Network structure for TGx profiling in the rat liver. The color of
each node represents its D-score (i.e., upregulation: orange and red and downregulation: blue), which helps users to
understand which biological/toxicological pathways were affected by chemical treatment.
4. Practical
Examples
for the Evaluation
of Drug-Induced Although the concept of ATP system is quite simple, the eventual
Toxicity output will be considerably enriched with toxicity relevant informa-
tion and helpful for interpreting the biological/toxicological sig-
nificance of the TGx data, considering that the contents of the gene
sets and structures of gene set-level network contain essential infor-
mation. In this section, we will present two practical examples on
how D-score analysis can be applied to preclinical evaluation of
drug-induced toxicity using TGx data. The ATP system was used
for D-score calculation, and the resulting output was analyzed
either by the ATP system or by commercial software. For presenta-
tion purposes, we used TIBCO Spotfire software (http://spotfire.
tibco.com/) to perform hierarchical clustering in Figs. 8 and 9.
4.1. Example 1: The first example shows how to identify genes and biological pathways
Identification of Genes that were affected in rat livers following treatment of 300 mg/kg
and Pathways bromobenzene (BBz), which is a representative hepatotoxicant that
Associated with Drug- causes oxidative stress following hepatic glutathione depletion (16).
Induced Toxicity The liver was harvested at 1, 2, 6, 12, and 24 h after a single BBz
treatment, and microarray analysis was performed on the obtained rat
liver using Affymetrix Rat 230 2.0 GeneChip. A total of 217 gene sets
associated with specific biological pathways registered in the BioCarta
Fig. 8. Identification of genes and biological pathways affected by chemical treatment.
Hematotoxicity
Cyp1a2
Gene set (biological / toxicological pathway)
LXR targets
Cyp4a
Cyp2c
Cyp3a
Cyp2b
Carcinogenesis
Aldo-keto reductase
ABC transporter
Proteasome
DNA damage
UGT
Oxidative stress
Gst
Cyp1a1
Glutathione homeostasis
#3_high dose
#4
#8
#5
#9
#2
#6
#7
#3_low dose
#1_low dose
#1_high dose
80 0 80
D-score
Compounds
Fig. 9. Application of D-score for compound screening.

database were obtained from MsigDB. Since the downloaded path-

way information consisted of gene symbol information, we need to
transform the gene symbol to corresponding Affymetrix GeneChip
probe IDs to perform D-score calculation by the ATP system.
Figure 8 is a screenshot of the result obtained with the ATP
system, where gene sets were listed in a decreasing order of D-score
at 12 h. D-score was highest for ARENF2 pathway gene sets, which
represents oxidative stress-induced gene expression via Nrf2 tran-
scription factor (http://www.biocarta.com/genes/index.asp). This
result is reasonable considering that hepatotoxicity elicited by BBz
is mediated via induction of oxidative stress. The next step is to
identify which genes actually affected the high value of the calcu-
lated D-score. The ATP system provides a hyperlink to the heat
map of expression data of individual genes included in the gene set,
and it is easy for users to verify the expression profile of individual
genes. As shown in Fig. 8, increased expression levels of Maff, Fos,
Mafk, and Jun genes were evident at 12 h after the BBz treatment.
Also, these expression changes were considered to be responsible
for the high value of the calculated D-score. Users need to continue
this process for other gene sets whose D-scores showed remarkable
increase or decrease after the chemical treatment, by which genes
and pathways which are closely associated with drug-induced tox-
icity can be identified. This is a typical analytical procedure to
efficiently identify the genes and pathways associated with drug-
induced toxicities. ATP system also provides hyperlinks to the radar
chart and gene set-level network presentation. Notably, user-
defined, gene set-level network structure is quite helpful for users
to interpret the toxicological significance of the TGx data effi-
ciently, as presented in a previous publication (15).
4.2. Example 2: The second example represents how D-score can be applied to
Application of D-Score screening for less toxic compounds in the preclinical drug develop-
for Compound ment process. Nine compounds that have the same pharmacological
Screening target were administered to rats, and the liver TGx data were sub-
jected to D-score calculation using a total of 12 gene sets, contents
of which are the same as previously published (17). The calculated
D-scores were then subjected to hierarchical clustering (Fig. 9). The
heat map of D-scores indicates the tendency for these compounds to
induce oxidative stress and affect glutathione homeostasis in the
rat liver. Downregulation of Cyp1a2 gene expression was also
noted, and hematotoxicity-associated gene expression change was
observed as well. Figure 9 demonstrates the magnitude of these
molecular perturbations tends to be greater in the left side of the
heat map, indicating that the potential to induce these molecular
perturbations would be higher for compounds located on the left
side such as #3, #4, and #8; and those located on the right side such
as #1 and #3 are considered to be less toxic. Interestingly, regarding
compound #1, D-score profile showed unique characteristics
compared with others, where induction of Cyp2b genes was

observed but little effects on Cyp1a2 and oxidative stress-related
gene sets were observed. These results demonstrate that compound
#1 elicits relatively unique molecular dynamics compared with
others. As such, the D-score analysis information substantially
enriches toxicity-relevant information based on TGx data, which
adds mechanistic insights into conventional toxicological endpoints
such as histopathology or blood chemistry, and assists toxicologists
to select less toxic compounds for further development as drug
candidates. Furthermore, we may use gene sets not only for those
associated with toxicity but also for pharmacological effects, by
which we can obtain information for estimating a possible therapeu-
tic window or safety margin. This is critical for selecting compounds
for further drug development process.
5. Conclusion
Although commercial software provides a user-friendly interface

and sophisticated statistical algorithms, the final outputs do not
always meet the practical needs of toxicologists because of the
complexity of the algorithms and ambiguity of the results. Com-
mercial software also does not always allow users to flexibly custom-
ize their functions. In contrast, software developed in-house
possesses advantages in that users can select fit-for-purpose algo-
rithms to address their own interests. Owing to the continuous
efforts and generous contributions of programmers and bioinfor-
maticians around the world, we can utilize myriad open-source
software, which sometimes even outperforms commercial software
because of its up-to-date maintenance of codes in response to users
demands and requests. By utilizing the ATP system presented in the
present article, members of the authors groups can now better
utilize gene set information accumulated in-house to address their
own issues encountered in the process of TGx data interpretations.
In addition, compared to the previous work flow without the ATP
system in which users had to run each fragmented program of the
software and manually transfer data from one program to another,
human errors that occurred in the data analysis process has been
dramatically reduced by utilizing the ATP system. Furthermore,
since the logical flow of the ATP system is very straightforward,
users can incorporate any idea that may possibly facilitate their
interpretation of the TGx data. To successfully develop an in-
house software like the ATP system, it is crucial to have mutual
understanding between bioinformaticians and toxicologists in each
others disciplines.
6. Notes
The following are notes to be considered when developing a prac-

tical and useful TGx data analysis system similar to the ATP system
for ordinary toxicologists who are not computational experts.
1. The usability and final output should be intuitive and simple:
Sometimes usability of the software (interface, computational
performance, or simplicity of the output) can be the highest
priority for users who are not experts in computational opera-
tions. In fact, usability can be an even more important factor
for such users than implementing highly sophisticated statisti-
cal algorithm whose outputs are not easy to interpret their
biological/toxicological significance. Data handling, computa-
tional algorithm, and final outputs should be as intuitive and
simple as possible to encourage toxicologists to utilize the
software.
2. The analytical algorithms should meet the users practical demands:
In the case of ATP system, there was no commercial software
available that implement D-score analysis for TGx data. The ATP
system was developed according to the users demand that they
want to automate and integrate every step of D-score-based
analysis, by which users can obtain enriched information relevant
to drug-induced toxicity by TGx data. Any analytical algorithms
that represent the general level of gene expression change of the
gene set will work in the same concept as the ATP system. The
users should apply their own analytical algorithms that will
address their own research interest.
3. Tuning-up the gene set contents is crucial in D-score analysis: In
the case of ATP system, performance of the calculated D-score
is highly dependent on the content of the gene sets. The
content of the gene lists will be customized to address their
own research interests, such as toxicity evaluations for liver,
kidney, heart, or testis. One of the drawbacks of D-score analy-
sis is that we discard the information for genes that are not
included in the gene set. Users should include as many genes as
possible to best utilize the TGx data, especially when the sensi-
tivity of detecting toxicity is the highest priority. On the other
hand, users may strictly select a small number of genes if their
priority is to reduce the false-positive results. It depends on
what type of outputs they want to obtain (e.g., toxicity screen-
ing, toxicity profiling, comparative analysis, etc.).
4. Reference database will help appropriate interpretation of the
output: To interpret the significance of gene expression changes
appropriately, it is crucial to utilize the reference database that
includes a vast amount of high quality TGx data sets. It should
be noted that the gene expression data is compatible for in-

house data and is fit-for-purpose of the interested toxicity
targets (i.e., microarray platform, species, study design, data
quality, etc.).
Acknowledgments
The authors would like to acknowledge the people who have con-
tributed to developing the open-source software and publicly avail-
able databases used herein. The authors are grateful to Dr. Kazumi
Ito, Dr. Takashi Yamoto, Kyoko Watanabe, Noriyo Niino, and
Miyuki Kanbori of Medicinal Safety Research Laboratories for
their devotion to the TGx research activity in Daiichi Sankyo.
Also, Drs. Masatoshi Nishimura, Koichi Tazaki, and Kazuhiko
Mori for their productive discussion and advice; Drs. Atsushi San-
buissho, Yuichi Kubo, Hideyuki Haruyama, and Sunao Manabe for
their continuous support of the toxicoinformatic research activity
in Daiichi Sankyo.
References
1. Bass AS, Cartwright ME, Mahon C, Morrison 7. Berthold MR, Borgelt C, Hoppner F, Klawonn
R, Snyder R, McNamara P, Bradley P, Zhou YY, F (2010) KNIME. In: Gries D, Schneider FB
Hunter J (2009) Exploratory drug safety: a dis- (eds) Guide to intelligent data analysis.
covery strategy to reduce attrition in develop- Springer, London, pp 375382
ment. J Pharmacol Toxicol Methods 60:6978 8. Gansner ER, North SC (1999) An open graph
2. Kiyosawa N, Ando Y, Watanabe K, Niino N, visualization system and its applications to soft-
Manabe S, Yamoto T (2009) Scoring multiple ware engineering. Softw Pract Exper 0:15
toxicological endpoints using a toxicogenomic 9. Carey VJ, Gentry J, Whalen E, Gentleman R
database. Toxicol Lett 188:9197 (2005) Network structures and algorithms in
3. R Development Core Team (2008) R: a Bioconductor. Bioinformatics 21:135136
language and environment for statistical com- 10. Hubbell E, Liu WM, Mei R (2002) Robust
puting. R Foundation for Statistical Computing, estimators for expression analysis. Bioinformat-
Vienna, Austria. ISBN 3-900051-07-0, http:// ics 18:15851592
www.R-project.org 11. Ashburner M, Ball CA, Blake JA, Botstein D,
4. Kiyosawa N, Ando Y, Manabe S, Yamoto T Butler H, Cherry JM, Davis AP, Dolinski K,
(2009) Toxicogenomic biomarkers for liver Dwight SS, Eppig JT, Harris MA, Hill DP,
toxicity. J Toxicol Pathol 22:3552 Issel-Tarver L, Kasarskis A, Lewis S, Matese
5. Grewal A, Lambert P, Stockton J (2007) Anal- JC, Richardson JE, Ringwald M, Rubin GM,
ysis of expression data: an overview. Curr Pro- Sherlock G (2000) Gene ontology: tool for the
toc Bioinformatics. Chapter 7, Unit 7.1 unification of biology. The Gene Ontology
6. Gentleman RC, Carey VJ, Bates DM, Bolstad Consortium. Nat Genet 25:2529
B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge 12. Ogata H, Goto S, Sato K, Fujibuchi W, Bono
Y, Gentry J, Hornik K, Hothorn T, Huber W, H, Kanehisa M (1999) KEGG: Kyoto Encyclo-
Iacus S, Irizarry R, Leisch F, Li C, Maechler M, pedia of Genes and Genomes. Nucleic Acids
Rossini AJ, Sawitzki G, Smith C, Smyth G, Res 27:2934
Tierney L, Yang JY, Zhang J (2004) Biocon- 13. Dahlquist KD, Salomonis N, Vranizan K,
ductor: open software development for Lawlor SC, Conklin BR (2002) GenMAPP, a
computational biology and bioinformatics. new tool for viewing and analyzing microarray
Genome Biol 5:R80
data on biological pathways. Nat Genet nomics for profiling toxicant-induced

31:1920 biological perturbations. Int J Mol Sci
14. Subramanian A, Tamayo P, Mootha VK, 11:33973412
Mukherjee S, Ebert BL, Gillette MA, Paulovich 16. Lau SS, Monks TJ (1988) The contribution of
A, Pomeroy SL, Golub TR, Lander ES, bromobenzene to our current understanding
Mesirov JP (2005) Gene set enrichment analy- of chemically-induced toxicities. Life Sci
sis: a knowledge-based approach for interpret- 42:12591269
ing genome-wide expression profiles. Proc Natl 17. Kiyosawa N, Manabe S, Sanbuissho A, Yamoto
Acad Sci USA 102:1554515550 T (2010) Gene set-level network analysis using
15. Kiyosawa N, Manabe S, Yamoto T, Sanbuissho a toxicogenomics database. Genomics
A (2010) Practical application of toxicoge- 96:3949
Chapter 17
Systems Toxicology from Genes to Organs

John Jack, John Wambaugh, and Imran Shah
Abstract
This unique overview of systems toxicology methods and techniques begins with a brief account of systems
thinking in biology over the last century. We discuss how systems biology and toxicology continue to
leverage advances in computational modeling, informatics, large-scale computing, and biotechnology.
Next, we chart the genesis of systems toxicology from previous work in physiologically based models,
models of early development, and more recently, molecular systems biology. For readers interested in
further details this background provides useful linkages to the relevant literature. It also lays the foundations
for new ideas in systems toxicology that could translate laboratory measurements of molecular responses
from xenobiotic perturbations to adverse organ level effects in humans. By providing innovative solutions
across disciplinary boundaries and highlighting key scientific gaps, we believe this chapter provides useful
information about the current state, and valuable insight about future directions in systems toxicity.
Key words: Systems toxicology, Cellular systems biology, Biological network inference, Agent-based
modeling, Virtual tissues, Doseresponse modeling, In vitro to in vivo extrapolation
1. Introduction
The grand challenge of toxicology is to predict the risk of chemical-

induced injury in humans. Though advances in biotechnology
have produced a vast amount of data on the molecular effects of
chemicals in living systems, it remains difficult to extrapolate this
information to adverse clinical outcomes. Here we will discuss
two main reasons why this is so difficult and how computational
methods from diverse fields of research could address these
problems. The first, and foremost, issue is that knowledge about
the cause and effect relationships between chemicals and their
toxicity are poorly defined. Indeed, it may even be impossible to
completely elucidate the detailed sequence of events from chemical
exposure to organ injury. The second important challenge is to
combine knowledge of these pathways with information on
human exposure in order to quantitatively estimate the risk
375
376 J. Jack et al.
Fig. 1. The progression from chemical stimulus to adverse clinical outcome transcends multiple levels of biological
resolution. Computational systems-based approaches can help to unravel the mode of action or pathways to toxicity,
bringing information on genetic and molecular perturbations to the scale of whole organ and organism. Reproduced from
ref. 64 with permission from Taylor & Francis.
of specific health effects. We present a systems view of toxicology

that considers practical solutions for mapping functional linkages
across levels of biological organization, and analyzing their physio-
logical role in maintaining homeostasis. The success of systems
toxicology, we believe, depends on bridging the disparate levels of
biological organization. Figure 1 represents a global (albeit mod-
est) view of the progression from chemical exposure to clinical
outcomes. Each column portrays a specific level of biological reso-
lution. Chemical-induced molecular responses and perturbations
of signaling pathways lead to alterations in cellular phenotypes,
ultimately leading to adverse histologic outcomes. For this chapter
we describe some of the key computational methods required for
mapping chemical-induced changes in gene regulation and protein
signaling to adverse organ-level effects.
The scope of this discussion on systems toxicology is defined by
the practical needs of chemical toxicity testing. This can require the
evaluation of thousands of chemicals by exposure and hazard, mode
of action analysis for human relevance, and quantitative risk estima-
tion across doses, life stages, genders and populations. Due to many
limitations of animal testing, it has been recognized that efficient
and effective alternatives must be developed. This challenge is
17 Systems Toxicology from Genes to Organs 377
being addressed by efforts such as Tox21 and the ToxCast

program, where in vitro endpoints are being systematically evalu-
ated as surrogates for in vivo effects (1). Systems toxicology is
playing an important role in providing in silico adjuncts to the
twenty-first century toxicity testing paradigm including: computa-
tional elucidation of toxicity pathways, and computer simulation
frameworks to provide an in vivo context to in vitro observations.
The traditional view of toxicology, as a descriptive science
focused on the study of poisons and their effects, is gradually
shifting to include predictive elements. For instance, computational
and statistical models are routinely used to predict the toxic
potential of chemicals (e.g., by classification or regression based
on molecular structure). Systems toxicology expands this realm of
predictive models to include methods based on models of physio-
logical processes in an attempt to quantitatively relate human expo-
sure to adverse effects. These include simulation frameworks that
integrate physiologically based pharmacokinetic (PBPK) and phar-
macodynamic (PBPD) models.
In building physiological models of organs, there is a greater
emphasis on representing the relevant mechanisms across biological
scales including: molecular interaction networks, their organization
in cells, and further aggregation into tissues. Additionally, there is a
critical need in systems toxicology for knowledge-based tools that
can efficiently and meaningfully synthesize large amounts of infor-
mation from the literature and public databases in order to recon-
struct plausible toxicity pathways, which can be systematically and
efficiently evaluated using dynamic simulation.
1.1. Background The foundations of systems-level thinking are not unique to toxi-
cology and date back almost 50 years (2, 3). The trends in techno-
logical advancements over the last few decades may provide
valuable lessons about the evolution of systems approaches in the
future. Although systems biology is generally considered a fruit
of the post-genomic revolution (46), complex systems in nature
have been studied for decades. Early cell-based models of morpho-
genesis (7), genetic regulatory networks (8), and metabolic reac-
tions (9, 10) predate the genome sequencing (11) and some even
the knowledge of the double-helical structure of DNA. These
systems models involve some of the same mathematical methods
in use today. However, due to the dearth of molecular data they
were more phenomenological and provided limited insight in
molecular mechanisms. Post-genomic systems biology was driven
by key technological advancements in molecular profiling, data
storage and computational capacity. Hence, it focused on recon-
structing large-scale molecular interaction networks, investigating
their static and dynamic behaviors and translating these molecular
models to higher order cellular events (i.e., changes in cell
state and fate).
378 J. Jack et al.
Over the course of the last 50 years, the resolution and

magnitude of biological assays have grown from just a few measures
of organ morphology to the expression of tens of thousands genes.
In systems biology this spurred the development of very detailed
molecular interaction maps (also called interactomes) to capture
the instructions encoded in the genome (12). Toxicology has used
these tools to gain insight into chemical-induced molecular
mechanisms, which have practical applications for screening envi-
ronmental chemicals (in vitro), and for screening clinical samples to
monitor potential off-target effects of drugs. The complex and
dynamic nature of physiological processes, however, can make it
difficult to translate early chemical-induced molecular perturba-
tions with clinically relevant effects (e.g., histopathology).
2. Systems
Toxicology:
Challenges
Bridging the molecular effects with adverse chemical-induced out-
comes is an important challenge for systems toxicology. While this
does require the ability to identify key molecular mechanisms, it is
equally important to recognize their role in a broader physiological
context. Systems biology research has produced valuable tools to
analyze molecular networks and we will discuss the additional cap-
abilities required for toxicology applications. First, it is important to
estimate the adverse effects of chemical exposure across different
doses and durations. Second, it is important to evaluate the contri-
bution of life-stages, genotypes and health states to individual out-
comes. Third, there are practical issues in using in vitro data to make
in vivo predictions, and to infer human effects from animal models.
This will require a new in silico and in vitro toolbox for uncovering
key linkages across biological organization from chemical exposure
to effects, and to estimate their dose-dependence. Lastly, living
systems are homeostatic, which means that they have the capacity
to maintain their internal state through physiological programs that
operate at different levels of organization. A recurring theme in
systems toxicology research is to provide insight into the design
principles underlying the homeostatic behavior of living systems,
and factors that lead to disease progression.
2.1. Defining Defining the spatial and temporal bounds of the physiological
the System systems relevant for analyzing toxicity is important for designing
experiments and computational models. Based on our knowledge
of pathways that lead to toxicity, the early steps are generally molec-
ular events while the downstream events are usually organ level
injury. The complex series of events between these two extremes
depend on a range of factors that can be very difficult to elucidate.
While chemicals interact with proteins on the order of seconds,
adverse histopathological outcomes are highly dose- and time-

dependent and manifest on a much larger timescale: weeks,
months, or years. Traditionally, organ systems toxicology consid-
ered the effects of chemicals on functional units comprised of
different tissues. Examples of functional units of organs include:
the hepatic lobule, the renal corpuscle, the intestinal crypt, the
pulmonary alveolus, the thyroid follicle, etc.
Though relatively small, these units of organs are highly hetero-
geneous structures comprised of different tissues with specialized
functions. In the mammalian liver, for instance, the vasculature from
the portal vein and hepatic artery branches until terminating in
hundreds of thousands of functional units called lobules (13). Each
hepatic lobule receives blood from up to six portal triads, each typically
consisting of a hepatic arteriole and a portal venule in addition to a bile
ductule (14). Blood flows through intervening spaces between the
cells, called sinusoids (15), and drains into the central vein. Figure 2
illustrates a sinusoid and provides context for the flow characteristics
from the portal triad to the central vein. Hepatocytes, considered the
primary workhorse of the liver, are arranged in plates one to two cells
thick, organized radially around the central vein. While hepatocytes
make up most of the volume of the lobule, there are at least four
additional cell types, spatially distributed across the lobule. The com-
plicated intercellular signaling which can occur between the different
cell types has been best classified in models of liver regeneration after
partial hepatectomy (16, 17); however, it remains unclear how the
regenerative processes factor into human disease (18).
Based on the heterogeneous cellular functions and phenotypes,
under normal conditions and following treatment with chemicals,
the lobule is divided into roughly three zones. While in vitro studies
typically average over the response of many hepatocytes within a
well, they do not capture important zonal variations of in vivo
hepatocyte functions (19, 20). Hence, there is a need for tools to
translate in vitro observations to in vivo physiological outcomes.
2.2. Computational Since toxicology investigates chemical-induced processes that span

Systems for Toxicology multiple scales, it requires tools for identifying key components of
the biological machinery and their functional linkages. Piecing
together evidence for developing a physiological model is resource
intensive and requires considerable domain expertise. We describe a
number of innovative tools that can alleviate some of the burden of
processing large amounts of information, and assist on reconstruct-
ing physiological pathways to toxicity. Pathways are valuable for
explaining mechanistic relationships between events, however, a
different set of computational methods are required for analyzing
the dose and time-dependent response of homeostatic systems to
chemicals. Computational tools for system reconstruction and
dynamic simulation depend on two technical dimensions.
The first dimension is concerned with assay resolutionthe
380 J. Jack et al.
Fig. 2. The liver lobule is heterogeneous with respect to blood flow, blood composition,
and cellular phenotypes. Zonated cellular function both maintains and emerges from this
complicated mix of gradients. Reproduced from ref. 65 with permission from John Wiley
and Sons.
measurement of biological structures and functions at different

biological scales; and the second dimension represents computing
capacitythe technology for storing, managing and analyzing
information from such experiments. There have been exponential
advancements in both dimensions to date, and this trend is
expected to continue.
2.3. Probing Systems: Models are evaluated with regards to their ability to provide insight
Multiresolution Data while reproducing the phenomena being modeled. For this reason,
cell- and tissue-scale microscopylong the tools of pathologyare
needed both for development and validation of models to simulate
in vivo context for the results of in vitro toxicity assays. Histopa-
thology images have long been used to obtain information on
microanatomic regions, vasculature, individual cells, cell types,
and cell phenotypes from two- and three-dimensional images.
Though traditionally time-intensive, advances in automated extrac-
tion of information from histopathology images are making it
possible to analyze these images at a single cell resolution
(2123). Additionally it is possible to extract information about
the functional state of cells using cytomorphologic features or
molecular markers (24). In vivo tissues are three dimensional,
therefore the ability to quantify cellular phenotypes as well as tissue
organization (e.g., ref. 13) are crucial for simulating in vivo context
for in vitro data.
In toxicology, the amount of chemical that is distributed to a
tissue or part of a tissue, i.e., tissue dosimetry, is often estimated
using physiologically based pharmacokinetic (PBPK) models. A
PBPK model consists of a system of ordinary differential equations
for the concentration of a compound (or compounds) in different
tissues. Typically some key tissues are treated as separate compart-
ments for which a homogeneous, well-mixed tissue-specific con-
centration is calculated, while other tissues are modeled in
aggregate as a single compartment (e.g., rapidly perfused tissues).
More complicated dynamics within a tissue, such as diffusion or
membrane transport, are often modeled with additional subcom-
partments with each subcompartment being well mixed. PBPK
models relate the concentration of compounds inhaled or ingested
from the environment to internal tissue doses (2527).
3. Methods
for Dynamic
Systems Modeling
As we progress through the chapter, we start with some higher level
modeling approaches such as PBPK models and cell based models
of morphogenesis. Then we discuss some of the mechanistic driven
modeling approaches, which have become more feasible from the
advent of new assaying technologies developed in the latter half of
the twentieth century into the present. These techniques employ a
better understanding of the underlying processes of cell state and
function with molecular level detail, for tracing the effects of xeno-
biotics perturbations and the potential of tissue level outcomes.
Finally, we show how modeling approaches can be combined to
explore toxicity, for instance coupling a PBPK model with agent-
based liver simulation.
382 J. Jack et al.
3.1. Compartmental For pharmacokinetic (PK) models the available data is typically
Models chemical concentration in tissues over time. If only serum data is
available, then often only an empirical (e.g., one or two compart-
ment) model can be built. If additional tissues are available, a PBPK
model may be built where the actual tissue volumes and blood flows
are used along with the tissue concentrations to determine how
compounds partition into specific tissues. Tissue partitioning
reflects that the concentration of a chemical within the tissue may
be much higher or lower than the concentration of chemical in the
blood, perhaps due to sequestering within cells or tissue-specific
binding. Given heterogeneous tissue, chemical distribution within
a tissue is expected to be heterogeneous. Thus, in order to deter-
mine the average concentration within a tissue using a minimal
number of samples, the entire tissue (e.g., the many deposits of
fat throughout the body or the many lobes of the liver) is typically
homogenized before analysis. In other words, pharmacokinetic
tissue samples are homogenized in recognition of heterogeneous
tissue distribution and not because the tissues are thought to be
homogenous. While this assumption facilitates accounting for the
total amount of compound within the body it obscures precise
estimation on the localization of the compound within the tissue.
For this reason tissues within PBPK models are typically described
by a few well-mixed (i.e., homogenous) compartments. Given the
available data, PK models can do a good job of describing the
absorption, distribution, metabolism, and excretion (ADME) of a
compound, but without additional data further predictive capacity
on biological effects is unattainable.
Biologically based doseresponse (BBDR) models are often
used in conjunction with an empirical PK or PBPK model to predict
a biological, and often toxicological, response to a compound
within a tissue as a function of chemical exposure. The BBDR
models are only able to make additional predictions because they
have been built using additional dataeither chemical-specific
experiments or prior knowledge about a biological process. If
chemical-specific information is available an empirical model may
be calibrated to produce results as seen in experiments. Without
this information a greater mechanistic understanding of the
homeostatic function of the biological processes that are respond-
ing to chemical concentration must be had in order to predict
doseresponse.
The data needed to build a mechanistic understanding of
homeostasis for an arbitrary biological process sufficient to predict
the effects of a chemical perturbation on that process may surpass
all relevant biological knowledge. For this reason systems based
models in which simpler rules at a lower level produce the complex
phenomena (such as homeostasis) are attractive. If these simpler
rules can be validated then predictions might be possible. Cell-
based models may offer deeper insight into mechanism.
3.2. Cell-Based Models A diverse collection of models are described in the literature on
simulating dynamic cellular behavior. Cell-based models, recapitu-
lating temporal changes in cellular phenotypes, have been used to
investigate emergent phenomenacomplex patterns/properties of
a system which cannot be derived simply from intuition and mech-
anism, but are discovered through simulation. For example, in the
last 50 years, impressive strides have been made in cell-based mod-
els of morphogenesis. These models were described mathematically
as Cellular Automata (CA) or Partial Differential Equations (PDE).
Indeed, the seminal work in the field of morphogenesis modeling
was encoded as a system of PDE (7).
There are important factors to consideration when selecting a
cell-based modeling approach. While one could argue that PDE
have greater precision, they also require detailed information on
kinetics which is often unavailable. A recurring theme in systems
toxicology (and, more generally, systems biology) is the lack of
sufficient data to fit mechanistic models based on partial or ordinary
differential equations. Additionally, PDE are computationally more
intensive than CA. While both techniques offer their own unique
contributions to the field, we will discuss the advent and develop-
ment of models involving CA.
The early pioneering efforts in CA include the following: John
von Neumanns self-replicating robot (28), Norbert Wiener and
Arturo Rosenblueths work on conduction in cardiac systems (29),
John Conways Game of Life; and the work of Stephen Wolfram
(30). The beauty in these models lies in the emergence of complex
patterns from sets of simple rules. Indeed, a wide range of phenom-
ena have been modeled using CA. The models we highlight illus-
trate the use CA in simulating developmental cues relevant to
systems toxicology and virtual tissues applications, since they
encompass mechanistic information at the molecular level which
may be useful in understanding chemical perturbations.
In ref. 31, a CA model on cell differentiation has been
described. Intracellular signaling is encoded as a genetic regulatory
network with binary activity of gene states, which was based on an
earlier CA model for single gene networks (32). Intracellular
dynamics, cell division, and cellcell interactions are juxtaposed in
the model to show main features of biological morphogenesis. An
important feature of this model is the use of temporal evolution of
genetic networks, which highlights the mechanism underlying the
higher order cell processes. Another model (33) incorporates a
genetic switch for differentiation of tissue into substrate-depleting
vessels as well as the lateral inhibition of an autocatalytic morpho-
gen. The model is able to recapitulate interesting features of mor-
phogenesis including: dichotomous and lateral branching, blind
vessel ends, and closed loops from anastosmosis. The model is
general enough to describe a variety of phenomena including leaf
veins, insect trachea, and neovascularization.
384 J. Jack et al.
Combining the approaches of CA and PDE has also been met

with success. In ref. 34, a hybrid approach involving a three-
dimensional CA/PDE model was presented. Using simple local
interactions to describe three processesi.e., production of and
chemotaxis to cAMP and cellular adhesionthe model showed
basic morphogenesis. The model involves three characteristics,
i.e., stream formation, cell sorting, and slug migration, and recapi-
tulates the spatial self-organization of amoebae into complex
behavior. The work is an extension of the CA model described in
ref. 35.
The recapitulation of key morphogenetic processes in cell-
based models and the complex phenomena which can arise from
their simulation may provide a potential platform for simulating the
effects of chemicals on tissues. There are several challenges for
using these models to test molecular responses to chemical pertur-
bations. First, determining the sufficient level of cell resolution: cell
models should be complicated enough to incorporate chemical-
induced molecular responses in the context of molecular networks,
but remain tractable computational problem when scaled to large
populations of cellsfor instance, in a virtual tissue. Second, it is
increasingly apparent that cell signaling pathways should not be
viewed as isolated systems: crosstalk between signaling pathways is
important. Probing deeper into the biochemistry of the cell and its
phenotypic changes, molecular systems models may provide insight
into the effects of chemical perturbation.
3.3. Molecular Systems Molecular systems have been abstracted generally as graphical net-
Models works in which the nodes represent molecular entities (e.g., genes
or proteins) and edges are interactions. A molecular interaction
network represents a hypothesis about the causal structure of the
underlying biological system, which is derived from available evi-
dence. Different computational methods can be used to simulate
the dynamic behavior of networks in response to perturbations,
such as chemical-induced receptor activity. These methods can be
classified along a number of dimensions. For instance, the values of
interacting nodes in the network can be represented as discrete or
continuous variables. The procedures for updating the state of the
nodes over time can be deterministic or nondeterministic (or sto-
chastic). We will describe two distinct modeling techniques which
illustrate these contrary modeling assumptions: metabolic pathways
and genetic regulatory networks.
A metabolic pathway can be described as a network in which
individual metabolites are the nodes and enzymatic reactions are
the edges. The relationship between the continuous concentrations
of metabolites can be expressed using the law of mass action as a
system of deterministic ODE. In many cases ODE must be solved
numerically but under certain equilibrium assumptions, the
MichaelisMenten (10) closed form solution (for a single
enzyme-catalyzed transformation) can be used to calculate fluxes

very easily. Systems of ODE have been extensively used in modeling
the biochemistry of cells. Moreover, given sufficient data, an exact
solution to the chemical master equation has been described
(3638). However, ODE models often require the estimation of a
large number of free parameters. Additionally, it is not straightfor-
ward how to investigate chemical perturbations to a system of ODE
given these large numbers of free parameters.
Unlike concentrations of metabolites that vary continuously in
biological specimens, genes are often found described in a switch-
like behavior: active/inactive, upregulated/downregulated or
ON/OFF. Such a discrete representation of gene activity is used
in Boolean Networks (BN) to describe complex genetic regulatory
processes (8). A BN relates the expression of single gene to regu-
lators using deterministic Boolean functions (AND, OR, NOT). In
order to account for variability in biological observations, the
deterministic update scheme can be augmented with probabilistic
approach (39, 40). BN molecular systems models are a promising
approach for quantitative estimation of cellular responses to molec-
ular cues for several reasons. First, BN models typically involve less
free parameters than ODE models. Second, the BN formalism
allows fast, efficient modeling of large populations of networks
(or cells). Third, the behavior of some signaling molecules has
been witnessed as a binary phenomenon (41).
Saez-Rodriguez et al. (42) provide an elegant method for
building a BN model (43) out of a molecular interaction network
and calibrating the model to experimental data. Training the model
to a large dataset for HepG2 cells, they were able to identify new
interactions which were confirmed via literature review. A different
model by Jack et al. (44) , investigated growth factor and cytokine
signaling in hepatocyte populations using ensembles of BN models.
Simulations results from these BN ensembles were aggregated and
calibrated to experimental concentration-dependent response data.
The ability to predict doseresponse behavior with BN, and to
estimate changes in cellular phenotypes from simulated molecular
interaction networks may prove useful to understanding the
response of cells to chemicals.
The knowledge of mechanisms for building molecular systems
models evolved rapidly in the wake of whole genome sequencing
and large-scale molecular profiling technologies. In the prege-
nomic era, the complexity of models was limited by mechanistic
insight (e.g., on protein signaling and gene regulation), and
computational capacity. Large-scale genomic sequencing and gene
expression profiling techniques brought about a shift in the devel-
opment of molecular mechanistic models. Using measurements on
thousands of genes, for instance, it became possible to use the data
to reverse engineer putative molecular interaction networks. This
made it possible to investigate the relationship between network
386 J. Jack et al.
topology and dynamic behavior. Two of the challenges in using

large-scale molecular profiles in dynamic modeling: they measure
the aggregate molecular state across tissue samples containing
millions of diverse cells, and it may not be cost-effective to generate
such data with a high temporal resolution.
Several standards have been developed over the last decade to
encode molecular systems, which simplifies sharing and reuse across
the community. Among these, the Systems Biology Markup Lan-
guage (SBML) (45, 46) and Cell Markup Language (CellML) (47,
48) are gaining acceptance, and tools, such as the Systems Biology
Workbench (SBW) (49) and Copasi (50) can import such models
to run dynamic simulations. Besides BN and ODE, there are addi-
tional mathematical formalisms have also been used to describe
molecular systems models: e.g., P Systems (5154) and pi-calculus
(55, 56).
3.4. Organ Modeling By definition physiology varies from organ to organ, but in some
respects the approach to modeling any organ is standardized.
While the heart is perhaps the most thoroughly modeled (57, 58)
the liver draws more attention with respect to toxicology (5964).
The liver is often the site of initial exposure to hazardous com-
pounds and their metabolites due to first-pass metabolism of blood
from the gastrointestinal tract via the hepatic vein. Although
mechanisms of chronic chemical-induced injury are not completely
understood, it is believed to involve multiscale molecular and cel-
lular interactions that culminate in tissue damage.
In a review of liver tissue simulation approaches, Ierapetritou
et al. (65) identified approaches ranging from systems of ODE
describing spatially homogeneous tissues (e.g., ref. 66) to high
dimensional models including fluid dynamics approaches based
upon approximations of the NavierStokes partial differential equa-
tions (67). The more complicated the approach, the greater the
data and computational needs, especially given the convoluted and
dynamic cellular boundary of the sinusoidal spaces.
3.5. Morphology Naturally occurring structurese.g., a tree, a shell, or a lungare

Generation often characterized by complex structures that can be understood
as a combination of stochasticity and fixed rules for generating
fractal structures. Fractals are self-similarspatial patterns repeat-
ing themselves on different scales. Although complicated in struc-
ture, the underlying rules are often quite simple. For instance, by
considering the need for cells to be sustained by the transport of
materials through networks that branch to all parts of a tissue (and
organism) West et al. (68) argued that, to leading order, the con-
tribution of blood flow from the aorta and arteries must scale as the
power of bodyweightthe empirical scaling value often used in
physiologic models (69).
By allowing the description of some irregular objects fractals

allow the quantification of pathological structures such as tumors,
as well as insight into the underlying biology such as tumor growth
and angiogenesis (70). Euclidean geometry (typically, integer
dimensionse.g., a point, a square, or a cube) can be confused
by fractal (literally, fractional dimensional) objects to the extent that
repeated measures of similar objects, e.g., tumors, can produce
inconsistent results (71). Analysis using fractal geometry allows
for reproducible quantification of complicated structures (70, 71).
In addition to the quantification of the space-filling properties of
tumors and blood vessels, fractal analysis has been used to detect the
coding regions in DNA and in models of epithelial cell growth,
blood vessel growth, periodontal disease, and viral infections (72).
3.6. Multiscale The challenges of multiscale simulation are easily seen in terms of
Simulation the liver: a mixed-media (e.g., air, food, water) whole-body expo-
sure to a compound leads to a concentration in the blood (possibly
via the gut) that flows into the liver where it filters past hepatocytes,
potentially causing toxicity (including rearrangement of the vascular
flows) or induction of enzymes and transporters within the hepa-
tocytes, which in turn changes the rate that the chemical is
eliminated from the blood which in turn changes the amount
of whole body exposure that would be needed to achieve a
similar effect the next time. The different scales interact and
changes on one scale may manifest themselves as hysteresis on a
second scale; that is, memory of the earlier exposure may impact
the outcome of future exposures.
Many numerical models have been developed to assess the
impact of zonation on the metabolism of a compound. One of
the earliest models investigated differential distributions of
enzymes both competing for the same compound (73, 74).
PBPK models subcompartments have been used in order to
model inhomogeneous induction of metabolism enzymes after
chemical exposure: e.g., CYP1A1/A2 by dioxin (75). That
model, which is illustrated in Fig. 3, describes the uneven P450
distribution; there were no cells, but the concentrations of com-
pounds in different, discrete zones of a lobule were modeled con-
tinuously and could therefore be coupled to a PBPK model. Other
models have been developed with similar liver with subcompart-
ments coupled to PBPK structures for modeling zonal heterogene-
ity due to not only enzymes but also transporters (76, 77).
Since there are many interacting in-flows and out-flows, solv-
ing for blood flow through the sinusoids in nontrivial (67). This is
particularly true if one wishes to allow for changes to the geome-
try whether due to regeneration, tumor growth, or chemical
insult. Analytic mathematical descriptions obtain solutions via
regular, hexagonal approximations to the lobule structure in
order to obtain chemical disposition and effects (59, 75).
388 J. Jack et al.
Fig. 3. The typically homogenous compartments of a PBPK model were subdivided for the liver into five zones differing in
proximity from the central vein/portal triad. This allowed for zonal differences in induction of P450 enzymes as a function of
concentration in each subcompartment. Reproduced from ref. 75 with permission from Elsevier.
For higher dimensional models with more complicated structures

where geometry is allowed to be asymmetric, exact solutions are
generally not possible (67). Fully three-dimensional fluid dynam-
ics approaches based upon approximations of the NavierStokes
partial differential equations have yielded solutions for carefully
developed finite-element meshes describing a given lobule (67)
and the hepatic vasculature specific to an individual (78). These
approaches have revealed the importance of pressureparticularly
with respect to the low-pressure portal venules and high pressure
hepatic arterioles (67). Such approaches are data- and computa-
tionally intensive and the need to consider ensembles of many
lobules, albeit in simpler terms (e.g., hemodynamics) has led to
the development of so-called microdosimetry PBPK models in
which Ohms law-like relations relate the pressure and resistance
of a particular sinusoidal segment to determine a mean flow
throughout that segment (63).
As a more cell-centered approach, Ohno et al. (61) coupled
independent realizations of a model for cellular dynamics into a
linear array to allow some instances of the model to be close to the
source of nutrients and foreign compounds while others were
further removed. These approaches that describe the flow through
the vasculature with a continuum (i.e., fluid dynamic) model while
maintaining a discrete, cell-based model for the hepatocytes. This
hybrid alternative makes cell-based models suitable for establishing
doseresponse relationships (61, 63).
Stepping away from chemical dosimetry, Hohme et al. (60)

have developed a cellular model of the hepatic lobule that considers
the biochemical forces between hepatocytes to allow dividing cells
to push into necrotic areas of a lobule, thus simulating recovery
following acute chemical toxicity (e.g., centrilobular necrosis
induced by CCl4).
4. Methods
for System
Reconstruction
Dynamic systems models in biology are usually constructed based
on domain knowledge of the underlying mechanisms. The advent
of large-scale -omic profiling has produced an unprecedented
amount of data on the molecular state of living systems under
normal conditions, following chemical treatment, and for different
phenotypic outcomes. A variety of computational methods have
been developed to generate hypotheses about the putative molecu-
lar mechanisms of chemical-induced injury. In toxicology, useful
methods for system reconstruction should generate hypotheses
about the pathways relating molecular perturbations (e.g., from
-omic assays) to phenotypic effects (e.g., histological lesions). This
is relevant for understanding the mode of action (MOA) for toxic-
ity, and for quantitatively evaluating the role of key events in
dynamic systems models.
A number of bottom-up approaches have been used to recon-
struct maps of the metabolic, signaling and genetic regulatory
relationships in complex molecular interaction networks (referred
to as the interactome) in eukaryotes (79). Such methods have
also been applied to identify functionally relevant subnetworks
(called modules) from molecular profiles (80), and to link genes
to human diseases (81). It has been difficult to systematically
reverse-engineer the molecular network modules in cells, which
can explain their functional responses, using large-scale molecular
assays and publicly available data. In toxicology (and other areas of
disease research), this is motivating a top-down approach in which a
hypothesis-driven approach is used to relate phenotypic responses,
such as cellular outcomes and histologic effects, to the underlying
molecular pathways. Knowledge-based systems could offer a prac-
tical solution that encodes information on tissue-level effects, cel-
lular phenotypes, and molecular mechanisms, in order to support
hypothesis generation by top-down and bottom-up inference. The
mode of action (MOA) framework (82), for example, outlines the
main issues and describes an approach for synthesizing knowledge
by extensive evaluation of prior evidence about the events involved
in chemical-induced toxicity. This is a resource-intensive process
that results in a theory (hypothesis) about causal relationships
between molecular, cellular, and tissue-level effects.
390 J. Jack et al.
Knowledge-based systems (83) can synthesize large amounts of

disparate experimental evidence in order to generate hypotheses as
a putative reconstruction of events leading to toxicity. Unlike
human readable knowledge bases, which primarily contain docu-
ments in free text, machine readable knowledgebases use a physio-
logically relevant abstraction (called an ontology) to encode
domain concepts that are human readable, and also amenable to
computational analysis. An ontology formally describes the differ-
ent types of entities and their relationships (84) in a specific
domain. Knowledge representation schemes have been used to
build semantic models of biological pathways (85) and a number
of public resources, such as the Open Biomedical Ontologies
(OBO) effort, are available for application to toxicology.
Some of the many advantages of KB systems are: transparency
and flexibility, which are useful for describing experimental evi-
dence and mechanistic hypotheses, and for sharing this information
with end-users in a clear and consistent manner. Such approaches
appear to be useful to study complex systems-behavior (8688).
The main challenges in KB modeling in toxicology are: standardi-
zation of biological concepts so that they can be manipulated
computationally, and curation of these concepts from the biomedi-
cal literature. While it has been very difficult to achieve consensus
on disparate evidence, inferences, and hypotheses about biological
processes, a growing number of concepts are been formalized in
biology (89) and a powerful text-mining approaches can extract
these concepts from articles (90).
5. Agent-Based
Models of Tissues
Cell-oriented agent-based modeling (ABM) of tissues offers a
number of unique advantages (91, 92). First, since cells are the
functional units of tissues, the ABM has more physiologic relevance
than a continuum model. Second, the agent responses can be
calibrated and verified through comparison with actual cells
in vitro (or ex vivo). Third, spatial outcomes from the ABM can
be translated to histopathologic effects such as acute lesion and
tumor formation. While the agent-based strategy is suitable for
modeling tissue responses, the approaches to the liver taken so far
have not provided a framework for estimating tissue dosimetry.
One of the first cell-based approaches to understanding the
liver-specific injury was made by Hunt et al. (59), where individual
hepatocytes were represented by independent agents wherein
metabolism can occur. In this and other models the cells are under-
stood in terms of energythey generally act to minimize energy
expenditure while stochastically selecting some less favorable
energy options with a probability that decreases the more energy
is required (59, 92). In energy-based models the stochastic element

is necessary to prevent a cell from becoming trapped in a state that is
a local energy minima when other lower energy states are available.
The environment of the agents in the Hunt et al. (59) model was
determined using a hybrid graph and grid approach in which com-
pounds are represented by objects moving through the lobule.
5.1. EPA Virtual Liver To provide simulated in vivo context for a suite of high-throughput
(v-Liver) assays conducted on thousands of chemicals (i.e., the ToxCast (1)
and Tox21 projects) the US Environmental Protection agency has
been developing a model to relate whole-body chemical exposures
to cell-scale concentrations within the hepatic lobule, and then
simulate the response of individual hepatocytes to chemical pertur-
bations. The goal of this Virtual Liver projects is to provide a
framework for simulating a canonical lobule for extended periods of
time ranging from hours to months.
5.2. ABM Construction If we understand the hepatocyte as an integratorcapable of exhi-

biting spatially heterogeneous responses to zonal gradients in
endogenous and exogenous chemical inputsthen cell-focused
models become an obvious choice (64). Cell-oriented models
[including agent-based models (ABM)] of tissues offer a number
of advantages (92). Since cells are the functional units of tissues,
these models would appear to have more physiologic relevance than
a continuum model (63). Cell responses can then be calibrated and
verified through comparison with actual cells in vitro (or ex vivo)
(Hoehme et al. 2010). Finally, spatial outcomes (i.e., zonal speci-
ficity) can be translated to histopathologic effects such as acute
lesions and tumor formation (62).
5.3. Cellular and In the hybrid model of Ohno et al. (61) cells are represented with
Microvascular separate, but interacting, ODE models arrayed in a sequence from
Architecture periportal to centrilobular. This hybrid, cells as agents and blood as
a continuum, approach was expanded by Wambaugh and Shah (63)
to allow the impact of sinusoidal geometry to be examined, as in
Hunt et al. (59). Figure 4 is an illustration of a hybrid model (63),
the simulated flow through sinusoidal segments in a given geome-
try, allowing the response of individual hepatocytes (agents) to be
dependent on the microenvironment (chemical concentration or
nutrients) around each cell.
In order to quantitatively describe the gradient of any com-
pound within the sinusoids several quantities must be known: the
amount flowing into the liver (requires PBPK preferably calibrated
from time course data, but potentially predicted from partitioning
experiments or pure chemical descriptors (93); the fraction of the
compound not bound to protein (requires in vitro experiment
(94); the rate of hepatic metabolism per enzyme (requires addi-
tional in vitro experiments); and the distribution/induction of the
392 J. Jack et al.
metabolizing enzymes. Without in vivo PK data conservative

assumptions can be made to use in vitro data to discriminate
between the behaviors of compounds. Metabolism rates predicted
from in vitro measures appear to be on the order of actual in vivo
rates (95).
Using their abstract, albeit philosophically fundamental, hepa-
tocytes as energy minimizers approach, Sheikh-Bahaei et al. (62)
were able to use a two substrate model for enzyme expression to
produce zonal heterogeneity. By mapping the number of simulated
concentration objects onto experimental doses and the average
expression of enzyme (e.g., predicted Cyp1A2 expression in
Fig. 5) to the actual mRNA expression a degree of agreement was
achieved between in silico predictions and experimental results
(96)one of the first linkages of a whole tissue liver simulation
on experimental chemical toxicity data.
Fig. 4. Flow through a spatially heterogeneous vascular network was simulated using
mass-balanced hemodynamics equations in order to allow the discrete agents (hepato-
cytes) to be coupled into a continuous PBPK model. Inset: The spatial complexity was
reduced by treating small segments of sinusoidal space as well-mixed subcompartments,
similar to the Andersen et al. (75) approach but with a complex, graph-based structure.
Reproduced from ref. 63 with permission from the Public Library of Science (PLoS).
Fig. 5. Chemically induced expression of P450 metabolizing enzymes was compared between hepatocytes harvested from
rats (a) and predicted using an agent-based model (b). Perivenous (solid) and periportal (open) hepatocytes differed in
expression levels (i.e., displayed zonation). Three different concentrations (lowsquare, mediumtriangle, highcircle)
were used. Both axes were manually converted from semiarbitrary units of the ABM to the lab units. Reproduced from ref.
62 with permission from Elsevier.
6. Conclusion
Effectively using this information to evaluate the potential for

adverse health outcomes continues to be difficult because of two
main issues. First, the relevant part of the system has not been
elucidated. Second, the dynamic concentration-dependent behav-
ior of relevant system components and their role in maintaining
homeostasis is poorly defined. Systems toxicology attempts to
address these challenges by integrative and cross-disciplinary
research that enables toxicity testing that is less dependent on
animals, more reliant on in vitro and in silico models (97), and
more human relevant.
394 J. Jack et al.
Disclaimer
The United States Environmental Protection Agency through its

Office of Research and Development funded and managed the
research described in this paper has been subjected to Agency
review and approved for publication. Reference to commercial
products or services does not constitute endorsement.
References
1. Dix DJ, Houck KA, Martin MT et al (2007) revealed by scanning electron microscopy. Cell
The toxcast program for prioritizing toxicity Tissue Res 148:111125
testing of environmental chemicals. Toxicol 16. Taub R (2004) Liver regeneration: from myth
Sci 95:512 to mechanism. Nat Rev Mol Cell Biol
2. Bertalanffy L (1957) Life, language, law: essays 5:836847
in honor of Arthur F Bentley. Antioch, Yellow 17. Fausto N, Campbell JS, Riehle KJ (2006) Liver
Springs, OH regeneration. Hepatology 43:S45S53
3. Bertalanffy L (1968) General systems theory: 18. Michalopoulos GK (2010) Liver regeneration
foundations, development, applications. after partial hepatectomy: critical analysis of
George Braziller, New York, NY mechanistic dilemmas. Am J Pathol 176:213
4. Ideker T, Galitski T, Hood L (2001) A new 19. Katz NR (1992) Metabolic heterogeneity of
approach to decoding life: systems biology. hepatocytes across the liver acinus. J Nutr
Annu Rev Genomics Hum Genet 2:343372 122:843849
5. Kitano H (2002) Systems biology: a brief over- 20. Gumucio JJ (1989) Hepatocyte heterogeneity:
view. Science 295:16621664 the coming of age from the description of a
6. Kitano H (2002) Computational systems biol- biological curiosity to a partial understanding
ogy. Nature 420:206210 of its physiological meaning and regulation.
7. Turing AM (1952) The chemical basis of mor- Hepatology 9:154160
phogenesis. Phil Trans Royal Soc Lond 21. Athelogou M, Schmidt G, Schape A et al
237:3772 (2007) Cognition network technologya
8. Kauffman SA (1969) Metabolic stability and novel multimodal image analysis technique for
epigenesis in randomly constructed genetic automatic identification and quantification of
nets. J Theor Biol 22:437467 biological image contents. In: Shorte SL,
9. Hill (1910) Proceedings of the physiological Frischknecht F (eds) Imaging cellular and
society: Jan 22 1910. J Physiol 40:ivii molecular biological functions. Springer, Ber-
lin, Heidelberg, pp 407422
10. Michaelis L, Menten M (1913) Die kinetik der
invertinwirkung. Biochem Z 49:333369 22. Roysam B, Ancin H, Bhattacharjya AK et al
(1994) Algorithms for automated characteriza-
11. Venter JC, Adams MD, Myers EW et al (2001) tion of cell populations in thick specimens from
The sequence of the human genome. Science 3-d confocal fluorescence microscopy data. J
291:13041351 Microsc 173:115126
12. Costanzo M, Baryshnikova A, Bellay J et al 23. Turner JN, Ancin H, Becker DE et al (1997)
(2010) The genetic landscape of a cell. Science Automated image analysis technologies for
327:425431 biological 3d light microscopy. Int J Imag Sys
13. Teutsch HF, Schuerfeld D, Groezinger E Technol 8:240254
(1999) Three-dimensional reconstruction of 24. Karacali B, Vamvakidou A, Tozeren A (2007)
parenchymal units in the liver of the rat. Hepa- Automated recognition of cell phenotypes in
tology 29:494505 histology images based on membrane- and
14. Crawford AR, Lin X-Z, Crawford JM (1998) nuclei-targeting biomarkers. BMC Med Imag-
The normal adult human liver biopsy: a quan- ing 7:7
titative reference standard. Hepatology 25. Andersen ME, Clewell HJ, Frederick CB
28:323331 (1995) Applying simulation modeling to pro-
15. Motta P, Porter KR (1974) Structure of rat blems in toxicology and risk assessment: a short
liver sinusoids and associated tissue spaces as
perspective. Toxicol Appl Pharmacol based uncertainty model for gene regulatory
133:181187 networks. Bioinformatics 18:261274
26. Clark LH, Woodrow Setzer R, Barton HA 41. Biggar SR, Crabtree GR (2001) Cell signaling
(2004) Framework for evaluation of can direct either binary or graded transcrip-
physiologically-based pharmacokinetic models tional responses. 20:31673176
for use in safety or risk assessment. Risk Anal 42. Saez-Rodriguez J, Alexopoulos LG, Epperlein
24:16971717 J et al (2009) Discrete logic modelling as a
27. Clewell HJ III, Andersen ME, Barton HA means to link protein signalling networks with
(2002) A consistent approach for the applica- functional analysis of mammalian signal trans-
tion of pharmacokinetic modeling in cancer duction. Mol Syst Biol 5:331
and noncancer risk assessment. Environ Health 43. Klamt S, Saez-Rodriguez J, Lindquist J et al
Perspect 110(1):8593 (2006) A methodology for the structural and
28. von Neuman J (1966) Theory of self- functional analysis of signaling and regulatory
reproducing automata. Univeristy Illinois networks. BMC Bioinformatics 7:56
Press, Champaign, IL 44. Jack J, Wambaugh J, Shah I (2011) Simulating
29. Wiener N, Rosenblueth A (1946) The mathe- quantiative cellular responses using asynchro-
matical formulation of the problem of conduc- nous threshold boolean network ensembles.
tion of impluses in a network of connected BMC Systems Biology Accepted.
excitable eements spcifically in cardiac muscle. 45. Hucka M, Finney A, Sauro HM et al (2003)
Arch Inst Cardiol Mex 16:205265 The systems biology markup language (sbml):
30. Wolfram S, Gad-el-Hak M (2003) A new kind a medium for representation and exchange of
of science. Appl Mech Rev 56:B18B19 biochemical network models. Bioinformatics
31. Silva HS, Martins ML (2003) A cellular auto- 19:524531
mata model for cell differentiation. Physica A 46. Hucka M, Bergmann F, Hoops S et al (2010)
322:555566 The systems biology markup language (sbml):
32. de Sales JA, Martins ML, Stariolo DA (1997) language specification for level 3 version 1 core
Cellular automata model for gene networks. (release 1 candidate). Nature Precedings
Phys Rev E 55:3262 47. Cuellar A, Lloyd C, Nielsen P et al (2003) An
33. Markus M, Bohm D, Schmick M (1999) Sim- overview of cellml 1.1, a biological model
ulation of vessel morphogenesis using cellular description language. Simulation 79:740747
automata. Math Biosci 156:191206 48. Lloyd C, Halstead M, Nielsen P (2004) Cellml:
34. Savill NJ, Hogeweg P (1997) Modelling mor- its future, present and past. Model Cell Tissue
phogenesis: from single cells to crawling slugs. Funct 85:433450
J Theor Biol 184:229235 49. Bergmann FT, Sauro HM (2006) Sbwa
35. Glazier JA, Graner F, ccedil et al (1993) Simu- modular framework for systems biology. In:
lation of the differential adhesion driven rear- Proceedings of the 38th conference on winter
rangement of biological cells. Phys Rev E simulation. Winter Simulation Conference,
47:2128 Monterey, California, pp 16371645
36. Gillespie D (1976) A general method for 50. Hoops S, Sahle S, Gauges R et al (2006)
numerically simulating the stochastic time evo- Copasia complex pathway simulator. Bioin-
lution of coupled chemical reactions. J Comput formatics 22:30673074
Phys 22:403434 51. Paun A, Perez-Jimenez M, Romero-Campero
37. Gillespie DT (1977) Exact stochastic simula- F (2006) Modeling signal transduction using
tion of coupled chemical reactions. J Phys p systems. In: Hoogeboom H, Paun G, Rozen-
Chem 81:23402361 berg G, Salomaa A (eds) Membrane comput-
38. Gibson MA, Bruck J (2000) Efficient exact ing. Springer, Berlin, Heidelberg, pp 100122
stochastic simulation of chemical systems with 52. Manca V (2008) The metabolic algorithm for
many species and many channels. J Phys Chem p systems: principles and applications. Theor
A 104:18761889 Comput Sci 404:142155
39. Shmulevich I, Dougherty ER, Zhang W 53. Jack J, Paun A (2009) Discrete modeling of
(2002) Gene perturbation and intervention in biochemical signaling with memory enhance-
probabilistic boolean networks. Bioinformatics ment. In: Priami C, Back RJ, Petre I (eds)
18:13191331 Transactions on computational systems
40. Shmulevich I, Dougherty ER, Kim S et al biology xi. Springer, Berlin, Heidelberg,
(2002) Probabilistic boolean networks: a rule- pp 200215
396 J. Jack et al.
54. Jack J, Paun A, Rodrguez-Paton A (2010) A 69. West GB, Brown JH, Enquist BJ (1999) The
review of the nondeterministic waiting time fourth dimension of life: fractal geometry and
algorithm. Nat Comput 111 allometric scaling of organisms. Science
55. Priami C, Regev A, Shapiro E et al (2001) 284:16771679
Application of a stochastic name-passing calcu- 70. Baish JW, Jain RK (2000) Fractals and cancer.
lus to representation and simulation of molec- Cancer Res 60:36833688
ular processes. Inf Process Lett 80:2531 71. Di Ieva A, Grizzi F, Gaetani P et al (2008)
56. Curti M, Degano P, Priami C et al (2004) Euclidean and fractal geometry of microvascu-
Modelling biochemical pathways through lar networks in normal and neoplastic pituitary
enhanced [pi]-calculus. Theor Comput Sci tissue. Neurosurg Rev 31:271281
325:111140 72. Cross SS (1997) Fractals in pathology. J Pathol
57. Nickerson DP, Hunter PJ (2005) The noble 182:18
cardiac ventricular electrophysiology 73. Pang KS (1983) The effect of intercellular dis-
models in cellml. Prog Biophys Mol Biol tribution of drug-metabolizing enzymes on the
90:346359 kinetics of stable metabolite formation and
58. Bassingthwaighte J, Hunter P, Noble D (2009) elimination by liver: first-pass effects. Drug
The cardiac physiome: perspectives for the Metab Rev 14:6176
future. Exp Physiol 94:597605 74. Pang KS, Stillwell RN (1983) An understand-
59. Hunt CA, Yan L, Ropella G et al (2007) The ing of the role of enzyme localization of the liver
multiscale in silico liver. J Crit Care on metabolite kinetics: a computer simulation.
22:348349 J Pharmacokinet Pharmacodyn 11:451468
60. Hohme S, Hengstler JG, Brulport M et al 75. Andersen ME, Eklund CR, Mills JJ et al (1997)
(2007) Mathematical modelling of liver regen- A multicompartment geometric model of the
eration after intoxication with ccl4. Chem Biol liver in relation to regional induction of cyto-
Interact 168:7493 chrome p450s. Toxicol Appl Pharmacol
61. Ohno H, Naito Y, Nakajima H et al (2008) 144:135144
Construction of a biological tissue model 76. Abu-Zahra TN, Pang KS (2000) Effect of
based on a single-cell model: a computer simu- zonal transport and metabolism on hepatic
lation of metabolic heterogeneity in the liver removal: enalapril hydrolysis in zonal, isolated
lobule. Artif Life 14:328 rat hepatocytes in vitro and correlation with
62. Sheikh-Bahaei S, Maher JJ, Anthony Hunt C perfusion data. Drug Metab Dispos
(2010) Computational experiments reveal 28:807813
plausible mechanisms for changing patterns of 77. Liu L, Pang KS (2006) An integrated approach
hepatic zonation of xenobiotic clearance and to model hepatic drug clearance. Eur J Pharm
hepatotoxicity. J Theor Biol 265:718733 Sci 29:215230
63. Wambaugh J, Shah I (2010) Simulating micro- 78. Basciano C, Kleinstreuer C, Kennedy A et al
dosimetry in a virtual hepatic lobule. PLoS (2010) Computer modeling of controlled
Comput Biol 6:e1000756 microsphere release and targeting in a repre-
64. Shah I, Wambaugh J (2010) Virtual tissues in sentative hepatic artery system. Ann Biomed
toxicology. J Toxicol Environ Health Eng 38:18621879
13:314328 79. Li S, Armstrong CM, Bertin N et al (2004) A
65. Lerapetritou MG, Georgopoulos PG, Roth map of the interactome network of the meta-
CM et al (2009) Tissue-level modeling of zoan C. elegans. Science 303:540543
xenobiotic metabolism in liver: an emerging 80. Bruggeman FJ, Westerhoff HV (2007) The
tool for enabling clinical translational research. nature of systems biology. Trends Microbiol
Clin Transl Sci 2:228237 15:4550
66. Rowland M, Benet LZ, Graham GG (1973) 81. Goh K-I, Cusick ME, Valle D et al (2007) The
Clearance concepts in pharmacokinetics. J human disease network. Proc Natl Acad Sci
Pharmacokinet Pharmacodyn 1:123136 104:86858690
67. Rani HP, Sheu TWH, Chang TM et al (2006) 82. Meek ME, Bucher JR, Cohen SM et al (2003)
Numerical investigation of non-newtonian A framework for human relevance analysis of
microcirculatory blood flow in hepatic lobule. information on carcinogenic modes of action.
J Biomech 39:551563 Crit Rev Toxicol 33:591653
68. West GB, Brown JH, Enquist BJ (1997) A 83. Karp PD (2001) Pathway databases: a case
general model for the origin of allometric scal- study in computational symbolic theories. Sci-
ing laws in biology. Science 276:122126 ence 293:20402044
84. Gruber TR (1995) Toward principles for the 92. Merks RMH, Glazier JA (2005) A
design of ontologies used for knowledge shar- cell-centered approach to developmental biol-
ing. Int J Hum Comput Stud 43:907928 ogy. Physica A 352:113130
85. Demir E, Cary MP, Paley S et al (2010) The 93. Poulin P, Theil F-P (2002) Prediction of phar-
biopax community standard for pathway data macokinetics prior to in vivo studies. Ii.
sharing. Nat Biotechnol 28:935942 Generic physiologically based pharmacokinetic
86. de Jong H (2002) Modeling and simulation of models of drug disposition. J Pharm Sci
genetic regulatory systems: a literature review. 91:13581370
J Comput Biol 9:67103 94. Brian Houston J, Carlile DJ (1997) Prediction
87. Aldridge BB, Saez-Rodriguez J, Muhlich JL of hepatic clearance from microsomes, hepato-
et al (2009) Fuzzy logic analysis of kinase path- cytes, and liver slices. Drug Metab Rev
way crosstalk in tnf/egf/insulin-induced sig- 29:891922
naling. PLoS Comput Biol 5:e1000340 95. Naritomi Y, Terashita S, Kagayama A et al
88. Sachs K, Perez O, Peer D et al (2005) Causal (2003) Utility of hepatocytes in predicting
protein-signaling networks derived from multidrug metabolism: comparison of hepatic
parameter single-cell data. Science 308:523529 intrinsic clearance in rats and humans
89. Noy NF, Shah NH, Whetzel PL et al (2009) in vivo and in vitro. Drug Metab Dispos
Bioportal: ontologies and integrated data 31:580588
resources at the click of a mouse. Nucleic 96. Santostefano MJ, Richardson VM, Walker NJ
Acids Res 37:W170W173 et al (1999) Dose-dependent localization of
90. Kim J-D, Ohta T, Tateisi Y et al (2003) Genia tcdd in isolated centrilobular and periportal
corpusa semantically annotated corpus for hepatocytes. Toxicol Sci 52:919
bio-textmining. Bioinformatics 19:i180i182 97. Collins FS, Gray GM, Bucher JR (2008) Trans-
91. Noble D (2006) The music of life: biology forming environmental health protection. Sci-
beyond the genome, Oxford University Press ence 319:906907
Chapter 18
Agent-Based Models of Cellular Systems

Nicola Cannata, Flavio Corradini, Emanuela Merelli, and Luca Tesei
Abstract
Software agents are particularly suitable for engineering models and simulations of cellular systems. In a
very natural and intuitive manner, individual software components are therein delegated to reproduce in
silico the behavior of individual components of alive systems at a given level of resolution. Individuals
actions and interactions among individuals allow complex collective behavior to emerge. In this chapter we
first introduce the readers to software agents and multi-agent systems, reviewing the evolution of agent-
based modeling of biomolecular systems in the last decade. We then describe the main tools, platforms, and
methodologies available for programming societies of agents, possibly profiting also of toolkits that do not
require advanced programming skills.
Key words: Agent-based modeling and simulation, Agent-oriented software engineering, Behavioral
models of biological systems, Multi-agent systems
1. Introduction
A biological system can be simulated in a natural way by a society of

software agents. Actually, the goal of systems biology is that of
providing mathematical and computational models of life, ranging
from the molecular level to those of molecular complexes, cells,
tissues, organs, organisms, communities of organisms, and ecosys-
tems. Models permit analysis and simulation, aiming at reprodu-
cing in silico peculiar aspects of life. Among the computational
models, behavioral models are intuitively closer to the reality than
pure mathematical models. In fact, they rely on software entities
whose characteristics resemble those of individual components of
alive systems at some chosen resolution. The simulated behavior
and interactions of those individuals at some given level of abstrac-
tion should let emerge at higher level the properties of the
collective behavior of the individuals population. For instance,
building a model with interacting enzymes and metabolites
(i.e., the individuals) should let emerge biochemical kinetic laws
399
400 N. Cannata et al.
of a well-mixed solution (i.e., the population) of enzymes and

metabolites. Modeling the behavior and the interactions of the
individuals in a community of organisms should reproduce their
social attitudes and rules. An effective model should allow to
zoom-in and zoom-out, letting emerge at a higher level of
abstraction the structure and properties determined by lower levels.
1.1. Software Agents Complex software systems can be designed around autonomous
and Multi-agent interacting components. Robin Milner suggested to call each of the
Systems several parts of a software system as agent, with its own identity,
which persists through time (1). The name agent derives from the
Latin agere, to act. A software agent is definitively a piece of
software that acts for a user or for another program in a relationship
of agency. We can metaphorically think to some kind of agreement
to act on ones behalf. Such action on behalf of implies that the
agent has the authority to decide which action (if any) is appropri-
ate. Wooldridge defined an agent as a computer system situated in
some environment and capable of autonomous, flexible action in
that environment in order to meet its design objectives (2).
The autonomy property (autonomous agent) implies that the
agent has the control over its internal state and over its own
behavior. A comparison between agent-based programming and
object-oriented programming (OOP) can better explain this prop-
erty. Both the methodologies permit to define an abstraction of an
internal state. In the case of OOP, the state encapsulated in an
object can be controlled from the software entity controlling the
object and invoking its methods. In the case of an agent, instead,
the agent itself has the full control on the state it encapsulates.
The situatedness property (situated agent) implies that the
agent perceives its environment through some sensors and is able
to act in the environment through some actuators. Usually sen-
sors and actuators are software ones but we can think also to robotic
agents in which they actually are physical devices. The environment
in which the agent is situated is typically dynamic, likely open,
unpredictable, and populated from other agents (i.e., multi-agent).
The flexibility property (flexible agent) can be declined in
different ways. A reactive agent responds in timely fashion to envi-
ronmental change. An adaptive agent responds to environmental
change according to its internal state. A proactive agent acts in
anticipation of future goals.
An additional property of agents is mobility. A mobile agent is
able to move from one to another distributed environment.
A multi-agent system (MAS) consists of a number of agents
interacting with each other in a dynamic environment (3). Agents
of a MAS will be acting with their different goals and motivations.
Furthermore, agents will require the ability to communicate, coop-
erate, and perceive other agents. A well-designed MAS is the one
that achieves a global task through the tasks of the single agents and
18 Agent-Based Models of Cellular Systems 401
their interactions. The first design step consists in decomposing a

global task into distributable subcomponents, yielding tractable
tasks for each agent. Once the agents responsible for all the differ-
ent tasks have been depicted, communication channels among
them must be established. In this way sufficient information can
be provided to each agent in order for it to achieve its task. Finally
the agents must be coordinated in a way that they cooperate on the
global task, or at the very least, in order to prevent them to pursue
conflicting strategies in trying to achieve their tasks. The funda-
mental problem, in analyzing/designing a MAS, is in determining
how the combined actions of a large number of agents lead to
coordinated behavior on the global task.
1.2. Agent-Based Biological systems are complex ones, i.e., many of their compo-
Modeling nents are coupled in a nonlinear fashion (4). They are characterized
by variables having complicated, discontinuous behaviors over
time. Furthermore, they exhibit the emergence property, i.e., com-
plex patterns do emerge from simpler interaction rules. The global
behavior of such a system can be determined by defining the lower-
level interaction rules among its components. Developing software
for agent-based systems can profit of modern software engineering
techniques, including decomposition, abstraction, and organiza-
tion. A problem can be divided into smaller, manageable subpro-
blems. Some details of a problem can be chosen to be modeled,
while others can be ignored. The relationships among the various
system components can be identified and managed. Software
agents are situated in space and time and have some properties
and some sets of local interaction rules. Though intelligent,
they cannot by themselves deduce the global behavior resulting
from their dynamic interactions. An agent-based system usually
evolves from the microlevel to the macrolevel. Usually agent-
based modeling (ABM) adopts a bottom-up design strategy rather
than a top-down one. Agents are commonly assumed to have well-
defined bounds and interfaces, as well as spatial and temporal
properties, including such dynamic properties as movement, veloc-
ity, acceleration, and collision. Agent-based systems also allow for
easy modification of interaction rules or behavior, as well as for
viewing agents or groups of agents at different levels of abstraction.
Modeling a system with a bottom-up approach requires that every
individual agents behavior be described. The greater the number
of details that go into describing the behavior of the system, the
greater is the computational power that is required to simulate the
behaviors of all constituent agents. This is a limitation in modeling
large systems using ABM (5). A reasonable approach is to provide
several levels of abstraction and granularity, which can be chosen
depending on the level of detail needed and the computational
resources available.
In general, ABMs are a class of computational models for

simulating actions and interactions of multiple entities in an
attempt to re-create and predict the appearance of complex phe-
nomena. Modelers can observe the emergence of phenomena from
the lower level of systems to higher level. The result can sometime
be an equilibrium or an emergent pattern (e.g., cycles). Simple
behavioral and interaction rules are able to generate complex
behavior. Individual agents are usually characterized as rational,
acting in what they perceive as their interest, e.g., economic benefit
or social status, using heuristic or simple decision-making rules.
They can also experience learning.
ABM has been used since the 1990s, for instance, to analyze
financial and business issues, to describe social phenomena, to
model populations and ecosystems evolution, wars, epidemies
spreading, and in general to solve technological problems. Exam-
ples of applications include human organizations, consumer behav-
ior, logistic, distribution, supply chains, stock markets. Other
examples concern crowd behavior, e.g., in public places and in
emergency situations, vehicles movement and traffic congestion,
growth and decline of ancient civilizations, migrations, and social
network effects.
1.3. Review of ABM Computational models of cellular Computational models of cellu-

of Cellular Systems lar systems began to appear even before the rapid establishment of
systems biology. This recentits manifesto (6) has been pub-
lished in 2001multidisciplinary research area builds on the
unprecedented availability of biomolecular data (characterized by
the suffix -omes). The high-throughput technologies changed
the focus from the identification and analysis of a single molecule at
time (e.g., gene, protein, metabolite) to the systematic simulta-
neous characterization of whole populations of molecules (e.g.,
genome, proteome, metabolome). The systemic approach can be
applied at different levels of abstraction; therefore, considering the
nature, properties, and interaction of the entities at different levels
of the structural hierarchy of life, from molecules to ecosystems.
For instance, molecular systems biology concentrates its attention
at the molecular level taking into account metabolic networks,
signal transduction networks, and genetic regulatory networks.
Several computational approaches have been proposed to
model cellular signaling pathways (e.g., Boolean networks, Petri
nets, Artificial Neural Networks, Cellular Automata) (7). In Cellular
Automata (CA) (8) information is inherently processed in parallel.
With CA the interactions between cells or molecules can be mod-
eled in a matrix, where the state of an element of the matrix depends
on the states of neighboring elements.
Also agents permit to model a cell as a society of autonomous
agents acting in parallel. Agents communicate between them
through messages and have the cognitive capabilities to interact
with the surrounding environment. In Cellulat each agent commu-

nicates with the others through the creation or modification of
signals on a shared data structure named blackboard (7). The
blackboard permits a very abstract representation of cellular com-
partments related to the signaling pathways, whereas the different
objects created on the blackboard represent signal molecules, acti-
vation or inhibition signals, or other elements belonging to the
intracellular medium.
Cellular systems can be generally simulated at three different
scales of resolution: the nanoscale (1010), the mesoscale (108),
and the continuum or macroscale (103) (9). At the atomic level
(nanoscale), molecular dynamics (MD) and Brownian dynamics
(BD) are typically used to model the behavior of a limited number
of atoms over relatively short periods of time and space. MD is fully
deterministic and remarkably accurate over the short temporal and
spatial scales that are normally simulated. Because of their accuracy,
MD techniques are well suited to simulate state or conformational
changes, predict binding affinities, investigate single molecule tra-
jectory, and model stochastic or diffusive interaction between small
numbers of macromolecules. In order to model molecular events
involving large numbers of molecules or macromolecules over
extended periods of time and space, continuum approaches are
usually adopted. At the macroscale, molecules essentially lose
their discreteness and become infinitely small and infinitely numer-
ous. The system of interest can be described with ordinary (ODE)
or partial (PDE) differential equations. However not all sets of
differential equations are solvable nor are all systems suitable to
be described by differential equations. Due to their continuum
nature, the solutions to differential equation always generate
smooth curves or surfaces that fail to capture the true granularity
or stochasticity of living system. Discontinuities, state changes,
irregular geometries, or discreteness with low number of molecules
are not easily described by differential equations. Cellular systems
can be effectively and efficiently modeled at the mesoscale
(108107 m). At that level, macromolecules can still be treated
as discrete objects occupying a defined space or volume. The possi-
bility of representing single macromolecules allows mesoscale mod-
els to display the stochasticity or granularity found in real molecular
systems. At this scale Brownian motion dominates over the other
forces and therefore significant dynamic simplification are possible,
allowing very long time scale and very large number of entities or
reaction to be modeled. Wishart et al. (9) propose to perform
mesoscale simulation by means of dynamic cellular automata
(DCA), an hybrid between the classical CA and agent systems.
DCA rely on a 2D grid to roughly resemble cell compartments.
Simulating Brownian motion and using simple pairwise interaction
rules, DCA can be used to model spatial and temporal phenomena
that include macromolecular diffusion, viscous drag, enzyme rate
processes, metabolic reactions, and genetic circuits. SimCell (9) was

an easy-to-use graphical simulator able to model a set of molecular
components and a collection of interaction rules. The five repre-
sented components are small molecules (metabolites, ligands),
membrane proteins, soluble proteins or RNA molecules, DNA
molecules (non-mobile), and membrane (non-mobile). Mem-
branes describe boxes or borders and may be permeable or imper-
meable to certain molecules. They may be used to define cell
compartments, including the nucleus or other organelles.
Troisi et al. (10) define an agent as a computer system that
decides for itself. After sensing the environment, it takes decisions
based on some rules. ABM was applied to the theoretical problem
of molecular self-assembly. In the problem, a system evolves from a
separated to an aggregated state following a combination of sto-
chastic, deterministic, and adaptive rules. Agents are identified with
a molecule or a group of molecules. Molecules have rigid shapes
formed by four contiguous cells of a 2D square lattice. Cells can be
of three types (neutral, positive, negative) and their interactions are
nearest-neighbor only. The interactions mimic van der Walls attrac-
tion between all cell types and Coulomb repulsion/attraction
between similarly/different charged cells. The published results
show that it is possible to devise a combination of stochastic,
deterministic, and adaptive rules that lead a disordered system to
organize itself in an ordered low-energy configuration.
Using agent-based technology, Emonet et al. developed Agent-
Cell (11), a model to study the relationships between stochastic
intracellular processes and behavior of individual cells. Korobkova
et al. showed that behavioral variability of an individual cell could
be the result of the stochastic nature of molecular interactions in
molecular signaling pathways (12). Consequently, even genetically
identical cells can exhibit different behaviors. Stochastic molecular
events in signaling pathways play a significant role in single-cell
behavior. AgentCell studies how molecular noise influences the
behavior of a swimming cell in a 3D environment. The model is
able to reproduce experimental data for bacterial chemotaxis, one
of the best characterized biological system. In the model, each
bacterium is an agent equipped with its own chemotaxis network,
motors, and flagella. A piece of software (scheduler) is responsible
for stepping the system through time. Scheduling consists of
keeping a global clock, updating the clock to the next event, and
maintaining a sorted list events. Each agent inserts its own future
events inside the schedulers list of events.
From this flashback we may observe how the community of
cellular ABM was already trying to answer to the necessity that was
then arising in the systems biology community. ABMs, establishing
a correspondence between a population of autonomous, interact-
ing, more or less intelligent software components and a popula-
tion of different species of biomolecules, permit more realistic
simulations than models based on differential equations. Phenom-

ena dictated from low numbers of molecules or arising from spatial
effects can be easily represented.
The necessity of representing space emerges as the final fron-
tier (13) in systems biology. Lemerle et al. underline that only at
that time the technological and theoretical advances begun to allow
the simulation of detailed kinetic models of biological systems
reflecting the stochastic movement and reactivity of individual
molecules within cellular compartments. At that time some model
specification languages like CellML (14) and SBML (15) based on
the XML markup language began to establish. SBML rapidly
became a de facto standard, permitting the interchange of models
and reproducibility of simulation on compatible simulation envir-
onments and tools. Lemerle et al. (13) consider also a number of
issues relevant to cellular simulation. The first of all is obviously
space representation. Is it continuum space supported, e.g., in
systems based on differential equation? Is space discretized, i.e.,
represented as a 2D regular lattice or as 3D voxels (voxeliza-
tions)? Another fundamental issue is the representation of molec-
ular entities individuals. Are species represented as concentrations,
like in ODEs systems, or as population (i.e., numbers), or taking
into account each individual (like with agents)? Another issue con-
cerns the enabling conditions for biomolecular reactions to happen
in the simulated cellular environment. Depending on the spatial
approach adopted, the reaction can be simulated when a collision
between two suitable species is detected or when the two species are
in the same voxel (or in neighboring ones). Important issues are
also the geometry of the model and the movement itself.
Also Takahashi et al. (16) highlight the importance of a spatial
representation in systems biology. In particle space, molecules
are represented as individual particles with positions in a continuum
space. Particles are usually given motions according to some kind
of force equations that are numerically integrated to advance
time. Reactions are represented as collisions between particles.
Discrete space representation (mesoscopic) discretizes the
space either by subvolumes (voxels) of an identical shape (typically
cubic) or by mean of a regular lattice. Some methods allow at most
one particle to occupy a lattice site. Others allow multiple particles
to reside in a single lattice site. Detailed space representation per-
mits to deal with very important issue that otherwise cannot be
taken into account, e.g., molecular crowding. Extremely high pro-
tein density in the intracellular space can actually alter protein
activities and break down classical reaction kinetics.
A goal in biomodeling research is to understand the linkage
from molecular level events to the emerging behavior of the system
(17). This task requires having plausible, adequately detailed design
plans for how components (single and composite) at various system
levels are thought to fit and function together. Experimentation is
then used to reconcile different design plan hypotheses. To actually

demonstrate that a design plan is functionally plausible, it is how-
ever necessary to assemble individual components according to a
design, and then show that the constructed device exhibits beha-
viors that match those observed in the original experiment. Tang
et al. (17) apply this methodology presenting a multilevel, agent-
based, model that represents the dynamics of rolling, activation,
and adhesion of individual leukocytes in vitro.
Vallurupalli and Purdy present an agent-based 3D model of
phage lambda, with its two characteristic phases: lysogenic and lytic
(4). This widely studied gene regulatory system has been modeled
using the unified modeling language (UML) and simulated using
the breve (18) 3D visualization engine. In their article it is stressed
how a complex behavior emerges from local agent interactions. The
agent approach allows also to study how individual parameters
affect overall system behavior.
Differently from Emonet et al., the model of bacterial chemo-
taxis proposed by Guo et al. (19) is a hybrid one. Biological cells are
modeled as individuals (agents) while molecules are represented by
quantities. This hybridization in entity representation entails a
combined modeling strategy with agent-based behavioral rules
and differential equations, thereby balancing the requirements of
extendible model granularity with computational tractability. In
their assay of 103 cells and about 106 molecules they are able to
produce cell migration patterns that are comparable to laboratory
observations.
A valuable issue raised by Guo and Tay (20) is that of event
scheduling. The update scheme of a MAS model refers to the
frequency of agent state updates and how these are related in
temporal order. In contrast to verifiable agent behavioral rules at
the individual level, the update scheme is a design decision made by
the model developer at the systems level that is subject to realism
and computational efficiency issues that directly affect the credibil-
ity and the usefulness of the simulation results. Usually simulation
adopts uniform time-step update scheme. Guo and Tay, in model-
ing immunological phenomena, characterized by multi-timescales,
suggest to adopt event-scheduling based asynchronous update
scheme. The scheme allows arbitrary smaller timescales for realism
and avoids unnecessary execution and delays to achieve efficiency.
In their article the application of the event-scheduling update
scheme to realistically model the B cell life cycle is presented. The
simulation results show a significantly reduced execution time (40
times faster) and also reveal the conditions where the event-
scheduling update scheme is superior.
Indeed remarkable has been the research activity on ABM of
cellular systems at the University of Sheffield. A proof-of-concept
was the Epitheliome (21), an ABM, in which there is a one-to-one
correspondence between biological cells and software agents. The
model is able to predict the emergent behavior resulting from the

interaction of cells in epithelial tissue. There is no fixed scaffold
that determines the structure achieved during tissue morphogene-
sis. The organization into complex tissues and organs is an emer-
gent property of the constituent cells of an organism and of the
genes that determine the behavior of those cells. The Epitheliome
model addresses explicitly the concept of structure as an emergent
property of the interaction or social behavior of a large number
(106107) of individual cells. All software development has been
carried out using object-oriented code in Mathworks MATLAB
(22). The model adopts rule-based modeling to describe cell cycle
progression and the modeled cell types are stem cells (which can
undergo growth and division, bonding, spreading, lateral migra-
tion, and apoptosis), transit amplifying cells, mitotic cells, post-
mitotic cells, and dead cells.
ABM has been then generalized to model intracellular chemical
interactions (23). It is essential that an ABM is able to deal with
individual interactions of molecule agents with the same accuracy as
reaction kinetics. The authors argue that any reasonably random
movement within an agents confines is sufficient for the model to
operate properly. Agents must at least move around enough to
regularly collide. The model must of course agree with the
corresponding reaction kinetics model in the circumstance where
reaction kinetics can reasonably be applied (i.e., with large numbers
of molecules of well-mixed chemicals).
The review from Walker and Southgate (24) examines
individual-based models (cellular automata or agent-based meth-
odologies) of cellular systems to explore multi-scale phenomena in
biology. Such models, where individual cells are represented as
equivalent virtual entities governed by simple rules, are inherently
extensible and can be integrated with other modeling modalities
(e.g., partial or ordinary differential equations) to model multi-
scale phenomena. Alternatively, hierarchical agent models may be
used to explore the functions of biological systems across temporal
and spatial scales.
Recently, Adra et al. (25) developed a 3D multi-scale compu-
tational model of the human epidermis which is composed of three
interacting and integrated layers: (1) an ABM which captures the
biological rules governing the cells in the human epidermis at the
cellular level and includes the rules for injury induced emergent
behaviors, (2) a COmplex PAthway SImulator (COPASI) (26)
ODE model which simulates the expression and signaling of the
transforming growth factor (TGF-b1) at the subcellular level, and
(3) a mechanical layer embodied by a numerical physical solver
responsible for resolving the forces exerted between cells at the
multicellular level.
Other recently published ABM of cellular systems concern
parasitology (evolution of Chagas disease (27)) and immunology
(competition between lung metastases and the immune system

(28)). On the field of immunology too, an interesting recent review
on ABM related to hostpathogen interaction and disease dynamics
is that from Bauer et al. (29). The authors emphasize the well-
known feature of generating surprisingly complex and emergent
behavior from very simple rules, including periodic behaviors or
intricate spatial and temporal patterns. Nonlinearities and time-
delays are not difficult to treat empirically since they can be
incorporated into the agents rules or they may even emerge natu-
rally as a consequence of the systems collective dynamics. Another
remarked advantage of ABM is that their computational structure is
inherently parallel and therefore can be implemented on parallel
computers very efficiently.
The recent advent of programming for graphical processing
units (GPU) together with the introduction of multi-core CPU
actually enabled an easier, cheaper, and resolute parallelization of
cellular simulation algorithms. In particular, GPUs do not have to
perform many of the generalized tasks that a CPU must perform
and therefore they have become highly optimized to perform
tightly coupled data-parallel processing with hundreds of indepen-
dent processor units and specialized memory addressing (30). In
the past few years GPUs have been already used for tasks such as
sequence analysis and molecular dynamics. Software toolkits like
CUDA and OpenCL have greatly eased the complexity of GPU
programming. ABM has numerous implementation challenges on
the GPU to handle dynamic agents and their interactions. The
methodology article of Christley et al. presents a pedagogical
approach to describing how methods for multicellular modeling
are efficiently implemented on the GPU. Aspects like memory
layout of data structures and functional decomposition are dis-
cussed. The authors deal also with various programmatic issues
and provide a set of design guidelines for GPU programming that
are instructive to avoid common pitfalls as well as to extract perfor-
mance from the GPU architecture.
The review from Dematte` and Prandi (31) takes stock of some
recent efforts in exploiting the processing power of GPUs for the
simulation of biological systems. General purpose scientific com-
puting on graphics processing units (GPGPU) actually offers the
computational power of a small computer cluster at a cost of a few
hundred dollars. However, computing with a GPU requires the
development of specific algorithms, since the programming para-
digm substantially differs from traditional CPU-based computing.
Other computational infrastructures, like the Grid, developed
for seamless sharing computational power and other resources like
memory, data, and knowledge among virtual organizations, enable
ambitious collaborative projects to take off. The ImmunoGrid
project, for instance, aims at developing agent-based simulations
of the human immune system at a natural scale (32). ImmunoGrid
has the task of modeling one of the most challenging components

of the virtual physiological human (VPH). The VPH (33) is cur-
rently developed through several initiatives that are expected to
enable an integrative and analytical approach to the study of medi-
cine and physiology and to drive the paradigm shift in health care.
The key benefits that the VPH aims to deliver are a holistic
approach to medicine, personalized care solutions, a reduced need
for animal experiments, and a preventive approach to the treatment
of disease.
Virtual tissues (VT) are clearly of paramount importance for
toxicology. Predicting the risk of chemical-induced human injury is
a major challenge in toxicology (34). The translational issues in
elucidating the complex sequence of events from chemical-induced
molecular changes to adverse tissue level outcomes, and estimating
their risk in humans, provide a practical context for developing VT.
They aim to predict histopathological outcomes from alterations of
cellular phenotypes that are controlled by chemical-induced per-
turbations in molecular pathways. As already emerged from our
excursus in the state of the art, the behaviors of thousands of
heterogeneous cells in tissues can be naturally simulated with
ABM. Further, Shah and Wambaugh state that to extrapolate tox-
icity across species, chemicals, and doses, VT require four main
components: (a) organization of prior knowledge on physiologic
events to define the mechanistic rules for agent behavior, (b)
knowledge on key chemical-induced molecular effects, including
activation of stress sensors and changes in molecular pathways that
alter the cellular phenotype, (c) multiresolution quantitative and
qualitative analysis of histologic data to characterize and measure
chemical-, dose-, and time-dependent physiologic events, and (d)
multi-scale, spatiotemporal simulation frameworks to effectively
calibrate and evaluate VT using experimental data.
We close our introduction to ABM of cellular systems by refer-
ring the readers also to the recent review on Internet resources for
ABM by Devillers et al. (35).
2. Materials
Nowadays, ABMs can be designed on dedicated simulation plat-

form or coded with specialized programming tools and frame-
works. Gilbert and Bankes depicted ABM in a symbiotic
relationship with computing technology (36). Modeling with
agents became feasible only with the advent of personal worksta-
tions. Following the technological development, the scale and
sophistication of the software available for modelers have greatly
increased. Very sophisticated modeling is now viable thanks to
complex algorithms, toolkits, and libraries. The earliest models
were developed on mainframe computers. From the early 1990s

most models have been developed in conventional programming
languages such as C++, JAVA, and SMALLTALK. But the disad-
vantages of using a general purpose language were evident: basic
algorithms must be always re-implemented, graphics libraries could
be hardly coupled with dynamic modeling, and the resulting code
was easily accessible only to those familiar with the language and
the compiler needed to run it. The first development was repre-
sented by the emergence of several standardized libraries easily
includable in developed ABM programs. REPAST (37), originally
only a set of JAVA libraries, allowed programmers to build simula-
tion environments (e.g., regular lattices), create agents in social
networks, collect data from simulations automatically, and build
user interfaces easily. Its features and design owe a lot to SWARM
(38), one of the first ABM libraries (36). These libraries required
nevertheless a good working knowledge of the programming lan-
guage that they are aimed at, usually JAVA.
In the meanwhile, agent-oriented programming languages
began to establish. JADE (39), in development since at least 2001
and adopted by a wide international community, is a very good
example. Being developed over JAVA, it is consequently portable
on all the operating systems supporting JAVA. JADE includes both
the libraries (i.e., the JAVA classes) required to develop application
agents and the run-time environment that provides the basic ser-
vices and that must be active on the device before agents can be
executed. Each instance of the JADE run-time is called container
(since it contains agents). The set of all containers is called
platform and provides a homogeneous layer that hides to agents
(and to application developers too) the complexity and the diversity
of the underlying tires (hardware, operating systems, types of net-
work, JAVA Virtual Machine) (40). In this way, it realizes a mid-
dleware, laying in the middle between application software that
may be working on different operating systems on different com-
puters. Purpose of a middleware is interoperability, i.e., a set of
services is provided, allowing multiple processes (e.g., agents) run-
ning on one or more machines to interact. In MAS, interoperability
should also be granted between heterogeneous agent-based sys-
tems. The issue of agent systems standardization has been tackled
by FIPA (41). FIPA is an IEEE Computer Society standards orga-
nization that promotes agent-based technology and the interoper-
ability of its standards with other technologies. Originally formed
in 1996 as a Swiss based not-for-profit organization to produce
software standards specifications for heterogeneous and interacting
agents and agent-based systems, in 2005 it was officially accepted
by the IEEE (42) as its 11th standards committee.
JADE is compliant with the FIPA specifications. As a conse-
quence, JADE agents can interoperate with other agents, provided
that they comply with the same standard. Other FIPA-compliant
agent systems includes Java Intelligent Agent Compontentware

(JIAC) (43), Smart Python multi-Agent Development Environ-
ment (SPADE) (44), which is written in the Python programming
language, and JACK Intelligent Agents (45). JACK is a mature,
cross-platform environment for building, running, and integrating
commercial-grade MASs. It is built on the sound BDI (Beliefs/
Desires/Intentions) logical foundation. BDI is an intuitive and
powerful abstraction that allows developers to manage the com-
plexity of the problem. In JACK, agents are defined in terms of
their beliefs (what they know and what they know how to do), their
desires (what goals they would like to achieve), and their intentions
(the goals they are currently committed to achieving).
Returning to JADE, we can observe that from the functional
point of view it provides the basic services necessary to distributed
peer-to peer applications in the fixed and mobile environment (40).
Each agent can dynamically discover other agents and communi-
cate with them according to the peer-to-peer paradigm. From the
application point of view, each agent is identified by a unique name
and provides a set of services. It can register and modify its services
and/or search for agents providing given services, it can control its
life cycle and, in particular, communicate with all other peers.
Agents communicate by exchanging asynchronous messages, a
communication model almost universally accepted for distributed
and loosely Internet Wireless environment coupled communica-
tions, i.e., between heterogeneous entities that do not know any-
thing about each other. In order to communicate, an agent just
sends a message to a destination. Agents are identified by a name
(no need for the destination object reference to send a message)
and, as a consequence, there is no temporal dependency between
communicating agents. The sender and the receiver could not be
available at the same time. The receiver may not even exist (or not
yet exist) or could not be directly known by the sender that can
specify a property (e.g., all agents interested in football) as a
destination. Because agents identifies each other by their name,
hot change of their object reference are transparent to applications.
The platform includes a naming service (ensuring each agent
has a unique name) and a yellow pages service that can be
distributed across multiple hosts. Another very important feature
consists in the availability of a rich suite of graphical tools support-
ing both the debugging and management/monitoring phases of
application life cycle. By means of these tools, it is possible to
remotely control agents, even if already deployed and running:
agent conversations can be emulated, exchanged messages can be
sniffed, tasks can be monitored, agent life cycle can be controlled.
Gilbert and Bankes recognize also that, similarly to what hap-
pened with statistical computing, the real breakthrough in ABM
was the development of packages, collections of routines assem-
bled with a common standardized user interface (36). Different
from agent-oriented programming languages these packages or

platforms do not require programming skills and permit direct
manipulation or visual programming of the models. They pro-
vide total immersion in an environment in which building blocks
can be assembled. The recently developed platforms for ABM offer
also facilities for other phases of a models life cycle, model evalua-
tion, and model maintenance.
A list of ABM software, including simulation platforms and
programming systems, highlighting their primary domain of appli-
cation, the programming language (if any), the compliance with
FIPA, and other capabilities (e.g., GIS, 3D) is constantly kept
updated in Wikipedia (46).
Even very basic and widely used software tools can support
ABM. For instance in ref. 47 it is shown how to realize a simulation
of a shopper agent model using a spreadsheet like Microsoft Excel.
In the last years a lot of specialized agent-based platforms have been
developed. Their use, after an initial effort for learning the particu-
lar concepts and characteristics of the platform, can dramatically
speed up not only the process of defining and running the simula-
tions but also the ways in which the results are handled and ana-
lyzed, by connecting with largely used data analysis and
visualization software tools and/or data formats. Another impor-
tant aspect to take into consideration is the scalability of the soft-
ware. The number of agents that have to be simulated for obtaining
significant results can be taken as a rough measure of the needed
computational power. While this number is under the order of 102
or 103 the simulation can usually be carried out on a desktop
machine. If the number is higher there is the need of more powerful
computing architectures for parallel and/or distributed computa-
tion. Some platforms give native support for these architectures
(48). Recently, also GPUs with nVIDIA CUDA programming
environment (49) are starting to be exploited by some platforms
(50). In the latter case the high performance computing power
can be obtained with lower costs with respect to classical parallel
architectures.
Nicolai and Madey (51) give a survey of tools, classifying them
using different characteristics: the programming languages on
which the tool is based, the type of license, the operating systems
for which the tool is available, the domain (including distributed
simulation) for which the platform is specialized (or if it is general
purpose), and the type of support offered to the final user. Other
useful reviews, organized with different views, can be found in refs.
52 and 53.We now briefly describe those tools that historically have
been mostly used and discussed. For a complete list we refer to the
surveys and Web pages given above.
Swarm (38), developed by the Swarm Development Group,
was one of the first general purpose ABM systems. It uses the
Objective-C programming language, but by using a middleware it
is possible to use also Java. A lot of documentation and tutorial

material is available, and the platform can be used on top of the
most diffused operating systems.
The Recursive Porous Agent Simulation Toolkit (Repast) (37)
was derived from Swarm, but incorporate a lot of facilities to
interact with other software for databases, data analysis, scientific
calculus, and data visualization. It also supports distributed com-
putation, and its specification of agents is based on Java, Groovy
(54), and flowchart diagrams.
The Multi-Agent Simulator Of Neighborhoods. . . or
Networks. . . or something. . . (Mason) (55) is a fully Java program-
mable library for agent-based simulations that can be integrated in
larger applications. It can be easily connected with (possibly 2D or
3D) visualization components.
Graphical and accessible programs for creating ABMs are
AgentSheets (56), NetLogo (57), and SeSAm (58).
NetLogo was originally designed for simple educational pur-
poses but now it is largely used for research as well. Many colleges
have used this as a tool to teach their students about ABM. A similar
program, StarLogo (59), has also been released with similar func-
tionality.
The SeSAm simulator with graphical modeling interface, opti-
mizing model compiler and plug-in system, is used in research
projects and educational environments. Many plug-ins are available
(Evolution, Event-Based Simulation, Communication, GIS,
Graph, FIPA, Import/Export, etc.), and the availability of the
Java source code and plug-in infrastructure allows for further
customization.
AnyLogic (60) is a commercial ABM tool. In AnyLogic the user
can combine ABM with discrete-event (process-centric) modeling
and system dynamics. Visual languages such as state charts, action
charts, process flowcharts, and Stock and flow diagrams are used to
define the behavior of agents.
Concerning ABM platform specifically addressed to the
biological domain, we mention SimBioSys (61), based on C++
language that allows programming evolutionary simulations.
3. Methods
Being ultimately a software product, the design and implementa-

tion of a MAS requires software engineering methodologies to
guide the whole production process. The use of a methodology
helps in correctly and effectively realizing the system of interest in
order to use it for the intended objectives. A methodology should
give guidelines for all the typical phases of the software life cycle:
requirement analysis, design, development, testing, and/or
validation, deployment, and maintenance. Modern software engi-

neering techniques are mainly focused in software methodologies
for the object-oriented paradigm. Standard formal and semi-formal
graphical specification languages exist for this paradigm (e.g., UML
(62)) and several standardized software engineering processes have
been defined (e.g., Rational Unified Process (63)).
Concerning agent-based systems, agent-oriented software
engineering (AOSE) (64) and agent-based modeling and simula-
tion (ABMS) (65) are recognized as very interesting emerging
paradigms that will have a major impact on the quality of science
and society over the next years. On the one hand, AOSE is
concerned in finding the best way to design and implement an
agent-based system, in order to create a MAS application that is
able to manage and/or resolve a complex problem. On the other
hand, ABMS has been defined as a third way of doing science in
addition to traditional deductive and inductive reasoning (66). The
novelty is that the scientist can face the understanding of the
complexity of natural phenomena using the notion of agent to
represent simple components of the real world and program them
making hypotheses on their (simple) behaviors. Then, putting all
together and running the simulation, the global system can be
observed and, if the hypotheses were correct, a (known or
unknown) emergent behavior should appear, that is more than
the sum of the simpler individual behaviors of the components.
In the last years there has been an increasing number of initia-
tives to develop methodologies for the development of agent-based
software systems. We mention GAIA (67, 68), Tropos (69),
Prometheus (70), INGENIAS (71), PASSI (72), ADELFE (73),
PASSIM (74). Each of these methodologies defines a different
meta-model, that is to say, a set of general concepts, and relations
among them, that are considered appropriate to model a MAS.
There are several reviews and comparisons between these different
methodologies, see, for instance, refs. 64 and 75. Unfortunately,
only a few of the methodologies are complete, in the sense that
they cover all the phases of the life cycle of an agent-based software.
All methodologies start from the requirements engineering phase
and most of them stop after the design phase, when a detailed
model of the system is available. Some of them continue towards
the phase of implementation, deployment, and testing, but none of
them goes beyond, to the maintenance phase. An attempt to unify
all the existing different methodologies under a general meta-
model has been done by FIPA (41), with the objective of defining
a standard AUML (Agent UML) language (76). However, as the
AUML Web site admits, the process has stopped at the moment
because the new versions of UML seem to already incorporate
notations that could be suitable for designing agent-based systems.
Recently, a unified graphical notation for AOSE (77) has been
proposed.
Let us briefly describe the meta-models of some of the meth-

odologies.
In GAIA, the basic building blocks of a MAS are: agents, roles,
activities, and protocols. An agent can have several roles, and for each
role it exhibits a different behavior. Roles are defined in terms of
permissions, responsibilities, and activities (i.e., the procedural abil-
ities of the agents) as well as of interactions with other roles. Services
are given by the agents when playing the associated roles and proto-
cols are thought as general stubs for permitting communications
between agents. Moreover, the meta-model contains social aspects
of the MAS that are useful to model open agent systems. Agents
constitute organizations, aggregated into an organizational struc-
ture, and both agents and roles observe organizational rules.
Tropos is an AOSE methodology that is based on three key
ideas: a notion of agent with cognitive skills (goal, plan, and belief)
used along all the software development phases, an approach called
requirement driven (a sort of goal-oriented analysis) that is used
along all the phases as well and the construction of a conceptual
model in subsequent steps of refinement until the level of code is
reached. In Tropos the elements of the basic ontology are the
following: actors, goals, planes, resources, dependencies, capabil-
ities, and beliefs. The actor is a strategic entity that exhibits a will.
Analyzing these actors and their dependencies with respect to other
actors, by using techniques such as means-end analysis, and/or
decomposition, it is possible to specify the agents and the relative
capabilities in order to model the system and solve the problem.
ADELFE is mainly concerned to the development of Adaptive
Multi-Agent Systems. Much account is given to cooperation.
Agents are defined with a limited (local) view of the environment
and they have beliefs on the environment, on the other agents, and
on themselves, on which they rely to decide their behaviors. Every
agent can update its beliefs and can share them with other agents.
The ADELFE agent is essentially cooperative, it always tries to solve
a local goal and keep cooperative relationships with others. Its basic
cycle is of the type perceive-decide-act. When a Non-Cooperative-
Situation is detected by an agent (for instance, it received a message
it could not understand) it tries to solve it and to stay cooperative
with the others (for instance, it sends the incomprehensible mes-
sage to other agents that it believes could understand it).
INGENIAS is the result of trying to integrate the best parts of
other methodologies into one and it is also a set of tools to develop
MAS (i.e., INGENIAS Development Kit (78)). The meta-model is
composed of the following concepts:
l Organization, an autonomous entity that has its own goal. It
can be structured in groups and contain workflows (procedural
information to organize tasks, resources, and the participants to
the procedure). The groups can be made of roles, agents,
resources, or applications.
l Agent, that is, again, an autonomous entity. It can play different

roles and pursue different goals. The agent has a mental state
made of mental entities (goals, facts, beliefs) that is managed by
a mental state manager (to create, modify, delete mental states)
and by a mental state processor (to determine how mental
states evolve and to select the action that the agent should
execute at any moment).
l Interactions that can be initiated by an agent and that may
involve more than two agents.
l Tasks and goals that are related. Every task is assigned an input
and an output and it is specified how they affect the environ-
ment or an agents mental state.
l Environment defines what is the perception of every agent and
also identifies the resources of the systems and who is responsi-
ble to manage them.
PASSI (Process for Agent Societies Specification and Imple-
mentation) is an iterative and incremental process for developing
a MAS. It also supports implementation and testing with the devel-
opment toolkit PTK, the Passi Toolkit (79). The PASSI meta-
model is organized in three different domains:
l Problem Domain. It encompasses scenarios, requirements,
ontology, and resources. A scenario is a sequence of interactions
that may happen among actors and the system. Requirements
are usually described using a UML use case diagram. The
ontology defines a set of concepts (categories of the domain),
actions (executable in the domain and that can affect the status
of concepts), and predicates (relating some domain elements).
The resources are those that can be accessed by the agents.
l Agency Domain. The Agent is defined in this domain. Every
agent is responsible of satisfying a set of requirements coming
from the Problem Domain, using its capabilities. Any agent can
play different roles along its life. These roles are parts of an
agents behavior that, depending on the specific role, can be
goal driven or can consist of services that the agent gives,
possibly accessing to available resources. There is a service
component that represents the service provided by a role in
terms of functionalities, associated to pre- and post-conditions
and possibly other details to be called properly by other agents
that may require it for their goals. Agents can use tasks or
communication to realize the aims of a role. A task is a portion
of behavior that can be considered atomic. Communication
consists on agents exchanging one or more messages, each of
them expressed according to an Agent Interaction Protocol
that is used to give a certain level of predefined semantics to
the message content.
l The Solution Domain is where the implemented systems will be

deployed. It describes the code structure in the FIPA-compliant
implementation structure that has been chosen for the imple-
mentation.
Several academic and industrial experiences have already shown
that the use of MAS offers advantages in many different areas such
as manufacturing processes, e-Commerce, and network manage-
ment (80). Since MAS in such contexts need to be tested before
their deployment and execution in real operating environment,
methodologies that support system validation through simulation
(e.g., discrete-event simulation, agent-based simulation, etc.) are
highly required. Verification involves debugging the model to
ensure it works correctly. Validation ensures that you have built
the right model. In fact, simulation of MAS cannot only demon-
strate that MAS correctly behaves according to its specifications but
can also support the analysis of emergent properties of the MAS
under test (74).
PASSIM makes a step towards this direction. PASSIM is an
agent-oriented software development methodology that uses sim-
ulation in two different moments: at an early stage to prototyping
the MAS being developed and at the late stage for validating the
requirements of the developed MAS. It is based on parts coming
from PASSI and from the distilled state charts (DSC)-based simu-
lation methodology (81). The life cycle proposed by PASSIM is
iterative and incremental and is based on the following steps:
requirement specification, design, simulation, coding, and deploy-
ment. After the simulation step the designers can either proceed to
the real implementation or use the results of the simulation as a
feedback on the design and requirement specification phases.
All the phases but simulation are supported by PTK, the PASSI
Toolkit (79). For the simulation phase the DSC Visual Toolset (82)
can be used.
4. Examples
Here we report on the lessons we have learnt in programming

ABMs of cellular systems. We find it very educational and at the
same time we realize that our experience parallels very well the
evolution of this research area.
We proposed a conceptual framework for engineering an agent
society to simulate the behavior of a biological system (83). The
framework is intended to support life scientists in building models
and verifying experimental hypotheses by simulation. We believe
that the use of an agent-based computational platform and of agent
coordination infrastructuresalong with the adoption of formal
methodswill make it possible to harness the complexity of the

biological domain by delegating software agents to simulate bio-
entities. Relevant issues are those of information management, as
well as the phases of model construction, analysis, and validation
(84). The proposed conceptual framework takes into account the
four steps suggested by Kitano (6): (1) system structure identifica-
tion, (2) system behavior analysis, (3) system control, (4) system
design. For each step, our framework exploits agent-oriented meta-
phors, models, and infrastructures to provide systems biologists
with the suitable methodologies and technologies.
As a recurrent benchmark in our experiments with ABM we
have chosen the very well-known and studied process of carbohy-
drate oxidation (CO). Our first focus was mainly on the model
engineering issues. In refs. 85 and 86 we adopted the PASSI AOSE
methodology (72). A simulator was designed with the help of the
PASSI Toolkit (79) and implemented on the Hermes agent mid-
dleware platform (87). The PASSI Toolkit provides the systems
specification by UML diagrams, very helpful for the implementa-
tion (e.g., for the automatic generation of stubs of code) and for
documentation purposes, concerning different design level. PASSI,
proceeding in a top-down way, naturally leads to the identification
of the structure and the behavior of the cell and its components.
Following a macrodescription of CO, we first identified the main
functions and then each cell compartment (or subcompartment)
involved into the process. Any identified compartment was mod-
eled as an active autonomous entity (and each function as its specific
role. The resulting multi-agent model consisted of the three agents:
Cytoplasm, Inner Mitochondrial Membrane, and Mitochondrial
Matrix, as depicted in the agent identification diagram shown in
Fig. 1. Two other auxiliary agents were introduced to support the
interface with the user (Interface Service Agent) and to simulate the
execution environment in which all the reactions take place (Envi-
ronment Service Agent). The environment represents a centralized
coordinator, keeping track of the available quantities of molecules.
The diagram shows also the roles played by each agent in the
identified functions.
Software engineering distinguishes between verificationdid
we build the system right?and validationdid we build the
right system? (88). Merelli and Young considered the problem of
model validation for simulation models whose structure as well as
behavior mimics the modeled biological systems (89). Intentional
insertion of faults is a well-known software testing technique
(mutation analysis). The authors proposed to bring some mod-
ifications on the model of CO in order to mimic some known or
plausible mutations in the subject system. When the modified
model is executed, then a behavior is expected corresponding to
that of the natural system with the same mutation. If the modifica-
tion does not correspond to a known natural mutation we expect a
Interface Service Agent
is a user assistant agent. It
embodies the interface between
User Interface
the user and the cell simulation
system
<<Agent>> (from 01-Domain Description phase) ...

AgCitosol
Agent Citosol simulates <<Agent>>
<<extend>> <<extend>>
the role of Cytoplasmin AgInterface
the Glycolysis, Lactic <<extend>>
and Alcoholic <<extend>>
LacticFermentation Glycolysis AlcoholicFermentation
Fermentations functions (from AgCitosol) (from AgCitosol)
(from AgCitosol)
Initialization ShowEnvironment
(from AgInterface) (from AgInterface) ShowATPinCitosol
(from AgInterface)
<<communicate>>
<<communicate>>
<<Agent>> <<communicate>>
AgMitochondrialInnerMembrane <<communicate>> <<communicate>>
Agent Inner
Mithocondrial
Membrane simulates the <<communicate>>
role of IMM in the
Transport and Electron Transport ElectronRespirationChain <<communicate>>
Respiration Chain <<communicate>>
(from AgMembranaMitocondrialeInter...(from AgMembranaMitocondrialeInter...
functions
<<Agent>>
<<communicate>> <<communicate>> Environment
<<Agent>>
AgMitochondrialMatrix UpdateSharedData
<<communicate>>
(from Environment
Agent Mithocondrial <<communicate>>
Matrix: simulates the
role of MM in the Partial
Oxidationof Pyruvate Environment: Service Agent
and Krebs Cycle PartialOxidationPyruvate KrebCycle simulates the execution environment
18 Agent-Based Models of Cellular Systems
functions (from AgMitochondrialMatrix (from AgMitochondrialMatrix in which any cell reaction takes place
Fig. 1. The PASSI agent identification UML diagram for our CO example.
419
biologically plausible change in the behavior. As an example it was

considered a mutation of the gene that produces one of the
enzymes involved in the partial oxidation of pyruvate, thus allowing
its passage from the cytoplasm to the mitochondrion. The previ-
ously developed model was not designed with such modification in
mind. It did not permit to zoom into the cellular compartments
and to see the enzymes at work. Therefore, it was not possible,
for instance, to disable a specific enzyme. On the same model the
mutation analysis technique exposed other weaknesses in the struc-
tural and behavioral correspondence with the modeled system. This
analysis leads to a re-implementation of the model to permit a finer
grain level of detail, so zooming into the cytoplasm. The new
implementation was reorganized to address the structural corre-
spondence, making it easy to model a class of mutations that delete
or suppress the activity of particular enzymes. We experimentally
applied a change corresponding to a pyruvate kinase deficiency and
observed resulting changes in the processing of glucose and glyco-
gen, which could be assessed for plausibility by biologists.
The previously described ABM of CO (87) has been built with
a top-down methodology, analyzing the cell at a high level of
abstraction and considering CO as a function at this level. In this
vision we correctly identified the cellular compartments responsible
for it and therefore modeled the cytoplasm and the mitochondria as
interacting agents. Such a model does not permit to zoom into a
deeper level of detail, thus having the possibility to observe how the
cellular behavior emerges from the behavior of the molecular spe-
cies hidden at the higher level. The agent society notion can be used
here for defining an ensemble of cellular agents and the coordina-
tion artifacts involved in the cellular task characterizing the cellular
agent society. The notion of agent society can be suitably adopted
also for scaling with complexity, identifying different levels of
descriptions of the same system. What can be described at one
level as an individual agent, can be described as a society of agents
(zooming in) at a more detailed levelas an ensemble of agents
plus their mediating artifactsand vice versa (zooming out). Con-
sequently, the model of CO was refined (90), zooming into the
cytoplasm and taking into account the pool of enzymes. In the
implementation, a basic class enzyme was defined to represent a
reactive entity. Then the enzyme was specialized for any of the
enzymes involved in the CO and acting in that compartment. In
particular, for each enzyme subclass the set of affinity to some
metabolites was introduced. In this way, glycolysis can be seen as
a function performed by a society of interacting cellular agents
the enzymes, whereas the metabolites are considered product of the
environment.
The next step was to provide physical characteristics (shape,
weight, size, position) to agents (both enzymes and metabolites),
to place them in the space and to allow them to autonomously
move and perceive the spatial neighborhoods, reacting accordingly.

We modeled the cellular spatial environment as a finite continuous
space filled by a hierarchy of entities (bioagents) of different
volumes. Enzymes and complexes are much bigger than metabo-
lites and therefore they move much more slowly, but they have
adaptive and reactive capabilities to recognize, in the dynamic
environment of the cytoplasm, their metabolic counterparts in
biochemical reactions. According to kinetic laws and metabolic
maps they will be able to transform substrate bioagents into the
corresponding product of the executable reaction. To have an idea
of the numbers of agents involved, we estimated that in a small
portion (1015 L) of the cytoplasm, we need to simulate the
movement and the interactions of several millions of autonomous
entities. Reasoning at a suitable level of detailsat the mesoscale as
suggested by Wishart et al. (9)we modeled every biomolecule
involved in a metabolic process as an agent and any biological
compartment as a coordination environment where molecular
bioagents are situated, move and react. We developed a prototype
MAS to test the practicability of our approach focusing on glycolysis
(91). Cytoplasm is characterized by a three-dimensional occupancy
and a number of physical properties like temperature, pH, fluid
viscosity, and concentrationwhich we suppose uniformof ions.
The cytoplasm keeps track of the position of all the molecular
bioagents wandering inside it. In our model, the cytoplasm is
intended as a coordination environment and all the enzymes, meta-
bolites, and their intermediate complexes are represented as moving
agents. All the moving agents are characterized by a shape, which at
the mesoscale can be considered spherical, and by a molecular
weight, which can be derived from biochemical databases (e.g.,
Kegg LIGAND (92)). The weight of the intermediate complexes
is intuitively calculated adding the weights of the molecular compo-
nents forming the complex. The radius of the spheres can be easily
calculated from their molecular weight. All the molecular bioagents
(Fig. 2) have a three-coordinate position in the cytoplasmic space
and move according to the Brownian motion, which is the predom-
inant law at this scale. From Stokes and Smoluchowski laws and
Einstein equation on Brownian motion we can assume that the
diffusion of spherical particles in a viscous fluid is depending on
their radius and on the temperature and viscosity of the fluid. The
latter takes also into account the local concentration of molecules
around the moving agent. We divided the molecular bioagents into
active and passive ones. The first (i.e., enzymes and complexes of
enzymes and metabolites), besides the capability to move and per-
ceive their environment, can also perform biochemical reactions.
The latter (i.e., metabolites) have only the capability to move and to
be manipulatedlike in the realityby active bioagents. The meta-
bolites substrates (i.e., inputs) of an enzymatic reaction undergo
to specific chemical transformations, which transform them into
Fig. 2. A snapshot of our metabolic reaction simulation, showing enzyme, metabolite, and
complex agents moving and acting in a portion of cytoplasm.
different chemical entities, products (i.e., outputs) of the reac-

tion. The enzyme instead remains unaltered by the completed reac-
tion and can therefore try to capture new instances of the
substrates into its active site, thus starting a new reaction. The net
result of a completed reaction will be the disappearing of the
instances of substrate bioagents previously captured by the enzyme
and the appearing of new instances of the product bioagents, exactly
as described by the chemical formula.
5. Notes
We have given a large view of the existing methodologies and tools

to realize an agent-based system. It might be difficult for a
researcher, possibly not coming from a computer science field or a
related one, to start facing the task of realizing a MAS for his/her
purposes. For this reason we suggest a useful, in our opinion,
starting place. Inside the wiki pages of Swarm, there is a page
presenting software templates (93). In this page, the so-called
StupidModel, templates are available for using in Swarm,
MASON, Repast, and NetLogo. They tackle 16 typical situations of

agent-based programming, each of which is equipped with sample
code realizing the relative function in the four tools. Situations
range from creating a grid world in which agents can move to
making agents born or die and obtaining statistic information
about the MAS evolution as charts or output files. The page con-
tains also other material more specific to Swarm and to other tools.
We think that these templates are a good way to start familiarizing
with the tools and the concepts of ABMS. The code can be copied,
made run as it is in the platforms, and then modified and extended
according to ones particular needs.
References
1. Milner R (1989) Communication and concur- 13. Lemerle C, Di Ventura B, Serrano L (2005)
rency. Prentice-Hall, New York, London Space as the final frontier in stochastic simula-
2. Wooldridge M (1997) Agent-based software tions of biological systems. FEBS Lett 579
engineering. IEE proceedings on software (8):17891794
engineering: 2637 14. Lloyd CM, Halstead MD, Nielsen PF (2004)
3. Wooldridge M (2002) An introduction to CellML: its future, present and past. Prog Bio-
MultiAgent systems. Wiley, West Sussex, UK phys Mol Biol 85(23):433450
4. Vallurupalli V, Purdy C (2007) Agent-based 15. Hucka M, Finney A, Sauro HM et al (2003)
modeling and simulation of biomolecular reac- The systems biology markup language
tions. Scalable Computing: Practice and Expe- (SBML): a medium for representation and
rience 8(2):185196 exchange of biochemical network models. Bio-
5. Bonabeau E (2002) Agent-based modeling: informatics 19(4):524531
methods and techniques for simulating 16. Takahashi K, Arjunan SN, Tomita M (2005)
human systems. Proc Natl Acad Sci USA 99 Space in systems biology of signaling
(Suppl 3):72807287 pathwaystowards intracellular molecular
6. Kitano H (2001) Foundations of systems biol- crowding in silico. FEBS Lett 579
ogy. MIT Press, Cambridge, MA (8):17831788
7. Gonzalez PP, Cardenas M, Camacho D et al 17. Tang J, Ley KF, Hunt CA (2007) Dynamics of
(2003) Cellulat: an agent-based intracellular in silico leukocyte rolling, activation, and adhe-
signalling model. Biosystems 68 sion. BMC Syst Biol 1:14
(23):171185 18. The breve simulation environment. http://
8. Wolfram S (1984) Cellular automata as models www.spiderland.org/. Accessed 19 Nov 2010
of complexity. Nature 311:419424 19. Guo Z, Sloot PM, Tay JC (2008) A hybrid
9. Wishart DS, Yang R, Arndt D et al (2005) agent-based approach for modeling microbio-
Dynamic cellular automata: an alternative logical systems. J Theor Biol 255(2):163175
approach to cellular simulation. Silico Biol 5 20. Guo Z, Tay JC (2008) Multi-timescale event-
(2):139161 scheduling in multi-agent immune simulation
10. Troisi A, Wong V, Ratner MA (2005) An models. Biosystems 91(1):126145
agent-based approach for modeling molecular 21. Walker DC, Southgate J, Hill G et al (2004)
self-organization. Proc Natl Acad Sci U S A The epitheliome: agent-based modelling of the
102(2):255260 social behaviour of cells. Biosystems 76
11. Emonet T, Macal CM, North MJ et al (2005) (13):89100
AgentCell: a digital single-cell assay for bacte- 22. MATLABThe language of technical com-
rial chemotaxis. Bioinformatics 21 puting. http://www.mathworks.com/pro
(11):27142721 ducts/matlab/. Accessed 19 Nov 2010
12. Korobkova E, Emonet T, Vilar JM et al (2004) 23. Pogson M, Smallwood R, Qwarnstrom E et al
From molecular noise to behavioural variability (2006) Formal agent-based modelling of intra-
in a single bacterium. Nature 428 cellular chemical interactions. Biosystems 85
(6982):574578 (1):3745
24. Walker DC, Southgate J (2009) The virtual 39. JADEJava Agent DEvelopment Framework.
cella candidate co-ordinator for middle-out http://jade.tilab.com/. Accessed 19 Nov 2010
modelling of biological systems. Brief Bioin- 40. JADEA White Paper. http://jade.tilab.
form 10(4):450461 com/papers/2003/WhitePaperJADEEXP.
25. Adra S, Sun T, MacNeil S et al (2010) Devel- pdf. Accessed 19 Nov 2010
opment of a three dimensional multiscale 41. IEEE foundation for intelligent physical
computational model of the human epidermis. agents. http://www.fipa.org/. Accessed 19
PLoS One 5(1):e8511 Nov 2010
26. Hoops S, Sahle S, Gauges R et al (2006) 42. IEEEthe worlds largest professional associa-
COPASIa COmplex PAthway SImulator. Bio- tion for the advancement of technology.
informatics 22(24):30673074 http://www.ieee.org/index.html. Accessed 19
27. Galvao V, Miranda JG (2010) A three- Nov 2010
dimensional multi-agent-based model for the 43. Java-based intelligent agent componentware.
evolution of Chagas disease. Biosystems 100 http://jiac.de/. Accessed 19 Nov 2010
(3):225230 44. SPADESmart python multi-agent develop-
28. Pennisi M, Pappalardo F, Palladini A (2010) ment environment. http://code.google.com/
Modeling the competition between lung p/spade2/. Accessed 19 Nov 2010
metastases and the immune system using 45. JACK autonomous software. http://aosgrp.
agents. BMC Bioinformatics 11(Suppl 7):S13 com/products/jack/index.html. Accessed 19
29. Bauer AL, Beauchemin CA, Perelson AS Nov 2010
(2009) Agent-based modeling of host- 46. Comparison of agent-based modeling soft-
pathogen systems: the successes and chal- ware. http://en.wikipedia.org/wiki/Compari-
lenges. Inf Sci (Ny) 179(10):13791389 son_of_agent-based_modeling_software.
30. Christley S, Lee B, Dai X et al (2010) Integra- Accessed 19 Nov 2010
tive multicellular biological modeling: a case 47. Macal CM, North MJ (2008) Agent-based
study of 3D epidermal development using modeling and simulation: desktop ABMS. In
GPU algorithms. BMC Syst Biol 4:107 proc. of winter simulation conference
31. Dematte L, Prandi D (2010) GPU computing 2007:95106
for systems biology. Brief Bioinform 11 48. Sanchez D, Isern D, Rodrguez-Rozas A et al
(3):323333 (2010) Agent-based platform to support the
32. Halling-Brown M, Pappalardo F, Rapin N et al execution of parallel tasks. Expert Syst Appl.
(2010) ImmunoGrid: towards agent-based doi:10.1016/j.eswa.2010.11.073
simulations of the human immune system at a 49. CUDA ZoneOfficial Website. http://www.
natural scale. Philos Transact A Math Phys Eng nvidia.com/object/cuda_home_new.html.
Sci 368(1920):27992815 Accessed 19 Nov 2010
33. Viceconti M, Clapworthy G, Van Sint JS 50. Allan R (2009) Survey of agent based model-
(2008) The virtual physiological humana ling and simulation tools. Technical report.
European initiative for in silico human model- computational science and engineering depart-
ing. J Physiol Sci 58(7):441446 ment, STFC Daresbury laboratory, Daresbury,
34. Shah I, Wambaugh J (2010) Virtual tissues in Warrington. http://epubs.cclrc.ac.uk/work-
toxicology. J Toxicol Environ Health B Crit details?w50398. Accessed 19 Nov 2010
Rev 13(24):314328 51. Nikolai C, Madey G (2009) Tools of the trade:
35. Devillers J, Devillers H, Decourtye A (2010) a survey of various agent based modeling plat-
Internet resources for agent-based modelling. forms. J Artif Soc Soc Simulat 12(2):2
SAR QSAR Environ Res 21(34):33750 52. Railsback SF, Lytinen SL, Jackson SK (2006)
36. Gilbert N, Bankes S (2002) Platforms and Agent-based simulation platforms: review and
methods for agent-based modeling. Proc Natl development recommendations. Simulation
Acad Sci USA 99(Suppl 3):71977198 82:609623
37. REPAST - Recursive porous agent simulation 53. Tobias R, Hofmann C (2004) Evaluation of
toolkit. http://repast.sourceforge.net/. free Java-libraries for social-scientific agent
Accessed 19 Nov 2010 based simulation. J Artif Soc Soc Simulat 7
38. A resource for agent- and individual-based (1):6
modelers and the home of Swarm. http:// 54. GroovyAn agile dynamic language for the
www.swarm.org/index.php/Main_Page. Java platform. http://groovy.codehaus.org/.
Accessed 19 Nov 2010 Accessed 19 Nov 2010
55. Mason multiagent simulation toolkit. http:// 72. Cossentino M (2005) From requirements to
cs.gmu.edu/~eclab/projects/mason/. code with the PASSI methodology. In:
Accessed 19 Nov 2010 Henderson-Sellers B, Giorgini P (eds) Agent-
56. AgentSheets. http://www.agentsheets.com/. oriented methodologies. Idea Group, London
Accessed 19 Nov 2010 73. Bernon C, Camps V, Gleizes M-P et al (2005)
57. NetLogo Home Page. http://ccl.northwest Engineering adaptive multi-agent systems: the
ern.edu/netlogo/. Accessed 19 Nov 2010 ADELFE methodology. In: Henderson-Sellers
58. SeSAmIntegrated environment for multi- B, Giorgini P (eds) Agent-oriented methodol-
agent simulation. http://www.simsesam.de/. ogies. Idea Group, London
Accessed 19 Nov 2010 74. Cossentino M, Fortino G, Garro A et al (2008)
59. StarLogo on the Web. http://education.mit. PASSIM: a simulation-based process for the
edu/starlogo/. Accessed 19 Nov 2010 development of multi-agent systems. Int J
Agent-Oriented Softw Eng 2(2):132170
60. Why AnyLogic simulation software? http://
www.xjtek.com/anylogic/why_anylogic/. 75. Henderson-Sellers B, Giorgini P (2005)
Accessed 19 Nov 2010 Agent-oriented methodologies. Idea Group,
London
61. SimBioSys class framework. http://www.luci
fer.com/~david/SimBioSys/. Accessed 19 76. The FIPA agent UML Web site. http://www.
Nov 2010 auml.org/. Accessed 19 Nov 2010
62. Unified modeling language, resource page. 77. Padgham L, Winikoff M, Deloach S, Cossen-
http://www.uml.org/. Accessed 19 Nov 2010 tino M (2009) A unified graphical notation for
AOSE. In proc. agent-oriented software engi-
63. Rational unified process: best practices for soft- neering IX. doi:10.1007/978-3-642-01338-
ware development teams whitepaper. http:// 6_9
www.augustana.ab.ca/~mohrj/courses/2000.
winter/csc220/papers/rup_best_practices/ 78. Garca-Magarin o I, Gutierrez C, Fuentes-Fer-
rup_bestpractices.html. Accessed 19 Nov 2010 nandez R (2009) The INGENIAS develop-
ment kit: a practical application for crisis-
64. Bernon C, Cossentino M, Pavon J (2005) management. In bio-inspired systems: compu-
Agent-oriented software engineering. Knowl tational and ambient intelligence. Lect Notes
Eng Rev 20(2):99116 Comput Sci 5517:537544
65. Macal CM, North MJ (2008) Agent-based 79. PASSI toolkit. http://sourceforge.net/pro1
modeling and simulation: desktop ABMS. In jects/ptk/. Accessed 19 Nov 2010
proc. of winter simulation conference 2007,
pp. 95106 80. Cossentino M, Fortino G, Gleizes MP et al
(2010) Simulation-based design and evalua-
66. Axelrod R (1997) Advancing the art of simulation of multi-agent systems. Modell Pract The-
tion in social sciences. In: Conte R, Hegsel- ory 18(10):14251427
mann R, Terna P (eds) Simulating social
phenomena. Springer-Verlag, Berlin 81. Fortino G, Garro A, Russo W (2005) An
integrated approach for the development and
67. Wooldridge M, Jennings NR, Kinny D (2000) validation of multi-agent systems. Comput Syst
The Gaia methodology for agent-oriented Sci Eng 20(4):94107
analysis and design. J Auton Agent Multi-
Agent Syst 3(3):285312 82. Fortino G, Garro A, Mascillaro S et al (2007)
ELDATool: a statechart-based tool for proto-
68. Zambonelli F, Jennings N, Kinny D (2003) typing multi-agent systems. Proc. of workshop
Developing multiagent systems: the Gaia on objects and agents (WOA 07)
methodology. ACM Trans Softw Eng Metho-
dol 12(3):417470 83. Cannata N, Corradini F, Merelli E et al (2005)
An agent-oriented conceptual framework for
69. Bresciani P, Giorgini P, Giunchiglia F et al systems biology. Trans Comput Syst Biol,
(2004) Tropos: an agent-oriented software Lect Notes Comput Sci (3737):105122
development methodology. J Auton Agent
Multi-Agent Syst 8:203236 84. Finkelstein A, Hetherington J, Li L et al (2004)
Computational challenges of systems biology.
70. Padgham L, Winikoff M (2002) Prometheus: a Computer 37(5):2633
methodology for developing intelligent agents. In
proc. of 3rd international conference on agent- 85. Corradini F, Merelli E, Vita M (2005) A multi-
oriented software engineering III:174185 agent system for modelling carbohydrate oxi-
dation in cell. Lect Notes Comput Sci
71. Pavon J, Gomez-Sanz J, Fuentes R (2005) The 3481:12641273
INGENIAS methodology and tools. In:
Henderson-Sellers B, Giorgini P (eds) Agent- 86. Cannata N, Corradini F, Merelli E (2008)
oriented methodologies. Idea Group, London Multiagent modelling and simulation of
carbohydrate oxidation in cell. Int J Modell University of Camerino, Department of Com-

Ident Control 3(1):1728 puter Science
87. Corradini F, Merelli E (2005) Hermes: agent- 91. Cannata N, Corradini F, Merelli E et al (2008)
based middleware for mobile computing. Lect A spatial simulator for metabolic pathways.
Notes Comput Sci 3465:234270 International workshop on multi agents sys-
88. Boehm B (1981) Software engineering eco- tems and bioinformatics (MAS & BIO 2008)
nomics. Prentice Hall, New York, London http://www.cs.unicam.it/tesei/updown/
89. Merelli E, Young M (2007) Validating MAS CCMT08b.pdf. Accessed 19 Nov 2010
with mutation. Int J Multiagent Grid Syst 3 92. KEGG LIGAND Database. http://www.
(2):225243 genome.jp/ligand/. Accessed 19 Nov 2010
90. Berluti E, Corradini F, Leli S et al (2006) Gly- 93. Software Templates SwarmWiki. http://
colisis and fermetation in cellular energy pro- www.swarm.org/index.php/Software_tem-
duction. Internal report unicam-cs01-2006, plates. Accessed 19 Nov 2010
Part VI
Mathematical and Statistical Background

Chapter 19
Linear Algebra
Kenneth Kuttler
Abstract
This chapter is a short review of linear algebra leading to a discussion of the singular value decomposition.
Key words: Linear algebra, Inner product space, Singular value decomposition
1. Introduction
This chapter is a review of beginning linear algebra leading to the

techniques of least squares, the singular value decomposition, and
other related topics having to do with the interaction between
linear algebra and inner product spaces. There are several good
references which contain more details on the topics in this chapter
which are listed in the reference section at the end. The online
book found at http://www.math.byu.edu/~klkuttle/Line aralge-
bra.pdf also has all the generalizations and proofs of what is in this
chapter.
2. The Inner Product

and Dot Product
The inner product is defined for vectors in Cn or Rn. To avoid
making a distinction I will use the symbol F. Vectors are denoted by
boldface letters. Thus a will denote the vector
a a1 ; . . . ; an :
Scalars are denoted by letters which are not in boldface.
429
430 K. Kuttler
Definition 2.1: Let a; b 2 Fn define a,b as

X
n
a; b ak b k :
k1
This is called the inner product.

With this definition, there are several important properties
satisfied by the inner product. In the statement of these properties,
a and b will denote scalars and a,b,c will denote vectors or in other
words, points in Fn . The following proposition comes directly from
the definition of the inner product.
Proposition 2.2: The inner product satisfies the following properties:
a; b b; a; (1)
a; a 0 and equals zero if and only if a 0; (2)
aa bb; c aa; c bb; c; (3)
c; aa bb ac; a bc; b; (4)
jaj2 a; a; (5)
Example 2.3: Find ((1,2,0,1),(0, i,2,3)).

This equals 0 2i 0 3 3 2i.
For any inner product there is always the CauchySchwarz
inequality.
Theorem 2.4: The inner product satisfies the inequality
ja; bj jajjbj: (6)

Furthermore equality is obtained if and only if one of a or b is a
scalar multiple of the other.
Proof: First define y 2 C such that
ya; b ja; bj; jyj 1;

and define a function of t 2 R
f t a tyb; a tyb:
Then by Eq. 2, f t 0 for all t 2 R. Also from Eqs. 1, 35
f t a; a tyb tyb; a tyb
a; a tya; b tyb; a t 2 jyj2 b; b
jaj2 2tReya; b jbj2 t 2
jaj2 2t ja; bj jbj2 t 2 :
19 Linear Algebra 431
Now if |b|2 0 it must be the case that (a,b) 0 because other-

wise, you could pick large negative values of t and violate f t 0.
Therefore, in this case, the CauchySchwarz inequality holds.
In the case that jbj 6 0; y f t is a polynomial which opens up and
therefore, if it is always nonnegative, the quadratic formula requires that
The discriminant
z}|{
4ja; bj2 4jaj2 jbj2 0
since otherwise the function, f(t) would have two real zeros and
would necessarily have a graph which dips below the t axis. This
proves Eq. 6.
It is clear from the axioms of the inner product that equality holds in
Eq. 6 whenever one of the vectors is a scalar multiple of the other. It only
remains to verify that this is the only way equality can occur. If either vector
equals zero, then equality is obtained in Eq. 6 so it can be assumed both
vectors are nonzero. Then if equality is achieved, it follows f(t) has exactly
one real zero because the discriminant vanishes. Therefore, for some value
of t; a tyb 0 showing that a is a multiple of b.
You should note that the entire argument in the above theorem
was based only on the properties of the inner product listed in
Eqs. 15. This means that whenever something satisfies these prop-
erties, the CauchySchwarz inequality holds. There are many other
instances of these properties besides vectors in Fn .
The CauchySchwarz inequality allows a proof of the triangle
inequality for distances in Fn in much the same way as the triangle
inequality for the absolute value.
Theorem 2.5: (Triangle inequality) For a; b 2 Fn
ja bj jaj jbj (7)
and equality holds if and only if one of the vectors is a nonnegative
scalar multiple of the other. Also
jjaj jbjj ja bj: (8)
Proof: By properties of the inner product and the CauchySchwarz

inequality,
ja bj2 a b;a b
a; a a; b b; a b; b
jaj2 2Rea; b jbj2
jaj2 2ja; bj jbj2
jaj2 2jajjbj jbj2
jaj jbj2 :
Taking square roots of both sides you obtain Eq. 7.
It remains to consider when equality occurs. If either vector equals
zero, then that vector equals zero times the other vector and the claim
about when equality occurs is verified. Therefore, it can be assumed both
432 K. Kuttler
vectors are nonzero. To get equality in the second inequality above,

Theorem 4 implies one of the vectors must be a multiple of the other.
Say b aa. Also, to get equality in the first inequality, (a,b) must be a
nonnegative real number. Thus
0 a; b a;aa ajaj2 :
Therefore, a must be a real number which is nonnegative.
To get the other form of the triangle inequality,
aabb
so
jaj ja b bj
ja bj jbj:
Therefore,
jaj jbj ja bj: (9)
Similarly,
jbj jaj jb aj ja bj: (10)
It follows from Eqs. 9 and 10 that Eq. 8 holds. This is because
jjaj jbjj equals the left side of either Eq. 9 or Eq. 10 and either
way, jjaj jbjj ja bj:
3. The Dot Product
If you forget about the conjugate, the resulting

P product will be
referred to as the dot product. Thus a b ni1 a i b i . This dot
product satisfies the following properties:
a b b a; (11)
aa bb caa c bb c; (12)
c aa bb ac a bc b; (13)
Note that it is the same thing as the inner product if the vectors are
in Rn. However, in case the vectors are in Cn, this dot product will
not satisfy a a 0. For example, 1; 2i 1; 2i 1 4 3:
However, 1; 2i ; 1; 2i 1 4 5:
For those who remember the geometric description of vector
addition, the CauchySchwarz inequality presented above may be
illustrated geometrically in the context of Rn. First note that from
the Pythagorean theorem, a reduces to the length of the vector a.
Now consider the following picture:
b ab
q a
By the law of cosines,

ja bj2 jaj2 jbj2 2jajjbj cos y:

Also from the properties of the dot product,
ja bj2 a b a b
jaj2 jbj2 2a b
and so comparing the above two formulas,
a b jajjbj cos y: (14)
Since cosy 1,
ja bj jajjbj:
Also note that Eq. 14 provides an easy way to find the angle
between two vectors and also a method to tell whether two vectors
are perpendicular.
4. Addition
and Scalar
Multiplication
A matrixis a rectangular array of numbers. Several of them are
referred to as matrices. For example, here is a matrix.
0 1
1 2 3 4
@ 5 2 8 7 A:
6 9 1 2
This matrix is a 3 4 matrix because there are three rows and four
columns. The first row is 1 2 304, 1 the second row is 5 2 8 7,
1
and so forth. The first column is @ 5 A: The convention in dealing
6
with matrices is to always list the rows first and then the columns. Also,
you can remember the columns are like columns in a Greek temple.
They stand upright while the rows just lay there like rows made by a
tractor in a plowed field. Elements of the matrix are identified accord-
ing to position in the matrix. For example, 8 is in position 2, 3 because
it is in the second row and the third column. You might remember
that you always list the rows before the columns by using the phrase
Rowman Catholic. The symbol aij refers to a matrix in which the i
denotes the row and the j denotes the column. Using this notation on
the above matrix, a 23 8; a 32 9; a 12 2; etc.
There are various operations which are done on matrices. They
can sometimes be added, multiplied by a scalar and sometimes
multiplied. To illustrate scalar multiplication, consider the follow-
ing example:
434 K. Kuttler
0 1 0 1
1 2 3 4 3 6 9 12
3@ 5 2 8 7 A @ 15 6 24 21 A:
6 9 1 2 18 27 3 6
The new matrix is obtained by multiplying every entry of the
original matrix by the given scalar. If A is an m n matrix,A is
defined to equal1A.
Two matrices which are of the same size can be added. When
this is done, the result is the matrix which is obtained by adding
corresponding entries. Thus
0 1 0 1 0 1
1 2 1 4 0 6
@3 4A @ 2 8 A @ 5 12 A:
5 2 6 4 11 2
Two matrices are equal exactly when they are the same size and the
corresponding entries are identical. Thus
0 1
0 0
@ 0 0 A 6 0 0
0 0
0 0
because they are different sizes. As noted above, you write cij for the
matrix C whose ijth entry is cij. In doing arithmetic with matrices
you must define what happens in terms of the cij sometimes called
the entries of the matrix or the components of the matrix.
The above discussion stated for general matrices is given in the
following definition.
Definition 4.1: Let A aij and B bij be two m n matrices. Then
A B C where

C c ij
for c ij a ij b ij : Also if x is a scalar,

xA c ij ;
where cij xaij. The number Aij will typically refer to the ijth entry
of the matrix A. The zero matrix, denoted by 0, will be the matrix
consisting of all zeros.
Note there are 2 3 zero matrices, 3 4 zero matrices, etc. In
fact for every size there is a zero matrix.
With this definition it is easy to verify all of the following
properties valid for A, B, and C, m n matrices and 0 an m n
zero matrix.
A B B A; (15)
the commutative law of addition,
A B C A B C ; (16)
the associative law for addition,

A 0 A; (17)
the existence of an additive identity,
A A 0; (18)
the existence of an additive inverse. Also, for a, b scalars, the
following also hold:
aA B aA aB; (19)
a bA aA bA; (20)
abA abA ; (21)
1A A: (22)
The above properties, Eqs. 1522 are known as the vector
space axioms and the fact that the m n matrices satisfy these
axioms is what is meant by saying that this set of matrices forms a
vector space. Note that, in particular, these axioms apply to points
in Fn written as n 1 or 1 n matrices, which we call vectors and
usually write in boldface.
Definition 4.2: Matrices which are n 1 or 1 n are called vectors and are
often denoted by a bold letter. Thus
0 1
x1
B . C
x @ .. A
xn
is a n 1 matrix also called a column vector while a 1 n matrix of
the form
x1 x n xT
is referred to as a row vector.
5. Matrix
Multiplication
All the above is fine, but the real reason for considering matrices is
that they can be multiplied. This is where things quit being banal.
The rule for matrix multiplication is this. You can multiply an m n
matrix times an n p matrix. The following may be helpful.
these must match
d
m n n p m p
The two middle numbers must match!

436 K. Kuttler
This is the way it must be to do the multiplication. As indicated, the

result will be an m p matrix.
The ijth entry of AB is the dot product of the ith row of A
with the jthcolumn of B and the dot product must make sense!
In symbols,
X
AB ij Aik B kj :: (23)
k
Consider the special case where A is m n and B x 2 Fn .

In this case Ax is a column vector. Then by the above definition,
this column is
0P 1 0 1
k A 1k x k X
A 1k
B .. C B . C
@ . A x k @ .. A:
P k
k A mk x k A mk
Thus, to find this product you take the entries of x in order from
top to bottom. You take the ith entry of x from the top and
multiply by the ith column of A. Then you add these. For example,
0 1
7
1 2 3 @ A 1 2 3 50
8 7 8 9
4 5 6 4 5 6 122
9
In general, if
A a1 an ;
where the ai are the columns, then
0 1
x1
B .. C
A@ . A x 1 a1 x n an :
xn
What about AB where B b1 bn ; the bi being the
columns of B? From the definition, the jth column of AB is
0P 1
Pk A 1k B kj
B k A 2k B kj C
B C
B .. C Abj :
@ . A
P
k A mk B kj
Thus AB is of the form

Ab1 Abp :
6. Systems of Linear
Equations and
Matrices
A system of linear equations is something of the form
a11 x 1 a12 x 2 a1n x n b 1
..
.
am1 x 1 a m2 x 2 amn x n b m
The variables to solve for are x1, . . ., xn. You notice there are m
equations for the n variables. As just explained, another way to
write the above system of equations in terms of matrix multiplica-
tion is in the form
0 1
0 1 x1 0 1
a 11 a 12 a1n B C b1
B .. .. .. CB x 2 C B .. C
@ . . . AB . C @ . A;
@ .. A
am1 a m2 amn bm
xn
Ax b for short. Thus a solution to the system of equations will be
x 2Rn such that Ax b. Another way to say the same thing is to
find x such that

x
A b 0:
1
Here the matrix on the left is obtained by starting with A and
adding a column at the end which consists of b. The vector which
follows is obtained by starting with x and adding an entry at the
bottom consisting of 1.
7. Row Operations
When one deals with matrices there are operations called row
operations. These are as follows:
Definition 7.1: The row operations consist of the following:
1. Switch two rows.
2. Multiply a row by a nonzero number.
3. Replace a row by a multiple of another row added to it.
The major result for row operations is the following.
Lemma 7.2: Let A be an m n matrix and letx be a vector in Fn such that
Ax 0. Suppose B is an m n matrix which results from A by doing a row
operation to A. Then Bx 0 also.
438 K. Kuttler
Proof: The vector x is just an n 1 matrix. Thus, to say that Ax 0 is to

say that the dot product of each row of A with x equals 0. Clearly this does
not change if you switch rows. If you multiply a, a row of A, by a nonzero
scalar a, then from the properties of the dot product listed above,
aa x aa x a0 0: If a,b are two rows of A, then from these
properties again,
a abx a x ab x 0 a0 0:
The dot product of the unchanged rows with x still gives 0. This
includes all three of the row operations.
This lemma gives the justification for the usual method of
finding solutions to a system of equations. Recall, it is desired to
find the solution x to the system of equations Ax b. This is the
same as finding x such that

x
A b 0: (24)
1
By the above lemma, you can do row operations onthe augmented

x
matrix A b yielding A 0 b0 and the vector willbe a
1
solution to the system

0 0
x
A b 0
1
if and only
if this vector is a solution to Eq. 24. The idea is to find
0 0
A b in such a way that it will be easy to see what this column
vector should be. The easiest form for spotting the solution is called
the reduced row echelon form.
Example 7.3: Solve the system of equations
0 10 1 0 1
1 2 3 x 1
@ 3 2 4 A@ y A @ 4 A:
5 6 2 z 3
As explained above, you form the augmented matrix
0 1
1 2 3 1
@ 3 2 4 4 A
5 6 2 3
and then do row operations on it till you get a much easier one.
After doing row operations, one can obtain the following reduced
row-echelon matrix.
0 1
55
B 1 0 0 52 C
B C
B 9 C
B 0 1 0 C:
B 8 C
@ 10 A
0 0 1
13

x
It is now easy to identify a vector which the above matrix
1
sends to 0. It is to let

55 9 10
x x; y; z ; ; :
52 8 13
This x is the solution to the original system of equations. Note the
solution to the original system is just the column vector on the right
in the reduced row echelon form.
More can be said here but this chapter is not oriented in this
direction. It suffices to see that this is always a reasonable way to
find solutions to a system of equations. Later, least squares solu-
tions will be discussed. These are important when there is no
solution to such a system of equations just discussed. To accomplish
the process of row reduction just described, you can use a computer
algebra system. Here are directions using Maple 12.
1. First open the application. On the left you will see a column of
gray boxes. Click on the one which is labeled matrix to obtain a
display which says rows, and right below it, columns. Enter the
appropriate numbers for the augmented matrix.
2. Next left click on insert matrix this will give you a template in
which you can fill in the appropriate numbers. It is convenient
to use the tab key on your keyboard to advance from one slot to
the next when entering the entries of the matrix.
3. When you have entered the entries of the matrix, right click on
it. Then click on solvers and forms, place the cursor on row
echelon form, and click on reduced. This gives the reduced
row-echelon form from which it will be maximally easy to spot
the solution to the original system.
8. Properties
of Matrix
Multiplication
First note the following example.
Example 8.1: Multiply the following in two different orders:
0 1
1 2 0
@ 0 3 1A 1 2 1
0 2 1
2 1 1
Note that it is not possible to multiply these matrices in the given
order because the first is 3 3 and the second is 2 3. However, you
can multiply them in the other order. You should check that
440 K. Kuttler
0 1
! 1 2 0 !
1 2 1 B C 1 9 3
B 0 3 1C
@ A
0 2 1 2 7 3
2 1 1
Order Matters!
Matrix multiplication is not commutative. This is very different
than multiplication of numbers!
As pointed out above, sometimes it is possible to multiply
matrices in one order but not in the other order. What if it makes
sense to multiply them in either order? Will the respective products
be equal then?
1 2 0 1 0 1 1 2
Example 8.2: Compare and :
3 4 1 0 1 0 3 4
The first product is

1 2 0 1 2 1
;
3 4 1 0 4 3
the second product is

0 1 1 2 3 4
;
1 0 3 4 1 2
and you see these are not equal. Therefore, you cannot conclude
that AB BA for matrix multiplication even if both orders make
sense. However, there are some properties which do hold.
Proposition 8.3: If all multiplications and additions make sense, the
following hold for matrices A,B,C and a,b scalars:
A aB bC a AB b AC : (25)
B C A BA CA: (26)
A BC AB C: (27)
Proof: Using the above definition of matrix multiplication, addition, and

scalar multiplication,
X
A aB bC ij Aik aB bC kj
k
X
Aik aB kj bC kj
k
X X
a A ik B kj b A ik C kj
k k
aAB ij b AC ij
aAB b AC ij
showing that A B C AB AC as claimed. Formula 26 is
entirely similar.
Consider Eq. 27, the associative law of multiplication. Before reading

this, review the definition of matrix multiplication in terms of entries of the
matrices.
X X X
ABC ij A ik BC kj A ik B kl C lj
k k l
!
X X X
A ik B kl C lj AB il C lj
l k l
AB C ij
This proves Eq. 27.
Another important operation on matrices is that of taking the
transpose. The following example shows what is meant by this
operation, denoted by placing a T as an exponent on the matrix:
0 1T
1 1 2i
@3 1 3 2
1 A :
1 2i 1 6
2 6
What happened? The first column became the first row and the
second column became the second row. Thus the 3 2 matrix
became a 2 3 matrix. The number 3 was in the second row and
the first column and it ended up in the first row and second column.
This motivates the following definition of the transpose of a matrix.
Definition 8.4: Let A be an m n matrix. Then AT denotes the n m
matrix which is defined as follows:
T
A ij A ji :
Related to the transpose is the adjoint

A ij A ji ;
where the line on the top denotes the complex conjugate. Thus
a ib a ib:
In this chapter, we are mainly interested in the case where all
matrices are real. Thus AT and A will be the same. The transpose
and adjoint of a matrix have the following important properties.
Lemma 8.5: Let A be an m n matrix and let B be a n p matrix. Then
AB T B T A T (28)
and if a and b are scalars,
aA bB T aA T bB T : (29)
In the case of the adjoint,
AB B A (30)
442 K. Kuttler
and if a, b are scalars,

aA bB aA bB : (31)

In addition, (A ) A and (A ) A. T T
Proof: From the definition,

X
AB T AB ji A jk B ki
ij
k
X
B T ik A T kj B T A T ij :
k
Equation 29 is left as an exercise. The claims in Eqs. 30 and 31 also

follow right away using the easily established fact that the conju-
gate of product equals the product of the conjugates and the
conjugate of a sum equals the sum of the conjugates. The last
claim is obvious.
The real significance of the adjoint is in the following proposi-
tion related to the inner product.
Proposition 8.6: Let A be an m n matrix, letx 2 Fn , andy 2 Fm :Then
Ax; yx; A y:
Proof: From the definition and the fact that the complex conjugate of a
product equals the product of the conjugates,
m X
X n X
n X
m X
n X
m
Ax; y A ji x i y j xi A Tij y j xi A ij y j
j 1 i1 i1 j 1 i1 j 1
x; A y: &
Definition 8.7: An n n matrix A is said to besymmetric if A AT. It is

said to be skew symmetric if AT A: It is Hermitian if A A.
Example 8.8: Let 0 1

2 1 3
A @1 5 3 A:
3 3 7
Then A is symmetric.
Example 8.9: Let 0 1
0 1 3
A @ 1 0 2 A:
3 2 0
Then A is skew symmetric.
There is a special matrix called I and defined by
I ij dij ;
where dij is the Kroneker symbol defined by

1 if i j ;
dij
0 if i 6 j :
It is called the identity matrix because it is a multiplicative identity
in the following sense.
Lemma 8.10: Suppose A is an m n matrix and In is the n n identity
matrix. Then A In A. If Imis the m m identity matrix, it also follows
that ImA A.
Proof: X
AI n ij A ik dkj A ij :
k
and so AIn A. The other case is left as an exercise for you.

Another useful concept is the trace of a matrix. For A an n n
matrix,
X
n
trace A A ii
i1
the sum of the entries down the main diagonal. Then here is a nice
theorem about the trace of a product.
Theorem 8.11: Let A be an m n matrix and let B be an n m matrix.
Then
trace AB trace BA:
Proof: !
X X XX
trace AB A ik B ki B ki A ik trace BA &
i k k i
9. Inverses
Here is the definition of the inverse of a matrix.

Definition 9.1: An n n matrix A has an inverse, A1 if and only if
AA 1 A1 A I where I dij for

1 if i j ;
dij
0 if i 6 j :
Such a matrix is called invertible.
So how do you find the inverse of an n n matrix when it
exists? First, it is important to realize that just because a matrix is
nonzero does not mean it has an inverse.
444 K. Kuttler

1 1
Example 9.2: Let A : Does A have an inverse?
1 1
One might think A would have an inverse because it does not
equal zero. However,

1 1 1 0

1 1 1 0
and if A1 existed, this could not happen because you could write

0 0 1
A 1 A1 A
0 0 1

1 1 1 1
A A I ;
1 1 1
a contradiction. Thus the answer is that A does not have an inverse.
Consider the problem of finding the right inverse. This is a
matrix B such that AB I. Later in the section on the determinant
it is shown that this will be the inverse. see Corollary 2. Denoting
this right inverse by B where
B b1 b2 bn ;
it is required to have Abi ei where ei is the vector which has a 1 in
the ith slot and a zero everywhere else. This is what is needed to
have AB I, the ith column of I being ei. Thus, as discussed above,
you need to row reduce the augmented matrix
A ei
bi
such that is a solution to the system
1
bi
A ei 0:
1
This involves doing row operations till you obtain
I bi
and then observing that this bi works.
Each time this is done, the same row operations can be used to
get A ! I. Therefore, a shortcut to doing this is to simply write the
n 2n matrix
A I
and do row operations on it till the left half equals I and then what
sits on the right side will be the inverse of A.
0 1
1 3 4
Example 9.3: Find the inverse of A @ 5 2 3 A:
2 1 5
You write
0 1
1 3 4 1 0 0
@ 5 2 3 0 1 0 A:
2 1 5 0 0 1
Then you do row operations on this until the left half equals I, and
it follows that what is left on the right side will be the inverse.
0 1
13 11 1
B 1 0 0
B 34 34 2CC
B 1C
B 0 1 0 19 3
C :
B 2C
B 34 34 C
@ 9 5 1 A
0 0 1
34 34 2
Thus the inverse of the given matrix is
0 1
13 11 1
B 34
B 34 2CC
B 19 3 1C
B C :
B 34 2C
B 34 C
@ 9 5 1 A

34 34 2
It is a good idea to check your work when you do one of these by
hand. From the above observation that the left and right inverses
are the same for a square matrix, it suffices to multiply on only one
side. In this case,
0 13 11 11

B 34 34 2C 0 1 0 1
B C 1 3 4 1 0 0
B 19 3 1 C
B C @ A @ A
B 34 C 5 2 3 0 1 0 :
B 34 2 C 2 1 5 0 0 1
@ A
9 5 1

34 34 2
Of course you dont want to do this sort of busy work. It is
easier to use a computer algebra system. Here is how you would
find the inverse of a matrix using maple 12.
1. First open the application. On the left you will see a column of
gray boxes. Click on the one which is labeled matrix to obtain a
display which says rows, and right below it, columns. Enter the
appropriate numbers.
2. Next left click on insert matrix. This will give you a template
in which you can fill in the appropriate numbers. It is conve-
nient to use the tab key on your keyboard to advance from one
slot to the next when entering the entries of the matrix.
3. When you have entered the entries of the matrix, right click on
it. Then click on inverse in standard operations. (When
you place the mouse on standard operations it will give you a
choice and one will be the inverse.)
It is that easy. If you dont want your answer to be in terms of
fractions, simply write one number in the display for the matrix
446 K. Kuttler
with a decimal point and it will give the answer in terms of decimals.
For example, write 5. 0 instead of 5.
The next section will be on the determinant and many of its
applications. Determinants are extremely unpleasant, but they pro-
vide the shortest proofs of many important theorems. This is why I
am presenting certain things in terms of determinants. They also
occur in the description of the multivariate normal distribution
which will be discussed later.
10. The Sign

Function
To see more details in the proofs of the following, see the appendix
on the mathematical theory of the determinant in Calculus: Theory
And Applications by Kuttler published by World Scientific. You can
also go to the site http://math.byu.edu/~klkuttle/ and scroll
down to either the calculus book or linear algebra book and consult
the material on the theory of the determinant. Other books which
give two different discussions of the determinant are (1, 2).
Let i1, . . ., in be an ordered list of numbers from 1, . . ., n. This
means the order is important so 1,2,3 and 2,1,3 are different.
The following lemma will be essential in the definition of the
determinant.
Lemma 10.1: There exists a unique function sgnn which maps each list
of numbers from 1,. . .,n to one of the three numbers, 0,1, or1 which also has
the following properties:
sgnn 1; . . . ; n 1 (32)
sgnn i 1 ; . . . ; p; . . . ; q; . . . ; i n sgnn
i 1 ; . . . ; q; . . . ; p; . . . ; i n : (33)
In words, the second property states that if two of the numbers are
switched, the value of the function is multiplied by1. Also, in the case
where n > 1 and i1,. . .,in 1,. . .,n so that every number from 1,. . .,n
appears in the ordered list, i1,. . .,in,
sgnn i 1 ; . . . ; i y1 ; n; i y1 ; . . . ; i n
1ny sgnn1 i 1 ; . . . ; i y1 ; i y1 ; . . . ; i n (34)
where n iy in the ordered list, i1,. . .,in.
To see this carefully proved, you can go to the above book or
site. Here is how this function can be defined. You look at the list of
numbers and count the number of inversions, pairs of numbers
which are out of order. For example there is one inversion for 2,1,3
because 2 is bigger than 1 but comes before it in the list. 3,1,2 has
2 inversions because 3 is larger than both 1 and 2. When you have
found the number of inversions k, the value of sgnn is defined as1k
if there are no repeats and 0 if there are numbers repeated.
11. The Determinant
In what follows sgn will often be used rather than sgnn because the
context supplies the appropriate n.
Definition 11.1: Let f be a real-valued function which has the set of
ordered lists of numbers from {1, . . ., n} as its domain. Define
X
f k1 kn
k1 ;...;kn
to be the sum of all the f(k1 kn) for all possible choices of ordered
lists (k1, . . ., kn) of numbers of 1, . . ., n. For example, in the case
where n 2,
X
f k1 ; k2 f 1; 2 f 2; 1 f 1; 1 f 2; 2:
k1 ;k2
Definition 11.2: Let aij A denote an n n matrix. The determinant of

A, denoted by detA, is defined by
X
detA sgnk1 ; . . . ; kn a1k1 ankn ;
k1 ;...;kn
where the sum is taken over all ordered lists of numbers from 1, . . ., n.
Note it suffices to take the sum over only those ordered lists in which
there are no repeats because if there are, sgnk1, . . ., kn 0 and so that
term contributes 0 to the sum.
Let A be an n n matrix, A aij and let r1, . . ., rn denote an
ordered list of n numbers from 1, . . ., n. Let Ar1, . . ., rn denote the
matrix whose kth row is the rk row of the matrix, A. Thus
X
detA r 1 ; . . . ; r n sgnk1 ; . . . ; kn a r 1 k1 ar n kn (35)
k1 ;...;kn
and
A 1; . . . ; n A:
12. Permuting Rows

or Columns
The next proposition tells how the determinant is affected when
you switch rows. A careful proof is in the above reference. However,
you can see this should be so by beginning with
X
A 1; . . . ; n sgnk1 ; . . . ; kn a 1k1 a nkn
k1 ;...;kn
448 K. Kuttler
and then showing that when you switch i, j in in both sides, the
sign of both sides changes. Thus a sequence of switches sufficient to
obtain A r1, . . ., rn on the left will yield sgn r1, . . ., rn det A on the
right.
Proposition 12.1: Let
r 1 ; . . . ; r n
be an ordered list of numbers from 1,. . .,n. Then
sgnr 1 ; . . . ; r n detA
P
sgnk1 ; . . . ; kn a r 1 k1 ar n kn (36)
k1 ;...;kn
detA r 1 ; . . . ; r n : (37)
Observation 12.2: There are n! ordered lists of distinct numbers from

1,. . .,n.
To see this, consider n slots placed in order. There are n choices
for the first slot. For each of these choices, there are n1 choices for
the second. Thus there are n(n1) ways to fill the first two slots.
Then for each of these ways there are n2 choices left for the third
slot. Continuing this way, there are n! ordered lists of distinct
numbers from (1, . . ., n) as stated in the observation.
13. A Symmetric
Definition
With the above, it is possible to give a more symmetric description
of the determinant from which it will follow that detA detAT.
Corollary 13.1: The following formula for det A is valid:
1
det A
n!
X X
sgnr 1 ; . . . ; r n sgnk1 ; . . . ; kn ar 1 k1 a r n kn : (38)
r 1 ;...;r n k1 ;...;kn
And also det AT det A where ATis the transpose of A. (Recall that
for AT aijT, aijT aji.)
Proof: From Proposition 1, if the ri are distinct,
X
det A sgnr 1 ; . . . ; r n sgnk1 ; . . . ; kn ar 1 k1 a r n kn :
k1 ;...;kn
Summing over all ordered lists, r1, . . ., rn where the ri are distinct,
(If the ri are not distinct, sgnr1, . . ., rn 0 and so there is no
contribution to the sum.)
X X
n! det A sgnr 1 ; . . . ; r n sgnk1 ; . . . ; kn
r 1 ;...;r n k1 ;...;kn
a r 1 k1 a r n kn :
This proves the corollary since the formula gives the same number
for A as it does for AT.
14. Properties
of Determinants
The following corollary says that the determinant is linear in each
row (column). The meaning of this term is described in the state-
ment of the corollary. The proof of this corollary is immediate from
the above definition of the determinant.
Corollary 14.1: If two rows or two columns in an n n matrix, A, are
switched, the determinant of the resulting matrix equals1 times the deter-
minant of the original matrix. If A is an n n matrix in which two rows are
equal or two columns are equal then det A 0. Suppose the ith row of A
equalsxa 1 yb 1 ; . . . ; xa n yb n . Then
det A x det A 1 y det A 2 ;

where the ith row of A1is a1,. . .,anand the ith row of A2is b1,. . .,bn,
all other rows of A1and A2coinciding with those of A. In other words,
det is a linear function of each row A. The same is true with the word
row replaced with the word column.
Definition 14.2: A vector w is a linear combination
P of the vectors v1, . . ., vr
if there exists scalars, c1, cr such that w rk1 c k vk . This is the same as
saying w 2 spanv1, . . ., vr.
The following corollary is also of great use and follows from the
above corollary. When a row (column) is linearly dependent on
other rows (columns), the determinant of the matrix is 0.
Corollary 14.3: Suppose A is an n n matrix and some column (row) is a
linear combination of r other columns (rows). Then det A 0.
Recall the following definition of matrix multiplication.
P If A and B are n n matrices, A aij and B bij, AB cij

Definition 14.4:
where c ij nk1 a ik b kj :
One of the most important rules about determinants is that the
determinant of a product equals the product of the determinants.
This theorem can be proved directly from the definition.
Theorem 14.5: Let A and B be n n matrices. Then
det AB det A det B :

450 K. Kuttler
15. Cofactor
Expansions
The most important theoretical property of determinants is the
method of Laplace expansion. In fact, this method is sometimes
used as the basis for the definition of the determinant. Virtually all
the major applications of determinants depend on it. Nevertheless,
it is interesting that it yields an impractical way to actually compute
the determinant.
Definition 15.1: Let A aij be an n n matrix. Thena new matrix called
the cofactor matrix, cofA, is defined by cofA cij where to obtain cij delete
the ith row and the jth column of A, take the determinant of the
n 1 n 1 matrix which results (this is called the ijth minor of A
) and then multiply this number by 1ij . To make the formulas easier to
remember, cofAij will denote the ijth entry of the cofactor matrix.
Now here is the method of cofactor expansions.
Theorem 15.2: Let A be an n n matrix where n 2. Then
X
n X
n
detA a ij cof A ij a ij cof Aij :
j 1 i1
The first formula consists of expanding the determinant along the ith
row and the second expands the determinant along the jth column.
Note that this gives an easy way to write a formula for the
inverse of an n n matrix.
16. Formula for

the Inverse
1
Theorem 16.1:
A exists if and only if det (A)60. If det (A)60,
then A 1 a1
ij where
1
a1
ij det A cof A ji
for cofAijthe ijth cofactor of A.

Proof: By Theorem 2, and letting air A, if det A60,
X
n
a ir cof A ir det A1 detA det A1 1:
i1
Now consider
X
n
air cof A ik det A1
i1
when k6r. Replace the kth column with the rth column to obtain a
matrix Bk whose determinant equals zero by Corollary 1. However,
expanding this matrix along the kth column yields
X
n
0 detB k det A 1 a ir cof A ik det A 1 :
i1
Summarizing,
X
n
a ir cof A ik det A 1 drk :
i1
Using the other formula in Theorem 2, and similar reasoning,

X
n
arj cof A kj det A 1 drk :
j 1

This proves that if det A60, then A1 exists with A 1 a1
ij , where
1
a 1
ij cof A ji det A :
Now suppose A1 exists. Then by Theorem 5,

1 detI det AA 1 detA det A1
so det A60.
The next corollary points out that if an n n matrix A has a
right or a left inverse, then it has an inverse.
Corollary 16.2: Let A be an n n matrix and suppose there exists an
n n matrix, B such that BA I. Then A1exists andA 1 B. Also, if
there exists C an n n matrix such that AC I, then A1 exists and
A 1 C.
Proof: Since BA I, Theorem 5 implies
det B det A 1
and so detA60. Therefore from Theorem 1, A1 exists. Therefore,

A 1 BAA 1 B AA 1 BI B:
The case where CA I is handled similarly.
The conclusion of this corollary is that left inverses, right
inverses and inverses are all the same in the context of n n
matrices.
Theorem 1 says that to find the inverse, take the transpose of
the cofactor matrix and divide by the determinant. The transpose of
the cofactor matrix is called the adjugate or sometimes the classical
adjoint of the matrix A. In words, A1 is equal to one divided by
the determinant of A times the adjugate matrix of A.
452 K. Kuttler
17. Cramers Rule
In case you are solving a system of equations, Ax y for x, it

follows that if A1 exists,

x A 1 A x A 1 Ax A 1 y
thus solving the system. Now in the case that A1 exists, there is a
formula for A1 given above. Using this formula,
X n X n
1
xi a 1
ij y j cof A ji y j :
j 1 j 1
det A
By the formula for the expansion of a determinant along a column,

0 1
y1
B .. C;
det@ ... ..
1
xi . .A
detA
yn
where here the ith column of A is replaced with the column vector
y1. . ., ynT, and the determinant of this modified matrix is taken and
divided by detA. This formula is known as Cramers rule.
18. Upper Triangular

Matrices
Definition 18.1: A matrix M is upper triangular if Mij 0 whenever i > j.

Thus such a matrix equals zero below the main diagonal, the entries of the
form Mii as shown.
0 1

B .. .. C
B0 . .C
B. . .. C:
@ .. .. . A
0 0
A lower triangular matrix is defined similarly as a matrix for
which all entries above the main diagonal are equal to zero.
With this definition, here is a simple corollary of Theorem 2. It
is easily seen by expanding repeatedly along the first column.
Corollary 18.2: Let M be an upper (lower) triangular matrix. Then det M
is obtained by taking the product of the entries on the main diagonal.
19. The Rank

of a Matrix
There are three ways to define rank of a matrix, but it turns out that
they all give the same number. The first of these discussed below is
called determinant rank.
Definition 19.1: A submatrix of a matrix A is the rectangular array of

numbers obtained by deleting some rows and columns of A. Let A be an m
n matrix. The determinant rank, also called the rank of the matrix,
equals r, where r is the largest number such that some r r submatrix of A
has a nonzero determinant.
With this definition, a major theorem is the following.
Theorem 19.2: If A, an m n matrix, has determinant rank r, then there
exist r rows (columns) of the matrix such that every other row (column) is
a linear combination of these r rows (columns).
The following theorem is of fundamental importance and ties

together many of the ideas presented above.
Theorem 19.3: Let A be an n n matrix. Then the following are
equivalent:
1. det A 0.
2. A,ATare not one to one. (For B A,AT, there exists x60 such
that Bx 0.)
3. A is not onto.
An equivalent formulation of the above theorem is the follow-
ing corollary.
Corollary 19.4: Let A be an n n matrix. Then the following are
equivalent:
1. detA60.
2. A and ATare one to one.
3. A is onto.
Proof: This follows immediately from the above theorem.
20. Schurs
Theorem
Consider the following system of equations for x1, x2, . . ., xn:
Xn
aij x j 0; i 1; 2; . . . ; m; (39)
j 1
where m < n. Then the following theorem is a fundamental obser-

vation.
Theorem 20.1: Let the system of equations be as just described in
Eq.39where m < n. Then letting
xT x 1 ; x 2 ; . . . ; x n 2 Fn ;
454 K. Kuttler
there exists x60 such that the components satisfy each of the equa-
tions of Eq. 39. Here F denotes numbers, either the real numbers R
or the complex numbers C.
Proof: The above system is of the form
Ax 0;
where A m n matrix with m < n. Therefore, if you form the
is an
A
matrix ; an n n matrix having nm rows of zeros on the
0
bottom, it follows this matrix has determinant equal to 0. There-
fore, from Theorem 3, there exists x60 such that Ax 0.
Definition 20.2: A set of vectors in Fn ; F R or C, x1, . . ., xk is called an
orthonormal setof vectors if

1 if i j;
xi ; xj xi xj dij
0 if i 6 j:
Theorem 20.3: Let v1 be a unit vector v1 1 in Fn , n > 1. Then there

exist vectors v2,. . .,vn such that fv1 ; . . . ; vn g
is an orthonormal set of vectors.

Proof: The equation for x,v1 x 0 has a nonzero solution x by Theorem
1. Pick such a solution and divide by its magnitude to get v2 a unit vector
such that v1 v2 0. Now suppose v1, . . ., vk have been chosen such that
v1, . . ., vk is an orthonormal set of vectors. Then consider the equations
vj x 0 j 1; 2; . . . ; k:
This amounts to the situation of Theorem 1 in which there are
more variables than equations. Therefore, by this theorem, there
exists a nonzero x solving all these equations. Divide by its magni-
tude and this gives vk + 1.
Definition 20.4: If U is an n n matrix whose columns form an ortho-
normal set of vectors, then U is called an orthogonal matrix if it is real and a
unitary matrix if it is complex.Note that from the way we multiply
matrices,
U T U UU T I :
Thus U 1 U T . If U is only unitary, then from the dot product in
Cn, we replace the above with
U U UU I :
Where the * indicates to take the conjugate of the transpose.
Note the product of orthogonal or unitary matrices is orthog-
onal or unitary because
U 1 U 2 T U 1 U 2 U T2 U T1 U 1 U 2 I
U 1 U 2 U 1 U 2 U 2 U 1 U 1 U 2 I
Definition 20.5: Let A be a complex n n matrix. The eigenvalues of A

are the complex numbers which are solutions l of the characteristic
equation defined by
detlI A 0:
Proposition 20.6: Let l be an eigenvalue of an n n matrix A. Then

there exists x60 such that
A lI x 0:
Conversely, if x60 andA lI x 0;then detlI A 0:
Proof: First suppose A lI x 0 for x60. Then A lI 1 cannot
exist because if it did, you could multiply on the left by it and conclude
x 0. Consequently, it follows from Theorem 1 that detA lI
detlI A 0.
Now suppose detlI A 0: Then from Theorem 3, lIA is not
one to one. Thus there exists x60 such that lI Ax 0:
Two matrices A and B are similar if there is some invertible
matrix S such that A S 1 BS. Note that similar matrices have the
same characteristic equation because, by Theorem 5 which says the
determinant of a product is the product of the determinants,

detlI A det lI S 1 BS det S 1 lI B S

det S 1 detlI B detS

det S 1 S detlI B detlI B
With this preparation, here is Schurs theorem.
Theorem 20.7: Let A be a real or complex n n matrix. Then there exists a
unitary matrix U such that
U AU T ; (40)
where T is an upper triangular matrix having the eigenvalues of A on
the main diagonal listed according to multiplicity as zeros of the
characteristic equation. If A has all real entries and eigenvalues,
then U can be chosen to be orthogonal.
Proof: The theorem is clearly true if A is a 1 1 matrix. Just let U 1 the
1 1 matrix which has 1 down the main diagonal and zeros elsewhere.
Suppose it is true for n 1 n 1 matrices and let A be an n n
matrix. Then let v1 be a unit eigenvector for A. There exists l1 such that
Av1 l1 v1 ; jv1 j 1:
By Theorem 3, there exists v1, . . ., vn, an orthonormal set in Rn. Let
U0 be a matrix whose ith column is vi. Then from the above, it
follows U0 is unitary. Then from the way you multiply matrices
U0AU0 is of the form
456 K. Kuttler
0 1
l1
B0 C
B . C;
@ .. A1 A
0
where A1 is an n 1 n 1 matrix. The above matrix is similar to
A so it has the same eigenvalues and indeed the same characteristic
equation. Now by induction there exists an n 1 n 1 uni-
tary matrix Ue 1 such that
e A1 U
U e 1 T n1 ;
1
an upper triangular matrix. Consider

1 0
U1 e1 :
0 U
From the way we multiply matrices, this is a unitary matrix and also

1 0 l1 1 0
U 1 U 0 AU 0 U 1 e e1
0 U 1 0 A1 0 U

l1
T;
0 T n1
where T is upper triangular. Let U U0U1. Then UAU T. If A
is real having real eigenvalues, all of the above can be accomplished
using the real dot product and using real eigenvectors. Thus U can
be orthogonal.
21. Symmetric and

Hermitian Matrices
A real matrix A is symmetric if A AT. It is Hermitian if A A.
Thus a real Hermitian matrix is symmetric.
Theorem 21.1: Let A be a Hermitian matrix. Then there exist a diagonal
matrix D consisting of the eigenvalues of A down the main diagonal and a
unitary matrix U such that
U AU D:
If A is real, then U can be a real orthogonal matrix.
Proof: It follows from Theorem 7 that there exists a unitary matrix U such
that
U AU T ;
where T is upper triangular. Now
T U A U U AU T
and so in fact T is a diagonal matrix having the eigenvalues of A and

they are all real. If A is a real Hermitian matrix, then all eigenvalues
are real as well, and by Schurs theorem, U in the above can be
orthogonal.
Theorem 21.2: Let A be a Hermitian matrix which has all positive
eigenvalues 0 < l1 l2 ln. Then
Ax; x l1 jxj2 :
If Ax, x 0, then all the eigenvalues of A are nonnegative.
Proof: Let U be the orthogonal matrix of Theorem 1. Then

Ax x xT Ax xT U D U T x
X
T
2
UTx D UTx li
U x i
i
X

l1
U T x
2 l1 U T xU T x
i
i
T
l1 U T x U T x l1 xT UU T x l1 xT I x l1 jxj2
For the last part, if Ax lx, then
0 Ax; x ljxj2
so l 0.
22. The Square Root
Let A be a Hermitian matrix which has all nonnegative eigenvalues.

Then we have the following theorem.
Theorem 22.1: Let A be a Hermitian matrix with nonnegative eigenva-
2
lues. Then there exists a Hermitian matrix A1/2such that A 1=2 A. If A
is real, then A1/2can also be real. The columns of U are eigenvectors and are
an orthonormal basis for Fn.
Proof: By Theorem 1, there exists a unitary matrix U such that
U AU D: (41)
where D is a diagonal matrix with the eigenvalues of A down the
main diagonal. Hence A UDU. Define A 1=2 UD 1=2 U . Here
D1/2 comes from replacing each diagonal entry of D with its square
root. This is clearly symmetric and
2
A 1=2 UD 1=2 U UD 1=2 U UD 1=2 D 1=2 U UDU A:
458 K. Kuttler
If A is real, then from Theorem 1, U can be orthogonal and real.

Hence A1/2 is also real.
From Eq. 41,
AU UD
and so, considering the ith column of both sides,
Aui li ui :
The columns of U are orthonormal because

ui ; uj ui uj dij
by the fact that UU I.
23. An Application
to Statistics
A random vector is a function X : O ! Rp where O is a probability
space. A probability space involves also a s algebra of measurable
sets F and a probability measure P : F ! 0, 1. In practice, people
often dont worry too much about the underlying probability
space and instead pay more attention to the distribution measure
of the random variable. For E a suitable subset of Rp, this measure
gives the probability that X has values in E. There are often excel-
lent reasons for believing that a random vector is normally
distributed(central limit theorem). This means that the probability
that X has values in a set E is given by

1 1 1
p=2
exp x m S x m dx:
E 2p det S1=2 2
The expression in the integral is called the normal probability
density function. There are two parameters, m and S where m is
called the mean and S is called the covariance matrix. It is a
symmetric matrix which has all real eigenvalues which are all posi-
tive. While it may be reasonable to assume this is the distribution, in
general you wont know m and S, and in order to use this formula
to predict anything, you would need to know these quantities.
What people do to estimate these is to take n independent
observations x1, . . ., xn and try to predict what m and S should
be based on these observations. One criterion used for making this
determination is the method of maximum likelihood. In this
method, you seek to choose the two parameters S, m in such a
way as to maximize the likelihood which is given as
Y n
1 1 1
1=2
exp xi m S xi m :
i1 det S
2
For convenience the term 2pp/2 was ignored. This leads to the
estimate for m as
1X n
m xi x:
n i1
This part follows fairly easily from taking the ln and then setting
partial derivatives equal to 0. The estimation of S is harder. How-
ever, it is not too hard using the theorems presented above. I am
following a nice discussion given in Wikipedia. It will make use of
Theorem 11 on the trace as well as the theorem about the square
root of a linear transformation given above. First note that the trace
of a 1 1 matrix is the single entry of the matrix. Therefore, by
Theorem 11,
11
z}|{
xi m S1 xi m trace xi m S1 xi m

trace xi mxi m S1
Therefore, the thing to maximize is
Y
n
1 1 1
1=2
exp trace xi mxi m S
i1 det S
2
!
1 n=2 1 X n
1
det S exp trace xi mxi m S
2 i1
0 1
S
n=2 B 1 X n
C
det S1 exp@ trace xi mxi m S1 A
2 i1

1 n=2 1 1
det S exp trace SS ;
2
where S is the p p matrix indicated above, m defined as x. Now S is
symmetric and has eigenvalues which are all nonnegative because
X
n X
n
xi m y; xi m y jxi m yj 0:
2
Sy; y
i1 i1
Therefore by Theorem 1, S has a self-adjoint square root. Using

Theorem 11 again, the above equals

1 n=2 1 1=2 1 1=2
det S exp trace S S S :
2
Let B S 1=2 S1 S 1=2 and assume detS60. Then S1 S 1=2
BS 1=2 . The above equals

1
det S 1 det B n=2 exp trace B :
2
460 K. Kuttler
Of course the thing to estimate is only found in B. Therefore,

detS1 can be discarded in trying to maximize things. Since B is
symmetric, it is similar to a diagonal matrix D which has l1, . . ., ln
down the diagonal. Thus it is desired to maximize
!n=2 !
Y p
1X
p
li exp li :
i1
2 i1
Taking ln it follows that it suffices to maximize
nX 1X
p p
ln li li :
2 i1 2 i1
Taking the derivative with respect to li ; 2 li 2

n 1 1
0 and so li n.
It follows from the above that S S B S where B1 has only
1=2 1 1=2
the eigenvalues 1/n. It follows B1 must equal the diagonal matrix
which has 1/n down the diagonal. The reason for this is that B is
similar to a diagonal matrix because it is symmetric. Thus
B P 1 n1 IP n1 I because the identity commutes with every
matrix. But now it follows that S n1 S: Of course this is just an
estimate and so we write S^ instead of S.
This has shown that the maximum likelihood estimate for S is
the following for m x.
X
n
^1
S xi mxi m :
n i1
24. The Singular

Value
Decomposition
In this section, A will be an m n matrix. To begin with, here is a
simple lemma.
Lemma 24.1: Let A be an m n matrix. Then AA is self-adjoint, and all
its eigenvalues are nonnegative.
Proof: It is obvious that AA is self-adjoint. Suppose AAx lx. Then

ljxj2 lx; x A Ax; x Ax;Ax 0:
Definition 24.2: Let A be an m n matrix. The singular values of A are

the square roots of the positive eigenvalues of AA.
With this definition and lemma, here is the main theorem on
the singular value decomposition. In all that follows, I will write the
following partitioned matrix:

s 0
;
0 0
where s denotes an r r diagonal matrix of the form
0 1
s1 0
B .. C
@ . A
0 sk
and the bottom row of zero matrices in the partitioned matrix, as
well as the right columns of zero matrices are each of the right size.
Either could vanish completely. However, I will write it in the above
form. It is easy to make the necessary adjustments in the other two
cases.
Theorem 24.3: Let A be an m n matrix. Then there exist unitary
matrices U and V of the appropriate size, such that

s 0
U AV ;
0 0
where s is of the form
0 1
s1 0
B .. C
s@ . A
0 sk
for the si the singular values of A, arranged in order of decreasing
size.
Proof: By the above lemma and Theorem 1, there exists an orthonormal
basis, vii 1n such that AAvi si2vi where si2 > 0 for i 1, . . ., k, si >
0, and equals zero if i > k. Just order the columns of V where VAAV
D according to decreasing eigenvalues. Thus for i > k, Avi 0 because
Avi ; Avi A Avi ; vi 0; vi 0:

For i 1, . . ., k, define ui 2 Fm by
ui s1
i Avi :
Thus Avi siui. Now

1

ui ; uj s1
i Av i ; s 1
j Av j s 1
i v i ; s j A Av j
s
s1 1 2 j
i vi ; sj sj vj vi ; vj dij
si
Thus uii 1k is an orthonormal set of vectors in Fm : Also,
AA ui AA s1 1 1
i Avi si AA Avi si Asi vi si ui :
2 2
Now extend uii 1k to an orthonormal basis for all of Fm ; fui gm

i1
and let
U u1 um
while
462 K. Kuttler
V v1 vn :
Thus U is the matrix which has the ui as columns and V is defined as
the matrix which has the vi as columns. Then
0 1
u1
B .. C
B . C
B C
U AV B
C
B uk CA v 1 v n
B . C
@ .. A
um
0 1
u1
B .. C
B . C
B C s 0
B C
B uk C s1 u1 sk uk 0 0 0 0 ;
B . C
@ .. A
um
where s is given in the statement of the theorem.
There is some interesting terminology connected to the singu-
lar value decomposition. The vectors
yi Avi si ui ; i k
obtained in the construction of the singular value decomposition
are called the principle component vectors. As pointed out in the
above derivation, these principal component vectors are orthogo-
nal.
The singular value decomposition has as an immediate corol-
lary the following interesting result.
Corollary 24.4: Let A be an m n matrix. Then the rank of A and
Aboth equal the number of singular values of A.
Proof: Since V and U are unitary, they are each one to one and onto and so
it follows that

s 0
rankA rankU AV rank
0 0
number of singular values.
Also since U, V are unitary,
rankA rankV A U rankU AV

s 0
rank number of singular values. &
0 0
25. Approximation
in the Frobenius
Norm
A norm is a measure of magnitude which satisfies the axioms of a
norm which are:
jju v jj jjujj jjvjj
jjaujj jajjjujj
jjujj 0 and equals 0 only if u 0:
The Frobenius norm is one of many norms for a matrix. It is
arguably the most obvious. Here is its definition.
Definition 25.1: Let A be a complex m n matrix. Then
jjA jjF trace AA 1=2 :

Also this norm comes from the inner product
A; B F trace AB :
Thus ||AF||2 is easily seen to equal Sijaij2 so essentially, it treats the
matrix as a vector in Fmn .
Lemma 25.2: Let A be an m n complex matrix with singular matrix

s 0
S
0 0
with s as defined above. Then
jjSjj2F jjA jj2F (42)

and the following hold for the Frobenius norm. If U,V are unitary
and of the right size,
jjUAjjF jjA jjF ; jjUAV jjF jjA jjF : (43)
Proof: From the definition and letting U, V be unitary and of the right
size,
jjUA jj2F trace UAA U trace AA jjA jj2F :

Also,
jjAV jj2F trace AVV A trace AA jjA jj2F :

It follows
jjUAV jj2F jjAV jj2F jjAjj2F :

Now consider Eq. 42. From what was just shown,
jjA jj2F jjU SV jj2F jjSjj2F :

464 K. Kuttler
Of course, this also shows that

X
jjA jj2F s2i ;
i
the sum of the squares of the singular values of A.
Why is the singular value decomposition important? It implies

s 0
AU V ;
0 0
where s is the diagonal matrix having the singular values down the
diagonal. Now sometimes A is a huge matrix, 1,000 2, 000 or
something like that. This happens in applications to situations
where the entries of A describe a picture. What also happens is
that most of the singular values are very small. What if you deleted
those which were very small, say for all i l and got a new matrix,
0
s 0
A0 U V ?
0 0
0
Then the entries of A would end up being close to the entries of A,
but there is much less information to keep track of. This turns out
to be very useful. More precisely, letting
0 1
s1 0
B .. C s 0
s@ . A ; U AV ;
0 0
0 sr

2 X

s s0 0
r
0 2
jjA A jjF
U
V
s2k :
0 0 F kl1
0 0
Thus A is approximated by A where A has rank l < r. In fact, it is
0
also true that out of all matrices of rank l, this A is the one which is
closest to A in the Frobenius norm. Here is why.
Let B be a matrix which has rank l. Then from Lemma 2
jjA B jj2F jjU A B V jj2F jjU AV U BV jj2F

s 0
U BV
;

0 0 F
and since the singular values of A decrease from the upper left to
the lower right, it follows that for B to be closest as possible to A in
the Frobenius norm,
0
s 0
U BV ;
0 0
0
which implies B A above.
This is obvious if you look at a simple example. Say
0 1
3 0 0 0
s 0
@0 2 0 0A
0 0
0 0 0 0
for example. Then what rank 1 matrix would be closest to this one
in the Frobenius norm? Obviously
0 1
3 0 0 0
@ 0 0 0 0 A:
0 0 0 0
26. Finding the

Singular Value
Decomposition
From the construction, you could find the singular value decom-
position if you can find the eigenvectors and eigenvalues of the
Hermitian matrix AA. There are various ways to do this numeri-
cally, some based on the QR algorithm. However, it is not the
purpose of this chapter to go into detail on these methods. Com-
puter algebra systems based on these methods provide a way to
obtain the singular value decomposition of a matrix. I will describe
how to use Maple to do this.
1. First open maple. Then select file followed by new followed by
worksheet mode. If you want to see things in the usual math
notation, be sure you have selected math on the top left corner
of the worksheet. It doesnt really matter but it will look
different if you dont do this. On later versions of maple, this
is the default anyway so you dont have to do anything. Now
you will have something which looks like >.
2. Type the
following next to it: with(LinearAlgebra); A:
1 2 3
Matrix : To get the matrix between those par-
4 5 6
entheses, just place the cursor there and under the gray box
labeled matrix, select the size of your matrix, and then click on
Insert Matrix on the left side of the screen. This will give a
template and you replace the symbols with numbers. Press
return. This will place a blue copy of the matrix in the middle
of the screen. You have just defined the matrix A.
3. On the following line, after > , you type SingularValues(A,
0 0 0 0 0 0
output[U , S , Vt ]) and then press enter. It will give you
the left unitary matrix followed by a list of the singular values
followed by the right matrix such that A USV where
unitary
s 0
S is the matrix for s the diagonal matrix which
0 0
consists of the singular values listed in the middle of the
above output. The little primes in the output are obtained
from pressing the key which has quotes on the top and a single
small vertical mark on the bottom.
If you just want the singular values, enter the matrix in the
usual way. Then right click on the matrix and select eigenvalues,
etc. then select singular values. If you want these to be in terms of
466 K. Kuttler
decimals instead of complicated radicals, make sure at least one

entry in the matrix has a decimal point. For example, type 5. 0 rather
than 5.
Example 26.1: Find the singular value decomposition for the matrix
0 1
1 2 1 2
@ 3 0 4 3 A:
2 1 1 1
First I open maple. Then click file and then worksheet mode.
Next I need to define the matrix so I enter it as described above and
press return which gives a blue copy of the matrix in the middle of
the screen. Next I go to > right below this and type according to
the above instructions and then press return. This yields the singu-
lar values are
6:03322; 3:64391; 1:149827:
It also two unitary transformations U and V such that
yields the
s 0
AU VT.
0 0
There are also easy to use computer algebra systems which
make this even easier. If you have Scientific Notebook or Scientific
Workplace, enter a matrix and with the cursor in the matrix, go to
the tool bar and click on the correct things, matrices, and then
SVD. It will produce it for you with no effort on your part.
27. Least Squares

and Singular Value
Decomposition
A least squares solution to the problem
Ax y;
where A is an m n matrix is a vector x 2 Fn such that Axy is as
small as possible. That is, jAx yj jAz yj for all z 2 Fn . The
following picture gives the idea of the following proposition:
y
A(x)
A(Fn)
Proposition 27.1: The vector x is a least squares solution to Ax y if and

only if AAx Ay.
Proof: Let
z 2 Fn and let y 2 C be such that
yy Ax;Ax Az jy Ax;Ax Azj:

The properties of the inner product apply and it follows that for t 2 R
jy A x tyx zj2
y Ax tyAx Az; y Ax tyAx Az
jy Axj2 t 2 jAx Azj2 2tReyy Ax;Ax Az
jy Axj2 t 2 jAx Azj2 2t jA y A Ax; x zj:
(44)
If x is a least squares solution then this is no smaller than yAx2
which requires
t 2 jAx Azj2 2t jA y A Ax; x zj 0:
If jA y A Ax; x zj>0; then the above could not be valid for
all t, as seen by taking t very small and positive. It follows that
A y A Ax; x z 0
for all z 2 Fn . Since z is arbitrary, this requires that
A y A Ax 0:
If the condition AAx Ay holds, then if z is arbitrary,
jy Azj2 jy Ax A x zj2
jy Axj2 jA x zj2 2Re y Ax;A x z
jy Axj2 jA x zj2 2Re A y Ax; x z
jy Axj2 jA x zj2 ;
which yields x is a least squares solution.
Do least squares solutions exist? Yes they always do, and there is
an interesting connection with the singular value decomposition. It
is desired to find a solution to the equation AAx Ay. Consider
this equation in terms of the singular value decomposition.
A A A
z
}|
{ z
}|
{ z
}|
{
s 0 s 0 s 0
V U U V xV U y:
0 0 0 0 0 0
Therefore, this yields the following, using block multiplication and
multiplying on the left by V.
2
s 0 s 0
V x U y: (45)
0 0 0 0
468 K. Kuttler
One solution to this equation which is very easy to spot is

1
s 0
xV U y: (46)
0 0
In all of these manipulations, the zero blocks are adjusted so
that all matrix multiplication makes sense, as described earlier.
28. Linear
Regressions
An important application of least squares is to the problem of
finding a straight line approximating some data. Thus you are
given data points ti, xi, i 1, . . ., m and you would like to find m
and b such that the line x mt b has the property that
x i mt i b: Of course this will be impossible to do and this is
why you look for a least squares solution to the system
0 1 0 1
1 t1 x1
B 1 t 2 C B x 2 C
B C b B C
B .. .. C B .. C:
@. . A m @ . A
1 tn xn
Thus you want to find a solution to
0 1
1 t1
B C
1 1 1 B 1 t2 C b
B. . C
t 1 t 2 t n @ .. .. A m
1 tn
0 1
x1
B C
1 1 1 B x2 C
B . C:
t 1 t 2 t n @ .. A
xn
Simplifying this, you need
P P
Pn P i t2i b
P i xi
:
iti iti m it ixi
Solving this for b and m yields the formulas for the linear regression
line. Thus
P P 2 P P
ixi iti iti i t i xi
b P P 2
n i t 2i iti
P P P
n it ixi ixi iti
m P 2 P 2 :
n iti iti
People used to plug in and compute these things and then draw
the graphs on graph paper. Fortunately you dont have to work so
hard anymore. Suppose you wanted to find the regression line
which goes with the data 1, 3, 1, 4, 1, 5, 2, 4, 2, 7, 3, 8, 4, 11,
and 5, 7. Here is how you can do it using Maple 12.
1. Open maple. Then click on file and then click on worksheet
mode. At > type the following: with(Statistics) and press
return. It will display all sorts of stuff which you can ignore.
2. At the next > you type X:Vector([1,1,1,2,2,3,4,5]);Y:
Vector([3,4,5,4,7,8,11,7]) (The X comes from the first com-
ponents in the data and the Y comes from the second compo-
nents.) Press return. This has now defined X and Y and they will
be displayed in blue on the screen.
3. At the next > you type the following. Fit(a*x+b,X,Y,x) and
then press return. It will then give you the desired equation. In
this case it is
1:29921259842519632x 3:03937007874015785:
Now suppose you want to graph this along with the data
points. You do the following.
4. At the next > you type the following. with(plots); A:plot(X,
Y,stylepoint, symbolasterisk, symbolsize15, colorred);
B:plot(1.29921259842519632*x+3.03937007874015785,
x0..6, color black, tickmarks [4, 4],font[TIMES,
Bold,24]); display(A,B); and then press return. This will graph
the regression line along with the data points which will be
plotted in this case as asterisks.
You can adjust the various parameters if you like. If you want to
do another one, just go back and redefine your vectors, press return
and etc. It does all the work for you. In the given example, the plot
it returns is as follows.
470 K. Kuttler
29. The Moore

Penrose Inverse
The particular solution of the least squares problem given in Eq. 46
is important enough that it motivates the following definition.
Definition 29.1: Let A be an m n matrix. Then the Moore Penrose
inverse of A, denoted by A+ is defined as
1
s 0
A V U :
0 0
Here

s 0
U AV
0 0
as above.
Thus A+y is a solution to the minimization problem to find x
which minimizes Axy. In fact, one can say more about this. In the
following picture My denotes the set of least squares solutions x
such that AAx Ay. +
A (y)
My
x
ker(A A)
Then A+y is as given in the picture.

Proposition 29.2: A+y is the solution to the problem of minimizing
Axy for allx which has smallest norm. Thus

AA y y
jAx yjfor all x
and if x1satisfiesjAx1 yj jAx yjfor all x, then A+y x1.

Proof: Consider x satisfying Eq. 45, equivalently AAx Ay. Then

s2 0 s 0
V x U y
0 0 0 0
and you want the solution x which has smallest norm. This is equiva-
lent to making Vx as small as possible because V is unitary and so it
preserves norms. For z a vector, denote by zk the vector in Fk which
consists of the first k entries of z. Then if x is a solution to Eq. 45,
2
s V xk sU yk

0 0
and so V xk s1 U yk : Thus the first k entries of Vx are

determined. In order to make Vx as small as possible, the remain-
ing nk entries should equal zero. Therefore,
1 1
V xk s U yk s 0
V x U y
0 0 0 0
and so

s1 0
xV U y A y:
0 0
Lemma 29.3: The matrix, A+satisfies the following conditions:
AA A A; A AA
A ; A A and AA are Hermitian: (47)
Proof: This is routine. Recall

s 0
AU V
0 0
and

s1 0
A V U
0 0
so you just plug in and verify it works.
+
A much more interesting observation is that A is characterized
as being the unique matrix which satisfies Eq. 47. This is the
content of the following Theorem. A proof is given in books
which discuss this subject. A web source may be found in the linear
algebra book on the web page http://math.byu.edu/~klkuttle/
Theorem 29.4: Let A be an m n matrix. Then a matrix, A0, is the Moore
Penrose inverse of A if and only if A0satisfies
AA 0 A A; A 0 AA 0 A 0 ; A 0 A and AA0 are Hermitian: (48)
The theorem is significant because there is no mention of eigen-
values or eigenvectors in the characterization of the Moore Penrose
inverse given in Eq. 48. It also shows immediately that the Moore
Penrose inverse is a generalization of the usual inverse. This is because
the usual inverse satisfies the Penrose conditions given above.
The Moore Penrose inverse can be computed in terms of
standard inverses as explained in the following proposition.
Proposition 29.5: The following limit is valid.
lim A A dI 1 A A
d!0
472 K. Kuttler
Proof: This is an application of the singular value decomposition. Recall
T
s 0 s 0
AU V ; A V U :
0 0 0 0
2
s dI 0
Therefore, A A dI V V and so itclearly has

0 dI
s1 0
an inverse. Now also recall that A is given by A V
+
0 0
U while
2 1 !
1 s dI 0 s 0 T
A A dI A V V V U
0 d1 I 0 0
2 1 !
s dI 0 s 0 T
V U
0 d1 I 0 0
2 1 !
s dI 0 s 0
V U
0 d1 I 0 0
2 1 !
s dI s 0
V U
0 0
after adjusting the size of the zero blocks as described above. Now a
short computation verifies that
2 1 1 1
s dI s s1 d s2 dI s :
Now recall the formula for the inverse of a matrix given in Theorem1
16.1. Using this formula, it is clear that limd!0 d s2 dI
s1 0 in the sense that each entry the matrices on the left converges
to 0.
Note that from the formula, you could get an idea of approxi-
mately how close it is to the desired solution,
and it is not a very
1 1
good way to approximate if the entries of s2 dI s are large.
This would happen for example if some singular values are small.
Still, it is an interesting formula.
Example 29.6: Find an approximation for the Moore Penrose inverse A+ if

1 2 3
A :
4 2 3
To do it right, you would get the singular value decomposition
and use it. However, you can also pick small d > 0 and follow the
above procedure.
0 !T
1 2 3
1 2 3
@
4 2 3 4 2 3
0 111
1 0 0 0 1
B CC 1:0 4:0
B CC B C
:0001B 0 1 0 CC @ 2:0 2:0 A
@ AA
3:0 3:0
0 0 1
0 1
0:0 215 2 0: 144 63
B C
B C
B 0: 233 85 0: 141 53 C
@ A
0: 184 62 :0 461 7
It is hoped this would be close. You could try another d and see if it
varies by much. You could also find it directly from the above
formula for A+ involving the singular value decomposition. If you
do it this way, you find that it should be
0 1
2: 153 846 16 102 0: 144 615 385
B C
B 0: 141 538 462 C
@ 0: 233 846 154 A:
0: 184 615 385 4: 615 384 64 102
References
1. Baker R (2001) Linear algebra. Rinton Press, 6. Horn R, Johnson C (1985) Matrix analysis.
Princeton, NJ Cambridge University Press, Cambridge, UK
2. Friedberg S, Insel A, Spence L (2003) Linear 7. Marcus M, Minc H (1964) A survey of matrix
algebra. Prentice Hall, Upper Saddle River, NJ theory and matrix inequalities. Allyn and Bacon,
3. Golub G, Van Loan C (1996) Matrix computa- Boston
tions. Johns Hopkins University Press, Balti- 8. Nobel B, Daniel J (1977) Applied linear algebra.
more, MD Prentice Hall, Upper Saddle River, NJ
4. Hofman K, Kunze R (1971) Linear algebra. 9. Gilbert S (1980) Linear algebra and its applica-
Prentice Hall, Upper Saddle River, NJ tions. Harcourt Brace Jovanovich, San Diego,
5. Householder A (1975) The theory of matrices in CA
numberical analysis. Dover, New York
Chapter 20
Ordinary Differential Equations

Jir Lebl
Abstract
In this chapter we provide an overview of the basic theory of ordinary differential equations (ODE).
We give the basics of analytical methods for their solutions and also review numerical methods. The chapter
should serve as a primer for the basic application of ODEs and systems of ODEs in practice. As an example,
we work out the equations arising in MichaelisMenten kinetics and give a short introduction to using
Matlab for their numerical solution.
Key words: Ordinary differential equations, ODE, Systems of ODE, Numerical methods, Matlab,
MichaelisMenten kinetics
1. Introduction
to Differential
Equations
A differential equation is simply an equation that depends not only
on the value of a variable but also on its derivatives. For example,
1.1. Differential Newtons law of cooling with a varying ambient temperature yields
Equations the equation
dx
x 2 cos t: (1)
dt
Here x is the dependent variable (temperature) and t is the indepen-
dent variable (time). A solution to (1) is simply a function x(t) that
satisfies the equation. All solutions to (1) can be written as
xt cos t sin t Ce t ;
for an arbitrary constant C. If we write an expression for all solu-
tions of a differential equation we often call it the general solution.
Suppose the temperature at a given time is known (an initial
condition). For example, x(0) x0 for some constant x0. We solve
for C to obtain C x 0 1. We obtain a particular solution that
475
476 J. Lebl
0 1 2 3 4 5
2 2
1 1
0 0
1 1
2 2
0 1 2 3 4 5
Fig. 1. Slope field and some solutions of x0 x 2 cos t.
satisfies our initial condition. In Fig. 1 we plot the solutions for

three different initial conditions, x0 1. 5, x0 0, and x 0 2.
The figure also gives the so-called slope field of the equation. On a
grid of points (t, x) we draw a short line with the slope dx dt as given
by the equation. The slope field can give us a very good general idea
of the behavior of the equation without actually finding specific
solutions.
An equation such as (1) is said to be linear as the equation
depends on x and its derivatives linearly. That is, only first powers of
x and dxdt appear in (1). The equation is said to be first order as only
the first derivatives of x appear.
Finally equation (1) is said to be an ordinary differential equa-
tion, or ODE, as there is only one independent variable. When
there is more than one independent variable and the equation
depends on partial derivatives the equation is called a partial differ-
ential equation, or PDE. For example the heat equation
@ 2 u @u

@x 2 @t
is a partial differential equation. Ordinary differential equations are
well understood and far easier to handle than PDEs. Fortunately,
ODEs come up often in practice.
A general kth order ODE would be written as
F x k ; x k1 ; . . . ; x 0 ; x; t 0;
20 Ordinary Differential Equations 477
for some function F of k + 2 variables. A kth-order linear ODE is an

equation of the form
F k tx k F k1 tx k1 F 1 tx 0 F 0 tx Gt:
When G(t) 0 for all t, the equation is said to be homogeneous.
Commonly the left-hand side models the behavior of the system,
and G(t) is external input. If the functions Fj(t) on the left-hand
side are constants, the equation is said to be a constant coefficient
equation (even if G(t) is not a constant).
1.2. Systems of When there is more than one dependent variable and more than
Differential Equations one equation we will talk of a system of ODEs. Let us write the
dependent variables as a vector
2 3 2 3
dx 1
x1 dt
6 x2 7 6 dx 2 7
6 7 d~
x 6 dt 7
~
x 6 .. 7; then we write 6 7:
4 . 5 dt 4 ... 7
6
5
xn dx n
dt
A first-order system of ODEs can be written as a vector equation

d~
x
~
F x; t ~
;~ 0;
dt
for a vector-valued function ~ F and where ~0 represents the vector of
all zeros. A linear first-order system of ODEs can be written as
d~
x ~
At Bt~
x Gt;
dt
where A(t) and B(t) are matrix-valued functions, when A(t) is
invertible then we often assume that A(t) is the identity.
We have deliberately only mentioned first-order systems. The
reason is that any higher-order equation (or system) can always be
turned into a first-order system in more variables. The process is
best understood by example. Suppose that we have a third-order
ODE
x 000 2x 00 3x 0 4x t: (2)
We define new variables y1, y2, y3. We let y1 x, y2 x0 , and y3 x0 .
We obtain two new equations y10 y2 and y20 y3. Thus we can
write down a system
y 01 y 2 0;
y 02 y 3 0;
y 03 2y 3 3y 2 4y 1 t:
Hence if we can solve (analytically or numerically) first-order sys-
tems, we can solve any ODE or a system of ODEs.
478 J. Lebl
1.3. Analytical Versus Some ODEs can be solved explicitly by analytical methods.
Numerical Solutions In general, we will not be able to find an explicit formula for the
solution, and we will have to be satisfied with a numerical approxi-
mation. When analytical methods do apply, such as for linear con-
stant coefficient ODEs, the form of the solution often tells us much
more about the behavior of the system then simply knowing its
values approximately at certain points. Furthermore, a complicated
system can be linearized and the linearization can be solved explic-
itly to obtain a local approximation.
Analytical methods should also not be shunned for their com-
plexity. Symbolic computation can greatly simplify the effort
required. If we can find an explicit solution symbolically, it is almost
always preferable, both in computational time and gained theoreti-
cal insight, to applying a purely numerical method.
Numerical methods have an advantage of always working.
On the other hand, they only give us an approximation of the
solution, and it is up to us to find out how good the approximation
is. Furthermore, applying numerical methods blindly might lead to
nonsense solutions. We will touch on the topics of error estimation
and stability of algorithms later.
1.4. Further Reading For further details on differential equations see the authors online
book (7) or the many other available books (1, 2, 4, 5, 6). For a
reference of ODEs with applications in biology and related sciences
see especially part II of (3).
2. Analytical
Methods
2.1. First-Order Perhaps the most important differential equation to understand is
Equations the equation of exponential growth or decay:
2.1.1. The Exponential dx
kx; (3)
dt
for some constant k. When k is positive, the equation gives expo-
nential growth, when k is negative we get decay. The equation
comes up often. Whenever the rate of growth (or decay) of a
quantity is proportional to the quantity itself. For example, the
equation gives the simplest model of population growth. Without
much difficulty we can guess the solution to be the exponential
x Ce kt :
Here C is an arbitrary constant. In fact, one way to define and
compute the exponential function is precisely as the solution to (3)
(with k 1 and C 1). It turns out that the majority of interesting
differential equations are solved by using the exponential in various

forms.
2.1.2. Picards Theorem A first-order differential equation with an initial condition is an

equation of the form
dx
f t; x; xt 0 x 0 : (4)
dt
One of the reasons why ODEs are so well understood is the
following theorem. It should be noted that Picards theorem also
holds for systems, but let us state it for a single equation for
simplicity.
Theorem 1 (Picards theorem on existence and uniqueness): If near
(t0,x0) the function f(t,x) is continuous (as a function of two variables) and
@f
@x exists and is continuous, then a solution to (4) exists (at least for some
small interval of ts) and is unique.
What the theorem says is that given a reasonably behaved ODE,

there is always a unique solution. Perhaps the main subtlety we
should point out is that the theorem does not guarantee that the
solution exists for all t. It only guarantees that a solution exists for
some small interval around t0.
For example, the equation dx dt x with the initial condition x
2
(0) 1 has a solution x 1t 1

. While the equation itself is continu-
ous and well behaved for all t and x, the solution blows up when we
approach t 1.
Let us now go over a few analytical methods of obtaining
solutions to ODEs.
Some simple equations such as dx dt f t or dt f x can be

dx
2.1.3. Integration
solved by elementary calculus. For example, if our equation is
dt f t, then of course a solution
dx
R is given by an antiderivative
of f(t), or in other words, x f(t)dt. Perhaps a better way of
writing the solution is to use definite integration, which could then
be solved numerically if necessary. That is, the solution to
dt f t, x(t0) x0 is
dx
t
xt f s ds x 0 :
t0
When the equation is of the form dxdt f x, x(t0) x0, we have

to first apply the inverse function theorem and integrate in terms of
x. It may be easier to write the equation in Leibniz notation
dx dx
f x or dt:
dt f x
Now we integrate to obtain

dx
t C:
f x
480 J. Lebl
Finally we have to solve for x.

dt x , x(0) A, for A 6 0. First write
Let us illustrate on dx 2
dx
x2
dt. Then we integrate to obtain

dx
t C:
x2
1 1
In other words t C. We solve for x . Solving for C
x C t
we obtain C 1/A.
2.1.4. Linear Equations Linear equations appear commonly in applications. They have
many nice properties and of all differential equations are perhaps
the best understood. Let us first focus on first-order linear equa-
tions. A first-order equation is linear if we can put it into the
following form:
dx
ptx f t: (5)
dt
Solutions of linear equations have nice properties. For example,
the solution exists wherever p(t) and f(t) are defined, and has the
same regularity (that is, has the same number of derivatives). But
most importantly for us right now, there is a method for solving
linear first-order equations.
First we find a function r(t) such that
dx d h i
rt rtptx rtx :
dt dt
Then we can multiply both sides of (5) by r(t) to obtain
d h i
rtx rtf t: (6)
dt
Now we integrate both sides. The right-hand side does not depend
on x and the left-hand side is written as a derivative of a function.
Afterwards, we can solve for x. The function r(t) is called the
integrating factor and the method is called the integrating factor
method.
In particular, we let

rt e ptdt : (7)
We substitute (7) into (6) and compute:

d h ptdt i
e x e ptdt f t;
dt

e ptdt x e ptdt f t dt C;

x e ptdt e ptdt f t dt C :
Let us illustrate the method on an example. Suppose we wish to

solve
dx 2
2tx e tt ; x0 1:
dt
2
We note that p(t) 2t and f t e tt . The integrating factor is
rt e pt dt e t . We compute
2
2 dx 2 2 2
et 2te t x e tt e t ;
dt
d h t2 i
e x et :
dt
We integrate
2
e t x e t C;
x e tt Ce t :
2 2
Finally, we must solve for the initial condition 1 x0

1 C. Therefore C 2, and the solution is
x e tt 2e t :
2 2
Note thatR we do not care which antiderivative we take when

computing e p(t)dt. We can always add a constant of integration, but
those constants will cancel out in the end.
Since we cannot always evaluate the integrals in closed form, it
is useful to write down the indefinite integrals as definite integrals.
Suppose we are given
dx
ptx f t; xt 0 x 0 :
dt
We write the solution in terms of definite integrals. If we pick the
basepoint for the integral correctly it is easy to find the constant of
integration.
t t u
ps ds ps ds
xt e t 0 e t0 f u du x 0 : (8)
t0
2.1.5. Autonomous Let us consider problems of the form

Equations
dx
f x; xt 0 x 0 ;
dt
where the derivative of solutions depends only on x (the dependent
variable). These types of equations are called autonomous equations.
If we think of t as time, the naming comes from the fact that the
equation is independent of time.
482 J. Lebl
0 5 10 15 20
10.0 10.0
7.5 7.5
5.0 5.0
2.5 2.5
0.0 0.0
2.5 2.5
5.0 5.0
0 5 10 15 20
dx
Fig. 2. Slope field and some solutions of dt 0:1 x 5 x.
Let us consider the logistic equation

dx
kxM x; x0 x 0
dt
for some positive k and M. This equation is commonly used to
model population if we know the limiting population M. That is M
is the maximum sustainable population. The logistic equation
leads to less catastrophic predictions on world population than
the simple exponential model dx dt kx. See Fig. 2 for an example
with k 0. 1 and M 5. Several solutions are drawn for different
initial conditions.
Note the solutions x 0 and x M (x 5). We call these types
of solutions the equilibrium solutions. The points on the x-axis
where f(x) 0 are called critical points. Each critical point corre-
sponds to an equilibrium solution. Note also, by looking at the
graph, that the solution x M is stable in that small perturba-
tions in x do not lead to substantially different solutions as t grows.
If we change the initial condition a little bit, then as t ! 1 we get
x ! M. We call such critical points stable. A critical point such as x
0 is called an unstable critical point. Small perturbations of the
initial condition lead to different behavior as t ! 1. Simply from
looking at the graph we can see that
8
<5 if x0 > 0;
lim xt 0 if x0 0;
t!1 :
DNE or 1 if x0 < 0;:
where DNE means does not exist.
x=M
x=0
Fig. 3. Phase diagram for the logistic equation.
It is not really necessary to find the exact solutions to talk about

the long-term behavior of the solutions of autonomous equations.
To decide exactly what happens when x(0) < 0 we would need to
solve the equation explicitly. In fact it turns out that for the logistic
equation the limit of x(t) as t ! 1 does not exist if x(0) < 0. The
reason is that the solution only exists for a finite length of time.
Often we are interested only in the long-term behavior of the
solution and we would be doing unnecessary work if we solved the
equation exactly. It is easier to just look at the phase diagram or phase
portrait, which is a simple way to visualize the behavior of autono-
mous equations. In this case there is one dependent variable x. We
draw the x-axis, we mark all the critical points, and then we draw
arrows in between. If f(x) > 0, we draw an up arrow. If f(x) < 0, we
draw a down arrow. See Fig. 3 for the phase diagram of the logistic
equation.
Once we draw the phase diagram, we can easily classify critical
points as stable or unstable. Critical points that have a phase dia-
gram like x M in the logistic equation are stable. All others are
unstable. Since any mathematical model we cook up will only be an
approximation to the real world, unstable points are generally bad
news.
2.2. Systems of ODEs Often we do not have just one dependent variable and one equa-
tion. We write
d~
x
~
F x; t ~
;~ 0
dt
for a first-order system of ODE.
As we mentioned in the introduction, we can write any nth-
order equation as a first-order system of n equations. A similar
process can be followed for a system of higher-order differential
equations. For example, a system of k differential equations in k
unknowns, all of order n, can be transformed into a first-order
system of n k equations and n k unknowns.
2.2.1. Linear First-Order An example of a system with a rather simple and elegant solution is a
Systems and Linearization constant coefficient first-order system. That is, suppose that we
have a constant k k matrix A and we wish to solve the system
~ x ~
x 0 A~ F t;
484 J. Lebl
for some k-vector ~x. We will mostly assume that ~ F t 0. That is,
we will consider the homogeneous system ~ x 0 A~
x.
Many of the equations and systems that come up in applications
are neither constant coefficient nor linear, and to solve them we
must satisfy ourselves with a numerical approximation. In this case,
however, we could linearize the equations just as we do in calculus.
We will obtain a constant coefficient linear system, which we can
then analyze easily. This technique can be useful if we wish to study
approximately the behavior for a short period of time. For example,
in the next section we will see that simply studying the eigenvalues
and eigenvectors of the resulting matrix can tell us much about the
behavior of the system. Some of these qualitative observations may
be hard to see if we simply look at numerical solutions. For exam-
ple, we can analyze the stability of equilibrium solutions of an
autonomous system.
For example, we start with a nonlinear autonomous system
d~
x ~
F ~
x :
dt
We wish to study the system near some point ~
x 0 and for some small
period of time. We could instead study the linear constant coeffi-
cient system
d~
x
J~F ~ x ~
x 0 ~ x 0 ;
dt
where J ~
F is the Jacobian matrix of ~ F evaluated at the point ~
x 0 . That
is,
2 3
@F 1 @F 1
j~ j~
6 @x 1. 0 . @x k 0
x x
J~ x 0 6
F ~ .. 7 7
4 . . . . . 5:
@F k
@x 1 j~x0 @F@x k j~
k
x0
Therefore, let us suppose that we have the system

x 0 A~
~ x: (9)
What we are looking for is a so-called fundamental solution. That is,
a matrix-valued function X(t) such that any solution to (9) can be
written as
~
x X t~
c;
for some constant vector ~ c. In fact, we can always assume that X(0)
I (the identity matrix) and therefore ~ x X t~c is the solution
satisfying the initial condition ~
x 0 ~
c.
If A was a number (a 1 1 matrix), then etA would be the
fundamental solution. This reasoning can be extrapolated for sys-
tems by considering the matrix exponential. Simply define
def 1 1 1
e A I A A2 A3 Ak
2 6 k!
A note about the matrix exponential that deserves mention is that

e AB 6 e A e B
in general. If AB BA then we do obtain e AB e A e B , which can
be proved by a formal power series computation.
Also by a formal power series computation we find
d tA
e Ae tA :
dt
Thus e tA~ x 0 A~
c solves ~ x and etA is our sought after fundamental
solution.
One could compute the matrix exponential by the Taylor series,
but that may not be very efficient. If we can diagonalize a matrix,
then computation of the matrix exponential is rather simple. First,
consider a diagonal matrix
2 3
l1 0 0
6 0 l2 0 7
6 7
D 6 .. .. . . .. 7:
4 . . . . 5
0 0 ln
As (tD)k is simply the diagonal matrix whose diagonal elements are
(tlj)k we find that
2 tl1 3
e 0 0
6 0 e tl2 0 7
e tD 6
4 ... .. .. .. 7:
. . . 5
0 0 e tln
Finally it follows from the power series expansion that if A can
be diagonalized as A EDE 1 then
1
e tA e tEDE Ee tD E 1 : (10)
nondiagonalizable Jordan blocks of the matrix A such

For the
l 1
as , we can do the following computation.
Write
the block as
0 l
l 1 0 1
lI + B where Bk 0. For example, lI , where
2 0 l 0 0
0 1
0. The matrices lI and B commute and so
0 0
e lI B e lI e B :
We know how to compute elI. As Bk 0, the power series for eB is
finite.
2.2.2. Two-Dimensional Let us consider a very simple system of two equations. The geo-
Homogeneous Constant metric intuition gained from this simple example can be easily
Coefficient Systems generalized to more complex systems, not necessarily constant
coefficient nor homogeneous. Suppose we have a real diagonaliz-
able 2 2 matrix A and the system
486 J. Lebl
3 2 1 0 1 2 3
3 3
2 2
1 1
0 0
1 1
2 2
3 3
3 2 1 0 1 2 3
Fig. 4. Example source vector field with eigenvectors and solutions.
0
x x
A : (11)
y y
We will be able to visually tell how the vector field looks once we
find the eigenvalues and eigenvectors of the matrix A.
Note that when we diagonalize the matrix A EDE 1 we note
that the columns of E are the eigenvectors of A. As E 1 is just a
constant matrix, we can see that we can write any solution to (11) as

x
v1 e l1 t c 2~
c 1~ v 2 e l2 t ;
y
where l1 and l2 are the two eigenvalues of A, ~ v1 and ~ v2 are the
corresponding eigenvectors, and c1 and c2 are constants.
Case 1. Suppose that the eigenvalues are real and positive. If (x, y) is
on the line determined by an eigenvector ~ v corresponding to a real
positive eigenvalue l then our solution is simply a multiple of ~ ve lt .
That is, our solution moves along a line away from the origin along
the vector ~v.
If we start at a general point in the plane we simply move along a
certain
curve away from the origin. See Fig. 4 for an example where
1 1
A . The eigenvalues are 1 and 2 and the corresponding
0 2

1 1
eigenvectors are and . The eigenvectors are shown in the
0 1
figure by the two large arrows. Several solutions are also shown.
This type of behavior is called a source or an unstable node.
3 2 1 0 1 2 3
3 3
2 2
1 1
0 0
1 1
2 2
3 3
3 2 1 0 1 2 3
Fig. 5. Example saddle vector field with eigenvectors and solutions.
Case 2. Suppose both eigenvalues were negative. We will get an

analogous behavior to the previous case, but as e l1 t and e l2 t will
now decay rather than grow for both eigenvalues, we simply reverse
direction of the solutions and the solutions will move toward the
origin. This type of behavior is called a sink or a stable node.
Case 3. Suppose one eigenvalue is positive and one is negative. In
this case e l1 t would grow and e l2 t would decay as t grew. We obtain
a picture as in Fig. 5. This type of behavior is called a saddle point.
The eigenvalues of A could also be complex. As A is real valued,
we can take a complex conjugate of both sides of equation (11), to
note that if we have a complex-valued solution, then its complex
conjugate is also a solution to (11). As complex eigenvalues and
eigenvectors of a real matrix come in pairs, we know that the
eigenvalues of A are l and its conjugate l and the eigenvectors
are ~v and its conjugate ~ v. As the real part of any vector can be
written in terms of the vector and its conjugate and as the equation
is linear, we note that any solution to (11) can be written as

x
ve lt c 2 Im~
c 1 Re~ ve lt ;
y
where Re and Im mean the real and imaginary parts. To write real
and imaginary parts of vectors as above, it will be useful in the
sequel to use Eulers formula e iy cos y i sin y.
Case 4. Suppose the eigenvalues are purely imaginary.That is,

0 1
suppose the eigenvalues are ib. For example, let A .
4 0
488 J. Lebl
3 2 1 0 1 2 3
3 3
2 2
1 1
0 0
1 1
2 2
3 3
3 2 1 0 1 2 3
Fig. 6. Example center vector field.

1
The eigenvalues turn out to be 2i and the eigenvectors are
2i

1 1
and . We take the eigenvalue 2i and its eigenvector and
2i 2i
note that the real and imaginary parts of ~
ve i2t are

1 i2t cos2t 1 i2t sin2t
Re e ; Im e :
2i 2 sin2t 2i 2 cos2t
In particular, our solution will simply go around in ellipses around
the origin. This type of behavior is called a center. See Fig. 6.
Case 5. Suppose the complex eigenvalues have positive real part.
That is, suppose the eigenvalues are a ib for some a > 0. Then the
(a + ib)t
real and imaginary parts of
e will contain a factor of eat. For
1 1
example, let A . The eigenvalues turn out to be 1 2i
4 1

1 1
and the eigenvectors are and . We take 1 + 2i and its
2i 2i

1
eigenvector and find the real and imaginary of ~ ve 12it are
2i

1 12it cos2t
Re e e t
;
2i 2 sin2t

1 12it sin2t
Im e e t
:
2i 2 cos2t
3 2 1 0 1 2 3
3 3
2 2
1 1
0 0
1 1
2 2
3 3
3 2 1 0 1 2 3
Fig. 7. Example spiral source vector field.
Therefore, the solutions will grow as t grows. We still get a rotation

around the origin coming from the imaginary part of the eigen-
value. We obtain a behavior referred to as spiral source. See Fig. 7.
Case 5. Finally the complex eigenvalues could have a negative real
part. That is, the eigenvalues could be a ib for some a < 0. The
real and imaginary parts of the complex exponentials would again
have a factor of eat, which in this case decays. Our solutions would
therefore spiral toward the origin in a behavior called spiral sink.
2.2.3. Nonhomogeneous x 0 A~
So far we have only talked about homogeneous systems ~ x.
Linear Systems Let us see what to do with a nonhomogeneous equation
~ x ~
x 0 A~ F t: (12)
For many problems the homogeneous part of an equation tells us
what the system wants to do on its own, while ~ F t will be some
external input.
The way the system is solved is to simply find any one particular
solution ~x p to (12). Then we find the general solution ~ x h to the
homogeneous equation ~ x 0h A~
x h . Finally, we use the linearity of
the equation to note that
~ xh ~
x ~ xp
is the general solution to (12) and we can solve for the initial
conditions.
Therefore, the trick is to find a solution ~
x p (any solution) to
(12), ignoring any initial conditions we have. There exist analytic
490 J. Lebl
methods for finding such a solutions, but they are beyond the scope
of this chapter. Often one simply makes an educated guess to find
the solution.
3. Numerical
Methods
3.1. Eulers Method As we said before, unless f(t, x) is of a special form, it is generally
very hard if not impossible to get a nice formula for the solution of
the problem
dx
f t; x; xt 0 x 0 :
dt
What if we want to find the value of the solution at some
particular t? Or perhaps we want to produce a graph of the solution
to inspect the behavior. In this section we will learn about the basics
of numerical approximation of solutions.
The simplest method for approximating a solution is Eulers
method. There are much better numerical methods, but Eulers
method best illustrates the ideas.
It works as follows: We take t0 and compute the slope k f
(t0, x0). The slope is the change in x per unit change in t. We follow
the line of this slope for an interval of length h on the t axis. Hence if
x x0 at t0, then we will say that x1 (the approximate value of x at
t 1 t 0 h) will be x 1 x 0 hk. We repeat the procedure to
compute t2 and x2 in the same way using t1 and x1.
More abstractly, for any i 1, 2, 3, . . ., we compute
t i1 t i h; x i1 x i h f t i ; x i :
The line segments we get are an approximate graph of the solution.
See Fig. 8 for the plot of the real solution and the first two steps of
the approximation with h 1.
x2
Let us see what happens with the equation dx dt 3 , x(0) 1.
Let us try to approximate y(2) using Eulers method. In Fig. 8 we
have essentially graphically approximated x(2) with step size 1.
With step size 1 we have x(2) 1. 926. The real answer is 3.
The difference between the actual solution and the approxi-
mate solution we will call the error. We will usually talk about just
the size of the error (absolute value) and we do not care much
about its sign. The main point is that we usually do not know the
real solution, so we only have a vague understanding of the error.
So using h 1 we are approximately 1.074 off. Let us halve the
step size. Computing x4 with h 0. 5, we find that x(2) 2. 209,
so an error of about 0.791. Table 20.1 gives the values computed
for various parameters.
We notice that except for the first few times, every time we
halved the interval the error approximately halved. This halving of
the error is a general feature of Eulers method as it is a first-order
1 0 1 2 3
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
1 0 1 2 3
Fig. 8. Two steps of Eulers method (step size 1) and the exact solution for the equation
dx x2
dt 3 with initial conditions x(0) 1.
Table 1
Eulers method approximation of x(2) where of dx/dt =
x2/3, x(0) = 1
h Approximate x(2) Error Error
Previous error
1 1.92593 1.07407
0.5 2.20861 0.79139 0.73681
0.25 2.47250 0.52751 0.66656
0.125 2.68034 0.31966 0.60599
0.0625 2.82040 0.17960 0.56184
0.03125 2.90412 0.09588 0.53385
0.015625 2.95035 0.04965 0.51779
0.0078125 2.97472 0.02528 0.50913
method. In general for an nth-order method, the error will be 21n of

the previous error every time we halve the interval.
Note that to get the error to be within 0.1 of the answer we had
to already do 64 steps. To get it to within 0.01 we would have to
halve another three or four times, meaning doing 5121,024 steps.
That is quite a bit to do by hand. A second-order method
would quarter the error every time we halve the interval, so we
would have to approximately do half as many halvings to get the
492 J. Lebl
Table 2
Attempts to use Eulers to approximate x(3) where
of dx/dt = x2/3, x(0) = 1
h Approximate x(3)
1 3.16232
0.5 4.54329
0.25 6.86079
0.125 10.80321
0.0625 17.59893
0.03125 29.46004
0.015625 50.40121
0.0078125 87.75769
same error. This reduction can be a big deal. With ten halvings
(starting at h 1) we have 1,024 steps, whereas with five halvings
we only have to do 32 steps, assuming that the error was compara-
ble to start with. So a higher-order method should reduce the
computations necessary drastically. We will give a fourth-order
method (RungeKutta) below, which happens to be essentially
the simplest practical method.
Note that we do not know the error! How do we know what is
the right step size? We can keep halving the interval and we expect
that we can estimate the error from a few of these calculations
and the assumption that the error goes down by a factor of
one half each time (if we are using Eulers method). That is, we
compute the approximation for h and h2 and we compute their
difference. This difference we can expect to be about half the
error of the approximation with step size h.
x2
Let us talk a little bit more about the example dxdt 3 , x(0) 1.
Suppose that instead of the value x(2) we wish to find x(3). The
results of this effort are listed in Table 2 for successive halvings of h.
What is going on here? If we solve the equation exactly we will
notice that the solution does not exist at t 3. In fact, the solution
goes to infinity when we approach t 3.
Further problems might arise when the solution oscillates
wildly near some point.
3.2. Fourth-Order In real applications we would not use a simple method such as
RungeKutta Eulers. The simplest method that would likely be used in a real
application is the standard fourth-order RungeKutta method.
That is a fourth-order method, meaning that if we halve the inter-

val, the error generally goes down by a factor of 16.
The method is very similar in the way it works to Eulers
method. The difference is that it does several approximations and
then takes a weighted average of them. Let us simply describe the
method without analyzing it in detail.
dt f t; x, an initial condition x(t0) x0,
We again start with dx
and a step size h. We now compute x1, x2, . . . in the following manner.
Suppose we know ti and xi. We wish to compute ti + 1 and xi + 1.
k1 f t i ; x i ;

h h
k2 f t i ; x i k1 ;
2 2

h h
k3 f t i ; x i k2 ;
2 2

k4 f t i h; x i k3 h ;
t i1 t i h;
k1 2k2 2k3 k4
x i1 x i h:
6
That is, k1 is just the slope at the starting point as in Eulers.
However, we then only go half the distance and compute the slope
k2 at this new approximation. We repeat this step using k2 to
compute k3. Finally we use k3 and a full step to obtain the slope
k4. We then use a weighted average of the slopes to compute the
next xi + 1.
Choosing the right method to use and the right step size can be
very tricky. There are several competing factors to consider.
l Computational time: Each step takes computer time. Even if
the function f is simple to compute, we do it many times over.
Large step size means faster computation, but perhaps not the
right precision.
l Roundoff errors: Computers only compute with a certain num-
ber of significant digits. Errors introduced by rounding numbers
off during our computations become noticeable when the step
size becomes too small relative to the quantities we are working
with. So reducing step size may in fact make errors worse.
l Stability: Certain equations may be numerically unstable. Small
errors may lead to large errors down the line. What may also
happen is that the numbers may never stabilize no matter how
many times we halve the interval. In the worst case the numeri-
cal computations might be giving us bogus numbers that look
like a correct answer.
We have seen just the beginnings of the challenges that appear
in real applications. Numerical approximation of solutions to
494 J. Lebl
differential equations is still a very active research area. For example,

the general purpose method used for the ODE solver (the ode45
function) in Matlab (8) and Octave (9) (as of this writing) is a
method that appeared in the literature only in the 1980s.
3.3. Systems and The numerical methods that we have described can be easily applied
Higher-Order Equations to systems of ODEs. What is simply done is to consider the depen-
dent variable a vector. That is, if we have the equation
x0 ~
~ f t;~
x ; ~
x t 0 ~
x0:
Let us describe Eulers method. We pick a step size h as before. And
as before we compute ti + 1 and ~
x i1 from ti and ~
xi:
t i1 t i h; ~ xi h ~
x i1 ~ f t i ;~
x i :
RungeKutta and other methods are converted for systems of
differential equations in the same exact way.
As we have said in the introduction, we can convert any equa-
tion of any order (or any system of any order) into a first-order
system. Thus, having numerical methods for first-order systems of
ODEs is sufficient for solving any single ODE or system of ODEs.
3.3.1. Example: An example where systems of ODEs which must be solved with
MichaelisMenten Kinetics numerical methods are the equations arising from enzyme kinetics.
For more details on these derivations see for example (3).
We construct a model that allows us to understand how fast a
certain reaction occurs and what is the concentration of product
and substrate at any particular point in time. We will obtain a
nonlinear system of ODEs that we will solve numerically.
Let S be the concentration of substrate, let P be the concentration
of the product. Let ES denote the concentration of substrate-bound
enzyme and EF the concentration of free enzyme. All of S, P, ES, and
EF are functions of the time t. The reactions occurring are as follows
k1
E F S E S !k2 E F P;
k1
where k 1, k1, and k2 are the reaction rate constants.

Now the amount of possible interactions between S and EF is
proportional to the concentration S times the concentration EF.
The number of successful interactions is then proportional to this
number SEF. Hence, the rate of the reaction taking EF and S into ES
must be k1SEF. The reverse reaction rate at which ES becomes S and
EF again, is proportional to ES, hence k 1ES. From these considera-
tions we obtain an equation for the rate of change of S
dS
k1 E S k1 SE F
dt
The rate of production of P is proportional to the concentration of
ES, therefore
dP
k2 E S :
dt
We also need to understand how ES and EF change. EF is being
produced by the first reaction at the rate k1 E S k1 SE F and by the
second reaction at the rate of k2ES. That is,
dE F
k1 E S k1 SE F k2 E S :
dt
Assuming that E E S E F , the total concentration of enzyme, is
constant we obtain
dE S
k1 E S k1 SE F k2 E S :
dt
We have obtained an autonomous system of four variables and four
unknowns. We can still do some simplification. First we note that
only one equation depends on P. As soon as we know ES, we also
know dPdt . So we might forget about the equation dt k2 E S until
dP
we find a solution. Furthermore as we are keeping E E F E S

constant, we need only solve for ES. Thus, the two equations we are
interested in are
dS
k1 E S k1 SE E S k1 k1 SE S k1 SE;
dt
dE S
k1 E S k1 SE E S k2 E S
dt
k1 k1 S k2 E S k1 SE:
We now have two equations and two unknowns. In the next
section we will use the Matlab software to find and plot numerical
solutions.
Let us now derive the MichaelisMenten equation, which is
really an approximation for the above system under the assumption
that we are in a state when the amount of substrate-bound enzyme
is essentially constant. This is essentially true except for some initial
interval. Let us make this assumption
dE S
0 k1 k1 S k2 E S k1 SE:
dt
Solving for ES we obtain
k1 SE
ES :
k1 k1 S k2
Letting K M k1kk
1
2
we obtain
SE
ES :
KM S
Now the rate of product formation is described in terms of ES so
496 J. Lebl
dP V max S
k2 E S ;
dt KM S
where we let V {max} k2E. The value V {max} becomes the maxi-
mum reaction rate. This simplified equation makes it easier to
determine the constants KM and V {max} from the measured data.
3.3.2. Using Matlab Let us give a short overview of how to use Matlab (8) to find and
or Octave plot numerical solutions to ODEs. All examples also work in the
free software clone Octave (9), which has essentially equivalent
syntax. In particular we will use the system of equations we derived
in the last example.
Let us pick some initial values for the parameters. Of course, in
practice obtaining these numbers can be very hard. Let the reaction
rate constants be k1 3, k1 12, k2 5. Let the initial total
concentration of enzyme be E 0. 004. Let the initial concentration
of the substrate-bound enzyme ES 0. Finally we set the initial
concentration of substrate S to be 1 and the initial concentration of
the product P to be 0.
We must define the system for Matlab. We define a vector ~ x to
store the variables S and ES, that is, x1 S and x2 ES. We are also
interested in P therefore we will let x3 P. Our system becomes
dx 1
k1 k1 x 1 x 2 k1 x 1 E;
dt
dx 2
k1 k1 x 1 k2 x 2 k1 x 1 E;
dt
dx 3
k2 x 2 :
dt
Adding in the last variable is not strictly necessary, but if we are
interested in P, then we might as well let Matlab solve for all
variables at once. We define new Matlab variables E, km1, k1, and
k2 to hold the parameters E, k 1, k1, and k2. We also define a vector
xinit to hold the initial values for S, ES, and P.
E0.004;
km13;
k112;
k25;
xinit[1,0,0];
Matlab will want a column vector of three functions depending

on the vector x. We refer to specific components of x as x(1) and
x(2). We must make it a function of time t and x, so we use the
anonymous function notation (Matlab version 7 or later)
dxdt @(t,x) [(km1-k1*x(1))*x(2)-k1*x(1)*E, . . .
-(km1+k1*x(1)+k2)*x(2)+k1*x(1)*E, . . .
k2*x(2)];
0 25 50 75 100
1.00 1.00
Product
Substrate
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0 25 50 75 100
Fig. 9. Product and substrate plots for the example system.
The three dots allow us to split the input over several lines.
Matlab now knows about the function called dxdt that is a func-
tion of t and the vector x.
To solve the ODE, we use the function ode45. This will use a
variable step method from the RungeKutta family of methods and
return two vectors. First a vector of times, and second a column
vector of values (which are in turn 3-vectors, so an n by 3 matrix).
Let us solve for time from t 0 to t 100.
[t,x] ode45(dxdt, [0,100], xinit);
The way Matlab handles graphing is to take a vector of values

for the independent variable and a vector of values for the depen-
dent variable and it simply connects the dots. We can therefore give
it the vector t and one of the columns of x. For example to plot S
we issue the command:
plot(t,x(:,1));
To plot ES we replace x(:,1) with x(:,2), and to plot P we

use x(:,3). For S and P we obtain plots as in Fig. 9.
References
1. Berg PW, McGregor JL (1966) Elementary par- 3. Edelstein-Keshet L (1988) Mathematical mod-
tial differential equations. Holden-Day, San els in biology. Random House, New York
Francisco, CA 4. Edwards CH, Penney DE (2008) Differential
2. Boyce WE, DiPrima RC (2008) Elementary dif- equations and boundary value problems: com-
ferential equations and boundary value pro- puting and modeling, 4th edn. Prentice Hall,
blems, 9th edn. Wiley, New York Englewood Cliffs, NJ
498 J. Lebl
5. Farlow SJ (1994) An introduction to differential 7. Lebl J (2012) Notes on diffy Qs: differential equa-
equations and their applications. McGraw-Hill, tions for engineers. http://www.jirka.org/diffyqs/.
Princeton, NJ 8. Matlab (2012) The MathWorks Inc., Massachusetts
6. Ince EL (1956) Ordinary differential equations. 9. Octave (2012) http://www.gnu.org/software/
Dover Publications, New York octave/.
Chapter 21
On the Development and Validation of QSAR Models

Paola Gramatica
Abstract
The fundamental and more critical steps that are necessary for the development and validation of QSAR
models are presented in this chapter as best practices in the field. These procedures are discussed in the
context of predictive QSAR modelling that is focused on achieving models of the highest statistical quality
and with external predictive power. The most important and most used statistical parameters needed to
verify the real performances of QSAR models (of both linear regression and classification) are presented.
Special emphasis is placed on the validation of models, both internally and externally, as well as on the need
to define model applicability domains, which should be done when models are employed for the prediction
of new external compounds.
Key words: Statistical methods, Regression, Classification, Validation, Applicability domain, Molecu-
lar descriptors, OECD principles
1. Introduction
In recent years more and more studies have been carried out in
which computational methods (so-called in silico) have been used
to predict the physicochemical properties and biological activities
(acute toxicity, carcinogenicity, mutagenicity, etc.) of chemical
compounds. The applications of these studies are historically
older and more frequent on pharmaceuticals in Drug design,
but environmental applications have recently been greatly
increased, mainly in the context of the new European legislation
on chemicals REACH (Registration, Evaluation, Authorisation and
restriction of Chemicals) (1). The OECD principles for QSAR
model validation have been defined (2) to establish recognized
rules for the use of QSAR predictions in regulation: these principles
are an excellent summary of the most important aspects of QSAR
modelling, which are commented below.
Quantitative structureactivity relationship (QSAR) modelling
is based on the fundamental assumption that the structure of a
499
500 P. Gramatica
molecule (i.e., its geometric, steric and electronic properties)

contains the features responsible for its physical, chemical, and
biological properties. By employing QSAR models, the biological
activity (or property, reactivity, etc.) of a new or untested chemical
can be inferred from the molecular structure of similar com-
pounds whose activities (properties, reactivities, etc.) have already
been experimentally assessed. The QSPR (Quantitative Structure
Property relationship) acronym is used when a property is mod-
elled. A structureactivity relationship (SAR) is a qualitative
relationship (i.e., an association) between a molecular (sub)struc-
ture and the presence or absence of a biological activity. A substruc-
ture associated with the presence of a biological activity is
sometimes called a biophore, whereas a substructure associated
with the absence of a biological activity is called a biophobe.
QSAR models exist at the crossroads of chemistry, statistics and
biology, in toxicological studies. The development of a new QSAR
model requires these three components: (1) a dataset providing
experimental measures of a biological activity or property for a
group of already tested chemicals (the training set); (2) molecular
structure and/or property data (i.e., the descriptors, or predictors)
for this group of chemicals; and (3) statistical methods (often called
chemometric methods), to find and validate the relationship
between these two sets.
Corvin Hansch (1964) is considered the father of the modern
QSAR, he had demonstrated the existence of a quantitative rela-
tionship between biological activity (a dependent response Y) and
structure or properties (the independent variables X) of a chemical.
The classical equation of Hansch is:
Biological activity Y a b logP cE dS . . .
where Log P or logKow is the partition coefficient between octanol
and water, the hydrophobicity term encoding the hydrophobic inter-
action of the chemical with the cell membrane. This term represents
the probability of a chemical to reach the target site by crossing the
cell membrane. E and S are structural parameters: E is the electronic
term encoding electronic interactions, S is the steric parameter,
related to bulk and shape. These terms represent the possibility of
the chemical to interact with the target and to be biologically active.
2. Important
Concepts
2.1. Input Input experimental data (often called endpoints) are any kind of
Experimental Data: response (physicochemical properties, toxicity or ecotoxicity end-
The Dependent points, environmental parameters) that could be modelled by
Variable Y QSAR or QSPR as dependent variable (Y). The number of input
data must be reasonably as high as possible, in order to have a wide
21 On the Development and Validation of QSAR Models 501
range of information and a better possibility of finding any

underlying relationship between the chemical structure and the
studied end-point. Due to the limitation in existing experimental
data, this number can often be a serious draw-back for efficient
QSAR modelling (no data, no model). Obviously, it is imperative
that the input data be both accurate and precise to develop a
meaningful model. The sentence garbage in, garbage out is
famous in QSAR. In fact, it must be realized that any resulting
QSAR model is only as statistically valid as the data that led to its
development. Therefore, there is a pressing need to develop and
systematically employ standard chemical record curation protocols
that would be helpful in the preprocessing of any chemical dataset
(3, 4). Molecular modellers typically analyze data obtained from a
literature search or provided by a collaboration. Consequently, the
experimental data quality depends on the data providers. One
limiting factor in the development of QSARs is the availability of
high quality experimental data, homogeneously determined
according to harmonized protocols. This implies that there should
be not only a clear definition of which end point is being modelled,
but also of how it is experimentally measured (the defined endpoint
of OECD Principle 1). Although for the various datasets toxicity
end points are often well-defined for the various datasets, there may
be marked differences in the experimental conditions and test
organisms within the single datasets. Combining data produced
by different protocols in training sets may introduce unnecessary
experimental variability, resulting in deviations from QSAR correla-
tions and should thus be avoided.
Data used in QSAR evaluations could consist of congeneric
series of chemicals (deriving in local QSAR models) or assure
structural diversity even within a chemical class (obtaining general
or global QSAR models). This diversity allows the generalization of
more robust QSARs that are more extensively applicable, having a
wider applicability domain (AD) (see below). It is important to
remember that a structureactivity model is defined and limited by
the nature and quality of the data used in the training set for model
development and should be applied only within the models
applicability domain, which must be defined.
The ideal QSAR model should: (1) consider an adequate num-
ber of training molecules for sufficient structural diversity, (2) have a
wide range of quantified endpoint potency (i.e., several orders of
magnitude) for regression models or adequate distribution of mole-
cules in each class (i.e., active and inactive) for classification models,
(3) be applicable for obtaining reliable predictions of new untested
chemicals (thus there is a need for validation and applicability domain
tools) and (4) if possible, allow to obtain mechanistic information on
the modelled end-point. This last point, which is a fundamental
aspect of descriptive QSAR modelling (mechanistic approach), is of
minor relevance in predictive QSAR modelling (statistical approach).
502 P. Gramatica
2.2. Molecular The second crucial point for QSAR modelling is the need to
Descriptors: The translate the molecular structure of the studied chemicals into
Independent numbers (molecular descriptors). A molecular descriptor is either
Variables X the result of some standardized experiment (a physicochemical
property) or the final result of a logical and mathematical procedure
that transforms the chemical information, encoded within a sym-
bolic representation of a molecule, into a useful number. In a QSAR
model, molecular descriptors, which characterize a specific aspect of
a molecule, are the independent variables X and are predictors of a
dependent variable Y (the modelled response).
Molecular descriptors include empirical, quantum chemical,
and other theoretical parameters. Empirical descriptors can be
measured or estimated and include physicochemical properties
(such as for instance descriptors for the hydrophobic properties
(logP) as well as solubility, ionization constants, etc.). Quantum
chemical properties include charge and energy values. Nonempiri-
cal or theoretical descriptors can be based on individual atoms,
substituents, or the whole molecule. They are typically structural
features (i.e., spatial disposition like conformation, geometry, shape
and molecular volume). They can be based on topology or graph
theory and, as such, they are developed from the knowledge of the
2D structure, or they can be calculated from the 3D structural
conformations of a molecule. Binding properties involve biological
macromolecules and are important in receptor-mediated responses
(i.e., in 3D-QSAR as Comparative Molecular Field Analysis
(CoMFA)).
In modern QSAR approaches, it is becoming quite common to
calculate, as input variables, a wide set of molecular descriptors of
different kinds able to capture all the structural aspects of a chemical
to translate its molecular structure into numbers. In fact, different
descriptors are different ways or perspectives to view a molecule,
taking into account the various features of its chemical structure.
A lot of software is available to calculate wide sets of different
theoretical descriptors, from SMILES, 2D-topological graphs or
3D-x,y,z-coordinates. Some of the more used are mentioned here:
ADAPT (5) OASIS (6), CODESSA (7), MolConnZ (8), and
DRAGON (9). Freely available software, such as MOPAC (10), is
able to perform the calculations of a variety of molecular orbital
properties (electronic properties, including the energies of the
highest occupied and lowest unoccupied molecular orbitals
(EHOMO and ELUMO respectively), atomic charges and superdelo-
calizabilities, dipole moment, and electrostatic potential), which are
estimated by applying quantum chemical calculations (semiempiri-
cal or ab initio methods) to molecular structures. Several thousand
molecular descriptors are now available and most of them have been
summarized and explained (11). The great advantage of theoretical
descriptors is that they can be calculated homogeneously and in a
reproducible way by a defined software for any chemical, even those

not yet synthesized, the only need being a hypothesized chemical
structure.
A crucial problem related to the molecular descriptor calcula-
tion is the input chemical structures, which must be carefully
verified to avoid wrong results from erroneous inputs (3, 4). This
is a fundamental starting point for the preparation of a good dataset
for subsequent QSAR calculations.
The simplest way to input chemical structures, and one that has
become a standard method for denoting structures in databases and
for inserting chemical structures into models for property calcula-
tion, is the Simplified Molecular Line Entry System (SMILES).
SMILES is a 2D representation of a chemical structure. It is in
the form of a string, written by following a small number of rules
that are simple to learn and to use. Briefly, in the SMILES string
each nonhydrogen atom (hydrogen is only explicitly included in
special circumstances) is denoted by its symbol; double and triple
bonds are shown by and # symbols respectively; branches
are shown in parentheses; and rings are opened and closed by the
use of numbers. Alternatively, the chemical structures can be
drawn and conformationally minimized by specific software (i.e.,
Hyperchem (12)) to derive the (x,y,z) atomic coordinates: in this
way molecular descriptors that also take into account the three-
dimensionality (3D) of the structure can be calculated.
2.3. Data Exploration Before modelling, the chemicals under study should be analyzed by
by Principal explorative methods to highlight their distribution and trends in
Component Analysis chemical space and to identify potential strong outliers due to their
peculiar structure.
Probably the most widely known and used explorative multivar-
iate method is Principal Component Analysis (PCA) (13). In PCA,
linear combinations of studied variables (here molecular descrip-
tors) are created, and these combinations explain, to the greatest
possible degree, the variation in the original data. The first principal
component (PC1) accounts for the maximum amount of possible
data variance in a single variable, while subsequent PCs account for
successively smaller quantities of the original variance. Principal
components are derived in such a way that they are orthogonal.
Indeed, it is good practice, especially when the original variables
have different ranges of scales, to derive the principal components
from the standardized data (mean of 0 and standard deviation of 1),
i.e., via the correlation matrix. In this way all the variables are treated
as if they are of equal importance, regardless of their scale of mea-
surement. To be useful, it is desirable that the first two PCs account
for a substantial proportion of the variance in the original data, thus
they can be considered sufficiently representative of the main infor-
mation included in the data, while the remaining PCs condense
irrelevant information and noise. It is quite common for a PCA
504 P. Gramatica
Fig. 1. Biplot of the first two principal components of a PCA. The points are the scores (the studied objects), the lines are the
loadings (the used variables/descriptors).
to be represented by a score plot, loading plot or biplot (Fig. 1),

defined as the joint representation of the rows and columns of a data
matrix: points (scores) represent the chemicals and vectors or lines
the variables (loadings). The lengths of the vectors indicate the
information associated with the variable, while the cosine of the
angle between the vectors reflects their correlation.
PCA is also used for variable selection in PLS methods (see
below).
Cluster Analysis is a multivariate method for the assignment of
a set of structures into subsets (called clusters) so that structures in
the same cluster are similar in some sense.
2.4. Statistical The core of QSAR modelling lies in the statistical methods, applied
Methods: Regression to relate the response (Y) to the molecular descriptors (X). This
and Relative relationship between the molecular descriptors of the chemical
Statistical Parameters structure and the modelled endpoint in a QSAR model is called
the algorithm of the model (the unambiguous algorithm of
OECD Principle 2).
A brief overview of the simplest algorithms, namely for linear
models, commonly used in QSAR, are presented here.
Regression analysis is the use of statistical methods for the
modelling of a dependent variable Y in terms of predictors X
(independent variables or molecular descriptors). Univariate
regression (ULR) involves only one dependent response variable
(Y) and one independent variable (X) that models a simple
Fig. 2. Regression line of a univariate model.
relationship between a molecular descriptor and an endpoint. ULR

assumes that the relationship being modelled is a straight line. The
normal method of determining the regression coefficients is to
minimize the sum of the squares of the residuals using the least
squares method.
The regression line (Fig. 2) expresses the best prediction of
the dependent variable (Y), given the independent variable (X).
However, nature is rarely (if ever) perfectly predictable, and usually
there is substantial variation of the observed points around the
fitted regression line. The deviation of a particular point from the
regression line (its predicted value) is called the residual value.
When the endpoint needs to be modelled using more than one
descriptor (selected by different approaches) then multivariate
techniques are applied.
There are many different multivariate methods for regression
analysis, applied in QSAR studies: Multiple Linear Regression
(MLR), Principal Component Regression (PCR), Partial Least
Squares (PLS), Artificial Neural Nets (ANN), K-nearest Neighbor,
are among the more common approaches for regression modeling.
The technique of multiple linear regression (MLR), particularly
OLS (Ordinary Least Squares), is the most popular regression
method for deriving QSAR models, and is based on completely
transparent and easily reproducible mathematical equations.
The MLR model equation is in the matrix notation:
y X b e where y and e are n vectors, and X is an n p matrix
of predictors (n number of chemicals, p number of predictors), e are
unobserved scalar random variables (errors) that account for the
discrepancy between the actually observed responses yi and the
predicted outcomes.
506 P. Gramatica
OLS produces a transparent and easily reproducible algorithm

relating the dependent variable Y (i.e., a biological activity) to a
number of independent variables Xj (molecular descriptors or
predictors) by using linear prediction equation:
X
P
Y b0 bj Xj b0 b1 X1 b2 X2 . . . bP XP :
j 1
In this equation, the regression coefficients (or bj coefficients)

represent the independent contributions of each independent vari-
able to the prediction of the dependent variable.
The regression coefficients bj in a MLR model can be estimated
using the least squares procedure by minimizing the sum of the
squared residuals. The aim of this procedure is to give the smallest
possible sum of squared differences between the true dependent
variable values and the values calculated by the regression model.
The number of chemicals (data points in the training set)
should also always be reported.
The training set is the set of chemicals used to derive a QSAR.
The data in a training set are typically organized in the form of a
two-dimensional matrix in which different rows (n) correspond to
different chemicals, and different columns (p) to different attri-
butes of the chemicals (e.g., SMILES codes, molecular descrip-
tors, properties). A homogeneous training set is a set of chemicals
that belong to a common chemical class, that share a common
chemical functionality or that have a common skeleton. A hetero-
geneous training set is a set of chemicals that belong to multiple
chemical classes, or that do not share a common chemical
functionality.
It is important to remember that there is a rule of thumb for
the number of descriptors that can be used in a QSAR model: at
least five data point (chemical) per descriptor must exist in the
model. Otherwise there is a strong possibility of finding casual
correlation or of arriving at a fitting model with poor or no
prediction ability.
For assessing the relative importance of descriptors, the stan-
dardized regression coefficients (i.e., the coefficients of the indepen-
dent variables, the predictors in a regression model divided by the
standard deviation of the corresponding predictor) must be used.
The advantage of standardized coefficients (with a mean of zero
and standard deviation of one), as compared to regression coeffi-
cients that are not standardized, is that their magnitude allows the
comparison of the relative contribution of each independent vari-
able in the prediction of the dependent variable. Thus, independent
variables with higher absolute value of their standardized coeffi-
cients explain greater part from the variance of the dependent
variable. If a b coefficient is positive, then its relationship of this
Fig. 3. Plot of experimental vs. predicted values in a regression model. The training and prediction chemicals are labeled
differently.
variable with the dependent variable is positive; if the b coefficient is

negative then the relationship is negative.
In this method it is assumed that the relationship between the
X and Y variables is linear and also that the residuals (predicted
minus observed values) are distributed normally (i.e., follow nor-
mal distribution). In practice these assumptions can virtually never
be confirmed; fortunately, multiple regression procedures are not
greatly affected by minor deviations from these assumptions.
For multivariate regression analysis it is normal to plot the
experimental values vs. the predicted ones by the model (Fig. 3).
For both simple and multiple linear regression analyses, a num-
ber of measures of statistical fit are commonly applied.
To assess the goodness-of-fit of a QSAR model the coefficient
of multiple determination (R2) is used (Eq. 3, Table 1).
MSS RSS
R2 1
TSS TSS
where MSS is the Model Sum of Squares, RSS is the sum of the
squares of the differences (residuals) between the experimental and
estimated responses when predictions are made for objects in the
training set and TSS is the total sum of the squares of the differ-
ences between the experimental responses and the mean values.
508 P. Gramatica
The total variation in any dataset is made up of two parts, the

part that can be explained by the regression equation and the part
that cannot be explained by the regression equation. R2 estimates
the proportion of the variation of Y explained by the regression (the
explained variance of the model). The stronger the relationship
between the dependent and the independent variables the more
R2 approaches 1, on the opposite, the weaker the relationship the
more R2 approaches zero. It equals the square of the correlation
coefficient r between the experimental response (the dependent
variable y) and the predictors (the independent variables x).
It is most important to avoid overfitting. In fact, the value of R2
can generally be increased by adding additional predictor variables
to the model, even if the added variable does not contribute to
reduce the unexplained variance of the dependent variable. This
inconvenience could be avoided by using another statistical
parameterthe so-called adjusted R2 (Radj2) (Eq. 4, Table 1).
Radj2 is obtained by dividing the residual sum of squares and total
sum of squares by their respective number of degrees of freedom.
The value of Radj2 decreases if an added variable to the equation
does not reduce the unexplained variance.
Table 1
Statistical parameters for fitting
No. Statistic Definition Equations and terms

P
1 MAECALC Mean absolute error in fitting jyi ^y j
MAECALC i n i
(calculated on training set)
yi observed dependent variable
^y i calculated dependent variable
r
P
2 RMSECALC Root mean square error in fitting
yi ^y i
2
SDEC Standard deviation in calculation RMSECALC SDEC i

n
3 R2 Coefficient of multiple R2 MSS 1 RSS

determination (or correlation) TSS TSS
P 2
MSS P i ^y i y
TSS P i yi y 2
2
RSS i yi ^y i
y mean value of the dependent variable
4 2
Radj Adjusted R2 RSS=n p 1
2
Radj 1
TSS=n 1
1 1 R2 n 1=n p
n number of objects
p number of predictor variables
r
P
5 s Standard error of estimate yi ^y i
2
s i
np1
6 F F-value MSS=p
F RSS =np1
Table 2
Statistical parameters for cross-validation and K multivariate correlation index

1 PRESSCV Predictive residual sum of squares P 2
PRESSCV i yi ^y i=i
(cross-validation)
P
2 MAECV Mean absolute error in cross- jyi ^y j
MAE i n i=i
validation (calculated on
yi observed response for the ith object
predictions in CV)
^y i=i response of the ith object estimated by
using a model obtained without using the ith
object
n number of objects
r
Pn
3 RMSECV Root mean square error in CV
yi ^y i=i
2
predictions RMSEPRED SDEP i1

n
SDEP Standard deviation in prediction symbols as above
Q 2 1 PRESSCV
4 2 Explained variance in prediction
Q
TSS

5 K Multivariate correlation index P Pln 1

n
n ln p
K 100
2p 1
p
ln eigenvalues obtained from the correlation
matrix of the dataset X(n, p)
n number of objects
p number of variables
The standard error of estimate s (Eq. 5, Table 1) measures the

dispersion of the observed values from the regression line. The
smaller the value of s the higher the reliability of the prediction.
However it is not recommended to have a standard error of esti-
mate smaller than the experimental error of the biological data, as
this indicates an overfitted model.
The root mean square error (RMSE) or standard deviation error
in calculation (SDEC, Eq. 2, Table 1), is similar to the standard
error of the estimate. They summarize the overall error of the
model: they are calculated as the root square of the sum of squared
errors in calculation divided by the total number of chemicals.
These parameters are calculated and compared both on training
and prediction chemicals (SDEP) (see below) (Eq. 3 in Tables 2
and 3). The more similar are these compared values the more the
model has a general applicability.
Some authors use the simpler MAE, mean average error (Eq. 1,
Table 1).
The statistical significance of the regression model can be
assessed by means of the Fisher statistic (F), (Eq. 6, Table 1). The
F-value or variance ratio is the ratio between explained and
510 P. Gramatica
Table 3
Statistical parameters for external validation

1 PRESSEXT Predictive residual sum of squares P 2
PRESSEXT i yi ^y i
(external validation) yi external response observed
for the ith object
^y i external response predicted
using the model
P
2 MAEEXT Mean absolute error in external prediction jyi ^y i j
MAEEXT n
symbols as above
r
P
3 RMSEEXT Root mean square error in external
yi ^y i
2
prediction RMSE EXT n

symbols as above
P
R02 yi ^y r0
2
4 (25) Determination coefficient forcing the
origin R02 1 i TSS i
^yir0 kyi P
^y yi
k slope Pi i 2 y
i i
Closeness Closeness between the R2 and R02 closeness

R2 R02
R2
determination coefficients
1 PRESS
5 2 Variance explained in external prediction
QF1 (32) QF1 2 EXT
SSEXT
PynTREXT
SSEXT y TR i1 yi y TR 2
y TR average of training observed
responses
1 PRESS
6 2 Variance explained in external prediction
QF2 (33) QF2 2 EXT
SSEXTPy EXT
nEXT
SSEXT y EXT i1 yi y EXT 2
y EXT average of external observed
responses
q
7 r 2 m (34) Closeness between the R2 and R02 r 2 m 1 R2 R02
determination coefficients
2
.
8 QF3 (35, Variance explained in external prediction 2
QF3 1 PRESSEXT TSS
nEXT nTR
36)
nEXT number of external objects
nTR number of training objects
Pn
9 CCC (37) Concordance correlation coefficient 2 x
x yi y
CCC Pn 2
Pni
i1
i1
xi
x 2
i1
yi y nxy 2
xi external response observed for the
ith object
yi external response predicted using the
model
x average of observed responses
y average of responses predicted by the
model
Note: observed and predicted responses
can be interchanged
unexplained variance for a given number of degrees of freedom,

respectively p and (n p 1), where n are the chemicals and p the
model descriptors. The regression equation is considered to be
statistically significant if the observed F-value is greater than a
tabulated value for the chosen level of significance (typically, the
95% level) and the corresponding degrees of freedom of F. Signifi-
cance of the equation at the 95% level means that there is only a 5%
probability that the dependence found was obtained due to chance
correlations between the variables.
2.4.1. Variable Selection The number of independent variables in a MLR model must be as
low as possible, and the ratio of observations to variables must be as
high as possible (a ratio of 5:1 is considered an absolute minimum).
Care must also be taken to ensure that all the variables in a MLR
analysis are significant (this can be assessed by the standardized
regression coefficient) and preferably that no combinations of inde-
pendent variables are collinear (in fact collinearity of variables
should be avoided in MLR, unless an appropriate method has
been applied to control the collinearity (see below) (14)).
For a large number of independent variables (i.e., physico-
chemical and/or structural properties) variable selection techni-
ques are commonly applied. This selection may be an empirical
process of the model developer, i.e., the selection of properties
known or thought to be important. Alternatively, variable selection
may employ stepwise selection techniques (forward or backward),
best subsets selection, or the use of genetic algorithms.
Genetic algorithms (GA) (15) are optimization methods based
on evolutionary principles (16). In GA terminology, a chromosome
is a p-dimensional vector (a string of bits) where each position
(a gene) corresponds to a variable (1 if included in the model,
0 otherwise). Each chromosome or individual in the population
represents a model with a subset of variables. A population of
models is obtained that evolves, according to genetic algorithm
rules, in order to maximize the predictive power of the models
(for instance, the explained variance in prediction Q2 (see below)).
In the first generation, the variables are chosen randomly. In
the next step reproduction takes place, so that the new individual
contains characteristic of both its parents. The next steps are cross-
overs and mutations, which allow better variable combinations to
be found. This reproductioncrossovermutation process is
repeated during the evolution of the population until a desired
target fitness score is reached. Only the models producing the
highest predictive power are finally retained and further analyzed.
GAs are used in QSAR analysis as a strategy for variable subset
selection (VSS) in multivariate situations where a large number of
molecular descriptors are potential X-variables (1722). There are
different types of GA analysis, which perform reproduction, cross-
over and mutation in different ways. An important characteristic of
512 P. Gramatica
the GA-VSS method is that the result is usually a population of

acceptable models.
The variable selection is based on Principal Component Analy-
sis of the original variables in an alternative linear regression
method that is surely free from the problem of variables collinearity:
Partial Least Squares (PLS) (23). PLS identifies the fundamental
relations between two matrices (X and Y), where X are orthogonal
latent variables, derived from linear combination of the original
variables by PCA. This approach models the covariance in these
two spaces, trying to find the multidimensional direction in the X
space that explains the maximum multidimensional variance direc-
tion in the Y space. PLS-regression is particularly suited when the
matrix of predictors has more variables than observations, and
when there is multicollinearity among X values.
X TP T E
;
Y TQ T F
where X is an n m matrix of predictors, Y is an n p matrix of
responses, T is an n l matrix (the score, component or factor
matrix), P and Q are, respectively, m l and p l loading matrices,
the matrices E and F are the error terms.
Similarly, principle components regression (PCR) is the appli-
cation of regression analysis to a dataset in which the descriptors are
principal components derived from linear combination of the more
fundamental descriptors.
2.5. Validation A necessary condition for the validity of a regression model is that
of QSAR Models the multiple correlation coefficient R2 is as close as possible to one
(OECD Principle 4) and the standard error of the estimate s is small. However, this
condition, which measures how well the model is able to mathe-
matically reproduce the end point data of the training set (fitting
ability), is an insufficient condition for model robustness and valid-
ity, as it do not express the ability of the model to make reliable
predictions on new data. However, it is important to highlight that
the predictive QSAR modelling utility should be mainly in obtain-
ing predicted data, in filling the data gaps and mainly in virtual
screening for prioritizing chemicals. A necessary approach is to
apply various cross-validations (24). Cross-validation refers to
the use of one or more statistical techniques for internal validation
in which different proportions of chemicals are omitted from the
training set (e.g., Leave-one-out (LOO), Leave-More-out (LMO),
bootstrapping) and iteratively put in test (or validation) set. QSAR
is developed on the basis of the data of the remaining chemicals,
and then used to make predictions for the chemicals that were
omitted (test chemicals). This procedure is repeated a number of
times, so that a number of statistics can be derived from the com-
parison of predicted data with the known data. Cross-validation
techniques allow the assessment of the internal prediction power
Number of components vs. R2 e Q2

100
90
R2
80
70
60
50
Q2
40
30
20
1 2 3 4 5 6 7 8 9 10
Number of components
Fig. 4. Comparison of the explained variance in fitting with the explained variance in cross-validation prediction.
and of the robustness of the model (stability of QSAR model

parameters). But nothing is known regarding the power to predict
new external chemicals (never included in the training set for model
development) (4, 2527).
The cross-validated explained variance or cross-validated corre-
lation coefficient (Q2, Eq. 4, Table 2) is used as a measure of the
goodness of internal power to predict. It is calculated by the
formula: 1 PRESS/TSS where PRESS is the Predictive Error
Sum of Squares, that is the sum of the squares of the differences
(residuals) between the experimental and predicted responses when
predictions are made for objects left out of the training sets and TSS
is the Total Sum of Squares (Eq. 1, Table 2).
In contrast to the fitting parameter R2, which is always
increased by adding more descriptors arriving at a perfectly fitted
model, the value of Q2 increases when useful predictors are added,
but decreases otherwise. The differing trends of R2 and Q2 with an
increasing number of predictor variables are illustrated in Fig. 4.
Cross-validation by LOO employs n sub-training sets, extracted
from the original training set, in which, at every step, one different
chemical has been excluded.. A number of models are developed by
using each of the n 1 chemicals in the training set and the one
chemical in the test set. For each model, the excluded chemical is
predicted and, at last, Q2 computed.
Cross-validation by LMO employs smaller training sets than the
LOO procedure and can be repeated many more times because of
the possibility of having a larger number of combinations and of
leaving many compounds out from the training set (up to 50%).
If a QSAR model has a high average Q2 in LOO and LMO
validations, it is generally concluded that the obtained model is
robust and internally predictive.
514 P. Gramatica
Bootstrap resampling (or bootstrapping) (28) is another

approach to internal validation, alternative to LMO. The basic
premise of bootstrap resampling is that the dataset should be
representative of the population from which it was drawn. Since
there is only one dataset, bootstrapping simulates what would
happen if the samples were selected randomly. In a typical bootstrap
validation, K groups of the size n are generated by a repeated
random selection (typically >1,000 times) of n objects from the
original dataset. It is possible for some objects to be included in the
same random sample several times, while other objects may be
never selected. The model obtained from the dataset of n randomly
selected objects is used to predict the target properties for the
excluded objects. As in the case of LMO validation, a high average
Q2 in the bootstrap validation is a demonstration of model
robustness.
Y-scrambling or response permutation/randomization testing
is another widely used technique to check the robustness of a
QSAR model, and to identify models based on chance correlation,
i.e., models where the independent variables are correlated by
chance to the response variables (23, 26, 27). In this test, the
dependent variable vector, the Y-vector, is randomly shuffled and
a new QSAR model is developed using the original independent
variable matrix, calculating the quality of the model (usually R2 or,
better, Q2). The procedure is repeated several hundred of times. It
is expected that the resulting QSAR models should generally have
low R2 and low Q2 LOO values. If the new models developed from
the dataset with randomized responses have R2 and Q2 significantly
lower than the original model, then there is a strong evidence that
the proposed model is well founded, and not just the result of
chance correlation.
Models based on chance correlation can be detected using the
QUIK rule (14), a simple criterion that allows the rejection of
models with high predictor collinearity, which could lead to chance
correlation. The QUIK rule is based on the K multivariate correla-
tion index (Eq. 5, Table 2) that measures the total correlation of
a set of variables. The rule is derived from the assumption that
the total correlation in the set given by the model predictors X
plus the response Y (KXY) should always be greater than that the
one measured only in the set of predictors (KXX). Therefore,
according to the QUIK rule, only models with the KXY correlation
among the [X + Y] variables greater than the KXX correlation
among the [X] variables can be accepted. The QUIK rule has
been demonstrated to be very effective in avoiding models with
multicollinearity without prediction power (1722).
2.5.1. External Validation Validation on chemicals not used in model development, the so-
called external validation, is especially important in the context of
using QSAR models for prediction of new data in virtual screening
(4, 2527). For this aim a prediction set of new chemicals is

needed.
Ideally, such a prediction set should consist of new data taken
from external sources and obtained after building the model, but in
everyday practice data are usually scarce. This is why, for validation
purposes, it is suggested to use only one part of the experimentally
available data (7090% of the compounds) for model building,
reserving the other 1030% of the compounds (supposed
unknown) until the model is built, at which time the model is
validated for its ability to predict this small subset of chemicals,
not used for model development and supposed unknown. By
convention, these two data subsets are called training set (TSET)
and prediction set (PSET), respectively.
The composition of the two sets is of crucial importance. At
this point the underlying goal is to ensure that both the training
and the prediction sets span, separately, the whole descriptor space
occupied by the entire dataset, and that the chemical domain in
the two datasets is not too dissimilar. The best splitting must
guarantee that the training and prediction sets are scattered over
the whole area occupied by representative points in the descriptor
space (representativity) and that the training set is distributed over
the entire area occupied by representative points for the whole
dataset (diversity). The most widely applied splitting methodolo-
gies are based on similarity analysis (for instance, D-optimal dis-
tance (17, 19, 29), sphere exclusion algorithm (30), Kohonen
Map-Artificial Neural Network (K-ANN) or Self-Organizing Map
(SOM) (1722, 31)) or on random selection through activity
sampling (20, 22). Random splitting, while useful if applied itera-
tively in splitting for CV internal validation and more similar to real
situations, gives very variable results when applied to statistical
external validation, depending greatly on set dimension and repre-
sentativity. In addition, there is a greater probability of having
chemicals outside the model AD in the prediction set. Two differ-
ent splitting methods are often used by the author in many papers
(17, 1922). The QSAR model developed by using the training set
chemicals is applied to the prediction set chemicals in order to verify
the real predictive ability of the model on compounds never used in
model development. When the same common set of molecular
descriptors is selected from the independent modeling of each
training set, and is verified as predictive for both prediction sets,
then this combination of descriptors is considered to encompass the
modeled response for the studied compounds, independently of
the splitting criteria, thus unbiased by structure and response.
Different formulas for the calculation of external predictivity
for new chemicals have been proposed (25, 3237) (Eqs. 49 in
Table 3) and various proponents have highlighted the quality of
their proposed parameter. These criteria are not always in agree-
ment, thus, in the authors opinion, the best approach is to verify
516 P. Gramatica
more than one criteria each time or to prefer our new proposal in
the QSAR modelling, the concordance correlation coefficient CCC
(Eq. 9), which is the more precautionary in accepting QSAR mod-
els as externally predictive (22, 37). All these validation criteria are
implemented in the new software QSARINS for QSAR MLR
model development and validation developed by the authors
group and will be freely available on the Web (38)
The standard deviation error in prediction (SDEP) is similar to
SDEC, but the residuals are calculated by using the predicted value
of the dependent variable when an observation is left out from the
training set and put in the test set (CV) or when it is calculated on
chemicals in prediction set (EXT).
Finally, when the developed QSAR model has been verified for
its predictive ability by one or more different splittings the set of
combined descriptors is used to derive a full model based on all the
chemicals: this is the final proposal, in order to have the maximum
information possible from the available experimental data (2022).
In Fig. 5 a scheme of the complete procedure in predictive QSAR
modelling is presented.
2.6. Statistical QSAR models can also be developed on qualitative categorical

Methods: Classification responses, in these cases classifications methods must be applied,
and Relative Statistical always with the validation procedure. In the simplest case, chemi-
Parameters cals are categorized into one of two groups depending, for instance,
on their biological activity: active/inactive or toxic/non-toxic.
Classification or discrimination analysis is the assignment of
objects to one of two/more existing classes based on a classification
rule. Classification is also called supervised pattern recognition, as
opposed to unsupervised pattern recognition, which refers, for
instance, to cluster analysis.
The goal is to calculate class models and a classification rule (by
selection of the predictor variables) based on a training set
with objects of known classes and then to apply this rule to a test
set with objects of unknown classes. A class or category is a distinct
subspace of the whole measurement space. The classes are defined a
priori by groups of objects in the training set. The objects of a class
have one or more characteristics in common, indicated by the same
value of a categorical variable (for instance, biodegradable/not
biodegradable). There is a wide range of classification methods:
DA (Discriminant Analysis (a Bayesian method): Linear DA, Qua-
dratic DA, Regularized DA), SIMCA (Soft Independent Modelling
of Class Analogy), k-NN (k-Nearest Neighbors), CART (Classifi-
cation And Regression Tree), CP-ANN (Counterpropagation-
Artificial Neural Networks), etc. To present one of the more used:
CART (39) is a nonparametric unbiased classification strategy to
classify chemicals with automatic stepwise variable selection. As the
final output, CART displays a binary, immediately applicable, clas-
sification tree (Fig. 6): each nonterminal node corresponds to a
Fig. 5. A scheme of the QSAR modelling procedure for predictive approach.
discriminant variable (with the threshold value of that molecular

descriptor), and each terminal node corresponds to a single class.
To classify a chemical, at each binary node, the tree branch match-
ing the values of the chemical on the corresponding splitting
descriptor must be followed.
The k-NN method is a nonparametric unbiased classification
method that searches for the k-nearest neighbors of each chemical
518 P. Gramatica
Fig. 6. A CART decision tree. At each node the value of the specified variable is applied for
class discrimination till the end in the root assigned classes (here 4 classes).
in a dataset (40, 41). The compound under study is classified by

considering the majority of classes to which the kth nearest chemicals
belong. k-NN is applied to autoscaled data with a priori probability
proportional to the size of the classes; the predictive power of the
model is checked for k nearest neighbors between 1 and 10.
Counterpropagation Artificial Neural Networks (CP-ANN),
particularly Kohonen Maps (31), are supervised classification meth-
ods. Input variables (descriptors) calculated for the studied chemi-
cals provide the input for the net or the Kohonen layer. The
architecture of the net is N N p, where p is the number of
input variables and each p-dimensional vector is a neuron (N).
Thus, the neurons are vectors of weights, corresponding to the
input variables. During the learning, n chemicals are presented to
the netone at a timea fixed number of times (epochs); each
chemical is then assigned to the cell for which the distance between
the chemical vector and the neuron is minimum. The target values
(i.e., the classes to be modelled) are given to the output layer (the
top-map: a two-dimensional plane of response), which has the
same topological arrangement of neurons as the Kohonen layer.
The position of the chemicals is projected to the output layer and
the weights are corrected in such a way that they fit the output
values (classes) of corresponding chemicals. The Kohonen-ANN
automatically adapts itself in such a way that similar input objects
are associated with topological close neurons in the top map.
Similar objects are located in topologically close neurons and simi-
larity decreases with increasing topological distance.
The trained network can be used for predictions: a new object
in the Kohonen layer will lie on the neuron with the most similar
chemicals. This position is then projected to the top-map, which
Fig. 7. The confusion matrix of a classification model. In the diagonal: the number of
chemicals correctly assigned to each of the G classes; out of the diagonal: the number of
misclassified chemicals with their erroneous class assignments.
provides a predicted output value. It is important to remember that

the Kohonen top-map has a toroidal geometry: each neuron has the
same number of neighbors, including the neurons on the borders
of the top map.
The output of a classification model is the class assignment and
the confusion (or contingency) matrix (Fig. 7), calculated without
and with cross-validation which shows how well the classes are
separated. The main diagonal (Cgg) represents the cases where
the true class coincides with the assigned class, that is, the number
of objects correctly classified in each class, while the nondiagonal
cells represent the misclassifications. Overpredictions are to the
right and above the diagonal, whereas underpredictions are to the
left and below the diagonal.
The internal predictive performances of classification models
are verified by the cross-validated error rate or risk in comparison
with the No-Model error rate or risk, the error provided in absence
of model: NOMER% n nM/n 100 where n is the total
number of objects and nM is the number of objects of the most
represented class.
The goodness-of-fit of a classification QSAR models can be
assessed in terms of its Cooper statistics, based on a Bayesian
approach (23). Bayesian-based methods can also be used to com-
bine results from different cases, so that judgments are rarely based
only on the results of a single study, but they rather synthesize
evidence from multiple sources. These methods can be developed
in an iterative manner, so that they allow successive updating of
battery interpretation.
520 P. Gramatica
The goodness-of-fit of a two-group QSAR can be represented

in the form of a 2 2 contingency table:
Predicted class
Active Inactive Marginal totals

Known class Active a b a+b
Inactive c d c+d
Marginal totals a+c b+d a+b+c+d
Cooper statistics summarize particular aspects of the contin-

gency table, and are defined as follows:
Statistical parameters
Sensitivity The proportion (or percentage) a/(a + b)
of the active chemicals
(chemicals that give positive
results experimentally) that
are predicted to be active
Specificity The proportion (or percentage) d/(c + d)
of the inactive chemicals
(chemicals that give negative
results experimentally) that
are predicted to be inactive
Concordance or The proportion (or percentage) (a + d)/
accuracy of the chemicals that are (a + b + c + d)
classified correctly
Positive predictivity The proportion (or percentage) a/(a + c)
of the chemicals predicted to
be active that give positive
results experimentally
Negative predictivity The proportion (or percentage) d/(b + d)
of the chemicals predicted to
be inactive that give negative
results experimentally
False positive rate The proportion (or percentage) c/(c + d)
(overclassification) of the inactive chemicals that 1 specificity
are falsely predicted to be
active
False negative rate The proportion (or percentage) b/(a + b)
(underclassification) of the active chemicals that 1 sensitivity
are falsely predicted to be
inactive
The statistics sensitivity, specificity and concordance provide

measures of a two-group QSAR to detect known active (toxic) che-
micals (sensitivity), inactive (nontoxic) chemicals (specificity), and all
chemicals (accuracy or concordance). The false positive and false
negative rates can be calculated from the specificity and sensitivity.
0.9
08
0.7 kNN
Lazy
RF
0.6
Sensitivity
Consensus
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 - Specificity
Fig. 8. The ROC plot for comparison of the performances in sensitivity and specificity of
different classification models.
The other two statistics, positive and negative predictivities, can

be thought of as conditional probabilities: if a chemical is predicted
to be active (toxic), the positive predictivity gives the probability
that it really is active (toxic); similarly, if a chemical is predicted to
be inactive (nontoxic), the negative predictivity gives the probabil-
ity that it really is inactive (nontoxic).
A receiver operating characteristic (ROC) plot (Fig. 8) can be
employed to graphically present model behavior in a visual way. In
the ROC graph (41), the X-axis is 1 specificity (false positive
rate) and the Y-axis is the sensitivity (true positive rate). The best
possible classification model would yield a point located in the
upper left corner of the ROC space, i.e., high true positive rate
and low false positive rate. On the contrary a classification model
with no discriminating power would give a straight line at an angle
of 45 from the horizontal, i.e., equal rates of true and false posi-
tives. An index of the goodness of the classification is the area under
the curve: a perfect model has an area of 1.0, whilst if the area
equals 0.5, the classifier has no discriminative power at all.
2.7. Other Statistical For modelling nonlinear relationships, both in regression and
Modelling Methods classification, Artificial Neural Networks (ANN) are particularly
accurate. ANN are nonlinear computational models that make
predictions by simulating the functioning of human neurons.
They can be used for pattern recognition problems and for QSAR
modelling. However they are often considered as black-boxes, as
522 P. Gramatica
the algorithms are not evident and the interpretability is lower in

comparison to other methods.
More accurate predictions can be achieved by using different
tools in order to apply a consensus approach. Combining predictions
obtained by different statistical methods and/or molecular descrip-
tors is a powerful tool, which gives greater confidence in predic-
tions and enlarge the applicability domain. (17, 20, 42)
2.8. Applicability Not even a robust, significant, and validated QSAR model can be
Domain (OECD expected to reliably predict the modelled property for the entire
Principle 3) universe of chemicals. In fact, only predictions for new chemicals
falling within the model domain can be considered reliable like
those for training chemicals and not extrapolations of the model.
The applicability domain is a theoretical region in physicochemical
space (the response and chemical structure space) for which a
QSAR model should makes predictions with a given reliability.
This region is defined by the nature of the chemicals in the training
set, and can be characterized in various ways (43).
While the range of the independent variable X (descriptor) is
useful to define the chemical domain of a univariate QSAR (based
on one descriptor), in multivariate models the descriptor ranges are
too limited to highlight those chemicals lying outside the domain.
In Multiple Linear Regression, where the data and residual
distribution is generally normal, the predicted data Y are obtained
from the experimental data Y through the Hat matrix of influence.
1 T
^y X X T X X y Hy
where X is the descriptor matrix.
A simple measure of a chemical being too far from the applica-
bility domain of the model is its leverage in the original variable
space, hii, which is defined (44) as:
1
hii xiT X T X xi i 1; . . . :; n
where xi is the descriptor row-vector of the query compound, and
X is the n p matrix of p model descriptor values for n training set
compounds. The superscript T refers to the transpose of the
matrix/vector. The ii main diagonal entry of the Hat matrix
(H, (hii)) provides a measure of how far observation i is from the
center of the X data (leverage).
The warning leverage h* is generally fixed at 3k/n, where n is
the number of training compounds, and k the number of model
parameters plus one (p + 1). A chemical with high leverage in the
training greatly influences the regression: the fitted regression line
is forced near the observed value and the residuals are small, thus
the chemical in the training set is not an outlier for the response
fitting. On the contrary, a chemical in a test set with an Hat value
greater than the warning leverage h* means that the predicted
Fig. 9. The Williams plot for the graphical visualization of outliers for the response (on the Y axis: standardized residuals
>2.5s) or for the structure (on the X axis: highest Hat value: >h* cut-off line) in regression models.
response is the result of substantial extrapolation of the model and,

therefore, may not be reliable, so the predicted value must be used
with great care. To visualize the complete AD of a QSAR model
(both for structure and response), the plot of standardized cross-
validated residuals vs. leverage (Hat diagonal) values (h)
(the Williams plot) (Fig. 9) can be used for an immediate and
simple graphical detection of both the response outliers (i.e., com-
pounds with cross-validated standardized residuals greater than
2.5-3 standard deviation units, >2.53s) and structurally influen-
tial chemicals in a model (h > h*).
3. Examples
In conclusion, and for the sake of space, interesting examples of

QSAR models and more details on the summarized modelling
procedures discussed here can be found in the cited papers of the
author (1722, 41, 42). Additionally, it could be useful for new
applicants to list here some computational tools from among those
524 P. Gramatica
more commonly used and which can be easily applied, taking into
account all the crucial points explained above for QSAR model
reliability, particularly external validation and applicability domain.
Freely available computational tools:
EPI Suite : http://www.epa.gov/opptintr/exposure/pubs/
episuite.htm
Caesar models: http://www.caesar-project.eu/
Toxtree: http://toxtree.sourceforge.net/download.html http://
ecb.jrc.ec.europa.eu/qsar/qsar-tools/index.php?
cTOXTREE
OECD QSAR toolbox: http://www.oecd.org/document/54/
0,3746,en_2649_34379_42923638_1_1_1_1,00.html
OpenTox models: http://www.opentox.org/
CADASTER models: http://www.cadaster.eu/
Commercial computational tools:
ACD labs Advanced Chemistry Development (http://acdlabs.
com/home/)
MultiCASE http://www.multicase.com/
PASS: http://195.178.207.233/PASS/Ref.html
Leadscope: http://www.leadscope.com/model_appliers/
Derek: http://www.lhasalimited.org
Topkat: http://accelrys.com/products/discovery-studio/predictive-
toxicology.html
Acknowledgments
I wish to thank Dr. Nicola Chirico for his collaboration in preparing

the Tables and Figures and for the implementation of QSARINS
software.
References
1. REACH (2007) http://ec.europa.eu/environ- 4. Tropsha A (2010) Best practices for QSAR
ment/chemicals/reach/reach_intro.htm model development. Validation, and Exploita-
2. OECD Guidelines (2004) http://www.oecd. tion Mol Inform 29:476488
org/dataoecd/33/37/37849783.pdf 5. http://www.netsci.org/Resources/Software/
3. Fourches D, Muratov E, Tropsha A (2010) Modeling/CADD/adapt.html
Trust, but verify: on the importance of chemi- 6. http://oasis-lmc.org
cal structure curation in chemoinformatics and 7. Katritzky AR, Karelson M, Petrukhin R
QSAR modelling research. J Chem Inf Model CODESSA PRO, University of Florida 2001
50:11891204 2005. http://www.codessa-pro.com/
8. MolConnZ (2003) Ver. 4.05, Hall Ass. Con- and applicability evaluations of regression
sult., Quincy, MA. http://www.edusoft-lc. based and classification QSARs. Environ
com/molconn/ Health Perspect 111:13611375
9. DRAGONSoftware for the calculation of 24. Hawkins DM (2004) The problem of overfit-
molecular descriptors. Talete srl, Milan, Italy. ting. J Chem Inf Comput Sci 44:112
(http://www.talete.mi.it/products/dragon_ 25. Golbraikh A, Tropsha A (2002) Beware of q2.
description.htm) J Mol Graph Model 20:269276
10. http://openmopac.net/ 26. Tropsha A, Gramatica P, Gombar VK (2003)
11. Todeschini R, Consonni V (2009) Molecular The importance of being earnest: validation is
descriptors for chemoinformatics. Wiley-VCH, the absolute essential for successful application
Weinheim and interpretation of QSPR models. QSAR
12. (2002) HyperChem 7.03 Hypercube, Inc., Comb Sci 22:6977
Florida, USA. www.hyper.com 27. Gramatica P (2007) Principles of QSAR mod-
13. Jackson JE (1991) A users guide to principal els validation: internal and external. QSAR
components. Wiley, New York Comb Sci 26:694701
14. Todeschini R, Consonni V, Maiocchi A (1999) 28. Efron B (1979) Bootstrap methods, another
The K correlation index: theory development look at the jackknife. Ann Stat 7:126
and its application in chemometrics. Chemom 29. Marengo E, Todeschini R (1992) A new algo-
Int Lab Syst 46:1329 rithm for optimal distance-based experimental
15. Leardi R, Boggia R, Terrile M (1992) Genetic design. Chemom Int Lab Syst 16:3744
algorithms as a strategy for feature selection. 30. Golbraikh A, Tropsha A (2002) Predictive
J Chemom 6:267281 QSAR modeling based on diversity sampling
16. Kubinyi H (1996) Evolutionary variable selec- of experimental datasets for the training and
tion in regression and PLS analyses. J Chemom test set selection. J Comput Aid Mol Des
10:119133 16:357369
17. Gramatica P, Pilutti P, Papa E (2004) Validated 31. Gasteiger J, Zupan J (1993) Neural networks
QSAR prediction of OH tropospheric degrad- in chemistry. Angew Chem Int Ed Engl 32
ability: splitting into training-test set and con- (503):527
sensus modelling. J Chem Inf Comp Sci 32. Shi LM, Fang H, Tong W et al (2001) QSAR
44:17941802 models using a large diverse set of estrogens.
18. Papa E, Villa F, Gramatica P (2005) Statistically J Chem Inf Comput Sci 41:186195
validated QSARs and theoretical descriptors for 33. Schuurmann G, Ebert RU, Chen J et al (2008)
the modelling of the aquatic toxicity of organic External validation and prediction employing
chemicals in Pimephales promelas (Fathead the predictive squared correlation coefficients
Minnow). J Chem Inf Model 45:12561266 test set activity mean vs training set activity
19. Liu H, Papa E, Gramatica P (2006) QSAR mean. J Chem Inf Model 48:21402145
prediction of estrogen activity for a large set 34. Roy PP, Somnath P, Indrani M et al (2009) On
of diverse chemicals under the guidance of two novel parameters for validation of predic-
OECD principles. Chem Res Toxicol tive QSAR models. Molecules 14:16601701
19:15401548 35. Consonni V, Ballabio D, Todeschini R (2009)
20. Gramatica P, Giani E, Papa E (2007) Statistical Comments on the definition of the Q2 param-
external validation and consensus modeling, A eter for QSAR validation. J Chem Inf Model
QSPR case study for Koc prediction. J Mol 49:16691678
Graph Model 25:755766 36. Consonni V, Ballabio D, Todeschini R (2010)
21. Gramatica P (2009) Chemometric methods Evaluation of model predictive ability by
and theoretical molecular descriptors in predic- external validation techniques. J Chemom 24:
tive QSAR modeling of the environmental 194201
behaviour of organic pollutants. In: Puzyn T, 37. Nicola Chirico N, Gramatica P (2011) Real
Leszczynski J, Cronin MTD (eds) Recent external predictivity of QSAR models: how to
advances in QSAR studies. Springer, New York evaluate it? Comparison of different validation
22. Bhhatarai B, Gramatica P (2010) Per- and criteria and proposal of using the concordance
poly-fluoro toxicity (LC50 inhalation) study correlation coefficient. J Chem Inf Model 51
in rat and mouse using QSAR modeling. (9):23202335
Chem Res Toxicol 23:528539 38. Chirico N, Papa E, Kovarich S, Cassani S,
23. Eriksson L, Jaworska J, Worth A et al (2003) Gramatica P (2011) QSARINS, software for
Methods for reliability, uncertainty assessment, QSAR MLR model calculation and validation,
526 P. Gramatica
20082012 University of Insubria, Varese, 42. Zhu H, Tropsha A, Fourches D et al (2008)

Italy. http://www.qsar.it Combinational QSAR modeling of chemical
39. Breiman L, Friedman JH, Olshen RA et al toxicants tested against Tetrahymena pyrifor-
(1998) Classification and regression trees. mis. J Chem Inf Model 48:766784
Chapman & Hall, Boca Raton 43. Netzeva TI, Worth AP, Aldenberg T et al
40. Sharaf MA, Illman DL, Kowalski BR (1986) (2005) Current status of methods for defining
Chemometrics. Wiley Interscience, New York the applicability domain of (quantitative) struc-
41. Li J, Gramatica P (2010) Classification and tureactivity relationships. ATLA 33:155173
identification of androgen receptor antagonists 44. Atkinson AC (1985) Plots, transformations
with various methods and consensus approach. and regression. Clarendon, Oxford
J Chem Inf Mod 50:861874
Chapter 22
Principal Components Analysis

Detlef Groth, Stefanie Hartmann, Sebastian Klie, and Joachim Selbig
Abstract
Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of
dimensions, while retaining as much as possible of the datas variation. Instead of investigating thousands of
original variables, the first few components containing the majority of the datas variation are explored. The
visualization and statistical analysis of these new variables, the principal components, can help to find
similarities and differences between samples. Important original variables that are the major contributors
to the first few components can be discovered as well.
This chapter seeks to deliver a conceptual understanding of PCA as well as a mathematical description.
We describe how PCA can be used to analyze different datasets, and we include practical code examples.
Possible shortcomings of the methodology and ways to overcome these problems are also discussed.
Key words: Principal components analysis, Multivariate data analysis, Metabolite profiling, Codon
usage, Dimensionality reduction
1. Introduction
Modern data analysis is challenged by the enormous number of

possible variables that can be measured simultaneously. Examples
include microarrays that measure nucleotide or protein levels, next
generation sequencers that measure RNA levels, or GC/MS and
LC/MS that measure metabolite levels. The simultaneous analysis
of genomic, transcriptomic, proteomic, and metabolomic data fur-
ther increases the number of variables investigated in parallel.
A typical problem illustrating this issue is the statistical evalua-
tion of clinical data, for instance investigating the differences
between healthy and diseased patients in cancer research. Having
measured thousands of gene expression levels, an obvious question
is which expression levels contribute the most to the differences
between the individuals, and which genotypic and phenotypic
properties, e.g., sex or age, are also important for the differences.
Visualization and exploration of just two variables is an easy task,
527
528 D. Groth et al.
whereas the exploration of multidimensional data sets require data

decomposition and dimension reduction techniques such as princi-
pal components analysis (PCA) (1). PCA can deliver an overview
about the most important variables that contribute the most to the
differences and similarities between samples. In many cases, these
might be the variables that are of biological interest. Those compo-
nents can be used to visualize and to describe the dataset in a
concise manner. Visualizing sample differences and similarities can
also give information about the amount of biological and technical
variation in the datasets.
Mathematically, PCA uses the symmetrical covariance matrix
between the variables. For square matrices of size N N, N eigen-
vectors with N eigenvalues can be determined. The components
are the eigenvectors of this square matrix and the eigenvector with
the largest eigenvalue is the first principal component. If we assume
that the most-varying components are the most important ones,
we can plot the first components against each other to visualize the
distances and differences between the samples in the datasets.
By exploring the variables that contribute most to important com-
ponents, it is possible to get insights into biological key processes.
It is not uncommon that the dataset has ten or even fewer principal
components containing more than 90% of the total variance in the
dataset, as opposed to thousands of original variables.
It is important to note that PCA is an unsupervised method,
i.e., a possible class membership of the samples is not taken into
account by this method. Although grouping of samples or variables
might be apparent in a low dimensional representation, PCA is not
a clustering tool as no distances are considered and no cluster labels
are assigned.
2. Important
Concepts
The -omics technologies mentioned above require careful data
preprocessing and normalization before the actual data analysis can
be performed. The methods used for this purpose have recently
been discussed for microarrays (2) and metabolite data (3). We here
only briefly outline important concepts specific to these data types
and give a general introduction using different examples.
2.1. Data Normalization Data normalization aims to remove the technical variability and
and Transformation to impute missing values that result from experimental issues.
Data transformation, in contrast, aims to move the data distribu-
tion into a Gaussian one, and ensures that more powerful para-
metric statistical methods can be used later on. The basic steps for
data preparation are shown in Fig. 1 and are detailed below.
22 Principal Components Analysis 529
Fig. 1. Steps in data preparation for PCA.
Data normalization consists mainly of background subtraction

and missing value imputation. Many higher data analysis methods
assume a complete data matrix without missing values. If we simply
omit rows and columns with missing values for large data matrices,
even if they are few, too much information will be lost from the
dataset. For instance, if only 1% of data are missing, with 1,000
rows and 50 columns, we would have 500 missing values, and
almost no data would be left in the matrix if the missing values
are distributed uniformly in the matrix. For missing values, simple
but often-used methods like the replacement of a missing value
with the row or column mean or median, or for log-transformed
data the replacement of missing values with zeros, are not feasible.
They do not take into account the correlative relations within the
data. Better-suited are methods using only relevant similar rows or
columns for the mean or median determination. For instance the
K-nearest neighbor (KNN) algorithm, which uses the k most simi-
lar rows or columns for the mean or median calculation, can be
used (4). Other examples of methods for missing value estimations
are based on least squares methods (5) or PCA approaches (6). For
an experimental comparison of different methods, the interested
reader should consult the articles of refs. 7 and 8.
PCA is heavily influenced by outliers, and therefore the next
step after data normalization and missing value imputation should
be the removal of outliers. An easy and frequently used method is
the removal of all values that are more than three times the standard
deviation from the sample mean. This should be done in an iterative
manner, because the outlier itself influences both the sample mean
and the standard deviation. After removal of outliers, the values for
the outliers should be imputed again as described above. Even if an
530 D. Groth et al.
outlier is not the result of a technical variation, it is a good idea for

the PCA to remove it. The reason for this is because the PCA result
is otherwise influenced mostly by noisy variables that contain out-
liers. The R-package pcaMethods can be used for outlier removal
and missing value imputation (6).
After normalizing, some kind of data transformation is needed,
because often variables within a dataset differ by their values in the
order of magnitudes. If, for example, the height for humans is
recorded in meters instead of centimeters, the variance for this
variable will be much lower than the variance of the weight
recorded in kilograms for the same people. In order to give both
variables equal chance of contributing to the analysis, they need to
be standardized. The technique used mostly for this purpose is
called scaling to unit variance: For determining the individual
unit variance value, the so called z-score (zi), from each original
value (xi), the mean for this variable (m) is subtracted and
the difference is then divided by the standard deviation (sx): zi
(xi m)/sx.
After scaling all variables to have a variance of one, the covari-
ance matrix equals the correlation matrix. One disadvantage of this
approach is that low level values, for instance slightly larger than
background values, get a high impact on the resulting PCA due
their large scatter.
Other data transformations that might replace or precede the
scaling procedure are log-transformations. In case zeroes exist in
the dataset, a positive constant (e.g., 1) must be added to all
elements of the data matrix, and then log-transformation can be
performed. If there are negative values, the asinh-transformation is
an option. Log-transformation of the data can bring a nonnormal
data distribution closer to being normally distributed. This allows
the usage of more powerful parametric methods for such data.
Often the individual values are transformed to fold-changes by
dividing them by the mean or median for this variable. Using this
approach the data also needs to be log-transformed to ensure a
normal distribution and a centering of the data at zero.
Scaling and log-transformation are illustrated using measure-
ments for 26 students, their height in centimeter, their weight in
kilogram, and their statistics course grade as a numerical value
between 1 and 5. The influence of scaling and log-transformation
on the original data distribution for each variable is shown in Fig. 2.
In Fig. 3, the data for each individual in our example dataset before
and after scaling are visualized using a pairs plot.
A step often required for later analysis, e.g., testing for differ-
entially expressed genes, is filtering to exclude variables that are not
altered between samples or that generally have low abundance.
PCA is well suited to ignore nonrelevant variables in the datasets.
For better compatibility with subsequent analyses like inferential
original data log2 original data
150
6
100
4
50
2
0
0
cm kg grade cm kg grade
unit variance (uv) data uv and centered data

25
3
20
2
15
1
10
0
1
5
cm kg grade cm kg grade
Fig. 2. Comparison of original, log-transformed, and scaled data.
tests, a filtering procedure may also be applied before performing

the PCA analysis. The problem that after scaling low level values,
sometimes noisy, get an equal impact on the PCA can diminished if
a filtering step is introduced. In our examples we use the standard
scaling procedure without outlier removal and any further data
transformation or filtering steps.
2.2. Principal We next demonstrate the PCA using the example of the students
Components Analysis dataset, with the variables cm, kg, and grade. The students
are the samples in this case. As it can be seen in Table 1, height,
measured in cm, and the weight, measured in kg, have a higher
variance in comparison to the to the grade, which ranges from 1 to
5. After scaling to unit variance, it can be seen in Table 2 that all
variables have variance of one. The weight and the height of our
sample students have a larger covariance than to that between grade
and both weight and height. Remember that the covariance
between variables indicate the degree of correlation between two
532 D. Groth et al.
50 60 70 80 90 110 1 2 3 4
200
190
J E J E
F F 180
Y V OR V OR Y
cm CK
QTUBXI P U Q
M KC
ITP XB
170
W ZM H W HD N Z
N
SADL
G AL S G
160
150
H 110
3 H 100
2 90
E E
1 P R
F kg F
P
80
J J OR 70
I O I XB
0 UBTX V UV
MLKQ T 60
Q
L Z MK Y D
W A C NS G
YZ
A
SGD
NW C
1 50
Z Z
2 Y Y
G G
1 BX BX
E E
S S
0
N TP
I R F
N T P
I R F
grade
D D
H C
K Q O C KQ O H
AL M J A LM J
1
W V W V
U U
2
1 0 1 2 1 0 1 2 3
Fig. 3. Pairs plot of unscaled (upper triangle) and scaled (lower triangle) data. Individuals can be identified by their letter
codes.
Table 1
Covariance matrix for the original data. The diagonal contains the variances
cm kg Grade
cm 47.54 36.90 0.11
kg 36.90 123.18 0.84
Grade 0.11 0.84 0.62
variables, when variables are scaled to have unit variance. High

positive values denote a high degree of positive correlation, and
large negative values indicate a high degree of negative correlation.
When there is unit variance, covariance near zero means no
Table 2
Covariance/correlation matrix for scaled data
cm kg Grade
cm 1.00 0.48 0.02
kg 0.48 1.00 0.10
Grade 0.02 0.10 1.00
Table 3
Covariance matrix for the principal components, the matrix
is a diagonal matrix, all nondiagonal values are zero. This
means that the principal components are uncorrelated to
each other
PC1 PC2 PC3

PC1 1.49 0.00 0.00
PC2 0.00 1.01 0.00
PC3 0.00 0.00 0.50
correlation between the variables. In the example, weight and

height are correlated but the grades are, of course, not correlated
to the weight and height of students.
2.2.1. Mathematical Mathematically, PCA is using the symmetric covariance matrix to

Background determine the principal components (PCs). The PCs are the
eigenvectors of the covariance matrix, and the eigenvector with
the largest eigenvalue is the first principal component. The vector
with the second largest eigenvalue is the second principal compo-
nent, and so on. Principal components are uncorrelated to each
other as the covariance matrix in Table 3 shows. The following code
will read in the data from a Web site and perform the calculation of
the eigenvectors and their eigenvalues directly using R.
> studentsread.table(http://cdn.bitbucket.org/
mittelmark/r-code/downloads/survey.min.tab,
header- TRUE)
> eigen.res eigen(cov(scale(students)))
> eigen.res
$values
[1] 1.4880792 1.0077034 0.5042174
$vectors
534 D. Groth et al.
[,1] [,2] [,3]

[1,] 0.6961980 0.19381501 0.6911904
[2,] 0.7093489 -0.03800063 -0.7038324
[3,] -0.1101476 0.98030184 -0.1639384
> eigen.res$values/sum(eigen.res$values)
[1] 0.4960264 0.3359011 0.1680725
After scaling, each variable contributes equally to the overall

variance; when there are three variables, they each contribute one-
third of the total variation in the dataset. However, the largest
eigenvalue of the covariance matrix is around 1.5, which means
that the first principal component contains around 50% of the
overall variation in the dataset, and the second component still
around 34% of the total variation. As shown in the last code
example line, the exact proportion values can be obtained by divid-
ing the eigenvalues by the sum of all eigenvalues.
The second component of the result of the eigen calculation in
R is the loading vectors. They contain in the columns the values for
each principal component, in the rows the values for the variables
belonging to the eigenvectors: cm, kg and grade for the
different components in this case. A large absolute loading value
means that the variable contributes much to this principal compo-
nent. The variables cm (first row) and kg (second row) con-
tribute mostly to component PC1 (first column) and PC3 (third
column), whereas the variable grade, third row, contributes
mostly to PC2 (second column).
Using the prcomp function of R, the same calculations can be
done in a more straight-forward manner. We create an object called
pcx with the scaled data. This object contains the variable load-
ings in a table called rotation, and the coordinates for the indi-
viduals inside the new coordinate system of principal components
in a table x. The latter are also called scores, and they show how
the variables correlate with a component. The summary command
for the pcx object shows the contribution for the most important
components to the total variance.
> pcx prcomp(scale(students))
> summary(pcx)
Importance of components:
PC1 PC2 PC3
Standard deviation 1.220 1.004 0.710
Proportion of Variance 0.496 0.336 0.168
Cumulative Proportion 0.496 0.832 1.000
> pcx$rotation
PC1 PC2 PC3
cm 0.6961980 -0.19381501 -0.6911904
kg 0.7093489 0.03800063 0.7038324
grade -0.1101476 -0.98030184 0.1639384
> head(pcx$x, n 4)
a b
1.5
1.5
1.0 o
Variances
Variances
1.0
o
0.5
0.5
0.0
0.0
PC1 PC2 PC3 PC1 PC2 PC3
c d
3
1.0
2
W U
0.5
V
1
AL M
H
D CK Q O J
PC2
PC2
kg
R 0.0 kg
I
0
SN T Pcm F
cm
BX
1
G E
grade
2
Z Y
1.0
grade
3
3 2 1 0 1 2 3 1.0 0.0 0.5 1.0

PC1
PC1
Fig. 4. Common plots for PCA visualization. (a) Screeplot for the first few components using a bar plot; (b) screeplot using
lines; (c) biplot showing the most relevant loading vectors; (d) correlation plot.
PC1 PC2 PC3

A -1.482874036 0.9159609 0.2510519
B -0.005895547 -0.8607660 0.1385797
C -0.714299682 0.3420358 -0.4925597
D -1.280122071 0.3014947 0.2079020
The variances of the eigenvalues for the first few components

are often plotted in a so called screeplot to show how the variance of
the principal components decreases by additional components,
shown in Fig. 4a, b. Often, even when a dataset consists of
thousands of variables, the majority of the variance is in the first
few components. To investigate how the variables contribute to the
loading vectors, a biplot and a correlation plot can be used. The
biplot shows both the position of samples in the new coordinate
space and the loading vectors for the original variables in the new
coordinate system (Fig. 4c). Often the number of variables shown is
limited to a few, mostly restricted to those correlating at best with
the main principal components. In this way, biplots can uncover the
536 D. Groth et al.
correspondence between variable and samples and identify samples

with similar variable values. A correlation plot can show how well
a given variable correlates with a principal component (Fig. 4d).
In the students example dataset we can see that the variables kg
and cm correlate well with the first component, whereas the
grade, while not correlated with the other variables, correlates
perfectly with the second component. Here, PC1 represents some-
thing like the general size, i.e., weight and height, whereas PC2
perfectly represents the course grade. To illustrate this we can
examine the individual values for some students. On the right side
of Fig. 4c are large students on the upper side are students with a
high grade.
If we compare the biplot with the original data in Fig. 2 we see
that students with higher (Z, Y) and lower grades (U, W), and with
higher (H, J, F, E) and lower sizes (S, N, G), are nicely shown in the
2D space of the biplot. As PCA performs dimension reduction,
there are some guidelines on how many components should be
investigated. For example, the components that cumulatively
account for at least 90% of the total variance can be used. Alterna-
tively, components that account for less variance than the original
variables (unit variance) can be disregarded.
2.2.2. Geometrical An intuitive geometrical explanation of PCA can be seen as a

Illustration rotation of the original data space. For three variables imagine
finding a point outside of a three-dimensional visualization of the
data points to maximize its projection onto a two-dimensional
surface. The angles chosen for this point represent the new compo-
nent system (Fig. 5). We illustrate this with just two variables, the
weight and height of our students. The data are shown in Fig. 6a in
a xy-plot. If we project the 2D space into a new coordinate system
we draw a line onto the xy-plot which shows the vector of our first
component. The second component is always orthogonal to the
first. We can see that after projecting the data into the new coordi-
nate system the first component contains much more variance than
the second component (Fig. 6b). Using just the columns for the
weight and height we see that the first component contains now
Fig. 5. Illustrative projection of a three-dimensional data point cloud onto a two-dimensional surface.
a b 3
l
H l
H
3
2
2
l
E
PC2
l 1
kg
F
1 l
P l
R l
l Jl l lL
lA
S
l
P
lI O 0 lN
G
D
lW
l Zl Ml UlBl l lI l
l l
X R l
F
0 lB
U l
V Kl l X
lT Ol
ll
T l
C Q l
E
l
L l Q
lK
M
l
l
l Z C
D l
Y V Jl
1 ll G
A
S llW
N l l 1 l
Y
2 1 0 1 2 1 0 1 2
cm PC1
c d
1.5 3 H
2
kg
Variances
1.0 1
ADL P
PC2
SN
0 GWZ M UBXI R F
CK QT O E
V J
1 Y
0.5 cm
2
3
0.0
PC1 PC2 3 2 1 0 1 2 3
PC1
Fig. 6. PCA plots using students height and weight data. (a) Scaled data; (b) data projected
into the new coordinate system of principal components; (c) screeplot of the two resulting
PCs; (d) biplot showing the loadings vectors of the original variables.
around 66% of the variance, whereas the second component con-

tains now 34% of the variance. This is nicely visualized in the
screeplot shown in Fig. 6c.
The principal components in this example have a certain
meaning. The first component represents general size of the stu-
dents, but size is here not only restricted to the height but also to
the width or weight. The second component could be explained as
the body mass index (BMI), people with a high value for PC2 have
a larger BMI than people with a low PC2 value. The biplot in
Fig. 6d also visualizes the loadings for the original variable into
the new component space. We can now also determine some
important properties for the different subjects regarding their
BMI. For instance we can assume that students F and E are large
but neither too lightweight nor too heavy for their height. In
contrast student H is quite heavy for his/her weight. On the left
are smaller students (for example Z) who do not vary greatly in
their BMI. Student Y has a low BMI in contrast. The original and
the scaled data for all students could be seen in Fig. 3.
538 D. Groth et al.
PCA can be performed either on the variables or on the sam-

ples. Generally only one type of PCA has to be performed. PCA on
variables focuses on the correlations between the variables, whereas
PCA on samples is examining the correlation between the samples.
In our examples we perform a PCA on the variables. To switch
between both modes, the data matrix just has to be transposed.
If making a serial experiment which just one or two parameters
changed, for example time and concentration, you will perform a
PCA on the variables, whereas if you have a lot of replicates for few
conditions you will do a PCA on the samples. For a PCA on samples
with many variables, the eigenvalues often do not contain much
information, as there are too many of them. In this case it is
advisable to try to group related variables together.
3. Biological
Examples
PCA was first applied to microarray data in 2000 (9, 10) and has
been reviewed in this context (11). We decided to choose other
types of data to illustrate PCA. First we use a dataset which deals
with the transformation of qualitative data into numerical data:
codon frequencies from various taxa will be used to evaluate their
main codon differences with a PCA. Next, we use data from a
recent study about principal changes in the Escherichia coli
(E. coli) metabolism after applying stress conditions (12). The
data analysis is extended with adding visualization that PCA enables
to better understand the E. coli primary stress metabolism.
3.1. Sequence Data Here we use PCA to demonstrate that the codon usage differs for
Analysis protein-coding genes of different organisms, a fact that is well-
known. Genome sequences for many taxa are freely available from
different online resources. We just take five genomes for this exam-
ple, although we could easily have taken 50 or 500: Arabidopsis
thaliana (a higher plant), Caenorhabditis elegans (a nematode),
Drosophila melanogaster (the fruit fly), Canis familiaris (the dog),
and Saccharomyces cerevisiae (yeast). For each of these genomes, we
only use protein-coding genes and end up with about 33,000
(plant), 24,000 (nematode), 15,000 (dog), 18,000 (fruit fly), and
6,000 (yeast) genes each.
For one species at a time, we then record for each of these gene
sequences which of the 64 possible codons is used how many times.
The data to be analyzed for interesting patterns is therefore a
5 64 matrix. It describes the codon usage combined for all
protein-coding genes for each of the five taxa. An abbreviated and
transposed version of this matrix is shown below; note that absolute
frequencies were recorded.
> codonsread.table(http://bitbucket.org/mittel-
mark/r-code/downloads/codonsTaxa.txt, header
TRUE, row.names1)
> head(codons[,c(1:3,62:64)])
AAA AAC AAG TTC TTG TTT
ATHA 419472 275641 432728 269345 285985 299032
CELE 394707 192068 272375 249129 210524 240164
CFAM 243989 187183 314228 205704 134896 186681
DMEL 185136 280003 417111 227142 173632 141944
SSCE 124665 72230 88924 52059 77227 76198
The visualization of such a matrix in a pairs plot, as we did

before, is generally impractical due to the large number of variables.
Instead, we perform a PCA, again using the software package R and
the R-script ma.pca.r. After importing the matrix into R and
loading ma.pca.r, a new ma.pca object can be created with the
command ma.pca$new(codons) (assuming the data is stored in a
variable codons). This automatically performs a PCA on the data,
and the internal object created by the prcomp function of R can be
used for plotting and analysis. As can be seen in Fig. 7, the first four
components carry almost all the variance of the dataset. The com-
mand ma.pca$biplot(), for example, generates the biplot
shown in Fig. 7b, which here displays the five most important
codons that differentiate between the five taxa in the first two
a 40 b
Variances
30
20 TGG
5 CCT GGG
10 CFAM
0 ATHA
PC1 PC3 PC5
PC2
ACT CAG
0 CAC
CELE
CAT GCC
SSCE
c 1.0
TGT
TGG
CCT GGG
AGGCTC DMEL
CTT
GCT
TCT
AGA
GCA CGG
GTC 5
TTT GGA GAG
GTG TCGACG
PC2
TCA
ACA ATG TTC GAC
GTT
AAA
GAA CCC
CTG
CAG
0.0 ACTCCA AGC
TGC
CAC
GGC
AGT
CAT GGT
TTA GCC
ATC
AAG
TAT
GAT
ATT
GTA TCC
ACC
TTGATA CGA CGC
CAACTATGA
AAT CGT
TAA GCG
CCG
AACTAC
TAGTCG
ACG
1.0
1.0 0.0 1.0 5 0 5
PC1 PC1
Fig. 7. Codon data. (a) Screeplot of PCs variances; (b) biplot for first two PCs and most important loadings; (c) correlation
plot for all variables.
540 D. Groth et al.
principal components. The command to get a list of these codons is

shown below:
> source("http://bitbucket.org/mittelmark/r-code/
downloads/ma.pca.r")
> codonst(scale(t(codons))) # transpose and scale
> ma.pca$new(codons)
> ma.pca$screeplot() # Fig 7A
> ma.pca$biplot(top5) # Fig 7B
> ma.pca$corplot() # Fig 7C
> ma.pca$getMainLoadings("PC1")[1:5]
[1] "CAG" "CAC" "ACT" "GCC" "CAT"
Figure 7b shows that in the first principal component, the

frequencies of the two sets of codons in grey (CAT, GCC, ACT,
CAC, CAG) correlate at best with the first component, and they are
therefore especially useful for distinguishing the plant, yeast, and
the nematode from the dog and the fruit fly. Similarly, for the
second component the codons in black font are responsible in the
second principal component for the separation of the five taxa
(CCT, GGG, TCG, ACG, TGG).
In addition to the biplot, a correlation plot can be generated.
The command ma.pca$corplot() will produce the plot shown
in Fig. 7c and displays the individual correlation of each variable to
the principal component. Finally, a summary of the PCA can be
printed to the R-terminal. This shows the amount of variance in the
individual components:
> ma.pca$summary()
Importance of components:
PC1 PC2 PC3 PC4 PC5
Standard
deviation 6.128 3.497 3.259 1.8980 2.13e-15
Proportion of
Variance 0.587 0.191 0.166 0.0563 0.00e+00
Cumulative
Proportion 0.587 0.778 0.944 1.0000 1.00e+00
Almost 60% of the total variance is in the first component which

agrees nicely with the correlation plot in Fig. 7c, showing that most
codons correlate either positively or negatively with the PC1.
3.2. Metabolite Data In this example we employ PCA to analyze the system level stress
Analysis adjustments following the response of E. coli to five different per-
turbations. We make use of time-resolved metabolite measure-
ments to get a detailed understanding of the successive events
following heat- and cold-shock, oxidative stress, lactose diauxie,
and stationary phase. A previous analysis of the metabolite data
together with transcript data measured under the exact same per-
turbations and time-points was able to show, that E. colis response
on the metabolic level shows a higher degree of specificity as
compared with the general response observed on the transcript

level. Furthermore, this specificity is even more prominent during
the early stress adaptation phase (12).
The lactose diauxie experiment describes two distinct growth
phases of E. coli. Those two growth phases are characterized by the
exclusive use of either of two carbon sources: first glucose and then,
upon depletion of the former in the media, lactose. Stationary
phase describes the timeframe in which E. coli stops growth,
because nutrient concentrations become limiting. Furthermore,
because of an increased cell density, stationary phase is character-
ized by hypoxia due to low oxygen levels. The dataset considered in
this example consists of metabolite concentrations measured with
gas chromatography mass spectrometry (GC-MS). The samples
were obtained for each experimental condition at time points
1050 min post-perturbation plus an additional control time-
series. Each experimental condition was independently repeated
three times and the measurements reported consist of the median
of those three measurements per condition and time-point. An
analysis of the obtained spectra lead to the identification of 188
metabolites of which 95 could be positively identified (58 meta-
bolic profiles could be chemically classified and 35 remain to be of
unknown structure). A detailed treatment of the extraction- and
data normalization procedures can be found in ref. 12.
Out of the 95 experimentally identified metabolites, we select
11 metabolites from E. colis primary metabolism for the PCA
(Fig. 8). The reasoning for this selection is the following: The
response of the metabolism following a perturbation is character-
ized by E. colis general strategy of energy conservation which is
expected to be reflected by rapid decrease of central carbon metab-
olism intermediates. From the literature (13) we know that on the
genome level this energy conservation is coinciding with a down-
regulation of genes related to cell growth.
We create a data frame metabolites in which each row repre-
sents a measurement for a certain experimental condition and time-
point. This amounts to 37 conditions:
> metabolitesread.table(http://bitbucket.org/
mittelmark/r-code/downloads/primary-metabolism-
ecoli.tab, headerTRUE)
> metabolites[c(1:3,35:38),1:5]
X2KeGuAc SuAc FuAc MaAc X6PGAc
cold_0 0.0055 0.0200 0.4507 0.0936 0.0086
cold_1 0.0038 0.0219 0.3794 0.0559 0.0109
cold_2 0.0053 0.0285 0.3311 0.0619 0.0105
stat_2 0.0307 0.1680 1.5829 0.2729 0.0374
stat_3 0.0997 0.2824 2.7050 0.3279 0.0217
stat_4 0.0495 0.1085 1.4768 0.2568 0.0141
stat_5 0.0850 0.1086 1.2772 0.1875 0.0126
> dim(metabolites)
[1] 38 11
542 D. Groth et al.
Glc-6-P
6-P-gluconolactone
Fru-6-P
pentose phosphate
6-P-gluconic a.
Ribulose -5-P Fru-1,6-P
pathway
Ribose-5-P Xylulose-5-P G3P
glycolysis
1,3 DPGA
G3P S7P 3PGA
2PGA
E4P Fru -6-P PEP
Pyruvic a.
Acetyl-CoA
OAA Citric a.
TCA cycle
Malic a. Isocitric a.
Fumaric a. 2-ketogluratic a.
Succinic a. Succinyl-CoA
Fig. 8. Overview E. colis primary metabolism. Metabolites for which concentrations

measured are denoted in bold.
Here, for example, cold_2 denotes the measurement for the

second time-point (20 min after application of the cold-stress) for
E. coli cells treated with heat-stress. Each such a condition is char-
acterized by 11 entries or observations (the columns of our data
frame) which are given by the 11 metabolite concentrations
measured.
Figure 9 shows a biplot of all 37 different conditions and their
respective measurement time-points. We project the conditions on
the axis defined by the first and second principal component which
together capture 79% of the total variation in the dataset. It is
directly visible that those two components are enough to discrimi-
nate the form of the experimental treatment, as well as discriminat-
ing the time within a condition:
Lactose diauxie and stationary phase both show a higher dis-
tance from the origin than any other of the stresses. Clearly, both
condition are characterized by either depletion (stationary phase)
or change of the primary carbon source (lactose diauxie). Naturally
we would expect that to have a huge impact on E. colis primary
metabolism as a result of changes in the corresponding metabolite
levels. Out of the three stress-conditions, cold-shock measurements
are the closest to control time-points. Again, this relates to the fact,
Fig. 9. Biplot of experimental conditions and their respective time-points.
that cold-shock is the physiological mildest stress compared to

heat-shock and oxidative stress.
In the origin of the PCA plot we find the early time-points from
control, heat, cold, and oxidative stress. Most likely this can be
attributed to the fact, that stress-adaptation is often not instanta-
neous and thus not immediately reflected on the metabolic level.
Notable exceptions are the 10 min measurement for lactose diauxie
and stationary phase. For heat stress it can observe that the further
time progresses, measurements have a greater distance to the ori-
gin. However, this trend is reversed for the late stationary phase and
lactose diauxic shift measurements (stat_5 and lac_4), as those
time-points move in closer to the origin. One possible explanation
is that E. coli has (partially) adapted to the new nutrient conditions,
and the metabolic profile is again closer to the control condition.
Finally, the metabolite levels that are important for the discrim-
ination of the timepoints are examined: The arrows in the biplot
indicate which metabolites have a dominant effect in finding the
two principal components. Since the direction of the arrows points
towards time-points from stationary phase, we can assume an
increase of metabolites associated with these arrows.
544 D. Groth et al.
Fig. 10. Metabolite concentrations of conditions and different time-points. Within each time-series, each metabolite
concentration is normalized to preperturbation levels.
Indeed, an investigation of the metabolite concentrations

(Fig. 10) shows a general decrease of the primary metabolites in
cold-, heat- and oxidative-stress conditions and a strong increase
for stationary phase and a medium increase for lactose shift, respec-
tively. Decreased levels of for example phosphoenolpyruvic acid
(PEP) and glycolic acid-3-phosphate (GlAc3Ph) from glycolysis
are dominant effects of stress application. This finding is in accor-
dance with the previously mentioned energy conservation strategy.
The pronounced and counter-intuitive increase for the TCA-
cycle intermediates 2-ketoglutaric acid (2KeGlu), succinic acid
(SuAc) and malic acid (MaAc) can be explained by the previously
mentioned increase in bacterial culture density under stationary
phase that results in a shift from aerobic to micro-aerobic (hypoxia)
conditions. The lack of oxygen triggers a number of adjustments of
the activity of TCA-cycle enzymes with the aim of providing an
alternative electron acceptor for cellular respiration. Briefly, this
increase of TCA-cycle intermediates arises from a repression of
the enzyme 2-ketoglutarate dehydrogenase which normally con-
verts 2-ketoglutaric acid to Succinyl-CoA with the results of an
accumulation of 2-ketoglutaric acid. A subsequent replacement of
succinate dehydrogenase activity by fumarate reductase allows
usage of fumarate (FuAc) as an alternative electron acceptor.
This in turn leads to an accumulation of succinic acid which cannot
be metabolized further and is excreted from the cell. Finally,

accumulation of malic acid can be interpreted as an effect of change
in metabolic flux towards the malate, fumarate and succinate
branch of the TCA cycle, forced by increased demands of fumarate
production for use as an electron acceptor.
4. PCA
Improvements
and Alternatives
PCA is an excellent method for finding orthogonal directions that
correspond to maximum variance. Datasets can, of course, contain
other types of structures that PCA is not designed to detect. For
example, the largest variations might be not of the greatest
biological importance. This is a problem which cannot easily be
solved as it requires the knowledge of the biology behind the data.
In this case it may be important to remove the outliers to minimize
the effect of single values on the overall outcome. Approaches to
provide outlier-insensitive PCA algorithms like robust (14) or
weighted PCA (15) and an R package, rrcov (16), which can be
used to apply some of the advanced PCA methods to the data set
are available. The R package provides the function PcaCov which
calls robust estimators of covariance.
In datasets with many variables it is sometimes difficult to
obtain a general description of a certain component. For this
purpose, e.g., in microarray analysis, often the enrichment for
certain ontology terms for the variables contributing at most to a
component is used to get an impression what the component is
actually representing (17).
Sometimes a problem with PCA is that the components,
although uncorrelated, are dependent and orthogonal to each
other. Independent components analysis (ICA) does not have this
shortcoming. Some authors have found that ICA outperforms PCA
(18), other authors have found the opposite (19, 20). Which
method is in practice best, depends on the actual data structure,
and ICA is in some cases a possible alternative to PCA. The
fastICA algorithm can be used for this purpose (21, 22). Because
ICA does not reduce the number of variables as PCA does, ICA can be
used in conjunction with PCA to get a decreased number of variables
to consider. For instance, it has been shown that ICA, when per-
formed on the first few principal components, i.e., on the results of
a preceding PCA, can improve the sample differentiation (23).
Higher-order dependencies, for instance data are scattered in a
ringlike manner around a certain point, are sometimes difficult to
resolve with standard PCA, and a nonlinear approach may be required
to transform the data firstly with a new coordinate system. This
parametric approach is sometimes called kernelPCA (24, 25). To
obtain deeper insights into the relevant variables required to differen-
tiate between the samples, factor analysis might be a better choice.
546 D. Groth et al.
Where PCA tries to find a projection of one set of points into a

lower dimensional space, the Canonical Correlation Analysis (CCA,
(26)) extends PCA in that way that it tries to find a projection of
two sets of corresponding points. An example where CCA could be
applied is a data set consisting one data matrix carrying gene
expression data, the other carrying metabolite data. There exists
an R package which can be used to perform simple correspondence
analysis as well as CCA (27).
5. Availability
of R-Code
The example data and the R-code required to create the graphics of
this article is available at the webpage: http://bitbucket.org/
mittelmark/r-code/wiki/Home.
The script file ma.pca.r contains some functions which can be
used to simplify data analysis using R. The data and functions of the
ma.pca object can be investigated by typing the ls(ma.pca)
command. Some of the most important functions and objects are:
l ma.pca$new(data)performs a new PCA analysis on data,
needs to be called first
l ma.pca$summary()returns a summary, with the variances for
the most important components
l ma.pca$scoresthe positions of the new data points in the new
coordinate system
l ma.pca$loadingsnumerical values to describe the amount
each variable contributes to a certain component
l ma.pca$plot()a pairs plot for the most important compo-
nents, % of variance in the diagonal
l ma.pca$biplot()produces a biplot for the samples and for the
most important variables
l ma.pca$corplot()produces a correlation plot for all variables
on selected components
l ma.pca$screeplot()produces an improved screeplot for the
PCA
These functions have different parameters, for example not
to plot the first two but other components can be chosen with
the pcs-argument. For instance: ma.pca$corplot(pcsc
(PC2,PC3),cex1.2) would rather plot the second versus
the third component and slightly enlarge the text labels. To get
comfortable with the functions users should study the material on
the project website and the R-source code.
Acknowledgments
We thank Kristin Feher for carefully reviewing our manuscript.
References
1. Hotelling H (1933) Analysis of complex statis- 14. Hubert M, Engelen S (2004) Robust PCA and
tical variables into principal components. J classification in biosciences. Bioinformatics
Educ Psychol 24:417441, and 498520 20:17281736
2. Quackenbush J (2002) Microarray data nor- 15. Kriegel HP, Kroger P, Schubert E, Zimek A
malization and transformation. Nat Genet 32 (2008) A general framework for increasing
(Suppl):496501 the robustness of PCA-based correlation clus-
3. Steinfath M, Groth D, Lisec J, Selbig J (2008) tering algorithms. In: Ludascher B, Mamoulis
Metabolite profile analysis: from raw data to N (eds) Scientific and statistical database man-
regression and classification. Physiol Plant agement. Springer, Berlin
132:150161 16. Todorov V, Filzmoser P (2009) An object-
4. Cover TM, Hart PE (1967) Nearest neighbor oriented framework for robust multivariate
pattern classification. IEEE Trans Inf Theory analysis. J Stat Softw 32:147
13:2127 17. Ma S, Kosorok MR (2009) Identification of
5. Bo TM, Dysvik B, Jonassen I (2004) LSim- differential gene pathways with principal com-
pute: accurate estimation of missing values in ponent analysis. Bioinformatics 25:882889
microarray data with least squares methods. 18. Draper BA, Baek K, Bartlett MS, Beveridge JR
Nucleic Acids Res 32:e34 (2003) Recognizing faces with PCA and ICA.
6. Stacklies W, Redestig H, Scholz M et al (2007) Comput Vis Image Understand 91:115137
pcaMethodsa bioconductor package 19. Virtanen J, Noponen T, Merilainen P (2009)
providing PCA methods for incomplete data. Comparison of principal and independent
Bioinformatics 23:11641167 component analysis in removing extracerebral
7. Troyanskaya O, Cantor M, Sherlock G et al interference from near-infrared spectroscopy
(2001) Missing value estimation methods for signals. J Biomed Opt 14:054032
DNA microarrays. Bioinformatics 17:520525 20. Baek K, Draper BA, Beveridge JR, She K (2002)
8. Celton M, Malpertuy A, Lelandais G, de Bre- PCA vs. ICA: a comparison on the feret data set.
vern AG (2010) Comparative analysis of miss- In Proc of the 4th Intern Conf on Computer
ing value imputation methods to improve Vision, ICCV 20190, pp 824827
clustering and interpretation of microarray 21. Hyvarinen A (1999) Fast and robust fixed-
experiments. BMC Genomics 11:15 point algorithms for independent component
9. Alter O, Brown PO, Botstein D (2000) Singu- analysis. IEEE Trans Neural Netw 10:626634
lar value decomposition for genome-wide 22. Marchini JL, Heaton C, Ripley BD (2009)
expression data processing and modeling. fastICA: FastICA algorithms to perform ica
Proc Natl Acad Sci USA 97:1010110106 and projection pursuit. http://cran.r-project.
10. Alter O, Brown PO, Botstein D (2003) org/web/packages/fastICA
Generalized singular value decomposition for 23. Scholz M, Selbig J (2007) Visualization and
comparative analysis of genome-scale expres- analysis of molecular data. Methods Mol Biol
sion data sets of two different organisms. Proc 358:87104
Natl Acad Sci USA 100:33513356 24. Scholz M, Kaplan F, Guy CL et al (2005) Non-
11. Quackenbush J (2001) Computational analysis linear PCA: a missing data approach. Bioinfor-
of microarray data. Nat Rev Genet 2:418427 matics 21:38873895
12. Jozefczuk S, Klie S, Catchpole G et al (2010) 25. Scholkopf B, Smola A, M uller KR (1998) Non-
Metabolomic and transcriptomic stress linear component analysis as a kernel eigen-
response of Escherichia coli. Mol Syst Biol value problem. Neural Comput 10:12991319
6:364 26. Hotelling H (1936) Relations between two
13. Gasch AP, Spellman PT, Kao CM et al (2000) sets of variates. Biometrika 28:321377
Genomic expression programs in the response 27. de Leeuw J, Mair P (2009) Simple and canoni-
of yeast cells to environmental changes. Mol cal correspondence analysis using the R pack-
Biol Cell 11:42414257 age anacor. J Stat Softw 31:118
Chapter 23
Partial Least Squares Methods: Partial Least Squares

Correlation and Partial Least Square Regression
Herve Abdi and Lynne J. Williams
Abstract
Partial least square (PLS) methods (also sometimes called projection to latent structures) relate the information
present in two data tables that collect measurements on the same set of observations. PLS methods proceed by
deriving latent variables which are (optimal) linear combinations of the variables of a data table. When the goal
is to find the shared information between two tables, the approach is equivalent to a correlation problem and
the technique is then called partial least square correlation (PLSC) (also sometimes called PLS-SVD). In this
case there are two sets of latent variables (one set per table), and these latent variables are required to have
maximal covariance. When the goal is to predict one data table the other one, the technique is then called
partial least square regression. In this case there is one set of latent variables (derived from the predictor table)
and these latent variables are required to give the best possible prediction. In this paper we present and
illustrate PLSC and PLSR and show how these descriptive multivariate analysis techniques can be extended to
deal with inferential questions by using cross-validation techniques such as the bootstrap and permutation
tests.
Key words: Partial least square, Projection to latent structure, PLS correlation, PLS-SVD,
PLS-regression, Latent variable, Singular value decomposition, NIPALS method, Tucker inter-battery
analysis
1. Introduction
Partial least square (PLS) methods (also sometimes called projection

to latent structures) relate the information present in two data tables
that collect measurements on the same set of observations. These
methods were first developed in the late 1960s to the 1980s by the
economist Herman Wold (55, 56, 57) but their main early area of
development were chemometrics (initiated by Hermans son
Svante, (59)) and sensory evaluation (34, 35). The original
approach of Herman Wold was to develop a least square algorithm
(called NIPALS (56)) for estimating parameters in path analysis
549
550 H. Abdi and L.J. Williams
Fig. 1. The PLS family.
models (instead of the maximum likelihood approach used for

structural equation modeling such as, e.g., LISREL). This first
approach gave rise to partial least square path modeling (PLS-PM)
which is still active today (see, e.g., (26, 48)) and can be seen as a
least square alternative for structural equation modeling (which
uses, in general, a maximum likelihood estimation approach).
From a multivariate descriptive analysis point of view, however,
most of the early developments of PLS were concerned with defining
a latent variable approach to the analysis of two data tables describ-
ing one set of observations. Latent variables are new variables
obtained as linear combinations of the original variables. When the
goal is to find the shared information between these two tables, the
approach is equivalent to a correlation problem and the technique is
then called partial least square correlation (PLSC) (also sometimes
called PLS-SVD (31)). In this case there are two sets of latent vari-
ables (one set per table), and these latent variables are required to
have maximal covariance. When is goal is to predict one data table
the other one, the technique is then called partial least square
regression (PLSR, see (4, 16, 20, 42)). In this case there is one set
of latent variables (derived from the predictor table) and these latent
variables are computed to give the best possible prediction. The
latent variables and associated parameters are often called dimen-
sion. So, for example, for PLSC the first set of latent variables is
called the first dimension of the analysis.
In this chapter we will present PLSC and PLSR and illustrate
them with an example. PLS-methods and their main goals are
described in Fig. 1.
2. Notations
Data are stored in matrices which are denoted by upper case bold
letters (e.g., X). The identity matrix is denoted I. Column vectors
23 Partial Least Squares Methods: Partial Least Squares. . . 551
are denoted by lower case bold letters (e.g., x). Matrix or vector
transposition is denoted by an uppercase superscript T (e.g., XT).
Two bold letters placed next to each other imply matrix or vector
multiplication unless otherwise mentioned. The number of rows,
columns, or sub-matricesis denoted by an uppercase italic letter
(e.g., I) and a given row, column, or sub-matrixis denoted by a
lowercase italic letter (e.g., i).
PLS methods analyze the information common to two matrices.
The first matrix is an I by J matrix denoted X whose generic element
is xi,j and where the rows are observations and the columns are
variables. For PLSR the X matrix contains the predictor variables
(i.e., independent variables). The second matrix is an I by K matrix,
denoted Y, whose generic element is yi,k. For PLSR, the Y matrix
contains the variables to be predicted (i.e., dependent variables). In
general, matrices X and Y are statistically preprocessed in order to
make the variables comparable. Most of the time, the columns of X
and Y will be rescaled such that the mean of each column is zero and
its norm (i.e., the square root of the sum of its squared elements) is
one. When we need to mark the difference between the original data
and the preprocessed data, the original data matrices will be denoted
X and Y and the rescaled data matrices will be denoted ZX and ZY.
3. The Main Tool:

The Singular Value
Decomposition
The main analytical tool for PLS is the singular value decomposition
(SVD) of a matrix (see (3, 21, 30, 47), for details and tutorials).
Recall that the SVD of a given J K matrix Z decomposes it into
three matrices as:
X
L
Z UDVT d u vT
(1)

where U is the J by L matrix of the normalized left singular vectors

(with L being the rank of Z), V the K by L matrix of the normalized
right singular vectors, D the L by L diagonal matrix of the L singular
values. Also, d, u, and v are,respectively, the th singular value,
left, and right singular vectors. Matrices U and V are orthonormal
matrices (i.e., UT U VT V I).
The SVD is closely related to and generalizes the well-known
eigen-decomposition because U is also the matrix of the normalized
eigenvectors of ZZT, V is the matrix of the normalized eigenvectors
of ZTZ, and the singular values are the square root of the
eigenvalues of ZZT and ZTZ (these two matrices have the same
eigenvalues). Key property: the SVD provides the best reconstitution
(in a least squares sense) of the original matrix by a matrix with a lower
rank (for more details, see, e.g., (13, 47)).
4. Partial Least
Squares Correlation
PLSC generalizes the idea of correlation between two variables to
two tables. It was originally developed by Tucker (51), and refined
by Bookstein (14, 15, 46). This technique is particularly popular in
brain imaging because it can handle the very large data sets gener-
ated by these techniques and can easily be adapted to handle
sophisticated experimental designs (31, 3841). For PLSC, both
tables play a similar role (i.e., both are dependent variables) and the
goal is to analyze the information common to these two tables. This
is obtained by deriving two new sets of variables (one for each table)
called latent variables that are obtained as linear combinations of
the original variables. These latent variables, which describe the
observations, are required to explain the largest portion of the
covariance between the two tables. The original variables are
described by their saliences.
For each latent variable, the X or Y variable saliences have a
large magnitude, and have large weights for the computation of the
latent variable. Therefore, they have contributed a large amount to
creating the latent variable and should be used to interpret that
latent variable (i.e., the latent variable is mostly made from these
high contributing variables). By analogy with principal component
analysis (see, e.g., (13)), the latent variables are akin to factor scores
and the saliences are akin to loadings.
4.1. Correlation Between Formally, the pattern of relationships between the columns of X
the Two Tables and Y is stored in a K J cross-product matrix, denoted R (that is
usually a correlation matrix in that we compute it with ZX and ZY
instead of X and Y). R is computed as:
R ZY T ZX : (2)
The SVD (see Eq. 1) of R decomposes it into three matrices:
R UDVT : (3)
In the PLSC vocabulary, the singular vectors are called saliences:
so U is the matrix of Y-saliences and V is the matrix of X-saliences.
Because they are singular vectors, the norm of the saliences for a
given dimension is equal to one. Some authors (e.g., (31)) prefer
to normalize the salience to their singular values (i.e., the delta-
normed Y saliences will be equal to U D instead of U) because the
plots of the salience will be interpretable in the same way as factor
scores plots for PCA. We will follow this approach here because it
makes the interpretation of the saliences easier.
4.1.1. Common Inertia The quantity of common information between the two tables can
be directly quantified as the inertia common to the two tables. This
quantity, denoted Total, is defined as
X
L
Total d ; (4)

where d denotes the singular values from Eq. 3 (i.e., dl is the th

diagonal element of D) and L is the number of nonzero singular
values of R.
4.2. Latent Variables The latent variables are obtained by projecting the original matrices
onto their respective saliences. So, a latent variable is a linear
combination of the original variables and the weights of this linear
combination are the saliences. Specifically, we obtain the latent
variables for X as:
LX ZX V; (5)
and for Y as:
LY ZY U: (6)
(NB: some authors compute the latent variables with Y and X
rather than ZY and ZX; this difference is only a matter of normali-
zation, but using ZY and ZX has the advantage of directly relating
the latent variables to the maximization criterion used). The latent
variables combine the measurements from one table in order to find
the common information between the two tables.
4.3. What Does PLSC The goal of PLSC is to find pairs of latent vectors lX, and lY, with
Optimize? maximal covariance and with the additional constraints that (1) the
pairs of latent vectors made from two different indices are uncorre-
lated and (2) the coefficients used to compute the latent variables
are normalized (see (48, 51), for proofs).
Formally, we want to find
lX; ZXv and lY; ZY u
such that

cov lX; ; lY; / lTX; lY; max (7)

[where cov lX; ; lY; denotes the covariance between lX, and lY, ]
under the constraints that
lT
X; lY;0 0 when 6
0
(8)
(note that lTX; lX;0 and lTY; lY;0 are not required to be null) and
uT T
u v v 1: (9)
It follows from the properties of the SVD (see, e.g., (13, 21, 30, 47))
that u and v are singular vectors of R. In addition, from Eqs. 3, 5,
and 6, the covariance of a pair of latent variables lX, and lY, is

equal to the corresponding singular value:
lT
X; lY; d : (10)
So, when 1, we have the largest possible covariance between
the pair of latent variables. When 2 we have the largest possible
covariance for the latent variables under the constraints that the
latent variables are uncorrelated with the first pair of latent variables
(as stated in Eq. 8, e.g., lX,1 and lY,2 are uncorrelated), and so on for
larger values of .
So in brief, for each dimension, PLSC provides two sets of
saliences (one for X one for Y) and two sets of latent variables.
The saliences are the weights of the linear combination used to
compute the latent variables which are ordered by the amount of
covariance they explain. By analogy with principal component anal-
ysis, saliences are akin to loadings and latent variables are akin to
factor scores (see, e.g., (13)).
4.4. Significance PLSC is originally a descriptive multivariate technique. As with all

these techniques, an additional inferential step is often needed to
assess if the results can be considered reliable or significant. Tucker
(51) suggested some possible analytical inferential approaches which
were too complex and made too many assumptions to be routinely
used. Currently, statistical significance is assessed by computational
cross-validation methods. Specifically, the significance of the global
model and of the dimensions can be assessed with permutation tests
(29); whereas the significance of specific saliences or latent variables
can be assessed via the Bootstrap (23).
4.4.1. Permutation Test The permutation testoriginally developed by Student and Fisher
for Omnibus Tests and (37)provides a nonparametric estimation of the sampling distri-
Dimensions bution of the indices computed and allows for null hypothesis
testing. For a permutation test, the rows of X and Y are randomly
permuted (in practice only one of the matrices need to be per-
muted) so that any relationship between the two matrices is now
replaced by a random configuration. The matrix Rperm is computed
from the permuted matrices (this matrix reflects only random asso-
ciations of the original data because of the permutations) and the
analysis of Rperm is performed: The singular value decomposition of
Rperm is computed. This gives a set of singular values, from which
the overall index of effect Total (i.e., the common inertia) is com-
puted. The process is repeated a large number of times (e.g.,
10,000 times). Then, the distribution of the overall index and the
distribution of the singular values are used to estimate the proba-
bility distribution of Total and of the singular values, respectively.
If the common inertia computed for the sample is rare enough
(e.g., less than 5%) then this index is considered statistically
significant. This test corresponds to an omnibus test (i.e., it tests an

overall effect) but does not indicate which dimensions are signifi-
cant. The significant dimensions are obtained from the sampling
distribution of the singular values of the same order. Dimensions
with a rare singular value (e.g., less than 5%) are considered signifi-
cant (e.g., the first singular values are considered significant if they
are rarer than 5% of the first singular values obtained form the Rperm
matrices). Recall that the singular values are ordered from the
largest to the smallest. In general, when a singular value is consid-
ered significant all the smaller singular values are considered to be
nonsignificant.
4.4.2. What are the The Bootstrap (23, 24) can be used to derive confidence intervals
Important Variables for and bootstrap ratios (5, 6, 9 , 40) which are also sometimes test-
a Dimension values (32). Confidence intervals give lower and higher values,
which together comprise a given proportion (e.g., often 95%) of
the values of the saliences. If the zero value is not in the confidence
interval of the saliences of a variable, this variable is considered
relevant (i.e., significant). Bootstrap ratios are computed by
dividing the mean of the bootstrapped distribution of a variable
by its standard deviation. The bootstrap ratio is akin to a Student
t criterion and so if a ratio is large enough (say 2.00 because it
roughly corresponds to an a .05 critical value for a t-test) then
the variable is considered important for the dimension. The boot-
strap estimates a sampling distribution of a statistic by computing
multiple instances of this statistic from bootstrapped samples
obtained by sampling with replacement from the original sample.
For example, in order to evaluate the saliences of Y, the first step is
to select with replacement a sample of the rows. This sample is then
used to create Yboot and Xboot that are transformed into ZYboot and
ZXboot, which are in turn used to compute Rboot as:
R boot ZY Tboot ZX boot : (11)
The Bootstrap values for Y, denoted Uboot, are then computed as
Uboot R boot VD1 : (12)
The values of a large set (e.g., 10,000) are then used to compute
confidence intervals and bootstrap ratios.
4.5. PLSC: Example We will illustrate PLSC with an example in which I 36 wines are
described by a matrix X which contains J 5 objective measure-
ments (price, total acidity, alcohol, sugar, and tannin) and by a
matrix Y which contains K 9 sensory measurements (fruity, floral,
vegetal, spicy, woody, sweet, astringent, acidic, hedonic) provided
(on a 9 point rating scale) by a panel of trained wine assessors (the
ratings given were the median rating for the group of assessors).
Table 1 gives the raw data (note that columns two to four, which
Table 1
Physical and chemical descriptions (matrix X) and assessor sensory evaluations (matrix Y) of 36 wines
Wine descriptors X: Physical/Chemical description Y: Assessors evaluation
Total
Wine Varietal Origin Color Price acidity Alcohol Sugar Tannin Fruity Floral Vegetal Spicy Woody Sweet Astringent Acidic Hedonic
1 Merlot Chile Red 13 5. 33 13. 8 2. 75 559 6 2 1 4 5 3 5 4 2
2 Cabernet Chile Red 9 5. 14 13. 9 2. 41 672 5 3 2 3 4 2 6 3 2
3 Shiraz Chile Red 11 5. 16 14. 3 2. 20 455 7 1 2 6 5 3 4 2 2
4 Pinot Chile Red 17 4. 37 13. 5 3. 00 348 5 3 2 2 4 1 3 4 4
5 Chardonnay Chile White 15 4. 34 13. 3 2. 61 46 5 4 1 3 4 2 1 4 6
6 Sauvignon Chile White 11 6. 60 13. 3 3. 17 54 7 5 6 1 1 4 1 5 8
7 Riesling Chile White 12 7. 70 12. 3 2. 15 42 6 7 2 2 2 3 1 6 9
8 Gewurztraminer Chile White 13 6. 70 12. 5 2. 51 51 5 8 2 1 1 4 1 4 9
9 Malbec Chile Rose 9 6. 50 13. 0 7. 24 84 8 4 3 2 2 6 2 3 8
10 Cabernet Chile Rose 8 4. 39 12. 0 4. 50 90 6 3 2 1 1 5 2 3 8
11 Pinot Chile Rose 10 4. 89 12. 0 6. 37 76 7 2 1 1 1 4 1 4 9
12 Syrah Chile Rose 9 5. 90 13. 5 4. 20 80 8 4 1 3 2 5 2 3 7
13 Merlot Canada Red 20 7. 42 14. 9 2. 10 483 5 3 2 3 4 3 4 4 3
14 Cabernet Canada Red 16 7. 35 14. 5 1. 90 698 6 3 2 2 5 2 5 4 2
15 Shiraz Canada Red 20 7. 50 14. 5 1. 50 413 6 2 3 4 3 3 5 1 2
16 Pinot Canada Red 23 5. 70 13. 3 1. 70 320 4 2 3 1 3 2 4 4 4
17 Chardonnay Canada White 20 6. 00 13. 5 3. 00 35 4 3 2 1 3 2 2 3 5
18 Sauvignon Canada White 16 7. 50 12. 0 3. 50 40 8 4 3 2 1 3 1 4 8
19 Riesling Canada White 16 7. 00 11. 9 3. 40 48 7 5 1 1 3 3 1 7 8
20 Gewurztraminer Canada White 18 6. 30 13. 9 2. 80 39 6 5 2 2 2 3 2 5 6
21 Malbec Canada Rose 11 5. 90 12. 0 5. 50 90 6 3 3 3 2 4 2 4 8
22 Cabernet Canada Rose 10 5. 60 1. 25 4. 00 85 5 4 1 3 2 4 2 4 7
23 Pinot Canada Rose 12 6. 20 13. 0 6. 00 75 5 3 2 1 2 3 2 3 7
24 Syrah Canada Rose 12 5. 80 13. 0 3. 50 83 7 3 2 3 3 4 1 4 7
25 Merlot USA Red 23 6. 00 13. 6 3. 50 578 7 2 2 5 6 3 4 3 2
26 Cabernet USA Red 16 6. 50 14. 6 3. 50 710 8 3 1 4 5 3 5 3 2
27 Shiraz USA Red 23 5. 30 13. 9 1. 99 610 8 2 3 7 6 4 5 3 1
28 Pinot USA Red 25 6. 10 14. 0 0.00 340 6 3 2 2 5 2 4 4 2
29 Chardonnay USA White 16 7. 20 13. 3 1. 10 41 6 4 2 3 6 3 2 4 5
30 Sauvignon USA White 11 7. 20 13. 5 1. 00 50 6 5 5 1 2 4 2 4 7
31 Riesling USA White 13 8. 60 12. 0 1. 65 47 5 5 3 2 2 4 2 5 8
32 Gewurztraminer USA White 20 9. 60 12. 0 0.00 45 6 6 3 2 2 4 2 3 8
33 Malbec USA Rose 8 6. 20 12. 5 4. 00 84 8 2 1 4 3 5 2 4 7
34 Cabernet USA Rose 9 5. 71 12. 5 4. 30 93 8 3 3 3 2 6 2 3 8
35 Pinot USA Rose 11 5. 40 13. 0 3. 10 79 6 1 1 2 3 4 1 3 6
36 Syrah USA Rose 10 6. 50 13. 5 3. 00 89 9 3 2 5 4 3 2 3 5
describe the varietal, origin, and color of the wine, are not used in
the analysis but can help interpret the results).
4.5.1. Centering Because X and Y measure variables with very different scales, each
and Normalization column of these matrices is centered (i.e., its mean is zero) and
rescaled so that its norm (i.e., square root of the sum of squares) is
equal to one. This gives two new matrices called ZX and ZY which
are given in Table 2.
The K 5 by J 9 matrix of correlations R is then computed
from ZX and ZY as
R ZY T ZX
2 3
0:278 0:083 0:068 0:115 0:481 0:560 0:407 0:020 0:540
6 7
6 0:029 0:531 0:3480:168 0:162 0:084 0:098 0:202 0:202 7
6 7
6
6 0:044 0:387 0:016 0:431 0:661 0:445 0:730 0:399 0:850 77
6 7
4 0:305 0:187 0:198 0:118 0:400 0:469 0:326 0:054 0:418 5
0:008 0:479 0:132 0:525 0:713 0:408 0:936 0:336 0:884
(13)
The R matrix contains the correlation between each of variable in
X with each of variable in Y.
4.5.2. SVD of R The SVD (cf., Eqs. 1 and 3) of R is computed as

R UDVT
2 32 3
0:366 0:423 0:498 0:078 0:658 2:629
6 76 7
6 0:180 0:564 0:746 0:021 0:304 76 0:881 7
6 76 7
6 76
6 0:584 0:112 0:206 0:777 0:005 76 0:390 7
7
6 76 7
4 0:272 0:652 0:145 0:077 0:689 54 0:141 5
0:647 0:255 0:364 0:620 0:006 0:077
2 3T
0:080 0:338 0:508 0:044 0:472
6 0:232 0:627 0:401 0:005 0:291 7
6 7
6 7
6 0:030 0:442 0:373 0:399 0:173 7
6 7
6 7
6 0:265 0:171 0:206 0:089 0:719 7
6 7
66 0:442 0:133 0:057 0:004 0:092 7 :
7
6 7
6 0:332 0:388 0:435 0:084 0:265 7
6 7
6 0:490 0:011 0:433 0:508 0:198 7
6 7
6 7
40:183 0:307 0:134 0:712 0:139 5
0:539 0:076 0:043 0:243 0:088
(14)
4.5.3. From Salience The saliences can be plotted as a PCA-like map (one per table), but
to Factor Score here we preferred to plot the delta-normed saliences FX and FY,
which are also called factor scores. These graphs give the same
information as the salience plots, but their normalization makes
Table 2
The matrices ZX and ZY (corresponding to X and Y)
ZX: Centered and normalized version of X: Physical/Chemical

Wine descriptors description ZY: Centered and normalized version of Y: Assessors evaluation
Wine Name Varietal Origin Color Price Total acidity Alcohol Sugar Tannin Fruity Floral Vegetal Spicy Woody Sweet Astringent Acidic Hedonic
1 Merlot Chile Red 0.046 0.137 0.120 0.030 0.252 0.041 0.162 0.185 0.154 0.211 0.062 0.272 0.044 0.235
2 Cabernet Chile Red 0.185 0.165 0.140 0.066 0.335 0.175 0.052 0.030 0.041 0.101 0.212 0.385 0.115 0.235
3 Shiraz Chile Red 0.116 0.162 0.219 0.088 0.176 0.093 0.271 0.030 0.380 0.211 0.062 0.160 0.275 0.235
4 Pinot Chile Red 0.093 0.278 0.061 0.003 0.098 0.175 0.052 0.030 0.072 0.101 0.361 0.047 0.044 0.105
5 Chardonnay Chile White 0.023 0.283 0.022 0.045 0.124 0.175 0.058 0.185 0.041 0.101 0.212 0.178 0.044 0.025
6 Sauvignon Chile White 0.116 0.049 0.022 0.015 0.118 0.093 0.168 0.590 0.185 0.229 0.087 0.178 0.204 0.155
7 Riesling Chile White 0.081 0.210 0.175 0.093 0.127 0.041 0.387 0.030 0.072 0.119 0.062 0.178 0.364 0.220
8 Gewurztraminer Chile White 0.046 0.064 0.136 0.055 0.120 0.175 0.497 0.030 0.185 0.229 0.087 0.178 0.044 0.220
9 Malbec Chile Rose 0.185 0.034 0.037 0.444 0.096 0.227 0.058 0.125 0.072 0.119 0.386 0.066 0.115 0.155
10 Cabernet Chile Rose 0.220 0.275 0.234 0.155 0.091 0.041 0.052 0.030 0.185 0.229 0.237 0.066 0.115 0.155
11 Pinot Chile Rose 0.150 0.202 0.234 0.352 0.102 0.093 0.162 0.185 0.185 0.229 0.087 0.178 0.044 0.220
12 Syrah Chile Rose 0.185 0.054 0.061 0.123 0.099 0.227 0.058 0.185 0.041 0.119 0.237 0.066 0.115 0.090
13 Merlot Canada Red 0.197 0.169 0.337 0.098 0.197 0.175 0.052 0.030 0.041 0.101 0.062 0.160 0.044 0.170
14 Cabernet Canada Red 0.058 0.159 0.258 0.119 0.354 0.041 0.052 0.030 0.072 0.211 0.212 0.272 0.044 0.235
15 Shiraz Canada Red 0.197 0.181 0.258 0.162 0.145 0.041 0.162 0.125 0.154 0.009 0.062 0.272 0.435 0.235
16 Pinot Canada Red 0.301 0.083 0.022 0.141 0.077 0.309 0.162 0.125 0.185 0.009 0.212 0.160 0.044 0.105
17 Chardonnay Canada White 0.197 0.039 0.061 0.003 0.132 0.309 0.052 0.030 0.185 0.009 0.212 0.066 0.115 0.040
18 Sauvignon Canada White 0.058 0.181 0.234 0.049 0.128 0.227 0.058 0.125 0.072 0.229 0.062 0.178 0.044 0.155
19 Riesling Canada White 0.058 0.108 0.254 0.039 0.122 0.093 0.168 0.185 0.185 0.009 0.062 0.178 0.523 0.155
20 Gewurztraminer Canada White 0.127 0.005 0.140 0.024 0.129 0.041 0.168 0.030 0.072 0.119 0.062 0.066 0.204 0.025
(continued)
Table 2
(continued)
ZX: Centered and normalized version of X: Physical/Chemical
Wine descriptors description ZY: Centered and normalized version of Y: Assessors evaluation
Wine Name Varietal Origin Color Price Total acidity Alcohol Sugar Tannin Fruity Floral Vegetal Spicy Woody Sweet Astringent Acidic Hedonic
21 Malbec Canada Rose 0.116 0.054 0.234 0.261 0.091 0.041 0.052 0.125 0.041 0.119 0.087 0.066 0.044 0.155
22 Cabernet Canada Rose 0.150 0.098 0.136 0.102 0.095 0.175 0.058 0.185 0.041 0.119 0.087 0.066 0.044 0.090
23 Pinot Canada Rose 0.081 0.010 0.037 0.313 0.102 0.175 0.052 0.030 0.185 0.119 0.062 0.066 0.115 0.090
24 Syrah Canada Rose 0.081 0.068 0.037 0.049 0.097 0.093 0.052 0.030 0.041 0.009 0.087 0.178 0.044 0.090
25 Merlot USA Red 0.301 0.039 0.081 0.049 0.266 0.093 0.162 0.030 0.267 0.321 0.062 0.160 0.115 0.235
26 Cabernet USA Red 0.058 0.034 0.278 0.049 0.363 0.227 0.052 0.185 0.154 0.211 0.062 0.272 0.115 0.235
27 Shiraz USA Red 0.301 0.142 0.140 0.110 0.290 0.227 0.162 0.125 0.493 0.321 0.087 0.272 0.115 0.300
28 Pinot USA Red 0.370 0.024 0.160 0.320 0.092 0.041 0.052 0.030 0.072 0.211 0.212 0.160 0.044 0.235
29 Chardonnay USA White 0.058 0.137 0.022 0.204 0.127 0.041 0.058 0.030 0.041 0.321 0.062 0.066 0.044 0.040
30 Sauvignon USA White 0.116 0.137 0.061 0.214 0.121 0.041 0.168 0.435 0.185 0.119 0.087 0.066 0.044 0.090
31 Riesling USA White 0.046 0.342 0.234 0.146 0.123 0.175 0.168 0.125 0.072 0.119 0.087 0.066 0.204 0.155
32 Gewurztraminer USA White 0.197 0.489 0.234 0.320 0.124 0.041 0.278 0.125 0.072 0.119 0.087 0.066 0.115 0.155
33 Malbec USA Rose 0.220 0.010 0.136 0.102 0.096 0.227 0.162 0.185 0.154 0.009 0.237 0.066 0.044 0.090
34 Cabernet USA Rose 0.185 0.082 0.136 0.134 0.089 0.227 0.052 0.125 0.041 0.119 0.386 0.066 0.115 0.155
35 Pinot USA Rose 0.116 0.127 0.037 0.007 0.100 0.041 0.271 0.185 0.072 0.009 0.087 0.178 0.115 0.025
36 Syrah USA Rose 0.150 0.034 0.061 0.003 0.092 0.361 0.052 0.030 0.267 0.101 0.062 0.066 0.115 0.040
Each column has a mean of zero and a sum of squares of one

Fig. 2. The Saliences (normalized to their eigenvalues) for the physical attributes of the
wines.
the interpretation of a plot of several saliences easier. Specifically,

each salience is multiplied by its singular value, then, when a plot is
made with the saliences corresponding to two different dimensions,
the distances on the graph will directly reflect the amount of
explained covariance of R. The matrices FX and FY are computed as
FX UD
2 3
0:962 0:373 0:194 0:011 0:051
6 0:473 0:497 0:291 0:003 0:024 7
6 7
6
6 1:536 0:098 0:080 0:109 0:000 7
7
4 0:714 0:574 0:057 0:011 0:053 5
1:700 0:225 0:142 0:087 0:000
(15)
FY VD
2 3
0:210 0:297 0:198 0:006 0:037
6 0:611 0:552 0:156 0:001 0:023 7
6 7
6 0:079 0:389 0:145 0:056 0:013 7
6 7
6 0:696 0:151 0:080 0:013 0:056 7
6 7
66 1:161 0:117 0:022 0:001 0:007 7
7
6 0:871 0:342 0:169 0:012 0:021 7
6 7
6 1:287 0:009 0:169 0:072 0:015 7
6 7
4 0:480 0:271 0:052 0:100 0:011 5
1:417 0:067 0:017 0:034 0:007
(16)
Figures 2 and 3 show the X and Y plot of the saliences for
Dimensions 1 and 2.
4.5.4. Latent Variables The latent variables for X and Y are computed according to Eqs. 5
and 6. These latent variables are shown in Tables 3 and 4. The
corresponding plots for Dimensions 1 and 2 are given in Figures 4
Fig. 3. The Saliences (normalized to their eigenvalues) for the sensory evaluation of the
attributes of the wines.
Table 3
PLSC. The X latent variables. LX = ZXV
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5

0.249 0.156 0.033 0.065 0.092
0.278 0.230 0.110 0.093 0.216
0.252 0.153 0.033 0.060 0.186
0.184 0.147 0.206 0.026 0.026
0.004 0.092 0.269 0.083 0.102
0.119 0.003 0.058 0.101 0.052
0.226 0.197 0.102 0.054 0.053
0.170 0.098 0.009 0.030 0.049
0.278 0.320 0.140 0.080 0.194
0.269 0.300 0.155 0.102 0.121
0.317 0.355 0.110 0.084 0.083
0.120 0.171 0.047 0.132 0.054
0.392 0.155 0.155 0.120 0.113
0.405 0.073 0.255 0.030 0.005
0.328 0.225 0.120 0.086 0.073
0.226 0.150 0.200 0.067 0.076
0.030 0.090 0.163 0.113 0.114
0.244 0.153 0.019 0.099 0.128
0.236 0.119 0.040 0.121 0.098
0.051 0.090 0.081 0.177 0.067
(continued)
Table 3
(continued)

0.299 0.200 0.026 0.097 0.088
0.206 0.146 0.046 0.029 0.058
0.201 0.214 0.034 0.065 0.159
0.115 0.076 0.046 0.040 0.040
0.323 0.004 0.058 0.123 0.221
0.399 0.112 0.193 0.009 0.083
0.435 0.029 0.137 0.106 0.080
0.379 0.310 0.183 0.013 0.016
0.018 0.265 0.002 0.079 0.062
0.051 0.192 0.097 0.118 0.183
0.255 0.326 0.164 0.106 0.026
0.146 0.626 0.127 0.134 0.058
0.248 0.126 0.054 0.021 0.077
0.226 0.174 0.010 0.027 0.054
0.108 0.096 0.080 0.040 0.110
0.084 0.025 0.079 0.117 0.092
Table 4
PLSC. The Y-latent variables. LY = ZXU

0.453 0.109 0.040 0.197 0.037
0.489 0.088 0.018 0.062 0.025
0.526 0.293 0.083 0.135 0.145
0.243 0.201 0.280 0.013 0.090
0.022 0.112 0.308 0.015 0.145
0.452 0.351 0.236 0.157 0.208
0.409 0.357 0.047 0.225 0.062
0.494 0.320 0.019 0.006 0.150
0.330 0.186 0.325 0.112 0.030
(continued)
Table 4
(continued)

0.307 0.170 0.005 0.062 0.040
0.358 0.252 0.167 0.053 0.142
0.206 0.280 0.171 0.006 0.060
0.264 0.072 0.075 0.090 0.042
0.412 0.125 0.050 0.103 0.160
0.434 0.149 0.152 0.268 0.030
0.202 0.194 0.237 0.016 0.160
0.065 0.138 0.330 0.134 0.029
0.314 0.021 0.066 0.094 0.159
0.340 0.194 0.173 0.368 0.138
0.169 0.186 0.057 0.120 0.019
0.183 0.019 0.017 0.002 0.045
0.154 0.037 0.120 0.112 0.188
0.114 0.010 0.196 0.096 0.051
0.161 0.114 0.025 0.019 0.035
0.490 0.141 0.076 0.031 0.083
0.435 0.180 0.162 0.072 0.035
0.575 0.208 0.365 0.024 0.167
0.357 0.124 0.098 0.046 0.137
0.145 0.113 0.078 0.002 0.087
0.268 0.299 0.177 0.161 0.114
0.283 0.232 0.008 0.109 0.068
0.260 0.158 0.147 0.124 0.081
0.106 0.373 0.078 0.117 0.065
0.275 0.275 0.305 0.102 0.019
0.060 0.300 0.238 0.091 0.004
0.130 0.209 0.162 0.110 0.030
and 5. These plots show clearly that wine color is a major determi-
nant of the wines both for the physical and the sensory points of
view.
2
11 9
10 2
23
21 34+ 12
4 3
+ + 35
22 1 + 26
33 5
24 36
+ +25
6 27+
17
14
1
19 8 20 13
18 16
7 30
+
15
+
+ 31 29 + 28
Chile red
Canada rose
+ 32 + USA white
Fig. 4. Plot of the wines: The X-latent variables for Dimensions 1 and 2.
33 2
+
34 35 +
11 + 3
12
36 27
+ 26 +
9 10 +
24 15 + 25
1
21 22
18 23 13
5 29 2 1
32 + +
+ 28 14
19 17
20 16 4
+ 31
8 + Chile red
30
6 7 Canada rose
+ USA white
Fig. 5. Plot of the wines: The Y-latent variables for Dimensions 1 and 2.
4.5.5. Permutation Test In order to evaluate if the overall analysis extracts relevant informa-
tion, we computed the total inertia extracted by the PLSC. Using
Eq. 4, we found that the inertia common to the two tables was
equal to Total 7. 8626. To evaluate its significance, we generated
10,000 R matrices by permuting the rows of X. The distribution of
the values of the inertia is given in Fig. 6, which shows that the
500
450
Number of Samples (out of 10,000) 400
350
300
250
200
150
100 Observed Inertia

of the Sample
50 (p < .001)
0
0 1 2 3 4 5 6 7 8
Inertia of the Permuted Sample
Fig. 6. Permutation test for the inertia explained by the PLSC of the wine. The observed value was never obtained in the
10,000 permutation. Therefore we conclude that PLSC extracted a significant amount of common variance between these
two tables P < 0.0001).
value of Total 7.8626 was never obtained in this sample.

Therefore we conclude that the probability of finding such a value
1
by chance alone is smaller than 10;000 (i.e., we can say that p <
.0001).
The same approach can used to evaluate the significance of the
dimensions extracted by PLSC. The permutation test found that
only the first two dimensions could be considered significant at the
a .05 level: For Dimension 1, p < .0001 and for Dimension 2 p
.0043. Therefore, we decided to keep only these first two dimen-
sions for further analysis.
4.5.6. Bootstrap Bootstrap ratios and 95% confidence intervals for X and Y are given
for Dimensions 1 and 2 in Table 5. As it is often the case, bootstrap
ratios and confidence intervals concur in indicating the relevant
variables for a dimension. For example, for Dimension 1, the
important variables (i.e., variables with a Bootstrap ratio > 2 or
whose confidence interval excludes zero) for X are Tannin, Alcohol,
Price, and Sugar; whereas for Y they are Hedonic, Astringent,
Woody, Sweet, Floral, Spicy, and Acidic.
Table 5
PLSC. Bootstrap Ratios and Confidence Intervals for X and Y.
Dimension 1 Dimension 2
Bootstrap Lower 95 % Upper 95 % Bootstrap Lower 95 % Upper 95 %

ratio CI CI ratio CI CI
X
Price 3.6879 0.1937 0.5126 2.172 0.7845 0.1111
Acidity 1.6344 0.3441 0.0172 3.334 0.8325 0.2985
Alcohol 13.7384 0.507 0.642 0.5328 0.2373 0.3845
Sugar 2.9555 0.4063 0.1158 4.7251 0.4302 0.8901
Tannin 16.8438 0.5809 0.7036 1.4694 0.0303 0.5066
Y
Fruity 0.9502 0.2188 0.0648 2.0144 0.0516 0.5817
Floral 3.9264 0.3233 0.1314 3.4383 0.9287 0.3229
Vegetal 0.3944 0.139 0.0971 2.6552 0.7603 0.195
Spicy 3.2506 0.1153 0.3709 1.0825 0.0922 0.4711
Woody 9.1335 0.3525 0.5118 0.6104 0.4609 0.2165
Sweet 6.9786 0.408 0.2498 1.9499 0.043 0.6993
Astringent 16.6911 0.439 0.5316 0.0688 0.3099 0.291
Acidic 2.5518 0.2778 0.0529 1.443 0.6968 0.05
Hedonic 22.7344 0.5741 0.4968 0.3581 0.285 0.4341
5. Partial Least
Square Regression
Partial least square Regression (PLSR) is used when the goal of the
analysis is to predict a set of variables (denoted Y) from a set of
predictors (called X). As a regression technique, PLSR is used to
predict a whole table of data (by contrast with standard regression
which predicts one variable only), and it can also handle the case of
multicolinear predictors (i.e., when the predictors are not linearly
independent). These features make PLSR a very versatile tool
because it can be used with very large data sets for which standard
regression methods fail.
In order to predict a table of variables, PLSR finds latent
variables, denoted T (in matrix notation), that model X and simul-
taneously predict Y. Formally this is expressed as a double decom-
position of X and the predicted Y:b
X TPT and b TBCT ;

Y (17)
where P and C are called (respectively) X and Y loadings (or

weights) and B is a diagonal matrix. These latent variables are
b that they explain.
ordered according to the amount of variance of Y
b can also be expressed as a regression
Rewriting Eq. 17 shows that Y
model as
b TBCT XBPLS
Y (18)
with
BPLS PT BCT (19)

(where PT+ is the MoorePenrose pseudoinverse of PT, see, e.g.,
(12), for definitions). The matrix BPLS has J rows and K columns
and is equivalent to the regression weights of multiple regression
(Note that matrix B is diagonal, but that matrix BPLS is, in general
not diagonal).
5.1. Iterative In PLSR, the latent variables are computed by iterative applications
Computation of the of the SVD. Each run of the SVD produces orthogonal latent variables
Latent Variables for X and Y and corresponding regression weights (see, e.g., (4) for
in PLSR more details and alternative algorithms).
5.1.1. Step One To simplify the notation we will assume that X and Y are mean-
centered and normalized such that the mean of each column is zero
and its sum of squares is one. At step one, X and Y are stored
(respectively) in matrices X0 and Y0. The matrix of correlations
(or covariance) between X0 and Y0 is computed as
R 1 XT0 Y 0 : (20)
The SVD is then performed on R1 and produces two sets of orthog-

onal singular vectors W1 and C1, and the corresponding singular
values D1 (compare with Eq. 1):
R 1 W1 D1 CT1 : (21)
The first pair of singular vectors (i.e., the first columns of W1 and
C1) are denoted w1 and c1 and the first singular value (i.e., the first
diagonal entry of D1) is denoted d1. The singular value represents
the maximum covariance between the singular vectors. The first
latent variable of X is given by (compare with Eq. 5 defining LX):
t1 X0 w1 (22)
where t1 is normalized such that tT1 t1 . The loadings of X0 on t1 (i.e.,
the projection of X0 on the space of t1) are given by
p1 XT
0 t1 : (23)
The least square estimate of X from the first latent variable is given
by
b 1 tT p :
X (24)
1 1
As an intermediate step we derive a first pseudo latent variable for

Y denoted u1 and obtained as
u1 Y0 c1 : (25)
Reconstituting Y from its pseudo latent variable as
b 1 u1 cT ;
Y (26)
1
and then rewriting Eq. 26 we obtained the prediction of Y from the
X latent variable as
b 1 t1 b 1 cT
Y (27)
1
with
b 1 tT1 u1 : (28)
b 1 on t1.
The scalar b1 is the slope of the regression of Y
Matrices Xb 1 and Y b 1 are then subtracted from the original X0
and original Y0 respectively, to give deflated X1 and Y1:
b1
X1 X0 X and b 1:
Y1 Y0 Y (29)
5.1.2. Last Step The iterative process continues until X is completely decomposed
into L components (where L is the rank of X). When this is done,
the weights (i.e., all the ws) for X are stored in the J by L matrix W
(whose th column is w). The latent variables of X are stored in the
I by L matrix T. The weights for Y are stored in the K by L matrix
C. The pseudo latent variables of Y are stored in the I by L matrix
U. The loadings for X are stored in the J by L matrix P. The
regression weights are stored in a diagonal matrix B. These regres-
sion weights are used to predict Y from X ; therefore, there is one b
for every pair of t and u, and so B is an L L diagonal matrix.
The predicted Y scores are now given by
b TBCT XBPLS ;
Y (30)
where, BPLS P BC , (where P is the Moore-Penrose pseu-
T+ T T+
doinverse of PT). BPLS has J rows and K columns.
5.2. What Does PLSR PLSR finds a series of L latent variables t such that the covariance
Optimize? between t1 and Y is maximal and such that t1 is uncorrelated with t2
which has maximal covariance with Y and so on for all L latent
variables (see, e.g., (4, 17, 19, 26, 48, 49), for proofs and develop-
ments). Formally, we seek a set of L linear transformations of X that
satisfies (compare with Eq. 7):
t Xw such that covt ; Y max (31)
(where w is the vector of the coefficients of the th linear transfor-

mation and cov is the covariance computed between t and each
column of Y) under the constraints that
tT t0 0 when 6 0 (32)
and
tT t 1: (33)
5.3. How Good is the A common measure of the quality of prediction of observations
Prediction? within the sample is the Residual Estimated Sum of Squares (RESS),
which is given by (4)
5.3.1. Fixed Effect Model
RESS b k2 ;
k Y Y (34)
where k k2 is the square of the norm of a matrix (i.e., the sum of
squares of all the elements of this matrix). The smaller the value of
RESS, the better the quality of prediction (4, 13).
5.3.2. Random Effect Model The quality of prediction generalized to observations outside of the
sample is measured in a way similar to RESS and is called the Pre-
dicted Residual Estimated Sum of Squares (PRESS). Formally PRESS is
obtained as (4):
PRESS e k2 :
k Y Y (35)
The smaller PRESS is, the better the prediction.
5.3.3. How Many Latent By contrast with the fixed effect model, the quality of prediction for
Variables? a random model does not always increase with the number of latent
variables used in the model. Typically, the quality first increases and
then decreases. If the quality of the prediction decreases when the
number of latent variables increases this indicates that the model is
overfitting the data (i.e., the information useful to fit the observa-
tions from the learning set is not useful to fit new observations).
Therefore, for a random model, it is critical to determine the
optimal number of latent variables to keep for building the
model. A straightforward approach is to stop adding latent variables
as soon as the PRESS decreases. A more elaborated approach (see, e.
g., (48)) starts by computing the ratio Q2 for the th latent
variable, which is defined as
PRESS
Q 2 1 (36)
RESS 1;
with PRESS (resp. RESS1) being the value of PRESS (resp. RESS) for
the th (resp. 1) latent variable [where RESS0 K I 1].
A latent variable is kept if its value of Q2 is larger than some
arbitrary value generally set equal to 1 :952 :0975 (an
alternative set of values sets the threshold to .05 when I 100 and
to 0 when I > 100, see, e.g., (48, 58), for more details). Obviously,
the choice of the threshold is important from a theoretical point of
view, but, from a practical point of view, the values indicated above
seem satisfactory.
5.3.4. Bootstrap When the number of latent variables of the model has been
Confidence Intervals for decided, confidence intervals for the predicted values can be
the Dependent Variables derived using the Bootstrap. Here, each bootstrapped sample pro-
vides a value of BPLS which is used to estimate the values of the
observations in the testing set. The distribution of the values of
these observations is then used to estimate the sampling distribu-
tion and to derive Bootstrap ratios and confidence intervals.
5.4. PLSR: Example We will use the same example as for PLSC (see data in Tables 1
and 2). Here we used the physical measurements stored in matrix X
to predict the sensory evaluation data stored in matrix Y. In order
to facilitate the comparison between PLSC and PLSR, we have
decided to keep two latent variables for the analysis. However if
we had used the Q2 criterion of Eq. 36, with values of 1. 3027 for
Dimension 1 and 0.2870 for Dimension 2, we should have kept
only one latent variable for further analysis.
Table 6 gives the values of the latent variables (T), the recon-
b and the predicted values of Y (Y).
stituted values of X (X) b The value
of BPLS computed with two latent variables is equal to
BPLS
2 3
0:0981 0:0558 0:0859 0:0533 0:1785 0:1951 0:1692 0:0025 0:2000
6 7
6 0:0877 0:3127 0:1713 0:1615 0:1204 0:0114 0:1813 0:1770 0:1766 7
6 7
6 0:0276
6 0:2337 0:0655 0:2135 0:3160 0:20977 0:3633 0:1650 :
0:3936 7
7
6 7
4 0:1253 0:1728 0:1463 0:0127 0:1199 0:1863 0:0877 0:0707 0:1182 5
0:0009 0:3373 0:1219 0:2675 0:3573 0:2072 0:4247 0:2239 0:4536
(37)
The values of W which play the role of loadings for X are equal to
2 3
0:3660 0:4267
6 0:1801 0:5896 7
6 7
W6 6 0:5844 0:0771 7:
7 (38)
4 0:2715 0:6256 5
0:6468 0:2703
A plot of the first two dimensions of W given in Fig. 7 shows that
X is structured around two main dimensions. The first dimension
opposes the wines rich in alcohol and tannin (which are the red
wines) are opposed to wines that are sweet or acidic. The second
dimension opposes sweet wines to acidic wines (which are also
more expensive) (Figs. 8 and 9).
Table 6
PLSR: Prediction of the sensory data (matrix Y) from the physical measurements (matrix X). Matrices T, U, X,
b Yb
T U b
X b
Y
Total
Wine Dim 1 Dim 2 Dim 1 Dim 2 Price acidity Alcohol Sugar Tannin Fruity Floral Vegetal Spicy Woody Sweet Astringent Acidic Hedonic
1 0.16837 0.16041 2.6776 0.97544 15.113 5.239 14.048 3.3113 471.17 6.3784 2.0725 1.7955 3.6348 4.3573 2.9321 4.0971 3.1042 2.8373
2 0.18798 0.22655 2.8907 0.089524 14.509 4.8526 14.178 3.6701 517.12 6.4612 1.6826 1.6481 3.8384 4.5273 2.9283 4.337 2.9505 2.4263
3 0.17043 0.15673 3.1102 2.1179 15.205 5.2581 14.055 3.2759 472.39 6.3705 2.0824 1.8026 3.6365 4.3703 2.9199 4.108 3.1063 2.8139
4 0.12413 0.14737 1.4404 1.0106 14.482 5.3454 13.841 3.438 413.67 6.4048 2.3011 1.8384 3.4268 4.0358 3.0917 3.7384 3.2164 3.5122
5 0.0028577 0.07931 0.13304 0.5399 13.226 5.8188 13.252 3.5632 245.11 6.4267 3.0822 2.0248 2.7972 3.14 3.4934 2.7119 3.5798 5.422
6 0.080038 0.015175 2.6712 2.671 13.069 6.4119 12.82 3.319 113.62 6.3665 3.8455 2.2542 2.2783 2.5057 3.7148 1.9458 3.9108 6.8164
7 0.15284 0.18654 2.4224 2.2504 14.224 7.4296 12.383 2.4971 31.847 6.1754 4.9385 2.6436 1.6593 1.9082 3.8112 1.1543 4.3538 8.205
8 0.11498 0.09827 2.9223 2.4331 13.636 6.9051 12.61 2.9187 43.514 6.2735 4.3742 2.4429 1.9796 2.2185 3.7601 1.5647 4.1249 7.4844
9 0.18784 0.21492 1.952 0.98895 7.6991 5.1995 12.482 5.4279 62.365 6.8409 3.1503 1.8016 2.2553 1.8423 4.3943 1.4234 3.7333 7.975
10 0.18149 0.21809 1.8177 0.95013 7.7708 5.1769 12.513 5.4187 71.068 6.8391 3.1112 1.7927 2.2876 1.8891 4.3728 1.4767 3.7149 7.8756
11 0.21392 0.25088 2.1158 1.4184 6.6886 5.017 12.388 5.8026 43.283 6.9247 3.0763 1.7341 2.2134 1.6728 4.5368 1.2706 3.7243 8.2918
12 0.080776 0.11954 1.2197 1.3413 11.084 5.6554 12.902 4.2487 158.44 6.5782 3.2041 1.9678 2.524 2.5621 3.8671 2.121 3.6803 6.5773
13 0.26477 0.085879 1.5647 0.88629 20.508 6.5509 14.323 1.1469 503.24 5.8908 2.8881 2.2864 3.5804 4.9319 2.2795 4.5097 3.3327 1.8765
14 0.27335 0.012467 2.4386 0.84706 19.593 6.1319 14.409 1.6096 538.44 5.9966 2.5048 2.1273 3.7516 5.0267 2.3272 4.6744 3.1889 1.6141
15 0.22148 0.14773 2.5658 0.88267 20.609 6.931 14.089 0.93334 430.34 5.8398 3.3465 2.4328 3.2863 4.595 2.3812 4.0928 3.5271 2.6278
16 0.15251 0.089213 1.1964 1.2017 18.471 6.6538 13.817 1.6729 367.45 6.0044 3.3258 2.332 3.1078 4.1299 2.7176 3.6395 3.5663 3.5337
17 0.020577 0.072286 0.3852 0.65881 15.773 6.6575 13.235 2.4344 214.94 6.1706 3.7406 2.3412 2.5909 3.197 3.2555 2.6449 3.805 5.4426
18 0.16503 0.15453 1.8587 0.17362 13.53 7.2588 12.349 2.7767 35.61 6.2384 4.8313 2.5797 1.6678 1.8359 3.8946 1.1032 4.3234 8.3249
19 0.15938 0.12373 2.0114 1.163 13.184 7.0815 12.394 2.9608 18.379 6.2806 4.6627 2.5123 1.7481 1.8903 3.9066 1.1882 4.2588 8.1846
20 0.034285 0.071934 0.99958 1.5624 16.023 6.6453 13.297 2.3698 231.5 6.1567 3.6874 2.3357 2.6485 3.2949 3.202 2.7511 3.7765 5.2403
21 0.20205 0.12592 1.0834 0.53399 8.7377 5.7103 12.362 4.8856 15.13 6.7166 3.6292 1.9958 2.0319 1.7003 4.3514 1.1944 3.9154 8.3491
22 0.13903 0.095646 0.90872 0.44113 10.351 5.8333 12.626 4.3693 80.458 6.6025 3.5372 2.0386 2.2379 2.1358 4.0699 1.6397 3.8397 7.4783
23 0.13566 0.14176 0.67329 0.26414 9.7392 5.5716 12.67 4.6698 100.15 6.6711 3.3041 1.9394 2.3371 2.1809 4.1077 1.7276 3.7534 7.3432
24 0.077587 0.048002 0.95125 0.55919 12.19 6.055 12.871 3.7413 137.99 6.4628 3.5342 2.1189 2.4052 2.5521 3.7752 2.0495 3.797 6.6631
25 0.21821 0.043304 2.897 1.0065 17.752 5.8598 14.197 2.2626 491.22 6.1423 2.4453 2.0275 3.6255 4.659 2.6061 4.3241 3.2047 2.3216
26 0.26916 0.13515 2.5723 0.85536 17.355 5.3054 14.484 2.6448 583.5 6.2322 1.8146 1.8147 4.0069 5.0644 2.5074 4.8404 2.9431 1.4018
27 0.29345 0.034272 3.4006 1.2348 19.282 5.8542 14.529 1.8326 578.41 6.0485 2.2058 2.021 3.9215 5.1914 2.3 4.8922 3.0675 1.2318
28 0.25617 0.20133 2.1121 0.9005 22.038 7.2062 14.211 0.39528 453.76 5.7192 3.4724 2.5349 3.3314 4.8178 2.1853 4.2883 3.549 2.2172
29 0.011979 0.21759 0.85732 0.18988 17.295 7.4986 12.996 1.5947 126.59 5.9776 4.5577 2.6614 2.1872 2.8984 3.2224 2.1988 4.1213 6.1909
30 0.034508 0.16317 1.5868 2.1363 16.08 7.2096 12.93 2.079 118.03 6.0867 4.3821 2.5534 2.1941 2.7626 3.3714 2.0981 4.0733 6.4213
31 0.17235 0.29489 1.6713 1.187 15.448 8.0531 12.226 1.8476 92 6.0264 5.5299 2.8808 1.3781 1.7195 3.7677 0.85837 4.58 8.6928
32 0.098879 0.52412 1.5407 1.0685 20.167 9.2864 12.41 0.087465 81.643 5.5898 6.35 3.3433 1.26 2.1385 3.2243 1.1171 4.8257 8.0376
33 0.1672 0.072228 0.62606 2.4774 10.171 5.986 12.484 4.3461 38.717 6.5956 3.7551 2.0981 2.0776 1.9242 4.1547 1.391 3.9372 7.9361
34 0.15281 0.11474 1.6241 1.4483 9.816 5.7363 12.576 4.5679 70.404 6.6469 3.4977 2.0027 2.2159 2.0463 4.1453 1.5591 3.8348 7.6456
35 0.072566 0.066931 0.35548 1.7924 12.006 5.9449 12.906 3.8469 150.44 6.4872 3.4248 2.0769 2.461 2.5965 3.7765 2.1136 3.7542 6.5542
36 0.056807 0.0071035 0.76977 1.6816 13.174 6.2693 12.938 3.3586 149.05 6.3768 3.6517 2.1988 2.416 2.6815 3.6481 2.1548 3.8253 6.4334
2
sugar
tannin
alcohol
price
total acidity
Fig. 7. The X-loadings for Dimensions 1 and 2.
spicy fruity
sweet
astringent
woody
acidic
hedonic
1
vegetal
Fig. 8. The circle of correlation between the Y variables and the latent variables for
Dimensions 1 and 2.
2
11
2 10
9
1
23
+ 3 4 12 21
26 + 34
5 35 22
25 + +
27 + 33
+ 36 24
+
6
14
20 1
16
13 17 8
19
15 + 30 18
+ 28 7
+ 29
+ 31
Chile red
Canada rose + 32
+ USA yellow
Fig. 9. PLSR. Plot of the latent variables (wines) for Dimensions 1 and 2.
6. Software
PLS methods necessitate sophisticated computations and therefore

they critically depends on the availability of software.
PLSC is used intensively is neuroimaging, and most of
the analyses in this domain are performed with a special MATLAB
toolboox (written by McIntosh, Chau, Lobaugh, and Chen).
The programs and a tutorial are freely available from www.rot-
man-baycrest.on.ca:8080. These programs (which are the
standard for neuroimaging) can be adapted for other types of data
than neuroimaging (as long as the data are formatted in a compati-
ble format). The computations reported in this paper were per-
formed with MATLAB and can be downloaded from the home page of
the first author (www.utdallas.edu/~herve).
For PLSR there are several available choices. The computations
reported in this paper are performed with MATLAB and can be down-
loaded from the home page of the first author (www.utdallas.
edu/~herve). A public domain set of MATLAB programs is also

available from the home page of the N-Way project (www.mod-
els.kvl.dk/source/nwaytoolbox/) along with tutorials and
examples. The statistic toolbox from MATLAB includes a function to
perform PLSR. The public domain program R implements PLSR
through the package PLS (43). The general purpose statistical
packages SAS, SPSS, and XLSTAT (which has, by far the most extensive
implementation of PLS methods) can be also used to perform PLSR.
In chemistry and sensory evaluation, two main programs are used:
the first one called SIMCA-P was developed originally by Wold (who
also pioneered PLSR), the second one called the UNSCRAMBLER was
first developed by Martens who was another pioneer in the field.
And finally, a commercial MATLAB toolbox has also been developed
by EIGENRESEARCH.
7. Related Methods
A complete review of the connections between PLS and the other

statistical methods is, clearly, out of the scope of an introductory
paper (see, however, (17, 48, 49, 26), for an overview), but some
directions are worth mentioning. PLSC uses the SVD in order to
analyze the information common to two or more tables, and this
makes it closely related to several other SVD (or eigen-
decomposition) techniques with similar goals. The closest tech-
nique is obviously inter-battery analysis (51) which uses the same
SVD as PLSC, but on non structured matrices. Canonical correlation
analysis (also called simply canonical analysis, or canonical variate
analysis, see (28, 33), for reviews) is also a related technique that
seeks latent variables with largest correlation instead of PLSCs
criterion of largest covariance. Under the assumptions of normality,
analytical statistical tests are available for canonical correlation anal-
ysis but cross-validation procedures analogous to PLSC could also
be used.
In addition, several multi-way techniques encompass as a par-
ticular case data sets with two tables. The oldest and most well-
known technique is multiple factor analysis which integrates differ-
ent tables into a common PCA by normalizing each table with its first
singular value (7, 25). A more recent set of techniques is the STATIS
family which uses a more sophisticated normalizing scheme whose
goal is to extract the common part of the data (see (1, 811), for an
introduction). Closely related techniques comprise common com-
ponent analysis (36) which seeks a set of factors common to a set of
data tables, and co-inertia analysis which could be seen as a gener-
alization of Tuckers (1958) (51) inter-battery analysis (see, e.g.,
(18, 22, 50, 50, 54), for recent developments).
PLSR is strongly related to regression-like techniques which

have been developed to cope with the multi-colinearity problem.
These include principal component regression, ridge regression,
redundancy analysis (also known as PCA on instrumental variables
(44, 52, 53), and continuum regression (45), which provides a
general framework for these techniques.
8. Conclusion
Partial Least Squares (PLS) methods analyze data from multiple

modalities collected on the same observations. We have reviewed
two particular PLS methods: Partial Least Squares Correlation or
PLSC and Partial Least Squares Regression or PLSR. PLSC analyzes
the shared information between two or more sets of variables. In
contrast, PLSR is directional and predicts a set of dependent vari-
ables from a set of independent variables or predictors. The rela-
tionship between PLSC and PLSR are also explored in (17) and,
recently (27) proposed to integrate these two approaches into a
new predictive approach called BRIDGE-PLS. In practice, the two
techniques are likely to give similar conclusions because the criteria
they optimize are quite similar.
References
1. Abdi H (2001) Linear algebra for neural net- 6. Abdi H, Edelman B, Valentin D, Dowling WJ
works. In: Smelser N, Baltes P (eds) Interna- (2009b) Experimental design and analysis for
tional encyclopedia of the social and behavioral psychology. Oxford University Press, Oxford
sciences. Elsevier, Oxford UK 7. Abdi H, Valentin D (2007a) Multiple factor
2. Abdi H (2007a) Eigen-decomposition: eigen- analysis (MFA). In: Salkind N (ed) Encyclope-
values and eigenvectors. In: Salkind N (ed) dia of measurement and statistics. Sage, Thou-
Encyclopedia of measurement and statistics. sand Oaks, CA
Sage, Thousand Oaks, CA 8. Abdi H, Valentin D (2007b) STATIS. In: Sal-
3. Abdi H (2007) Singular value decomposition kind N (ed) Encyclopedia of measurement and
(SVD) and generalized singular value decom- statistics. Sage, Thousand Oaks, CA
position (GSVD). In: Salkind N (ed) Encyclo- 9. Abdi H, Valentin D, OToole AJ, Edelman B
pedia of measurement and statistics. Sage, (2005) DISTATIS: the analysis of multiple
Thousand Oaks, CA distance matrices. In: Proceedings of the
4. Abdi H (2010) Partial least square regression, IEEE computer society: international confer-
projection on latent structure regression, PLS- ence on computer vision and pattern recogni-
regression. Wiley Interdiscipl Rev Comput Stat tion pp 4247
2:97106 10. Abdi H, Williams LJ (2010a) Barycentric dis-
5. Abdi H, Dunlop JP, Williams LJ (2009) How to criminant analysis. In: Salkind N (ed) Encyclo-
compute reliability estimates and display confi- pedia of research design. Sage, Thousand Oaks,
dence and tolerance intervals for pattern classi- CA
fiers using the Bootstrap and 3-way 11. Abdi H, Williams LJ (2010b) The jackknife. In:
multidimensional scaling (DISTATIS). Neuro- Salkind N (ed) Encyclopedia of research
Image 45:8995 design. Sage, Thousand Oaks, CA
12. Abdi H, Williams LJ (2010c) Matrix algebra. 28. Gittins R (1985) Canonical analysis. Springer,
In: Salkind N (ed) Encyclopedia of research New York
design. Sage, Thousand Oaks, CA 29. Good P (2005) Permutation, parametric and
13. Abdi H, Williams LJ (2010d) Principal compo- bootstrap tests of hypotheses. Springer, New
nents analysis. Wiley Interdiscipl Rev Comput York
Stat 2:433459 30. Greenacre M (1984) Theory and applications
14. Bookstein F (1982) The geometric meaning of of correspondence analysis. Academic, London
soft modeling with some generalizations. In: 31. Krishnan A, Williams LJ, McIntosh AR, Abdi
Joreskog K, Wold H (eds) System under indi- H (2011) Partial least squares (PLS) methods
rect observation, vol 2. North-Holland, for neuroimaging: a tutorial and review. Neu-
Amsterdam. roImage 56:455475
15. Bookstein FL (1994) Partial least squares: a 32. Lebart L, Piron M, Morineau A (2007) Statis-
dose-response model for measurement in the tiques exploratoires multidimensionelle.
behavioral and brain sciences. Psycoloquy 5 Dunod, Paris
16. Boulesteix AL, Strimmer K (2006) Partial least 33. Mardia KV, Kent JT, Bibby JM (1979) Multi-
squares: a versatile tool for the analysis of high- variate analysis. Academic, London
dimensional genomic data. Briefing in Bioin- 34. Martens H, Martens M (2001) Multivariate
formatics 8:3244 analysis of quality: an introduction. Wiley, Lon-
17. Burnham A, Viveros R, MacGregor J (1996) don
Frameworks for latent variable multivariate 35. Martens H, Naes T (1989) Multivariate cali-
regression. J Chemometr 10:3145 bration. Wiley, London
18. Chessel D, Hanafi M (1996) Analyse de la co- 36. Mazerolles G, Hanafi M, Dufour E, Bertrand
inertie de k nuages de points. Revue de Statis- D, Qannari ME (2006) Common components
tique Appliquee 44:3560 and specific weights analysis: a chemometric
19. de Jong S (1993) SIMPLS: an alternative method for dealing with complexity of food
approach to partial least squares regression. products. Chemometr Intell Lab Syst
Chemometr Intell Lab Syst 18:251263 81:4149
20. de Jong S, Phatak A (1997) Partial least squares 37. McCloskey DN, Ziliak J (2008) The cult of
regression. In: Proceedings of the second inter- statistical significance: how the standard error
national workshop on recent advances in total costs us jobs, justice, and lives. University of
least squares techniques and error-in-variables Michigan Press, Michigan
modeling. Society for Industrial and Applied 38. McIntosh AR, Gonzalez-Lima F (1991) Struc-
Mathematics tural modeling of functional neural pathways
21. de Leeuw J (2007) Derivatives of generalized mapped with 2-deoxyglucose: effects of acous-
eigen-systems with applications. Department tic startle habituation on the auditory system.
of Statistics Papers, 128 Brain Res 547:295302
22. Dray S, Chessel D, Thioulouse J (2003) 39. McIntosh AR, Lobaugh NJ (2004) Partial least
Co-inertia analysis and the linking of ecological squares analysis of neuroimaging data: applica-
data tables. Ecology 84:30783089 tions and advances. NeuroImage 23:
23. Efron B, Tibshirani RJ (1986) Bootstrap meth- S250S263
ods for standard errors, confidence intervals, 40. McIntosh AR, Chau W, Protzner A (2004)
and other measures of statistical accuracy. Stat Spatiotemporal analysis of event-related fMRI
Sci 1:5477 data using partial least squares. NeuroImage
24. Efron B, Tibshirani RJ (1993) An introduction 23:764775
to the bootstrap. Chapman & Hall, New York 41. McIntosh AR, Bookstein F, Haxby J, Grady C
25. Escofier B, Page`s J (1990) Multiple factor anal- (1996) Spatial pattern analysis of functional
ysis. Comput Stat Data Anal 18:120140 brain images using partial least squares. Neuro-
26. Esposito-Vinzi V, Chin WW, Henseler J, Wang Image 3:143157
H (eds) (2010) Handbook of partial least 42. McIntosh AR, Nyberg L, Bookstein FL, Tul-
squares: concepts, methods and applications. ving E (1997) Differential functional connec-
Springer, New York. tivity of prefrontal and medial temporal
27. Gidskehaug L, Stdkilde-Jrgensen H, cortices during episodic memory retrieval.
Martens M, Martens H (2004) Bridge-PLS Hum Brain Mapp 5:323327
regression: two-block bilinear regression with- 43. Mevik B-H, Wehrens R (2007) The PLS pack-
out deflation. J Chemometr 18:208215 age: principal component and partial least
squares regression in R. J Stat Software 53. van den Wollenberg A (1977) Redundancy
18:124 analysis: an alternative to canonical correlation.
44. Rao C (1964) The use and interpretation of Psychometrika 42:207219
principal component analysis in applied 54. Williams LJ, Abdi H, French R, Orange JB
research. Sankhya 26:329359 (2010) A tutorial on Multi-Block Discriminant
45. Stone M, Brooks RJ (1990) Continuum Correspondence Analysis (MUDICA): a new
regression: cross-validated sequentially con- method for analyzing discourse data from clin-
structed prediction embracing ordinary least ical populations. J Speech Lang Hear Res
squares, partial least squares and principal com- 53:13721393
ponents regression. J Roy Stat Soc B 55. Wold H (1966) Estimation of principal com-
52:237269 ponent and related methods by iterative least
46. Streissguth A, Bookstein F, Sampson P, Barr H squares. In: Krishnaiah PR (ed) Multivariate
(1993) Methods of latent variable modeling by analysis. Academic Press, New York
partial least squares. In: The enduring effects of 56. Wold H (1973) Nonlinear Iterative Partial
prenatal alcohol exposure on child develop- Least Squares (NIPALS) modeling: some cur-
ment. University of Michigan Press rent developments. In: Krishnaiah PR (ed)
47. Takane Y (2002) Relationships among various Multivariate analysis. Academic Press, New
kinds of eigenvalue and singular value decom- York
positions. In: Yanai H, Okada A, Shigemasu K, 57. Wold H (1982) Soft modelling, the basic
Kano Y, Meulman J (eds) New developments design and some extensions. In: Wold H, Jor-
in psychometrics. Springer, Tokyo eskog K-G (eds) Systems under indirect obser-
48. Tenenhaus M (1998) La regression PLS. Tech- vation: causality-structure-prediction, Part II.
nip, Paris North-Holland, Amsterdam
49. Tenenhaus M, Tenenhaus A (in press) Regular- 58. Wold S (1995) PLS for multivariate linear
ized generalized canonical correlation analysis. modelling. In: van de Waterbeenl H (ed)
Psychometrika QSAR: chemometric methods in molecular
50. Thioulouse J, Simier M, Chessel D (2003) design, methods and principles in medicinal
Simultaneous analysis of a sequence of paired chemistry, vol 2. Verla Chemie, Weinheim
ecological tables. Ecology 20:21972208 Germany
51. Tucker L (1958) An inter-battery method of 59. Wold S, Sjostrom M, Eriksson L (2001)
PLS-regression: a basic tool of chemometrics.
factor analysis. Psychometrika 23:111136
Chemometr Intell Lab Syst 58:109130
52. Tyler DE (1982) On the optimality of the
simultaneous redundancy transformations.
Psychometrika 47:7786
Chapter 24
Maximum Likelihood
Shuying Yang and Daniela De Angelis
Abstract
The maximum likelihood method is a popular statistical inferential procedure widely used in many areas to
obtain the estimates of the unknown parameters of a population of interest. This chapter gives a brief
description of the important concepts underlying the maximum likelihood method, the definition of the
key components, the basic theory of the method, and the properties of the resulting estimates. Confidence
interval and likelihood ratio test are also introduced. Finally, a few examples of applications are given to
illustrate how to derive maximum likelihood estimates in practice. A list of references to relevant papers and
software for a further understanding of the method and its implementation is provided.
Key words: Likelihood, Maximum likelihood estimation, Censored data, Confidence interval,
Likelihood ratio test, Logistic regression, Linear regression, Dose response
1. Introduction
The maximum likelihood method is, like the least squares method,
a statistical inferential technique to obtain estimates of the
unknown parameters of a population using the information from
an observed sample. It was primarily introduced by RA Fisher
between 1912 and 1920, though the idea has been traced back to
the late nineteenth century (1, 2).
The principle of the maximum likelihood method is to find the
value of the population parameter, the maximum likelihood esti-
mate (MLE), that maximize the probability of observing the given
data. The maximum likelihood method, by motivation, is different
from the least squares method, but the MLEs coincide with the
least squares estimates (LSEs) under certain assumptions, e.g., that
residual errors follow a normal distribution.
While the maximum likelihood theory has its basis the point
estimate of unknown parameters in a population described by a
581
582 S. Yang and D. De Angelis
certain distribution (e.g., examples in Subheading 2 and example 1

in Subheading 3), its application extends far beyond the simple
distributional forms to situations where the distribution of the
random quantities or variables of interest (y) are determined by
some other variables (x). This is the case in linear or nonlinear
regression models and compartmental pharmacokinetics models
such as those described in the previous chapters. In such situations,
mathematical models are utilized to describe the relationship
between y and x given some unknown parameters, which are
referred to as model parameters. The most frequent use of the
maximum likelihood method is to obtain the point estimates of
these model parameters.
The maximum likelihood method has been widely applied for
statistical estimation in various models as well as for model selection
(39).
The likelihood and log-likelihood functions are the foundation
of the maximum likelihood method. Definitions of the likelihood
and log-likelihood are given in the next sections. The idea of
likelihood is also at the basis of the Bayesian inferential approach,
which will be explained in the next chapter in more detail.
The aim of this chapter is to introduce the concept of the maxi-
mum likelihood method, to explain how maximum likelihood esti-
mates are obtained and to provide some examples of application of
the maximum likelihood method in the estimation of population and
model parameters. The practical examples are provided with details so
that readers will have thorough understanding of the maximum
likelihood method and be able to apply it at the same time.
2. Important
Concepts
2.1. Likelihood Suppose we have a sample y (y1,. . .yn) where each yi is indepen-
and Log-Likelihood dently drawn from a population characterized by a distribution
Function f(y; y). Here f(y; y) denotes the probability density function
(PDF) (for continuous y) or the probability distribution function
(for discrete y) of the population, and y are unknown parameters.
Depending on the distributions, y can be a single scalar parameter
or a parameter vector.
2.1.1. Likelihood Function If y were specified, the probability of observing yi, given the popu-
lation parameter y, can be written as f(yi; y), which is the probability
density function or the probability function evaluated
Q at yi. Then
the joint probability of observing (y1,. . .,yn) is f y i ; y. This is the
i
likelihood function. Throughout this chapter, we use interchange-
ably the notation L(y) and L(y; y) to describe the likelihood func-
tion, where y (y1,. . .,yn). In practice, y is unknown and it is our
24 Maximum Likelihood 583
objective to infer the value of y from the observed data, in order to

describe the population of interest.
The likelihood function appears to be defined the same as the
probability or probability density function. However, the likelihood
function is a function of y. Specifically, a probability or probability
density function is a function of the data given a particular set of
population parameter values, whereas a likelihood is a function of
the parameters assuming the observed data are fixed. It measures
the relative possibility of different y values representing the true
population parameter value. For simplicity L(y) has been expressed
as a function of a single parameter y. In more general terms,
however, the likelihood is a multidimensional function. For many
commonly encountered problems, likelihood functions are unim-
odal; however, they can have multiple modes, particularly in com-
plex models. In addition, a likelihood function may be analytically
intractable. In that case, it may be difficult to express it in a simple
mathematical form and some form of simplification or linearization
of the likelihood may be required (810).
2.1.2. Log-Likelihood The log-likelihood is defined as the natural logarithm of the

Function likelihood.
P It is denoted as LL y LL y; y lnL y; y
lnf y i ; y, where ln indicates the natural logarithm. The log-
i
likelihood is a monotonic transformation of the likelihood func-
tion, so they both reach the maximum at the same value of y. In
addition, for the frequently used distributions, the LL(y) is a sim-
pler function than L(y) itself.
For the case where y follows a normal distribution,
2
1 2s2 ym
1
f y; y p e , and lnf y; y 12 ln2ps2 2s12
2ps2

y m2 ; where y m; s2 . m and s2 represent the population
mean and variance, respectively. The likelihood function based on
Q 1 12 y i m2
data y1,. . .,yn, is then Ly p e 2s , and the log-
2ps 2
i
P
likelihood is LLy n2 ln2ps 2s12
2
y i m2 . For illustra-
i
tion purpose, Fig. 1 shows the likelihood (left panel) and log-
likelihood (right panel) functions based on a set of data
(n 1,000) randomly drawn from a standard normal distribution.
Suppose Y is a discrete variable taking two values, for example,
success (1) or failure (0); presence of skin lesions (1) or no skin
lesions (0). In statistical terms, y is known to follow a Bernoulli
distribution with probability P(y 1) p and P(y 0) 1 p,
where 0 p 1. The probability function of the Bernoulli
random variable is f y; y py 1 p1y . Note that here y p.
Let y1,. . .,yn be n observations from a Bernoulli distribution,
where yi 1 or 0, i 1,2,. . .n. Of the n observations, k is the
Fig. 1. The likelihood (left) and log-likelihood (right) functions based on n 1,000 samples randomly selected from a
standard normal distribution [mu denotes the mean, sigma2 indicates s2 ].
number of 1s and n k is the number of 0s. The likelihood

corresponding to these data is:
L y; y py 1 1 p1y 1 py 2 1 p1y 2 . . . py n 1 p1y n

P P
yi 1y i
p i 1 p i
pk 1 pnk
and the log-likelihood is:
LL y; y klnp n kln1 p. In Fig. 2, the top panel
shows the likelihood and log-likelihood function of this example
for n 10 and k 2.
2.1.3. Likelihood Function There are cases where a subset yk+1,. . .,yn of data y1,. . .,yn may not be
of Censored Data precisely observed, but the values are known to be either below or
above a certain threshold. For example, many laboratory based
measurements are censored due to the assay accuracy limit, usually
referred to as the lower limit of quantification (LLQ). This happens
when the bioanalysis system cannot accurately distinguish the level
of component of interest from the system noise. For such cases,
the exact value for yi, i k + 1,. . .n, is not available. However, it is
known that the value is equal to or below the LLQ. Such data are
referred to as left censored data.
In other cases, the data are ascertained to be above a certain
threshold, with no specific value assigned. For example, in animal
experiments, the animals are examined every day to monitor the
appearance of particular features, e.g., skin lesions. The time to the
appearance of lesions is then recorded and analyzed. For animals
with no skin lesions by the end of the study (2 weeks for example),
-5
0.006
-15
Log-Likelihood
0.004
Likelihood
-25
0.002
-35
0.000
0.2 0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Probability p Probability p
-5
0.006
-15
Log-Likelihood
0.004
Likelihood
-25
0.002
-35
0.000
-1.386 -1.386
-4 -2 0 2 4 -4 -2 0 2 4
logit p logit p
Fig. 2. Likelihood and log-likelihood functions with respect to p and a. Note: the red solid square points mark the maximum
of the L(p) and LL(p), the texts above the x-axis indicate the MLE of p (0.2) or a (logit of p) (-1.386).
the time to lesions will be recorded as > 2 weeks. These data are
referred to as right censored data. As it is only known that an animal
has no lesion at 2 weeks after the treatment, whether the animal will
have and when it will have skin lesions is not known.
However, suppose examinations were not carried out between
day 7 and 11 and at day 11 an animal was found to have lesions.
Then the time to lesions will be between 7 and 11 days, although
the exact time of lesions appearance is not known. In this case, the
time to lesion for this animal will be interval censored, i.e., it is
longer than 7 days, but shorter than 11 days.
When such cases arise in practice, ignoring the characteristics of
the data in the analysis may cause biases (see ref. 11 and the
references cites therein), so appropriate adjustments must be
applied.
Let y1,. . .,yk represent the observed data, and yk+1,. . .,yn those
not precisely observed but known to be left censored (assumed to
lie within interval [1,LLQ] or [0,LLQ] for laboratory measure-
ments that must be greater than or equal to 0). Assume that
the observed or unobserved y follow the same normal distribution

N(m, s2), then the likelihood for the precisely observed data y1,. . .,
Q
k
1 2s2 y i m
1 2
yk is: Ly p e , but the probability of the cen-
2ps 2
i1
sored yi (i k + 1,. . .,n) needs to be written as:

LLQ
1 2s2 y i m
1 2
L i y p e dy (replace 1 with 0 if yi must be
2
2ps
1
greater or equal to 0), which is the cumulative probability
up to
LLQ of the normal distribution. Note, y m; s2 :
The full likelihood function of all data y1,. . .,yk, yk+1,. . .,yn is,
therefore:
Yk 2 Yn
LLQ
1 12 y i m 1
p e 2s2 y i m dy
1 2
L y; y p e 2s
i1 2ps2 ik1 2ps2

1
The likelihood function for interval censored and right cen-

sored data, can be written in the exact same way. Instead of inte-
grating from (1,LLQ) for the left censoring, integration from
(LOW, +1) for right censored and [LOW,UPP] for interval cen-
sored data, where LOW and UPP are the threshold for the lower
and upper limit of the observation, respectively.
2.2. Maximum The identification of the maximum likelihood estimation (MLE) is

Likelihood Estimation achieved by searching the parameter space (one or multidimen-
sional), to find the parameter values that give the maximum of
the likelihood function.
We show below how this is carried out in the case where y is a
scalar parameter.
Maximization Process
From mathematical theory, the maximum of any function is
achieved at the point where the first
P derivative (if exist) is equal to
zero. As defined above, LL y lnf y i ; y, so the MLE of y
satisfies the following equation: i
X i df y ;y
dLLy 0
LL y dy
0
dy i
f y i ; y
0 df y ;y
where dLLy
dy (or LL y) and dyi indicate the first derivative of
LL y and f y i ; y with respect to (w.r.t) parameter y. Let ^y be the
solution of the above equation.
It is known that the first derivative is zero at any minimum
points as well. In order to get truly the maximum, the second
derivative (if exists) evaluated at ^y must be negative, i.e.
0 2 ^ 21
^ X d f y i ;y df y i ;^y
d LLy
2

@ dy2 dy A<0
dy2 f y ; ^y f 2 y ; ^y
i i i
2 ^
LLy d 2 f y ;^y
with d dy 2 and dy2i denoting the second derivative of LL y
and f y i ; y w.r.t. y, and evaluated at ^y.
For cases where more than one parameter is involved in the
log-likelihood function, the partial derivatives of LL y with respect
to each of the parameter will be used (see example 3.3). It is not
difficult to imagine that if log-likelihood has a very complicated
form, the equations involving the derivatives or the partial deriva-
tives may not be easily solved analytically. Fortunately, many algo-
rithms have been developed to solve these equations, mostly
iterativelythat is by repeating a sequence of calculations until
the resulting values from subsequent iterations are similar. The
Newton-Raphson Optimization Algorithm is one that most com-
monly used.
Starting from a plausible arbitrary point (y0 ) in the parameter
space, the iterative procedure search through the parameter space
based on certain rules, which are proved mathematically to ensure
that the process will identify the maximum of the log-likelihood
function. For example, with the Newtons method, at iteration
LL 0 yn
n + 1, yn1 yn LL 00 y , where yn is the parameter value at the
n
n-th step, LL 0 yn and LL 00 yn are the first and second derivative of

the log-likelihood w.r.t y, evaluated at y yn . The algorithm stops
when the process converges which means when yn1 is close
enough (meets the predefined criteria, usually a small number say,
10E-8) to the value yn at the previous step (12).
Difficulties: The maximization process can encounter many pro-
blems. For example, when the log-likelihood function is flat with
respect to some parameters, the algorithms may have difficulty to
find the maximum point. This could indicate lack of information in
the data to identity specific parameter values and that more data
may be needed. There are also times where the algorithm seems to
find a maximum point, but by choosing a different initial point y0 ,
the procedure may result in a different maximum. This happens
when the log-likelihood function is multimodal. In practice, the
general suggestion for such situations is to repeat the algorithm
from several diverse starting points and/or modify the search steps
(tolerance criteria) to ensure that similar results are achieved.
2.3. Properties of MLE The MLE has several important properties making the maximum
likelihood approach an attractive method for statistical inference.
The MLEs are, for example, asymptotically normal, consistent,
efficient, and parameterization invariant (13). What follows is
focused on the properties more used in practice.
2.3.1. Asymptotically MLE When n (the number of observations) tends to infinity, the MLE of
Follows Normal or y^yMLE follows a normal distribution with mean the true parame-
Multinormal Distribution ter y, and variance (or variance-covariance matrix if y is a vector) the
inverse of the expectation of the second derivative of LL(y) with
respect to y, that is: ^yMLE N y; I y1 , where

2
I y E y d dy
LLy
2 .
I y is called the Fishers information matrix.
Inferences can be made on the basis of this asymptotic
distribution.
It should be noted that given the asymptotic nature of these
properties, they are not guaranteed for small samples. For example,
MLEs obtained from small samples can be biased.
2.3.2. Parameterization This property states that if y^MLE is the MLE for parameter y, then
Invariance the MLE of any function of y, say, g(y), is defined as g^yMLE , that is
the value of the function g evaluated at y ^yMLE ,
This parameterization invariance property of MLEs allows
flexibility in choosing model parameterizations, which is important
in cases where parameter transformation can make the maximiza-
tion step simpler and easier. Example 3.2 illustrates how parameters
can be transformed in practice in order to obtain appropriate
estimate of the unknown parameters of interest.
2.4. Confidence It is important to understand and characterize the uncertainty of the

Interval from MLE MLE if only one experiment is done and the MLE of the population
or model parameters is obtained from data which are generated
from that particular experiment. This is usually achieved by specify-
ing a confidence interval for the unknown y around the MLE.
From the previous section, ^yMLE N y; I y1 . In fact, I y
is not known as y is unknown. In practice, I y is usually approxi-
mated
by
plugging in the estimated value ^yMLE , then obtaining
^yMLE
I y^MLE
2
d LL
2
dy
. This is used to construct a confidence inter-
val around the MLE.
The approximate 1 2a% confidence interval for the
unknown y around the corresponding MLE is:
1=2
yMLE z a I ^yMLE
^
where z a is the critical value of the standard normal distribution

corresponding to the chosen a level, for example, z a 1:96 for
a 0:025, and 1.64 for a 0:05:
As illustrated in Subheading 3.2, users should be cautious when
using this method to calculate the confidence interval for a param-
eter of interest.
2.5. Likelihood Ratio Often several different models can be found to describe a popula-
Test tion from which the data are selected. The problem is then to
identify which model is more appropriate to explain the data and
represent the population. It is possible to use likelihood theory to
test the appropriateness of a given model in comparison to a model
from the same family but with a different number of parameters
(nested models). For example, assume LA is the maximum of the
likelihood function corresponding to a particular model (model A)

and LB is the maximum of the same model but with smaller number
of parameters (model B). Let k be the difference in the number of
parameters between the two models. Then the likelihood ratio R is:

LB
R 2 ln 2lnL B lnL A
LA
2 lnL B 2 lnL A
R has approximately a Chi-squared distribution with k-degrees
of freedom if model B is the true model. The calculated value of R
should be consistent with such a Chi-squared distribution. Large
value of R (with small p value) constitutes evidence against the
model with fewer parameters, B, in favor of model A.
It is noted that sometimes, 2 times the log-likelihood is
minimized instead of maximizing the log-likelihood in many statis-
tical software.
2.6. Further Reading Readers who are interested in the general theory and application of
and Statistical the maximum likelihood method are referred to refs. 1318.
Software A number of statistical packages are available for obtaining
maximum likelihood estimates of model parameters in both the
linear or nonlinear modelling fields. Packages commonly used in
academia and pharmaceutical industry include, SAS (19), Stata
(20), and R (21). In addition, a few other packages are dedicated
to the analysis of non linear mixed effects models, like NONMEM
(10) and Monolix (22). R and Monolix are freely downloadable
online.
3. Examples
3.1. Maximum Suppose random variable y represent the time to an event, e.g., a
Likelihood Estimation certain toxicity from an experiment, and it is assumed that y follows y
of Exponential an exponential distribution with density function: f y; y 1y e y ,
Distribution where y > 0.
Let y1,. . .,yn be a random sample from this exponential distribu-
tion, then the likelihood and log-likelihood functions are given by:
P yi
y1
y2
yn
y
L y 1y e y 1y e y . . . 1y e y y1n e i , and LL y
P
n lny 1y yi.
i P
d LL y y
The solution of the equation y y2i i 0 is:
n
^y 1 P dy
n y i , which is the mean of sample y1,. . .,yn , and denoted by y.

i 2
It is noted that: d dLL2 yy yn2 2 n
y
y3
ny2n
y3
y
, which evaluated
^
at y, is such that <0. Therefore LL y has a maximum at y ,
d 2
LL ^
y
P dy
2
and ^yMLE n1 y i y :
i
d LL y
Note: dy represents the derivative of LL(y) with respective
d 2 LL y
to y, and d2 y
is the derivative of d LL
dy
y
with respective of
parameter y, also the second order derivative of LL(y) with respec-
tive to y.
An alternative way of proving that LL(y) has maximum at y is
through the analysis of the behavior of d LL dy
y
:
rewrite d LL
dy
y
as: d LL y
dy n
y ny
y 2 ny
y2
n
y
y2
nyyny
2
ny y
y2
So if y<y , then d LLdy
y
>0; indicating LL(y) is increasing, and if
y>y , then d LLdy
y
<0, indicating LL(y) is decreasing and therefore
LL(y) has the maximum at y .
3.2. Probability In an animal toxicity study, ten cynomolgus monkeys were admi-
of Toxicity nistered a 100 mg of compound X intravenously every week for
8 weeks. During the study, two out of the ten monkeys developed
skin lesions. What is the probability of having the skin lesions in the
entire population?
Solution:
Let y 1 indicate the event of having skin lesions, and y 0
indicate no skin lesions, where y following a Bernoulli distribution
with the probability of having skin lesions p.
The data obtained from the monkey study were: (y1,. . .,
y10) (0,0,0,0,1,0,0,0,0,1).
Then the likelihood and log-likelihood of these data can be
written as:
Lp pk 1 pnk and LLy klnp n kln1 p,
where p is the unknown parameter, k 2 and n 10 (see Fig. 2),

so LLp 2lnp 8ln1 p.
Solving the equation dLLp k nk ^ k
dp p 1p 0 gives: p n 10 0:2
2
To confirm that this is the maximum value of the log-likelihood

function, the second derivative of the LL(p) is calculated:
d 2 LLp d LL^p2
dp2
pk2 1p
nk ^
2 , substituting p 0:2 gives dp2
<0:
Therefore, p^ is the maximum likelihood estimate of the population
parameter p.
As illustrated in the above Subheading 2.4, the confidence 2
LL^p
interval of p around ^ p, can be calculated using: I ^p d dp 2
^
3
knk 62:5, the standard error of p is 0.126, therefore, the 90%
n
confidence interval of parameter p around ^p would be

(0.05,0.41), assuming p has an asymptotical normal distribution.
However, it is known that probability is between 0 and 1. In
order to maintain this assumption throughout the calculation, a
logit function is used to transform the probability into a variable
p
that can take values between 1 and 1, where logitp ln1p .
# y=(y1,...,yn), n=10, two of the yis are 1, eight of them are 0.
> res1 <-glm(y~1,family=binomial,data=d)

> summary(res1)
Call:
glm(formula = y ~ 1, family = binomial)
Deviance Residuals:
-0.668 -0.668 -0.668 -0.668 1.794
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.3863 0.7906 -1.754 0.0795 .
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10.008 on 9 degrees of freedom
Residual deviance: 10.008 on 9 degrees of freedom
AIC: 12.008
Number of Fisher Scoring iterations: 4
Fig. 3. R-code and outputs to obtain MLE of p.
a
Using this transformation, assume a logit(p), then p 1ee
a .
Replace p with a in the above derivative functions, we then have:

LLp klnp n kln1 p ka nln1 e a LLa
dLLa ea
Solving equation da k n 1e a
^ lnnk
0, we have a k
.
2
a
As I a ^ d da
LL^
2 knk
n 1:6, the standard error of a ^ is
approximately 0.79. The 90% confidence interval of a is then
(2.68, 0.09).
According to the parameterization invariance property of the
MLE, the MLE of parameter p can be calculated by back transform-
e a^
ing the logit function, thus ^p 1e ^ n 0:2 and its 90% confi-
a
k
e 2:68
dence interval is (0.06, 0.48). Note: 1e 2:68 0:06, and
0:09
e
1e 0:09
0:48. Figure 3 below gives the R code to obtain the
MLEs of a and p.
Figure 2 below gives the R code to obtain the MLEs of a and p.
The readers are referred to the R manual (21) for details on
how to set up models in R and the interpretation of the parameter
estimates. Specifically for this example, the logit of probability p,
p
i.e., a log itp ln1p is estimated, and denoted as intercept in
the R output.
Therefore a ^ 1:3863 and its standard error is 0.7906.
These values are similar to that calculated manually above. The
90% confidence interval of parameter a around its MLE is
(2.6829,0.0897). Back transform the logit function, we have
^
^p e a^ 0:2, and the 90% confidence interval of p is (0.064,
a
1e
0.4776).
3.3. Linear Regression Assume that yi b0 + bxi + ei, where i 1,2,. . .,n, the yis are
independently drawn from a population, the xis are independent
variables, and that ei ~ N(0, s2) is residual error. Q 1
The likelihood of y (y1,. . .,yn) is: Ly p
2ps2
2 i
e 2s2 y i b 0 bx i , where y b 0 ; b; s2 ,
1
n 1 X
LLy ln2ps2 2 y i b 0 bx i 2
2 2s i
or
1 X
2LLy nln 2p lns2 2 y i b 0 bx i 2
s i
The MLE of b0, b as well as s2 are obtained by maximizing the

LL(y) or minimizing the 2LL(y). The minimization of LL(y) is
illustrated below. The minimization is achieved by solving the
following equations simultaneously.
@2LLy 1 X
2 2 y i b 0 bx i 0 (1)
@b 0 s i
@2LLy 1 X
2 2 x i y i b 0 bx i 0 (2)
@b s i
@2LLy n 1 X
2 4 y i b 0 bx i 2 0 (3)
@s2 s s i
where @2LLy
@b 0 , @2LLy
@b and @2LLy
@s2
represent the first order
partial derivation of 2LL(y) with respect to b0, b and s2, respec-
tively.
Solving the
P above P three P equations, we have:
^ xy P
b^0MLE n1
x y
b MLE P 2 P 2i i ,
i i i i i
y i bx i and
x x
i i
P i i i
y b 0 bx i 2
^2 MLE i i n
s
It is noted that the MLEs of b, b0 are equivalent to their
corresponding least squares estimates.
3.4. Dose Response In the early drug development, compound X was tested in monkeys
Model to assess its toxicological effects. Three doses (10, 100, 300 mg) of
compound X and placebo were given to 40 monkeys, 10 monkeys
in each dose group, every week for 8 weeks. During the 8 weeks of
study, the appearance of skin lesion was observed and recorded.
The question is whether the probability of skin lesion is associated
with the dosage given. The number of skin lesion in each dose
group was: 0, 1, 5, 9 for placebo, 10 mg, 100 mg and 300 mg
group, respectively.
Solutions:
Let y 1 indicate the presence of skin lesion, and y 0 indicate no
skin lesion. The question can then be rephrased by asking whether
p P(y 1) is related to the dosage. Logistic regression analysis is

a technique to analyze this type of data, where the dichotomous
depend variable is specified in terms of several independent vari-
ables.
The basis of logistic regression is to model the logit transfor-
mation of the probability p as a linear function of the independent
variables. For this example, p is the probability of skin lesion, and
the independent variable is the dose.

Assume D represents the dose administered. The logistic regres-
p
sion can be written as: logitp ln 1p a blnD 1, where
a and b are parameters of the model.
For the i-th animal, the corresponding probability
of having
skin lesions is described as pi, and logit pi a blnD i 1.
Then the likelihood of observing the data as described above is:
Q P
Ly pi y i 1 pi 1y i and LLy y i ln pi 1 y i
i i
ln 1 pi . Given pi is a function of unknown parameters a and b,
then LL(y) is a function of a and b, and y a; b.
The partial derivatives of LL(y) with respect to a and b are:
@LLy P y i 1y i @LLy P h y i 1y i i
@a p 1p and @b p 1p lnD i 1
i i i i
i i
The MLE of a and b can be obtained by solving the equations

@LLy
@a 0 and @LLy
@b 0. Although these equations do not look
complicated, solving them analytically is not easy and a numerical
solution is required. In the following (Fig. 4), the results using glm
in R (21) are presented. The MLE of a (denoted as Intercept ) and b
(denoted as log(dose + 1)) are 5.7448 and 1.3118, with standard
error of 2.0339 and 0.4324, respectively.
To test if increasing the dose has significantly increased the
probability of skin lesions statistically, we can use the likelihood
ratio test as described in Subheading 2.5. In the R output above,
the Null deviance and Residual deviance are given, where Null
deviance is the deviance of a null model where only intercept is
fitted, and the Residual deviance is the deviance of the specified
model. Note: the deviance in this case is defined as the minus
twice the maximized log-likelihood evaluated at MLE of y (i.e.,
2LL^y). The likelihood ratio is R 25.4, with one degree of
freedom. On the basis of a Chi-squared distribution with one degree
of freedom, this corresponds to a p-value of 4.62e-07, indicative of
evidence against the null model. The conclusion is that the proba-
bility of skin lesions is statistically significantly related to the dosage
and it is increased with increasing dose. Figure 5 depicts the model
predicted probability of having skin lesions and their 95% confi-
dence intervals versus the amount of drug administered.
For any given dose of the compound, the probability of skin
lesions can be calculated by back transform the logit function, i.e.,
# y=(y1,...,yn), and doseis a vector of doses given to each of the n animals
> Res.logit <-glm(y~log(dose+1),family=binomiall,data=d)
> Summary(res.logit)
Call:
glm(formula = y ~ log(dose + 1), family = "binomial", data = d)
Deviance Residuals:
-1.95101 -0.37867 -0.07993 0.56824 2.31126
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.7448 2.0339 -2.825 0.00473 **
log(dose + 1) 1.3118 0.4324 3.034 0.00242 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 52.925 on 39 degrees of freedom

Residual deviance: 27.510 on 38 degrees of freedom
AIC: 31.51
Number of Fisher Scoring iterations: 6
Fig. 4. The R-code and outputs to obtain the MLE of a and b.

1.0
0.8
Probability of skin lesion
0.6
0.4
0.2
Observed
Model Prediction
95% Confidence Interval
0.0
0 100 200 300 400

Dose (mg)
Fig. 5. Observed and model predicted probability of skin lesion (black dots are the
observed proportion of monkeys having skin lesion for the corresponding dose group;
solid line is the model predicted probability of skin lesion; dashed lines are 95 %
confidence interval of probability (p); Dose 1 represents placebo).
expab lnD1
p 1expab lnD1 . For example, when no drug is given, i.e.,
e a^
D 0, the probability of having skin lesions is 1e a^ 0:003 and
the 95% confidence interval is (0,0.15). When D 200 mg, the

probability of skin lesions is 0.77 with 95% confidence interval of
(0.52, 0.91). Note that the confidence interval is calculated using
the formula as described in Subheading 2.4.
References
1. Hald A (1999) On the history of maximum models with below-quantification-limit data.
likelihood in relation to inverse probability Pharm Stat 9(4):313330
and least squares. Statist Sci 14(2):214222 12. Fletcher R (1987) Practical methods of
2. Aldrich J (1997) R.A. Fisher and the making of optimization, 2nd edn. Wiley, New York
maximum likelihood 19121922. Statist Sci 12 13. Young GA, Smith RL (2005) Essentials of
(3):162176 statistical inference, chapter 8. Cambridge
3. Akaike H (1973) Information theory and an University Press, Cambridge, UK
extension of the maximum likelihood principle. 14. Bickel PJ, Doksum KA (1977) Mathematical
In: Petrox BN, Caski F (eds) Second interna- statistics. Holden-day, Inc., Oakland, CA
tional symposium on information theory. 15. Casella G, Berger RL (2002) Statistical infer-
Akademiai Kiado, Budapest, pp 267281 ence, 2nd edn. Pacific Grove, Duxberry, CA
4. Schwarz G (1978) Estimating the dimension of 16. DeGroot MH, Schervish MJ (2002) Probabil-
a model. Ann Statist 6:461464 ity and statistics, 3rd edn. Addison-Wesley,
5. McCullagh P, Nelder JA (1989) Generalized Boston, MA
linear models, 2nd edn. Chapman and Hall, 17. Spanos A (1999) Probability theory and statis-
New York tical inference. Cambridge University Press,
6. Cox DR (1970) The analysis of binary data. Cambridge, UK
Chapman and Hall, London 18. Pawitan Y (2001) In all likelihood: statistical
7. Cox DR (1972) Regression models and life modelling and inference using likelihood.
tables. J Roy Statist Soc 34:187220 Cambridge University Press, Cambridge, UK
8. Lindsey JK (2001) Nonlinear models in medi- 19. SAS Institute Inc. (2009) SAS manuals. http://
cal statistics. Oxford University Press, Oxford, support.sas.com/documentation/index.html
UK 20. STATA Data analysis and statistical software.
9. Wu L (2010) Mixed effects models for complex http://www.stata.com/
data. Chapman and Hall, London 21. The R project for statistical computing.
10. Beal SL, Sheiner LB, Boeckmann AJ (eds) http://www.r-project.org/
(19892009) NONMEM users guides. Icon 22. The Monolix software. http://www.monolix.
development solutions. Ellicott City org/
11. Yang S, Roger J (2010) Evaluations of Bayesian
and maximum likelihood methods in PK
Chapter 25
Bayesian Inference
Frederic Y. Bois
Abstract
This chapter provides an overview of the Bayesian approach to data analysis, modeling, and statistical
decision making. The topics covered go from basic concepts and definitions (random variables, Bayes rule,
prior distributions) to various models of general use in biology (hierarchical models, in particular) and ways
to calibrate and use them (MCMC methods, model checking, inference, and decision). The second half of
this Bayesian primer develops an example of model setup, calibration, and inference for a physiologically
based analysis of 1,3-butadiene toxicokinetics in humans.
Key words: Bayes rule, Bayesian statistics, Posterior distribution, Prior distribution, Markov chain
Monte Carlo simulations
1. Introduction
Bayesian statistics are essentially a straightforward probabilistic cal-

culus approach to data analysis and modeling. They require sum-
marizing, in the form of probability distributions, the state of
knowledge before seeing the data. As we will see, specifying such
prior knowledge may require some care. Yet, their results do not
rely on assumed conditions of an infinite number of observations,
and their methods are exact (hence they are never asymptotic and
the confidence intervals they provide are exact, up to numerical
error, even for small sample sizes). They are relatively transparent
and easy to understand, even for complex models. That is a partic-
ular advantage in biology, where complexity is usually the name of
the game. Hence the favor they have had in the last 20 years in that
field; a success much aided by advances in numerical computing
(the famous and remarkably powerful Markov chain Monte Carlo
methods) applied to probabilistic calculus.
597
598 F.Y. Bois
Bayesian analysis proceeds in essence by inferring about

(hidden) causes on the basis of (observed) effects, i.e., on the
basis of data. The difference between frequentist and Bayesian
treatments of the problem is that the frequentist assumes that the
data should be treated as random variables, while the Bayesian
considers that the data are given and that the model parameters
are random variables. The difference may seem subtle but results in
quite a different mathematical treatment of the same problem. The
causes dealt with are either the structure or the parameter values of
probabilistic models of the actual phenomena supposed to generate
the observations. Such probabilistic models can range from fairly
simple to very sophisticated. For example, the analyst can posit that
the observations are generated by a purely random Gaussian pro-
cess (the model); That being assumed, her aims can be, for
example, to infer about reasonable values of the mean and standard
deviation (SD) of the model. She may then want to check that the
model is approximately correct and to predict how many samples it
will take for a future observation to be above a certain threshold
value. Being probabilistic in essence, Bayesian analysis derives its
inferences in the form of probability distributions for the variables it
seeks to identify (the mean, SD, or number of future samples in the
previous example). Such distributions, called posterior because
they are the final result of the analysis, summarize what is known
about those variables, including the uncertainty remaining about
them. For a general idea of the method, readers can stop here; but
useful details follow for those interested in using Bayesian methods.
For further exploration of the topic, please refer to (17). A journal
(Bayesian Analysis, http://ba.stat.cmu.edu), edited by the Interna-
tional Society for Bayesian Analysis (ISBA, http://www.bayesian.
org/), is specifically devoted to research in the area.
2. Important
Concepts
2.1. Random Variables A fundamental tenet of the Bayesian approach is that probabilities
Everywhere are formal representations, i.e., tools to quantify, degrees of belief.
First, the model posited to generate the data is probabilistic and
therefore gives the probability of occurrence of any particular data
value. For example, if we assume that the data values are normally
distributed with mean m and SD s, we would bet that data values
exceeding m + 3s will be quite rare. Actually, any model output can
be a considered random variable. Second, the model parameters, if
their values are not precisely known to the analyst, are also
described by probability distributions and are treated as random
variables. This can also apply to the models boundary conditions,
e.g., the initial concentrations in a reactor. Following our simple
25 Bayesian Inference 599
Fig. 1. A simple graphical model. The data Y are supposed to be distributed normally
around mean m with SD s. The conditional dependence of Y on m and s is indicated by
arrows.
example, we may know that the mean m we seek to identify is

somewhere between 1 and 100, with no particular idea where it
sits in that interval: a simple choice of distribution for m would then
be uniform between those two bounds. Such a parameter distribu-
tion is called prior distribution, because it can (and at least
should) be defined before seeing the data to analyze, and only on
the basis of the knowledge we have of the process studied (the case
of complete ignorance about the possible values of the model
parameters will be examined below).
2.2. Graphical Models A statistical model can be usefully represented by a graph and more
precisely by a directed acyclic graph (DAG) (8, 9). In such graphs
(see Fig. 1) the vertices (nodes) represent model-related random
variables (inputs or design parameters, observable quantities, etc.)
and directed edges (arrows) connect one vertex to another to
indicate a stochastic dependence between them. Figure 1 represents
graphically the above Gaussian example: the data values Y are
supposed to be distributed around mean m and SD s: The depen-
dence of Y on m and s is indicated by arrows. To be coherent these
graphs need to be acyclic, which simply means that there is no way
to start a vertex and follow a sequence of edges that eventually
loops back to the start. Besides the clarity of a graphical view, they
also offer simple algorithmic ways to automatically decompose a
complex model into smaller, more tractable, subunits. Such simpli-
fications can be achieved through rules to determine conditional
independence between nodes, as explained below.
2.3. Multilevel A very useful and common extension of the simple model of Fig. 1
(Hierarchical) Models is to build a hierarchy of dependencies (10). Take the example of
inter- and intra-variability. Interindividual variability means that
individuals are different in their characteristics or more precisely
in the measure of their characteristics. For example, body mass
differs between subjects. This difference can be modeled determin-
istically (mass increases nonlinearly with age, etc.), but such a
modeling usually leaves out some unexplained variability. People
of the same age, sex, ethnicity, birth cohort, etc. still have different
body masses. In front of such residual uncertainty (or very early
on if we have no mechanistic clue) we can resort to probabilistic
600 F.Y. Bois
Fig. 2. A hierarchical (multilevel) graphical model. The data Y are supposed to be

distributed normally around mean yij with SD s. The parameter yij is in turn normally
distributed around yi with SD D. Finally yi is distributed around mean on m with SD S.
Prior distributions P are placed at the open ends of the hierarchy.
modeling: the residual variability is random. To extend the

model of Fig. 1, assume that the body masses of n subjects are
measured twice on two different days. The measured value, Y, for a
given person differs by a random amount from the true value, yij,
of her body mass because scales are imprecise. We note by s the SD
of measurements with the scale at hand. But body mass values vary
notorious from hour to hour and day to day, depending on what we
eat, drink, eliminate, etc. A persons yij on a given measurement day
j will differ from that on another day. That difference, again, could
be due to growth during puberty, but let us assume again that it is
random (imagine we did not keep track of calories intake during the
holidays). A simple way to model such intraindividual variability is
to assume that the instantaneous yij is distributed normally
around a subject-specific mean yi and SD D (in Fig. 2 you have to
imagine that there are two yij per subject), We weighted several
subjects and i goes from 1 to n: there is no reason to assume that
individual body masses yi are the same. If there are no obvious
differences in age, sex, etc. to stratify upon or model with a growth
curve, we can again resort to stochastic modeling and imagine that
they are distributed randomly around a grand population mean m
with a population SD S (again, imagine the yi node in Fig. 2 as a
collection of n nodes, all linked to m and S). If we do not have
several measurements per animal, trying to recover information
about intraindividual variability is difficult, but the framework
exposed in (11), in which a strong informative prior is placed on
D, might help.
Why go to such length at describing a simple body mass mea-
surement? First, biological sciences have evolved dramatically dur-
ing the last 50 years. It was customary before to think in terms of
reference man, to average measurements and trim outliers.
Understanding differences and susceptibility, moving toward
personalized medicine, is at the front line of our current thinking.
In that context, correct estimates of yi, D, and S are potentially

important. The model also separates estimation of uncertainty (s is
an uncertainty estimate) and variability assessment (D and S are
variability parameters). A note on s is in order here: if we are sure
that our model is correct, Gaussian errors, etc., then s represents
clearly measurement error. However, if the model is even a bit
wrong (biased scale, lognormal errors, etc.) s will in fact represent
an omnibus aggregate of modeling and measurement error. This
has clear implications for its posterior interpretation but also on
setting its prior: Our a priori on s should be more vague than for
just measurement error if we are not totally confident about our
model. This can also be checked a posteriori: if our posterior
estimate of s is quite larger than an informative prior based on
quality assurance data on measurement precision, we should sus-
pect that something is wrong with the model. The above model,
purely stochastic, could be augmented with deterministic links,
introducing various known or suspected covariates like age, sex,
genetic variants, etc. The existence and strength of those links can
be tested and better estimated with such models, leading to better
testing of mechanistic hypotheses and more predictive science.
Note that this type of model is not specifically Bayesian and can
be treated using frequentist approximations (12). A specific Bayes-
ian addition is the use of prior distributions placed on the popula-
tion m, S, D, and s parameters. Priors are designated by P in square
nodes in Fig. 2. Enabling a better use of prior information is
actually a second, specifically Bayesian, advantage of hierarchical
models. Most of the time, currently (this may change in the future),
what we already know about biological systems is on average.
We may know the average height of adult male Caucasian Amer-
icans, for example. Placing that information at the right level in the
model hierarchy is cleaner, more precise, more transparent, and
leads to better inference. Another specificity of the Bayesian
approach is the treatment of data as given and the derivation of
parameters distributions conditional to them, using Bayes rule as
described below.
Note that multilevel models extend far beyond interindividual
variability. The units they consider can be, for example, ethnic
groups, geographic areas, or published study results (in meta-
analysis). In fact, they cover the whole area of nonlinear mixed-
effect models and latent variable models or about any problem in
which various homogeneous groups, eventually nested, can be
defined.
2.4. Mixture Models Mixture models can be seen as a special class of hierarchical model,
in which the intermediate nodes are discrete indicator variables
(sort of labels). For example, assume that we are given measure-
ments of the body mass of a group of male and female rats. We just
know that animals of the two sexes were used, but we do not know
602 F.Y. Bois
the sex of any particular animal. We just have one body mass for it.
We also know that males and females have different average body
masses, and we are in fact also interested in recovering (guessing)
the sex of these animals. This is in fact a particular case of a very
general classification problem, and we could easily imagine more
than two classes. The corresponding model can be viewed hier-
archically with a set of indicator variables, d, above the individual
data and conditioning the values of the individual parameters. The
variables d are themselves given a probabilistic specification, usually
in the guise of a binomial model if there are two classes or multino-
mial beyond that. Further details on this class of models are given
for example in (5).
2.5. Nonparametric The flexibility of hierarchical and mixture modeling is often suffi-
Models cient to yield useful predictive models and data analyses. However,
flexibility can be further increased by calling upon infinite-
dimensional models (paradoxically called nonparametric models)
of functions. These nonparametric models can be used to replace
fixed functional forms like the linear model or even fixed distribu-
tion like the lognormal. Relaxing the assumption of a fixed func-
tional link leads nonparametric regression models, for which
Gaussian processes, spline functions or mixtures, or Dirichlet pro-
cesses have been well investigated (13, 14). For example, the
assumption of a simple unimodal distribution of body masses
within a population is probably incorrect if the population is not
homogeneous. This may not prevent posterior individual esti-
mates to form clumps which can be identified, but estimation may
be unstable and convergence may be difficult to obtain, and in any
case, population variance will mean very little. A mixture of Dirich-
let processes can be used to put a flexible model on the distribution
of the individual parameters in the population. Similarly, with
enough daily data, we may want to model flexibly the uneven
evolution of body mass during pregnancy rather than resorting to
the simplistic variability model of Fig. 2 (15).
2.6. Conditioning on Bayes approach also considers that the data, once they have been
the Data to Update observed (i.e., have become actual, by opposition to imag-
Prior Knowledge: ined), cease to be random and have to be treated as fixed values.
Bayes Rule This is similar to the quantum physics view of our world, in which
particles are described as density functions (waves) before they are
observed, and collapse into actual particles with precise character-
istics and position only after observation (16). Thomas Bayes idea
was then to simply apply the definition of conditional probabilities
to reverse the propagation of uncertainty.
By definition the conditional probability of an event A, given
event B, is as follows:
PA; B
PA jB ; (1)
PB
where P(A, B) denotes the joint probability that both A and B

occur, and P(B) the probability that B occurs, regardless of what
happens to A. That definition applies to probabilities, but also,
more generally, to probability distributions, be they discrete or
continuous density functions. By convention, in the following, we
will write [x] the probability distribution of the random variable x.
With that notation, Eq. 1 reads [A|B] [A, B]/[B].
A posteriori inference. After having observed data, if those are different
from what we expected (shifted, over-concentrated, or over-dispersed)
we usually want to infer about the parameter values susceptible to have
led to such observations1. That requires computing [u|y], the posterior
distribution of all models parameters, u, given the data y (i.e., posterior
to collecting y). Applying Eq. 1, we simply obtain:
yyjy
yjyy y; y yjyy ) yjy : (2)
y
This is the celebrated Bayes rule, which states that the proba-
bility distribution of the unknowns given the data at hand are
proportional to the prior distribution [u] of those unknowns
times the data likelihood, [y|u], which depends on the model. In
some cases, for complex or peculiar models, it may be difficult to
compute [y|u], but the principle would remain the same. The term
[y] is called the prior predictive probability of the data. It can be
obtained by marginalizing [y,u], i.e., integrating it over the para-
meters u:

y yjyydy: (3)
y
Since the data are considered fixed numerical values, [y] can be
considered as a normalization constant. It can either be calculated
precisely or numerically or even remain unspecified, as when using
MCMC sampling, which can sample values of y from [u|y] regard-
less of the value of [y].
The posterior parameters distribution summarizes what is
known about u after collecting the data y and the remaining uncer-
tainty about it. It is obtained by updating the prior [u] using the
data likelihood (Eq. 2), and this updating is a simple multiplication.
In the usual case where several parameters have to be estimated, the
posterior distribution is a joint (multivariate) distribution, which
can be quite complex (see Subheading 2.8).
Posterior predictive probability. The probability distribution for new
(future) data when we have updated the parameters distribution
should also reflect that updating. Hopefully, having analyzed data
1
Note that if the data were exactly what we expected a priori, there would be not much need to improve the
model.
604 F.Y. Bois
will lead us to make more precise predictions of the future. In fact,

all the above distributions can be sequentially updated as new data
are observed. This makes the development of sequential tests par-
ticularly easy in Bayesian statistics. Using the square bracket nota-
tion, the posterior predictive probability for a new data value z,
given that some data y has already been observed, is obtained by
linking the past and future data through the parameter posterior [u|
y], integrating over all possible parameter values:

z jy z jyyjy dy: (4)
y
Conditional independence. The probabilistic approach taken in

Bayesian statistics boils therefore down to finding the joint distri-
bution of the model unknowns, given the data and the priors. This
could be a difficult task, even for moderately complex models. But
in fact we can usually simplify the problem. Take back the example
of the hierarchical model of Fig. 2. Its unknowns are parameters m,
yi, yij, S, D, and s. In all generality we are therefore looking for the
posterior [m, yi, yij, S, D, s|Y]. By Eq. 1 it is equal to [m, yi, yij, S, D,
s, Y]/[Y], which seems only mildly useful and definitely intimidat-
ing. This is where conditional independence arguments come into
play. In Fig. 2, if we were given a value for yi, for example, the likely
values of m and S would depend only on it and on their prior. The
values of all other parameters and data would not matter. Para-
meters m and S are therefore conditionally independent of yij, D, Y,
and s given yi. Let split [m, yi, yij, S, D, s, Y] into the product [m,
S|yi, yij, D, s, Y] [yi, yij, D, s, Y]. The above independence
argument implies that [m, S|yi, yij, D, s, Y] reduces to [m, S|yi],
which by Bayes theorem (Eq. 2) is equal to [m, S] [yi|m, S]/[yi].
Similar reasoning can be used to reduce [yi, yij, D, s, Y]. We find:
m; Syi jm; S yi ; Dyij jyi ; D
m; yi ; yij ; S; D; s; Y Y jyij ; s
yi yij
yij ; s: (5)
But [yi, D] can be factored in [yi][D] and [yij, s] in [yij][s],
because our priors say nothing about yi or yij. Similarly, [m, S] can
usually be factored because we tend to assign independent prior
distributions to the various parameters. That is not mandatory: if
we had enough information about covariance between m and S, for
example, we could specify a bivariate prior distribution for them. In
that case, the term [m, S] would remain. All factorizations done, by
independence arguments, we get:
m; yi ; yij ; S; D; s; Y mSyi jm; SDyij jyi ; DY jyij ; ss: (6)
The posterior in this case is proportional to a simple product of
known distributions, if we have assigned some tractable form to
each prior. The three remaining conditional distribution terms are
simply Gaussian distributions, as specified by our model.
This is about all there is to Bayes theorem. However, as we will

see below, there are some practicalities in defining the priors and
getting at the posterior, and making decisions on that basis.
2.7. Quantifying Prior Informative priors. The existence of prior information about a
Knowledge models parameter values is quite common in biology. For example,
we can define reasonable bounds for almost any parameter having
a natural meaning. That is even more the case when the
models are detailed and mechanistically based. PBPK models
(see the corresponding chapter in this volume) are good examples
of those. The scientific literature, nowadays often abstracted in
databases, gives us values or ranges for organ volumes, blood
flows, etc. Such data can be used directly to inform prior parameter
distributions, without analyzing through the model. Such priors
may also come from Bayesian analyses using simpler models. In that
case, we often proceed by analogy, assuming that the prior infor-
mation comes from individuals or species similar to those for which
we have the data to analyze. The fact that interindividual variability
is often present may cast doubts on the validity of such analogies,
but hierarchical models (see above) can be used to protect against
unwarranted similarity assumptions. Another way to obtain infor-
mative priors is to elicit them from expert judgment (2, 4, 17).
Various techniques can be used for such an elicitation. They usually
boil down to having field experts summarize their knowledge about
a parameter in the form of percentiles and then fitting a distribution
function to such data. The distribution functions used can be
of minimal entropy, reference, conjugate, forms, etc.
(see below). In any case, we should in general strive for using
carefully chosen informative priors, as that is efficient use of the
knowledge already painstakingly acquired.
Caveat. The prior should not be constructed using the data entering
the likelihood (i.e., the data to be analyzed). This would be a double
use of the same information and clearly violates Bayes rule.
Vague (noninformative) priors. In some cases (ad hoc parameters,
symmetry of the problems, very poor information a priori, over-
whelming data which will surely dominate the inference, etc.) a
vague prior can be preferred. In that case it is first assumed that
parameters are a priori independent (i.e., what we know about one
does not have a bearing on what we know of a other). Second, all
values of the parameter (or of its logarithm if it is a variance
parameter) are considered equiprobable. An example of symmetry
reasoning is to use a priori the same probability of occurrence for
each of the six sides of an ordinary dice. A certain number of
mathematical criteria have been proposed to derive vague priors,
such as in Jeffreys, maximum entropy, hierarchical, or reference
priors (2, 4, 6, 18).
606 F.Y. Bois
Improper priors. Noninformative priors are often improper, or

degenerate, in the sense that they do not have a defined integral
over their range of values. Proper probability density functions have
an integral equal to 1. When multiplied by a proper likelihood
function, improper density may lead to proper posteriors, but that
is not always the case, in particular, in complex hierarchical models.
Improper posteriors are a real problem as they lead to nonsensical
inferences and decisions. Improper priors should be used very care-
fully and can be usefully replaced by vague (i.e., large variance) but
proper priors. The sensitivity of the results to vague assumptions
about the priors can (and probably should) be checked a posteriori
(see below the section on Robustness of Bayesian Analyses).
Conjugate priors. When using such priors, the analyst chooses a
prior distribution shape that matches the data likelihood, in the
sense that it leads to an easy, or at least a closed, analytical form for
the posterior distribution. For example, if the data model is Normal
(m, s) (and so is the data likelihood), a Normal prior distribution for
m leads to still a Normal posterior distribution for m, if s is known.
Conjugate priors can be informative or vague, and they are simply
convenient for analytical calculus. That is their only justification.
Most numerical sampling algorithms, nowadays, do not require a
closed analytical form for the posterior; conjugate priors can then
be dispensed with, in the favor of better informative priors, or more
flexible nonparametric forms.
2.8. Getting at the Analytical solutions. Simple problems, or problems for which con-
Posterior jugate or flat priors can be used, may admit closed-form solutions.
That is usually the case with distributions from the exponential
family (35). For example, for binomially distributed data, and
hence a binomial model, the parameter to estimate is usually the
sampling probability p. For example, imagine a device drawing
random series of 0 or 1. You assume that each time it draws either
1 with fixed probability p, or 0 with probability 1 p. The proba-
bility of drawing x 1 in a series of n draws is then:
x j p; n / px 1 pnx : (7)
The conjugate prior for parameter p can be shown to be the
beta distribution with hyper-parameters a and b (46):
pja; b / pa1 1 pb1 : (8)

Symbols a and b are called hyper-parameters to differentiate
them from the parameters of interest, in this case p (see Subhead-
ing 2.2). With a beta prior and a binomial likelihood, the posterior
of p is still beta, but with updated parameters x + a and n x + b:
pjx ; n; a; b Beta(pjx a; n x b: (9)
In such a case, posterior inference is automatic: the full

distribution of p is known. For example, its most probable value,
the mode of the distribution, is (x + a 1)/(n + a + b 2). Sim-
ilarly the predictive posterior distribution for a future draw is beta-
binomial, etc. (6).
Analytical solutions are also available for some multivariate
posterior distributions of the exponential family (3, 5). When
there are many data, and the prior is rather vague, the posterior
distribution will follow closely the data likelihood. In that case, if
the likelihood is analytically tractable, it can be used as an analytical
approximation of the posterior.
Numerical solutions. For the last 20 years, numerical algorithms
able to draw samples from a posterior distribution have become
mainstream. Samples can be obtained even if the posterior is
defined up to a constant, and in particular leaving undefined the
prior predictive distribution of the data ([y] in Eq. 2). A large
number of such algorithms exists, and for details the reader is
referred to the abundant literature on the subject (19, 20). Two
common methods are Markov chain Monte Carlo sampling and
Gibbs sampling.
Markov chain Monte Carlo (MCMC) sampling methods can gener-
ate a sequence of random draws from any distribution for which the
density can be computed up to a constant. The first example of
those is the MetropolisHasting (MH) algorithm (21, 22). Let u be
a set {y1, . . ., yn} of parameters of interest. According to Eq. 2, its
joint posterior distribution is proportional to the product of its
prior distribution by the data likelihood:
ujy / uyju: (10)
To simplify notation, let us write (u) the product [u][y|u].
At step zero, the MH algorithm starts from an arbitrary point
u0 (the exponent is used here for indexing). It then samples a
candidate point u0 using a so-called instrumental (i.e., arbitrary
but judiciously chosen) conditional distribution J(u|). The condi-
tioning is in general on the previously sampled value of y (in this
case u0), hence the appellation Markov chain because the new
value depends partly on the previous. For example J(u|) is very
often a multivariate Normal distribution centered on the previous
draw of u. The candidate u0 is selected or not, according to a rather
simple procedure:
1. Compute the ratio
0
f u J ui u
0
ri 0 with i 0: (11)
f ui J u ui
2. If r 1, accept u0 ; otherwise, accept it only with probability r
(to that effect it is enough to sample a uniform random number
u between zero and one and accept u0 if u r).
608 F.Y. Bois
3. If u0 is accepted, keep and call it u1; otherwise discard it and

make u1 u0.
The algorithm continues sample proposed values and accepts
or rejects them, according to the value of ri, as long as you wish. It
can be shown that after an infinite number of draws, the retained
values of u form a random sample from the desired posterior
distribution. In practice, it is enough to run several chains from
different starting points u0 and check that they approximately
converge to the same distribution (23). Note that if J(u|) is sym-
metrical and centered on the previous value, the ratio J(ui|u0 )/
J(u0 |ui) in Eq. 11 is always equal to 1 and does not need to be
evaluated. Note also that the above assumes that (u) can be eval-
uated numerically.
Instead of sampling the whole parameter vector, u, it is common
practice to split into components of possibly differing dimensions, as
we have done in Eq. 5 using conditional independence arguments,
and then update these components one by one. In the simplest case,
each component is a scalar and its proposal distribution J is univariate.
The sampling scheme of Eq. 8 is used, but the posterior f is then the
distribution of a given component given all the others, its prior and
the data likelihood. Oftentimes, conditional independence argu-
ments (see Subheading 2.2) simplify f dramatically.
Gibbs sampling. A special case of the component by component
MetropolisHasting algorithm is Gibbs sampling (5, 24). For a
number of general models, and in particular with conjugate dis-
tributions, the conditional posterior distribution of some or all the
components of u can be directly sampled from. In that case, the
proposed value is always accepted and the algorithm can be quite
fast. The hybrid MetropolisGibbs sampling, often used, consists in
using Gibbs sampling for components of u for which direct sam-
pling can be done and a Metropolis step for the others.
Particle algorithms. These methods were mainly developed for
applications to sequential estimation problems (25) or difficult
problems with multiple posterior modes. see ref. 26 for a recent
review of the topic. To simulate a sample from the target distribu-
tion f, N samples are generated and tracked in parallel. Each of
these samples is called a particle, and N particles together form
an interacting-particles system. The basis of these particle algo-
rithms is in general a sampling importance resampling (SIR)
scheme (27, 28), but other selection methods are possible. SIR
proceeds as follows:
Draw a set of N values ui (particles) from the prior [u].
P
Assign to each ui a normalized weight equal to yjui = j yuj .
Resample M values (particles), with replacement, from the
previous sample, using the weight of each ui as its sampling
probability.
The new sample obtained can be considered as drawn from the

posterior [u][y|u]. If the weights are very different, the resampling
will tend to favor a few of the original particles, leading to a
degeneracy problem with many particles having the same value.
To avoid that, an independent MetropolisHastings step may be
added for each particle, to move them away from each other. These
particles interact with each others because their weights are not
independent and so is their sampling.
Software. The major software package we recommend is R (http://
www.R-project.org), which is free and very well maintained and
developed by the R Development Core Team (29). More specifi-
cally Bayesian are BUGS (http://www.mrc-bsu.cam.ac.uk/bugs)
and GNU MCSim (http://www.gnu.org/software/mcsim) (30,
31). Many other commercial and specialized software are available.
For further information, multiple resources are available on the
Web, such as the International Society for Bayesian Analysis
(ISBA, http://www.bayesian.org), the Bayesian Inference for the
Physical Sciences project (BIPS, http://www.astro.cornell.edu/
staff/loredo/bayes), the ASA Section on Bayesian Statistical
Sciences (http://www.amstat.org/sections/SBSS/), etc.
2.9. Checking The above models and analyses can be quite complex. Common
the Results sense dictates that we should check them thoroughly before jump-
ing to conclusions. Several aspects of the model can be checked:
Posterior distributions consistency. The joint posterior distribution of
the model unknowns summarizes all our knowledge about them
from priors and data. Hopefully, the data have been sufficient to
modify substantially the prior, so that prior and posterior differ, the
later being still reasonable. For example, the posterior density
should not be concentrated on a boundary, if a bounded prior has
been used. That would be a sure sign that data and prior conflict
about the value of some parameters. Such conflicts can also lead to
multi-peak posterior distributions, hard to estimate and sample
from. In that event, the data, the prior, and the model itself should
be carefully questioned. If the data are not informative, we fall back
on the prior, which is not a problem as long as informative prior
distributions have been used. Actually, there are ways to estimate
retrospectively or prospectively the information gain brought by
specific experiments (32, 33) but it goes beyond the scope of this
chapter. If neither the data nor the prior (or even part of the prior)
is informative, the posterior is left vague and numerical algorithms
usually have problems converging in such cases. So, beware of
noninformative priors on any parameter if you are not sure that
the data will be informative about it. The problem may arise in
higher dimension than the usual univariate marginal, and it may
happen that only combinations of parameters are identifiable given
610 F.Y. Bois
the data. For example, if two parameters a and b are only found as
the product ab in the model formulation, chances are that the data
likelihood only constrain that product and that any combination of
a and b will be acceptable if vague (even if proper) priors are used.
This can be diagnosed by examining the correlations between pairs
of parameters, and such a check should be routinely done. Higher
dimensions correlations can indeed happen, but they are harder to
check, and they tend to translate into 2D correlations anyway.
Data fit. Let us imagine that we have a well formed estimate of the
posterior distribution or a large sample of parameter vectors drawn
from it. The next, obvious, step is to check whether the data
analyzed are well modeled. That is relatively straightforward to do
if the graphs to construct (e.g., observed vs. predicted data values)
are easy to obtain. If it is analytical, the posterior predictive distri-
bution of the data should be used to assess systematic deviations in
scale or location between the actual data and their estimated distri-
bution. If a posterior parameter sample was obtained by numerical
methods, the model can usually be run for each parameter vector
sampled to simply simulate data values. For each data point, an
histogram of predicted values can be constructed and the probabil-
ity of the data under the model can be checked (confidence bands
can for example be formed for the data).
Cross validation. Whenever possible, it is worth keeping part of the
data unused for model calibration and reserve it for the predictive
data check described above (rather than using the data used for
calibration). If the cross-validation data are reasonably well mod-
eled they can always be reintroduced in the calibration data set for
an increased accuracy of parameter estimates.
2.10. Inference Summarizing the results. The results of your analysis could be given
and Decision (published) in the form of a large sample from the posterior distri-
bution, but it is usual and probably useful to summarize that
information in the form of point estimates, credibility intervals,
etc. In the Bayesian framework all these can be cast easily as a
decision problem, involving possible valuation of the consequences
of errors (an overestimation error may not have the same impor-
tance or cost as an underestimation). It can be shown, for example,
that with a loss proportional to the square of errors, the optimal
marginal estimate for a parameter is its posterior mean. Similarly,
under absolute error loss, the optimal Bayes estimate is the poste-
rior median, and under zeroone loss the Bayes estimate is the
posterior mode. However, most people to which we communicate
results are unable to justify a preference for a particular loss func-
tion, and offering them several estimates just confuses them. In that
case, I tend to prefer reporting the mode, particularly with many-
parameter models, because it is the best compromise between prior
and data. While point estimates are useful summaries, many people
have also come to understand and appreciate a level of uncer-

tainty attached to them. Again, that is very natural from a Bayesian
point of view, which considers distributions, even predictive, as
measures of willingness to bet on any particular value. Easy to
compute and present measures of uncertainty (at least in the case of
univariate, e.g., marginal, distributions) are standard deviation,
coefficient of variation, and percentiles: there is only x% chance
that a parameter y exceeds its (1 x)th percentile. Another,
slightly different, way to assess uncertainty is via credible
regions.
Highest posterior density credible regions. These are Bayesian analogs
of frequentist confidence intervals. To construct a 100a % credible
set for a parameter y you choose a set A such that P(y e A) a (i.e.,
the probability that y lies in A is at least a). Such a credible set has the
advantage to contain the most likely values of y. It is sort of an
extension of the mode and can be easily applied to multidimensional
distributions. If you have a random sample from the posterior,
generated by an MCMC algorithm, it is easy to have an estimate of
the posterior density (up to a constant) output at the same time. If
you have a sample of 100 posterior parameter vectors, you would
just keep the 95 vectors having the highest posterior density (just
sort them on that criterion) to get the parameter vectors from the
95% highest posterior region. It is then just a matter of finding the
samples boundaries and plotting contours or histograms.
Posterior predictive simulations. The same measures of location and
uncertainty can be used for any prediction the model can make. It is
just a matter of running simulations using the corresponding pos-
terior parameter vectors as input.
Hypothesis testing. Bayesian analysts tend to prefer posterior infer-
ence to hypothesis testing, because as defined classically, hypotheses
are usually sharp alternatives. Posterior distributions on the con-
trary are often smooth and hypotheses will tend to introduce
arbitrary decision elements in the problem. In any case, if deemed
useful, tests can be performed in Bayesian analysis in quite a natural
way. Consider, for example, a first hypothesis H0 that parameter y
belongs to a set o0 versus the alternative H1 that y e o1. The first
step, as usual, is to assign prior probabilities to those two hypoth-
eses, say [H0] and [H1]. With symmetric loss (equal loss when
choosing either hypothesis if it is false), it can be shown that the
optimal decision is to choose the hypothesis which has the highest
posterior probability. If you have a random posterior sample of
parameter values, it suffices to count those samples which fulfill
H0 versus those which fulfill H1 and chose the hypothesis with the
highest count. You just need to make sure that your prior parameter
distribution reflects correctly your prior beliefs about H0 and H1, as
it should.
612 F.Y. Bois
To compare two hypotheses (or two models), if you prefer to

avoid expressing a prior preference for them, you can use the Bayes
factor (BF) (34). BF is the ratio of posterior odds in favor of H0
over prior odds:
H0 jy=H1 jy yjH0
BF ; (12)
H0 =H1 yjH1
where [y|Hi] is the prior predictive distribution of the data defined
in Eq. 3, with the conditioning on the model made explicit:

yjHi yjui ; Hi ui jHi dui : (13)
ui
Often the two models have parameters in common, but that is

not necessary, hence the notation yi to differentiate the two param-
eter sets. BF measures whether the data y increase or decrease the
odds of H0 relative to H1. A BF higher than one means that, as far
as the data are concerned, H0 is more plausible than H1. Note that
for BF to be meaningful, the prior predictive distributions of the
data must be proper, and hence the parameter priors in Eq. 13.
Model choice is a decision problem that can be cast in the above
hypothesis testing. The optimal decision is to choose the model
with the highest posterior probability, and if our prior beliefs are
indifferent, we can use the Bayes factor to compare various models.
Note that we tend to favor parsimony and follow Ockhams princi-
ple that our explanations of the world should be as simple as possi-
ble. The Bayes factor automatically penalizes for parameter number
(the predictive density of the data is necessarily smaller for higher
parameter numbers, because the same integral has to spread in a
larger space). One problem of automatic penalization is that the
Bayes factor may give too much weight to parsimonious models in
the case of vague priors. The question of model choice is still an
open one, in fact, but Bayesian covariate selection is well developed.
Sampling model structure. Models with different structures can also
be randomly sampled by MCMC sampling. In that case, their
parameters also need to be sampled at the same time, since the
posterior is a mixture of parameter distributions indexed by the set
of models considered. Priors, preferably proper even if vague,
should be placed on the models and their parameters. Care must
be taken in MCMC sampling then, because the number of para-
meters may not be same for the various models. Green (35) pro-
pose for that particular case a reversible jumps MCMC algorithm
which solves the problem. The output of such algorithms is a series
of model indices and associated parameter vectors. For a given
model index (i.e., a given model) the parameter values sampled
can be analyzed exactly as above for a single model.
Model averaging. Opposite to model choice, in which just one

model is selected as the best even if it is marginally better,
model averaging uses a panel of models to perform predictive
simulations. If models and parameters have been jointly sampled
as explained just above, it make sense to use that joint posterior
distribution to perform predictive simulations. This usually
improves predictive performance and robustness of the results.
Robustness of Bayesian analyses. The fact that a Bayesian analysis
requires the definition of prior distribution is not in itself a particular
problem if those priors can be justified on substantive grounds.
When that is not the case, it is useful or even necessary to check
the sensitivity of the results with respect to changes in the priors.
The simplest is to rerun the analysis with different prior assumptions
and check the stability of the results. If an important result is
influenced by the shape of a vague prior placed on a particular
parameter, it may be time to consider acquiring additional data on
that parameter.
3. An Example
of Application:
Butadiene
Population PBPK The following example is taken from a series of clinical studies on
Modeling 1,3-butadiene metabolism in humans. We were primarily interested
in identifying the determinants of human variability in butadiene
metabolism (3642). I will present here an application of Bayesian
population modeling to that question, along the lines of Gelman
et al. (43). The deterministic link between the parameters of inter-
est and the data will be a so-called physiologically based pharmaco-
kinetic (PBPK) model (4448).
3.1. The Data The data we will analyze in this example were obtained from 11
healthy adults (aged between 19 and 65 years) who gave informed
consent to the study. They were exposed on two occasions for
20 min to 2 ppm butadiene in the air. The second test took place
48 weeks after the first. Timed measurements of butadiene were
made in their exhaled breath during exposure and for 40 min after it
ended. For each subject, on each occasion, pulmonary flow rate,
Fpul, was monitored (with a coefficient of variation, CV, of 10%) and
a blood sample was taken to determine butadiene blood over air
partition coefficient, Pa (estimated with a CV of 17%). In addition,
the sex, age, body mass, and height of the subjects were recorded
upon enrollment. For details see ref. 41.
3.2. Statistical Models Visual inspection of the data (Fig. 3) strongly suggests that intrain-
dividual variability is present, in addition to interindividual varia-
bility. To test the statistical significance of this observation, we will
614 F.Y. Bois
5.00 a b c
l ll
ll l l
lll l ll ll l l l
0.50
l l
l
ll l l
l l l l
l
l
0.05 l
l
l l l
l l l
l
l l l l
l
Concentration in exhaled air (ppm)
5.00 d e f
ll
ll l l l ll l l l
l ll l l l
ll
0.50 l
l l
l l
l l ll l
ll
l l
l l l l
0.05 l l
l
l l l l l
l l
l
5.00 g h i
l ll
ll l l
lll l l
ll
l
ll
l l l l
lll
0.50
l
l l l
l
l l ll
l
l
l
0.05 l
l l
l
l
l
l l l l
l
l l
l
l l
0 10 30 50 0 10 30 50 0 10 30 50
Time (min)
Fig. 3. Nine (randomly taken out of 11) subjects data on 1,3-butadiene concentration in exhaled air after a 20 min
inhalation exposure to 2 ppm 1,3-butadiene in the air. Both inter- and intraindividual variability are apparent.
compare two multilevel population models using Bayes factors. The

first model, A, explicitly includes interindividual variability only and
lumps together intraindividual variability, measurement error, and
modeling error (Fig. 4). Model B has one more level to account
separately for intraindividual variance (Fig. 5).
Models A and B are variants of the hierarchical model presented
in Fig. 2. The graphs of Figs. 4 and 5 are fancier than the bare bones
DAG in Fig. 2. Known variables are in square nodes, unknown in
round nodes, and we summarize the hierarchy by a piled deck of
cards. The triangle f represents a deterministic function with para-
meters yk (k 1 . . . n). Function f in our case is biologically based
pharmacokinetic (PBPK) model with 14 parameters which has to
be evaluated numerically (see Fig. 6 and next section). The com-
plete corresponding standard DAG would be about 250 nodes and
very hard to read. That is actually why some of those graphs can be
unwieldy and some practitioners shy away from then (A. Gelman,
personal communication).
P P
i
c i i
f Yq1i Yq2i
Yi
P 1 2 3
Fig. 4. A hierarchical interindividual variability model for 1,3-butadiene data analysis. For
subject i the measured exhaled air concentration Yi are supposed to be distributed log-
normally around the predictions of a PBPK model, f, with geometric SD s1. Pulmonary flow
rate measurements, Yy1 , and blood over air partition coefficient measurements, Yy2 , are
assumed to be distributed around the corresponding parameters (Fpul and Pa, members of the
subject-specific parameter set ui) geometric SDs s2 and s3, respectively. Parameters yi,
together with inhaled air concentration C, measurement times t, and covariates j i, condition
the predictions of f. At the population level, yi are supposed to be log-normally distributed
around a geometric mean m with geometric SD S. Prior distributions P are placed on s1, m,
and S. Known quantities are placed in square nodes, estimands in circular nodes.
In model A (Fig. 4), the i individuals (i 1 . . . 11) are each

assumed to have a set of n unknown physiological or pharmacoki-
netic characteristics measured by parameters yik, relevant in the
context of our analysis. For a given characteristic k, the individuals
values yik are assumed to be log-normally distributed around a
population mean mk, and a population (or interindividual) SD Sk:
logyik jmk ; Sk N logyik jlogmk ; logSk : (14)
Those population means and SDs are supposed to be known
only partially, as specified by the prior distributions P(m) and P(S)
we will place on them.
For a given individual i, the observed exhaled air concentration
values, Yi, are supposed to be log-normally distributed around a
geometric mean given by the PBPK model f, with a geometric SD
s1. Model f, as described below, takes a series of measurement
times, t, exposure concentration, C, individual parameters ui, and
616 F.Y. Bois
P P P
i i
C i
j
t ij
f Yq1ij Yq2ij
Yij
P 1 2 3
Fig. 5. Model B, describing inter- and intraindividual variability, for 1,3-butadiene data
analysis. The symbols are identical to those of Fig. 4, with an extension to describe a pair
of occasions j at which each subject i is observed. The data Y are now doubly subscripted
and parameters uij describe the state of subject i on occasion j. They are assumed to be
log-normally distributed around the subject-specific parameters yi with geometric SD D.
covariates j i in input. In our case t and C were the same for all
subjects and are not subscripted. In model A, the data from the two
occasions j for each subject are pooled together and considered as
repeated measurements made at the same time. The direct measure-
ments, Yy1 and Yy2 , of pulmonary flow rate, Fpul, and butadiene
blood over air partition coefficient, Pa, are not considered as known
covariates, but modeled as data (which in effect they actually are),
log-normally distributed around their true values (y1 and y2) with
geometric SDs s2 and s3, respectively. The data likelihood is then:
logYi ; Yy1 i ; Yy2 i jf ui ; xi ; t; C; s1; s2; s3
N log Yi jlogf ui ; xi ; t; C; logs1
N log Yy1 i jlogui1 ; logs2
N log Yy2 i jlogui2 ; logs3 :
(15)
Model B (Fig. 5) differs only slightly from A: an intermediate
layer of occasion-specific parameters is added. We now differentiate
LUNGS
FAT
POORLY
PERFUSED
WELL
PERFUSED
Fig. 6. Representation of the PBPK model used for 1,3-butadiene. This model corresponds
to the function f in Figs. 4 and 5. Its parameters are listed in Table 1 and the equations are
given in the text.
between two occasion-specific values, uij (j 1 or 2) for a given

individual and the average (over time) parameter values ui. At the
population level, the k parameters of individual i, yik, are still
distributed around mk, with SD Sk, as in Eq. 14. The new set of
parameters, yijk, are assumed to be log-normally distributed around
yik with an inter-occasion (or intraindividual) SD Dk:

logyijk jyik ; Dk N logyijk jlogyik ; logDk : (16)
Note that we assume that the inter-occasion SD is the same for
all individuals, and also that individuals vary randomly in time. In
some cases (pregnancy, aging, etc.) it is probably better to model
explicitly the time-evolution of the model parameters, rather than
assigning their variation to chance.
The data likelihood does not change its form and is similar to
Eq. 15, but (and that maybe where the model improvement will lie)
yi is replaced by yij:
logYij ; Yy1 ij ; Yy2 ij j f yij ; xi ; t; C; s1; s2; s3

N logYi jlogf yij ; xi ; t; C; logs1

N logYy1 ij logyij 1 ; logs2

N logYy2 ij logyij 2 ; logs3 :
(17)
3.3. Embedded PBPK The same PBPK model f is embedded in models A and B. It is a
Model minimal description of butadiene distribution and metabolism in
the body after inhalation (Fig. 6). Three compartments lump
618 F.Y. Bois
together tissues with similar perfusion rate (blood flow per unit of
tissue mass): The well-perfused compartment regroups the liver,
brain, lungs, kidneys, and other viscera; The poorly perfused
compartment lumps muscles and skin; The third is fat tissues.
Butadiene is transported to each of these compartments via arterial
blood. At the organ exit, venous blood is assumed to be in equilib-
rium with the compartment tissues. Butadiene can also be metabo-
lized into an epoxide by the liver, kidneys, and lung, which are part
of the well-perfused compartment. The kinetics of butadiene in
each of the three compartments can therefore be described classi-
cally by the following set of differential equations:

dQpp Qpp
Fpp Cart ;
dt Ppp Vpp

dQfat Qfat
Ffat Cart ;
dt Pfat Vfat

dQwp Qwp
Fwp Cart kmet Qwp ; (18)
dt Pwp Vwp
where Qx is the quantity of butadiene in each compartment
(x pp for poorly perfused, fat, or wp for well-perfused).
Fx and Vx are the corresponding blood flow rate and volume,
respectively. Cart is butadiene arterial blood concentration. The
partition coefficients Px are equilibrium constants between butadi-
ene concentration in compartment x and its concentration in venous
blood. The first-order rate constant for metabolism is noted kmet.
The arterial blood concentration Cart is computed as follows,
assuming instantaneous equilibrium between blood and air in the
lung:
Fpul 1 rds Cinh Ftotal Cven
Cart ; (19)
Fpul 1 rds =Pa Ftotal
where Ftotal is the blood flow to the lung, Fpul the pulmonary
ventilation rate, rds the fraction of dead space (volume unavailable
for bloodair exchange) in the lung, and Pa the blood over air
partition coefficient. In our experiments, dead space is artificially
increased by the use of a face mask. Cven is the concentration of
butadiene in venous blood and is simply obtained as the sum of
butadiene concentrations in venous blood at the organ exits
weighted by corresponding blood flows:
P
x2fpp;fat;wpg Fx Qx =Px Vx
Cven ; (20)
Ftotal
with
Ftotal Fpp Ffat Fwp : (21)
Finally, butadiene concentration in exhaled air, Cexh, can be
obtained as:
Cart
Cexh 1 rds rds Cinh : (22)
Pa
Remember that Cexh, Fpul, and Pa have been measured on the
exposed subjects. The model values for those form, with the data
values, the basis of the computation of the data likelihood (Eqs. 15
and 17). In Eqs. 15 and 17, Fpul was noted y1 and Pa noted y2 for
convenience.
Model parameter scaling. You may have noticed that the inter- and
intraindividual variances S and D are just vectors, rather than being
full variancecovariances matrices, as is customary in population
models (49, 50). We have not modeled covariances between param-
eter values for a given individual. The reason is that, by model
construction, those covariances are modeled deterministically by
scaling functions, which render actual parameters (scaling coeffi-
cients) independent from each other. That approach is heavily used
in purely predictive PBPK models (51). It is well known, for exam-
ple, that total blood flow (cardiac output) is correlated with alveolar
ventilation rate, which depends in turn on pulmonary ventilation
rate and the fraction of dead space, defined above (52). We model
this dependency as:
Fpul 1 rds
Ftotal : (23)
1:14
The coefficient 1.14 corresponds to the value so-called ventila-
tion over perfusion ratio, at rest (like while seating, as were our
subjects during the controlled exposure experiments).
In turn, blood flow rates to the various tissues and organs
depend on cardiac output; at least, their sum must equal cardiac
output. Those relationships were modeled by the following alge-
braic equations:
Ffat f Ffat :Ftotal ; (24)
Fpp f Fpp :Ftotal ; (25)
Fwp Ftotal 1 f Ffat f Fpp (26)
The choice to condition Fwp upon the others is quite arbitrary
and is dictated by the pragmatic consideration that the fractional
flows fFfat and fFpp are rather smaller and even if sampled indepen-
dently will not add up to more than one. For a more balanced
alternative see ref. 43.
Tissue volumes scaling is a bit more sophisticated and uses the
subjects age (A in years), sex (S, coded as 1 for males and for
2 females), height (Bh, in m), and mass (Bm, in kg) (53):
Bm
Vfat Bm 0:012 0:1082 S 0:0023A 5:4: (27)
Bh2
620 F.Y. Bois
Table 1
List of PBPK model parameters used for 1,3-butadiene
(see Fig. 6)
Parameter (or scaling Source or prior

coefficient) Symbol Unit distributiona
Body mass Bm kg Measured on individuals
Body height Bh m Measured on individuals
Age A year Collected from
individuals
Sex S Collected from
individuals
Fraction of lean mass well fVwp L/kg LN(0.2, 1.2) [0.1, 0.35]
perfused
Pulmonary ventilation Fpul L/min LN(7, 1.2) [4.0, 12.0]
Fraction of dead space rds LN(0.4, 1.2) [0.23, 0.45]
Fractional blood flows
Poorly perfused fFpp LN(0.15, 1.2) [0.06,
0.26]
Fat fFfat LN(0.05, 1.2) [0.03,
0.09]
Partition coefficients
Blood to air Pa LN(1.3, 1.2) [0.75, 2.25]
Well-perfused tissue to Pwp LN(0.7, 1.2) [0.4, 1.2]
blood
Poorly perfused tissue to Ppp LN(0.7, 1.2) [0.4, 1.2]
blood
Fat tissue to blood Pfat Set to value 22, based on
(54)
Metabolic rate constant kmet min1 U(0.01, 0.6)
a
LN(geometric mean, geometric SD) [truncation bounds]: lognormal distribution; U
(truncation bounds): uniform distribution
The volume of the well-perfused compartment is scaled to lean

body mass, through a fractional volume coefficient, and that of the
poorly perfused compartment is computed to respect a constraint
on total body volume (10% bones, etc. taken into account):
Vwp f V wp Bm Vfat ; (28)
Vpp 0:9Bm Vwp Vfat : (29)
Given this re-parametrization, the actual model parameters are
those found on the left side of the above equations or those left
unscaled, such as the partition coefficients or kmet (Table 1).
3.4. Choosing As can be seen in Table 1, nine parameters will be sampled during
the Priors model calibration. The others are either measured with sufficient
precision (Bm, Bh) or determined without ambiguity (S, A). The
case of Pfat is special: Knowledge of the model behavior (see Eq. 18)
indicates that it determines the rate of butadiene exit from the fat,
which in turn is rate limiting for the terminal half-life of butadiene
in blood. However, with a follow-up of only 60 min (see Fig. 3),
too short to observe that terminal elimination phase, we have no
hope to get information about that parameter from the data. We
therefore set it to a published value (54).
The prior knowledge we have about the value of the other
parameters is rather general and concerns in fact population
averages. For most of the population mean parameters, m, we use
lognormal distributions which constrain them to be positive, with a
geometric SD of 1.2 (corresponding approximately to a CV of
20%), further truncated to stay within physiological bounds (38).
Those parameters are rather well known and those priors are very
informative. We make an exception for the major focus of our
study: the metabolic rate constant kmet. We have a general idea of
its value (0.24 min1 20%) (36), but we chose for its population
mean a uniform prior ranging a factor of 60 in order to let the data
speak about it.
For the population standard deviations, Sk, which measure
interindividual variability, we know from the previous analyses
(36, 38) that they correspond to CVs of the order of 10100%.
Remember (Eq. 14) that we defined the population distribution to
be lognormal (i.e., normal after logarithmic transformation). We
will use MCMC sampling to sample from the posterior distribution
and we do not need to use a conjugate prior distribution (which in
this case would be inverse-gamma). For flexibility and simplicity ,
we use a half-normal prior (with a SD of 0.3) for the logarithm of all
the population variance Sk2 (55). This is quite informative: its mean
is 0.24, so, as expected, our prior population distributions will have
a CV around 0.5 (square root of 0.24), with values concentrated
between 0 and 1.
The above population priors will be used for both model A and
model B. The latter requires in addition the specification of the
intraindividual SDs Dk. We do not have much knowledge about
them, but we do not expect them to be much larger than interindi-
vidual variabilities, given the small time span separating the two
observations. So, again, we will use the same half-normal prior
(with SD 0.3) for the logarithm of each intraindividual variance
Dk2.
We are left with defining priors for the geometric SDs of the
measurement errors (Eqs. 15 and 17) s1, s2, and s3. We expect s1
to be well identified because we have 110 data points altogether,
and therefore as many differences between model and data, to
estimate the analytical error on butadiene concentrations in exhaled
622 F.Y. Bois
air. So we can use a vague prior appropriate for SDs: a log-uniform

distribution with bounds 1 and 1.3 (we do not expect the errors
CV to be higher than 30%, given our past experience with the
analytical techniques and our model). Since we know the precision
of the pulmonary ventilation and partition coefficient measurements
(see Subheading 3.1) we simply set their geometric SDs s2 and s3 to
1.10 (
e0.1) and 1.185 (
e0.17), respectively.
3.5. Computing Using GNU MCSim it is enough to specify the prior distributions
the Posterior and dependencies for the various parameters and the data likeli-
hoods. The hierarchical links are automatically made and the pos-
teriors sampled numerically by a standard MetropolisHastings
algorithm (31) (see also the users manual online at http://www.
gnu.org/software/mcsim). It is best, in our experience to run at
least three Markov chains, starting from different pseudo-random
number seeds. They should all converge, in distribution, to the
posterior, but there is no general rule about the rate of convergence
for complex problems like ours. A good diagnostic for convergence
is provided in (23) and that is the one we will use. A first step is to
run a single chain for about 1050 iterations to evaluate the time
needed for an iteration. That will give you an idea of the number of
chains and iterations you can run given your time constraints and
available hardware. The more iterations, the better. It is usually
recommended to run chains as long as possible, to keep the second
half of the samples generated and to check convergence on that set.
You can in fact try to keep the last 3/4 of iterations if they appear to
have converged by the time the first quarter is done. There is no real
rule about that. A general strategy, if computations seem to require
more than an hour, is to run batches of 310 chains for a day or so.
Then check visually that they progress toward convergence (check-
ing at least the trajectory of the population parameters, the data
likelihood, and the posterior density); run the convergence diag-
nostic tool and focus on monitoring problem parameters, slow to
converge. GNU MCSim allows you to restart a chain at the point
you stopped it (that is probably a must for any useful MCMC
software). You can run batches until convergence is achieved. Stor-
age space can be saved by only recording the samples generated
during one iteration in every five, or ten, etc. and forget about the
others. You still probably need a few thousand samples to form
good approximations of the posterior distribution confidence
regions etc. For nice problems, even as complex as our current
example, convergence is usually achieved between a few thousand
and a hundred thousand iterations. It depends on the weight of the
data compared to the prior, the distance between prior and
likelihood, the quality of the model calibrated, and the identifia-
bility of its parameters:
Few data will not move the joint posterior very far from the
priors (unless vague prior are used); The posterior will be rather
flat and easy to sample from. You may learn little from them, but
you will get the result quickly. Note that a MCMC sampling
strategy has been proposed to take advantage of this feature and
of the consistent updating properties of Bayesian approach: In
essence, data are gradually introduced in the problem to
smoothly reach the posterior distribution (56).
If the data weigh a lot and conflict with the priors, the posterior
might be multimodal and difficult to sample from. The sampler
can get stuck in local modes and convergence maybe never
reached. Also, a lot of data will usually tell a rich story and will
require a detailed model to explain them; Good detailed models
are harder to come by.
There seem to be many incompatible ways to fit a bad model to
data, and that translate into a multimodal posterior distribution,
with none of the modes giving a satisfying fit. The problem may
lie in the deterministic portion of the model or in its probabilis-
tic part (for example, your data comes from a mixture of dis-
tributions, while you assume unimodality). In my experience
bad models are very hard to work with and hardly converge.
Your model, even if good, may be over-parameterized. For
example, your model may include a sum, a product, or a ratio
of parameters, while the data only constrain the result of that
operation. Think about fitting data nicely aligned along a
straight line with a model like y abx + c. Chances are that an
infinity of couples (a, b) will fit as well and have equal posterior
density. This translates in very high correlations between para-
meters and a very hard posterior to sample from. In the simple
case evoked, that problem could be diagnosed in advance and
corrected from the start, but that is much harder to do with a
complex nonlinear model. In theory, if you have placed infor-
mative priors on the parameters (here, on a and b) you should be
safe. However, it is difficult to say how informative they have to
be compared to the data. You can try to understand the cause of
the problem by examining the correlations between parameters
and the autocorrelation within chains (high autocorrelations are
bad news). The solution is usually to simplify the model or to
reparameterize it.
For our butadiene example, which runs quickly, 20,000 itera-
tions per chain are enough to reach convergence for model A and
30,000 for model B. We run five MCMC chains of 50,000 itera-
tions for A and 60,000 iterations for B, keeping 1 in 10 of the last
30,000 iterations. That leaves us with a total of 15,000 samples
(vectors) from the joint posterior distribution for each model, with
the data log-likelihood and posterior log-density for each of them.
624 F.Y. Bois
Table 2
Summary (mode, mean 6 SD [2.5th percentile, 97.5th percentile]) of the posterior
distributions of the population geometric means, m, and population
(or interindividual) geometric SDs, , for model A parameters
Parameter m S
fVwp 0.24, 0.22 0.032 [0.16, 0.29] 1.02, 1.5 0.31 [1.02, 2.2]
Fpul 7.7, 7.4 0.77 [5.9, 9.1] 1.26, 1.4 0.26 [1.14, 2.1]
rds 0.39, 0.38 0.039 [0.30, 0.45] 1.17, 1.5 0.32 [1.08, 2.3]
fFpp 0.17, 0.17 0.020 [0.13, 0.21] 1.26, 1.5 0.24 [1.15, 2.1]
fFfat 0.049, 0.053 0.0086 [0.038, 0.072] 1.14, 1.6 0.32 [1.08, 2.3]
Pa 1.5, 1.5 0.15 [1.2, 1.8] 1.25, 1.4 0.26 [1.10, 2.1]
Pwp 0.78, 0.74 0.11 [0.54, 0.99] 1.20, 1.5 0.32 [1.08, 2.2]
Ppp 0.58, 0.63 0.097 [0.47, 0.86] 1.75, 1.5 0.32 [1.05, 2.2]
kmet 0.19, 0.20 0.044 [0.14, 0.31] 1.30, 1.6 0.26 [1.15, 2.2]
Gelman and Rubin R diagnostic (23) is at most 1.04 for any

parameter (at most a 4% in reduction of the marginal interchain
variance is expected to be achievable if we were to run the chains
further). For practical purposes this is close enough to convergence
and we can start to analyze the results of our model calibrations.
Note that it would be unwise to examine the posterior if some
parameters have not yet converged, because it is a joint posterior
distribution: We can try to avoid very large correlations to speed up
convergence, but all posterior parameter values are still correlated
to a certain degree, and that is definitely true through the hierarchy.
Table 2 gives summary statistics of the posterior distributions
for the population means, m, and SD, S, of model A. Table 3
summarizes the posteriors for m, S, and the intraindividual SDs D
for model B. With both models the populations means are quite
close to the (mostly informative) priors we used. A rather vague
prior was used for kmet, the metabolic rate constant of interest, but
the posterior mean is around 0.2 min1, quite similar to the value
of 0.24 min1 20% found previously with another population
(36). Figure 7 shows the trajectory of the MCMC sampler for the
population mean of kmet in model A. The five chains converged
very quickly to its posterior distribution. Both models tell the same
story about population averages, but what about variances? Inter-
estingly, the interindividual variability estimates, S, are quite similar
in the two models and around 50% (95% confidence interval from
about 10% to a factor of 2). Figure 8 shows the sampler trajectory
and a Gaussian kernel estimate for the posterior density of the
Table 3
Summary (mode, mean 6 SD [2.5th percentile, 97.5th percentile]) of the posterior
distributions of the population geometric means, m, and population
(or interindividual) geometric SDs, , and intraindividual SDs, , for model B
parameters
u m S D
fVwp 0.21, 0.23 0.032 [0.17, 0.30] 1.2, 1.5 0.30 [1.1, 2.2] 1.2, 1.5 0.30 [1.1, 2.2]
Fpul 7.6, 7.4 0.77 [6.0, 9.2] 1.3, 1.4 0.25 [1.1, 2.0] 1.002, 1.1 0.06 [1.01, 1.2]
rds 0.36, 0.38 0.04 [0.29, 0.44] 1.3, 1.5 0.34 [1.02, 2.3] 1.1, 1.5 0.33 [1.05, 2.2]
fFpp 0.15, 0.19 0.022 [0.13, 0.22] 1.3, 1.4 0.27 [1.04, 2.1] 1.4, 1.6 0.22 [1.3, 2.2]
fFfat 0.048, 0.054 0.008 [0.04, 0.071] 1.3, 1.5 0.33 [1.06, 2.2] 1.4, 1.6 0.31 [1.09, 2.2]
Pa 1.3, 1.4 0.15 [1.1, 1.8] 1.2, 1.3 0.28 [1.03, 2.1] 1.1, 1.2 0.16 [1.03, 1.6]
Pwp 0.80, 0.75 0.11 [0.55, 0.99] 1.9, 1.5 0.32 [1.07, 2.2] 1.3, 1.6 0.31 [1.13, 2.3]
Ppp 0.61, 0.68 0.11 [0.49, 0.93] 1.2, 1.5 0.32 [1.1, 2.3] 1.9, 1.7 0.31 [1.2, 2.3]
kmet 0.15, 0.21 0.062 [0.14, 0.39] 1.2, 1.6 0.29 [1.14, 2.2] 1.6, 1.7 0.27 [1.2, 2.2]
logarithm of kmet population variance in model A. In model B, the

intraindividual variability estimates, D, similarly hovers around 50%,
and well above 1 (which would correspond to no intraindividual
variability), except for Fpul for which intraindividual variability
seems very low. The posterior estimates of the residual error for
exhaled air concentrations, s1, are very different between the two
models: For model A it is a whooping factor 3.1 0.013; With
model B we go down to a factor of 1.16 0.01, around 15%, quite
congruent with usual analytical errors. This translates into a mark-
edly better fit of model to the data (Figs. 9 and 10).
At the individual level the model parameters are reasonably well
estimated. Figure 11 shows box-plots of the posterior estimates of
kmet for the population and for each subject, estimated using model
B. Even in that small population sample, individual kmet can differ
by a factor of 2 (see subjects D and J), and intraindividual variability
can be about as high for some subjects B (see subject F). Our
assumption that all subjects have the same intraindividual variability
could be wrong though, because several subjects are fairly stable (B,
G, H, etc.) Note the shrinkage effect of the population model: the
individual averages (the first box for each subject) do not fall
between the two occasion boxes, as would be expected if they
were simple averages. They are pulled toward the overall population
mean. This feature stabilizes estimation: if only a few data were to
be obtained on a subject, her parameter estimates would be pulled
toward the population average, rather than wandering into impos-
sible values by lack of identifiability.
626 F.Y. Bois
0.6
0.5
Population average of kmet (1/min)
0.4
0.3
0.2
0.1
0 10000 20000 30000 40000 50000

Iteration
Fig. 7. Trajectory of the MCMC sampler for the population mean of kmet (1,3-butadiene metabolic rate constant) in model A,
with interindividual variability only. The five chains converged quickly to the posterior distribution. The last 30,000
iterations were kept to form the posterior sample used for further analyses. On that basis, a smoothed (Gaussian kernel)
estimate of the posterior density for kmet is shown on the right.
The parameter values are sampled from their joint (multivari-

ate) posterior distribution. They are therefore eventually corre-
lated. It is useful to look at their correlations. Strong correlations
tend to slow down Gibbs-type MCMC samplers. If that is a prob-
lem, it may be possible to reparameterize the model to break those
correlations. For example, if a and b are very correlated, you may
want to first sample a and then the ratio r b/a and compute b as
a r. Poor convergence is not a problem in our case, but is it still
useful to understand how the parameters influence each others in
the calibration process. Figure 12 shows a gray-scale coded repre-
sentation of the correlation matrix between the parameters for
subject A. The strongest correlations (0.65) are between kmet
and rds at the occasion level (the closest to the data). You can
also observe that the parameters controlling inputs and outputs are
the most correlated (they tend to affect the most data predictions).
1.2
1.0
Logarithm of the population variance of kmet
0.8
0.6
0.4
0.2
0.0
0 10000 20000 30000 40000 50000
Iteration
Fig. 8. Trajectory of the MCMC sampler for the logarithm of kmet population variance in model A, with interindividual
variability only. On the right, a smoothed (Gaussian kernel) estimate of the last 30,000 samples, in the dashed rectangle,
gives an estimate of the posterior density. It is truncated to zero because variances cannot be negative.
The subject average parameters are much less correlated, and that is
true also of the population parameters (not shown). You can also
see (series of diagonals beside the main, trivial, one) that the occa-
sion level parameters influence the subjects averages and that the
occasions influence each other.
3.6. Checking We have in fact already started checking the models when assessing
the Models the fits to the data (Figs. 9 and 10). The fit of model B is quite
better and its residual variance, s1, much lower. We have also seen
that the posterior parameters estimates are reasonable. In the case
of a parametric population model, it is also useful to check the
distributional assumptions of the hierarchy. We have assumed log-
normal distributions throughout. Was that reasonable? Figure 11
shows that, at least for kmet, individual values seem reasonably
spread around their mean. Figure 13 may be clearer. It shows a
simple Gaussian kernel density estimate of the posterior
628 F.Y. Bois
a b
Observed concentration (ppm)

1.00
0.10
0.01
0.01 0.10 1.00 0.01 0.10 1.00

Predicted exhaled air concentration (ppm)
Fig. 9. Observed versus predicted concentrations of 1,3-butadiene in exhaled air, all data together. For a perfect fit, all
points would fall on the diagonal. The fit of model B (with inter- and intraindividual variability) is markedly better than that
of model A (interindividual variability only).
5.00
a b c
0.50
0.05
Concentration in exhaled air (ppm)
5.00 d e f
0.50
0.05
5.00
g h i
0.50
0.05
0 10 30 50 0 10 30 50 0 10 30 50
Time (min)
Fig. 10. Model predictions (lines) and observed concentrations, on two occasions (open and closed circles), of 1,3-
butadiene in exhaled air, for the same nine volunteers as in Fig. 3. Model A predictions are indicated by dashed lines;
Model B predictions for the two occasions are indicated by solid lines.
0.6
0.5
0.4
kmet (1 / min)
0.3
0.2
0.1
0.0
A B C D E F G H I J K
Fig. 11. Box-plot of the posterior samples of kmet for the population (noted m) and for each
subject, estimated using model B, with inter- and intraindividual variability. For each
subject the first box corresponds to the individual average, yi, and the other two to the
occasion-specific values yi1 and yi2..
distribution of subject-specific (yi) values for kmet, using model B.

The kernel, obtained from 11 15,000 samples of that yi, is
moderately skewed to the right. The average estimates (over
15,000 samples) of the subject-specific kmet (one per subject) are
shown as individual points and are grouped under the kernel (more
spread, but these are averages). Finally we also have, as shown in
Fig. 13, a random sample of ten lognormal densities obtained using
posterior samples of (m, S) pairs for kmet. If the model is correct, any
of these should resemble the kernel estimate. That seems to be
the case, even if the kernel represents a sort of average too. The
lognormal assumption does not seem obviously wrong here. A
truncated normal would have probably also passed the test, but
with only 11 subjects it is difficult and vain to go beyond this simple
check. The figure also illustrates a feature of multilevel inference:
the population distribution estimates (thin lines) are much wider
(in a sense, robust) than the small sample of subjects we have. All
the shapes we see here have reasonable support, given the data.
Note however, that the possibility of long-tails to the right is
somewhat encouraged by our lognormal model.
3.7. Inference on Model As we have seen above, model B, with both inter- and intraindividual
Structure variability really seems to be a better model than model A. In our case,
intraindividual variability could be as high as interindividual variability.
630 F.Y. Bois
1.0
kmet
Pwp
Ppp
Pa
12
Fpul
fFpp
fFfat
0.5
rds
fVwp
kmet
Pwp
Ppp
Pa
11 Fpul 0.0
fFpp
fFfat
rds
fVwp
kmet
Pwp
Ppp 0.5
Pa
1 Fpul
fFpp
fFfat
rds
fVwp
1.0
a
a
p
ds
pp
pp
ds
pp
pp
ds
pp
pp
p
l
l
t
et
et
et
P
P
pu
pu
pu
fa
fa
fa
w
w
r
r
fF
fF
fF
P
fF
fF
fF
fV
fV
fV
P
m
m
F
F
k
1 11 12
Fig. 12. Graphical representation of the correlation matrix between posterior parameter values for subject A, using model
B. The first group of parameters are subject-level averages, y1, the second and third groups are the occasion-specific
parameters y11 and y12. Correlations are stronger at the occasion level.
Can we quantify the support the data bring to that hypothesis?

Posterior parameter distributions were obtained for both models
via numerical Bayesian calibration. The Bayes factor (Eq. 12) for
model B against model A gives a measure of their relative likelihood
(34). For simplicity we estimate the Bayes factor as the ratio of the
harmonic mean of the data likelihood for model B over that for
model A. The logarithm of the data likelihood is given for every
sampled parameter vector in the output of GNU MCSim. In our case
the Bayes factor is 3 1050. A trimmed harmonic mean estimator
gives a very similar number and our estimates is stable. Such a high
value of the Bayes factor is a decisive argument in favor of model B.
We can be about certain that intraindividual variability is present in
those data and should be taken into account.
Density
4
0.0 0.2 0.4 0.6 0.8 1.0

kmet
Fig. 13. A posterior check of distributional assumptions of model B: Kernel density

estimate of the posterior distribution of subject-specific (yi) values for kmet (thick line);
Averages (over 15,000 samples) of the subject-specific kmet for the 11 subjects (circles)
and 10 lognormal densities obtained using random posterior samples of (m, S) pairs for
kmet (thin lines).
3.8. Making Predictions We are interested in assessing the capacity of different subjects in
metabolizing 1,3-butadiene. Figure 11 summarizes the posterior
distribution of a key parameter involved, but that may not be the
whole story. The health risks due to butadiene exposure may rather
be linked to the quantity of butadiene metabolized, Qmet. That
quantity is certainly a function of kmet, but it depends also on the
quantity of butadiene inhaled, which in turn depends on Fpul, etc. It
is easy, using the PBPK model and the posterior parameters samples
we have, to compute Qmet for each subject within 60 min at each
occasion after their 20 min exposure to 2 ppm butadiene in air
(Fig. 14). We can see that the subjects ranking using Qmet is
different from the one using kmet. When computing those predic-
tions, it is important to use one by one the entire parameter vectors
sampled, even if that is for a subset of the samples. That is because
the parameter values sampled are correlated (see Fig. 12) and must
stay so. Usually, MCMC samplers output one parameter vector at a
time or per line of output file, and you just need to do one
computation per line, using all the parameter values of that line.
Purely predictive simulation for new random subjects
requires a bit more care. In the case of model B, we have a posterior
sample of 15,000 9 (mk, Sk) pairs, because we updated the distri-
bution of k 9 model parameters. We can use each of the nine
pairs in a given sample to sample nine random parameter values.
632 F.Y. Bois
0.30
0.25
Quantity metabolized (mg)

0.20
0.15
0.10
0.05
0.00
X A B C D E F G H I J K
Fig. 14. Box-plots of the predicted amount of 1,3-butadiene metabolized by an hypotheti-

cal subject, X, randomly drawn using model B posterior parameter distributions, and of
the estimates for the 11 subjects of the calibration dataset, on the two occasions they
were observed.
Such values define a random subject (random, but resembling

the 11 subjects studied). It is preferable to sample one random
subject from each of 15,000 sets of pairs (mk, Sk) rather than 15,000
subjects from just one set of pairs (even if that is the set having the
maximum posterior probability), otherwise you ignore the uncer-
tainty in m and S estimates. It is better to balance the uncertainty in
m and S and the population variability around them. How to
actually sample from a given (mk, Sk) pair? If the model is correct,
it is proper to use the distribution specified by it. In our case, we
assumed that the yik were log-normally distributed with parameters
mk and Sk. We should therefore use that distribution to sample the
parameter values for random subjects. Indeed, this assumes that
our model is right, an assumption we checked above. We could also
have taken a nonparametric approach to estimate the shape of the
population distribution itself (see Subheading 2.5), but that
seemed a bit fancy with just 11 subjects. In any case, it is important
to use all (mk, Sk) pairs on an output line to sample a random
subject, because they are correlated and we should not break their
correlation carelessly.
But in our case, sampling random subjects is probably not
enough. We have seen that intraindividual variability is important,
and it may not make sense to ignore it. With the above procedure,
we have generated average random subjects. If we want to
correctly simulate observations of real subjects, we should add

intraindividual variability. That is not particularly difficult: we just
need to randomly sample yijk values from lognormals with para-
meters yik and Dk. That is actually what I have done to simulate
random subject X in Fig. 14. I have simulated 15,000 random
subjects observed on random occasions. That better reflects the
uncertainty and levels of variability we expect in a real population.
Note that I had to define additional distributions for the covariates
body mass, height, and age, because setting them would have been
restrictive. I just used a multivariate normal distribution, derived
from the covariate values of 11 subjects, to sample 15,000 random
values for them. I also set sex to 1, but I checked before that sex is
not correlated with the other covariates and posterior parameter
estimates (at least with our 11 subjects).
One thing is still missing in the above simulations: Posterior
parameter values for a subject at a given occasion are correlated
(Fig. 12) and that we have not modeled that. We defined a single
SD, Dk, for each parameter rather than a full covariance matrix.
Using such a matrix we could have captured these correlations and
reproduce them when simulating random subjects on random
occasion. First, those correlations were not too large (the highest
was at 0.65) and we have scaled the model parameters (to body
mass, etc.) to precisely take care of the strongest correlations via in
fact a deterministic correlation model. The second, and probably
desperate, line of defense is that GNU MCSim does not allow you
to use a covariance matrix. Using the diagonal variances implies that
we neglected the covariance in our sampling and produced over-
dispersed random variate. That is not so bad if we want to be
conservative about uncertainty, but could be unrealistic. The results
for Qmet predictions should be checked for sanity: Their CV, as
shown in Fig. 14, falls between 20 and 40%, and reaches 50% for
subject X. We do not expect much greater precision for a non-
measured quantity, given the data we had.
Sampling as described above mixes variability and uncertainty
about the population parameter estimates. That is fine to create
random subjects. If we want to have an idea of the interindividual
variability for a given parameter we just need to look at the poste-
rior distribution of the corresponding S. Its average, for example, is
a reasonable estimate of the variance of that parameter in a popula-
tion. But, what if we want to estimate only the interindividual
variability of a model prediction, e.g., Qmet, in the population? A
solution is to compute stratified estimates of variance: For each
sampled vector (m, S) generate M (e.g., 1,000) subjects, run the
PBPK model, compute the variance of the Qmet predictions
obtained. This variance reflects pure interindividual variability. Do
the same for the N (m, S) vector we have sampled. The N variances
obtained reflect interindividual variability only. Note, however, that
we cannot do away with the uncertainty about the interindividual
634 F.Y. Bois
variance S, except by averaging over the N variances. In our case,

the above computations indicate that about 55% of the total vari-
ance of Qmet predictions can be attributed to interindividual varia-
bility and the rest to intraindividual variability.
3.9. Conclusion The above example does not exhaust the topic of Bayesian inference
of the Example and is not even complete by itself. The distributional assumptions
made for the variances S and D, in particular, should be checked, as
well as the sensitivity of the results to the bounds imposed on the
prior distributions. The conclusion that intraindividual variability is
important was probably obvious from the start (Fig. 3). This may
be quite general, but the data are seldom here to check it, and most
population pharmacokinetic analyses do not investigate intraindi-
vidual variability. Note, however, that when it is omitted, interindi-
vidual variability estimates are quite reasonable (Tables 2 and 3), at
least in this case. There are plenty of aspects of Bayes inference that
we have just mentioned in passing (nonparametrics, for example) or
not at all (e.g., optimal design, see refs. 36, 57, and the vast area of
clinical trial design and analysis). But their principles remain the
same. The interested reader can refer to the literature indicated at
the end of the introduction of this chapter to go deeper and beyond
what we have surveyed here.
References
1. Albert J (2007) Bayesian computation with R. with aggregated data. J Agr Biol Environ Stat
Springer, New York 12:346363
2. Berger JO (1985) Statistical decision theory 12. Pillai G, Mentre F, Steimer JL (2005) Non-linear
and Bayesian analysis, 2nd edn. Springer, mixed effects modelingfrom methodology
New York and software development to driving implemen-
3. Box GEP, Tiao GC (1973) Bayesian inference tation in drug development science. J Pharmaco-
in statistical analysis. Wiley, New York kinet Pharmacodyn 32:161183
4. OHagan A (1994) Kendalls advanced theory 13. Dunson DB (2009) Bayesian nonparametric
of statisticsvolume 2BBayesian inference. hierarchical modeling. Biom J 51:273284
Edward Arnold, London 14. Gosh JK, Ramamoorthi RV (2003) Bayesian
5. Gelman A, Carlin JB, Stern HS, Rubin DB non-parametrics. Springer, New York
(2004) Bayesian data analysis, 2nd edn. 15. Bigelow JL, Dunson DB (2007) Bayesian
Chapman & Hall, London adaptive regression splines for hierarchical
6. Bernardo JM, Smith AFM (1994) Bayesian data. Biometrics 63:724732
theory. Wiley, New York 16. Oppenheim J, Wehner S (2010) The uncer-
7. Press SJ (1989) Bayesian statistics: principles, tainty principle determines the nonlocality of
models, and applications. Wiley, New York quantum mechanics. Science 330:10721074
8. Whittaker J (1990) Graphical models in applied 17. Garthwaite PH, Kadane JB, OHagan A (2005)
multivariate statistics. Wiley, Chichester Statistical methods for eliciting probability dis-
9. Shafer G, Pearl J (1990) Readings in uncertain tributions. J Am Stat Assoc 100:680700
reasoning. Morgan Kaufmann, San Mateo, CA 18. Jaynes ET (2003) Probability theory: the logic
10. Gelman A (2006) Multilevel (hierarchical) of science. Cambridge University Press, Cam-
modeling: what it can and cannot do. Techno- bridge
metrics 48:432435 19. Gilks WR, Richardson S, Spiegelhalter DJ
11. Chiu WA, Bois F (2007) An approximate (1996) Markov Chain Monte Carlo in practice.
method for population toxicokinetic analysis Chapman & Hall, London
20. Liu JS (2001) Monte Carlo strategies in scien- 37. Brochot C, Smith TJ, Bois FY (2007) Devel-
tific computing. Springer, New York opment of a physiologically based toxicokinetic
21. Metropolis N, Rosenbluth AW, Rosenbluth model for butadiene and four major metabo-
MN, Teller AH, Teller E (1953) Equation of lites in humans: global sensitivity analysis for
state calculation by fast computing machines. J experimental design issues. Chem Biol Interact
Chem Phys 21:10871092 167:168183
22. Hastings WK (1970) Monte Carlo sampling 38. Mezzetti M, Ibrahim JG, Bois FY, Ryan LM,
methods using Markov chains and their appli- Ngo L, Smith TJ (2003) A Bayesian compart-
cations. Biometrika 57:97109 mental model for the evaluation of 1,3-
23. Gelman A, Rubin DB (1992) Inference from butadiene metabolism. J R Stat Soc C 52:
iterative simulation using multiple sequences 291305
(with discussion). Stat Sci 7:457511 39. Micallef S, Smith TJ, Bois FY (2002) Model-
24. Geman S, Geman D (1984) Stochastic relaxa- ling of intra-individual and inter-individual
tion, Gibbs distributions, and the Bayesian res- variability in 1,3-butadiene metabolism. In:
toration of images. IEEE Trans Pattern Anal PAGE 11annual meeting of the population
Mach Intell 6:721741 approach group in Europe, Population
Approach Group in Europe, Paris, ISSN
25. Doucet A, de Freitas N, Gordon N (2001) 18716032
Sequential Monte Carlo methods in practice.
Springer, New York 40. Ngo L, Ryan LM, Mezzetti M, Bois FY, Smith
TJ (2011) Estimating metabolic rate for
26. Andrieu C, Doucet A, Holenstein R (2010) butadiene at steady state using a Bayesian
Particle Markov chain Monte Carlo methods. physiologically-based pharmacokinetic model.
J R Stat Soc B 72:269342 J Environ Ecol Stat 18:131146
27. Smith A, Gelfand A (1992) Bayesian statistics 41. Smith T, Bois FY, Lin Y-S, Brochot C,
without tears: a samplingresampling perspec- Micallef S, Kim D, Kelsey KT (2008) Quanti-
tive. Am Stat 46:8488 fying heterogeneity in exposure-risk relation-
28. Rubin DB (1988) Using the SIR algorithm to ships using exhaled breath biomarkers for
simulate posterior distributions. In: Bernardo 1,3-butadiene exposures. J Breath Res 2:
JM, De Groot MH, Lindley DV, Smith AFM 037018 (037010 p.)
(eds) Bayesian Statistics 3. Oxford University 42. Smith T, Lin Y-S, Mezzetti L, Bois FY, Kelsey
Press, Oxford, pp 395402 K, Ibrahim J (2001) Genetic and dietary factors
29. R Development Core Team (2010) R: a lan- affecting human metabolism of 1,3-butadiene.
guage and environment for statistical comput- Chem Biol Interact 135136:407428
ing. R Foundation for Statistical Computing, 43. Gelman A, Bois FY, Jiang J (1996) Physiologi-
Vienna, Austria cal pharmacokinetic analysis using population
30. Bois FY, Maszle D (1997) MCSim: a simula- modeling and informative prior distributions.
tion program. J Stat Software 2(9). http:// J Am Stat Assoc 91:14001412
www.jstatsoft.org/v02/i09 44. Bischoff KB, Dedrick RL, Zaharko DS, Long-
31. Bois FY (2009) GNU MCSim: Bayesian statis- streth JA (1971) Methotrexate pharmacokinet-
tical inference for SBML-coded systems biol- ics. J Pharm Sci 60:11281133
ogy models. Bioinformatics 25:14531454 45. Bois FY, Zeise L, Tozer TN (1990) Precision
32. Hammitt JK, Shlyakhter AI (1999) The and sensitivity analysis of pharmacokinetic
expected value of information and the proba- models for cancer risk assessment: tetrachlor-
bility of surprise. Risk Anal 19:135152 oethylene in mice, rats and humans. Toxicol
33. Yokota F, Gray G, Hammitt JK, Thompson KM Appl Pharmacol 102:300315
(2004) Tiered chemical testing: a value of infor- 46. Droz PO, Guillemin MP (1983) Human sty-
mation approach. Risk Anal 24:16251639 rene exposureV. Development of a model for
34. Kass RE, Raftery AE (1995) Bayes factors. biological monitoring. Int Arch Occup Envi-
J Am Stat Assoc 90:773795 ron Health 53:1936
35. Green PJ (1995) Reversible jump Markov 47. Gerlowski LE, Jain RK (1983) Physiologically
chain Monte Carlo computation and Bayesian based pharmacokinetic modeling: principles
model determination. Biometrika 82:711732 and applications. J Pharm Sci 72:11031127
36. Bois FY, Smith T, Gelman A, Chang H-Y, 48. Reddy M, Yang RS, Andersen ME, Clewell HJ
Smith A (1999) Optimal design for a study of III (2005) Physiologically based pharmacoki-
butadiene toxicokinetics in humans. Toxicol netic modeling: science and applications. Wiley,
Sci 49:213224 Hoboken, New Jersey
636 F.Y. Bois
49. Racine-Poon A, Wakefield J (1998) Statistical 54. Filser JG, Johanson G, Kessler W, Kreuzer
methods for population pharmacokinetic mod- PE, Stei P, Baur C, Csanady GA (1993) A
elling. Stat Methods Med Res 7:6384 pharmacokinetic model to describe toxicoki-
50. Lunn DJ, Best N, Thomas A, Wakefield J, netic interactions between 1,3-butadiene
Spiegelhalter D (2002) Bayesian analysis of and styrene in rats: predictions for human
population PK/PD models: general concepts exposure, IARC Scientific Publication No.
and software. J Pharmacokinet Biopharm 29: 127. In: Sorsa M, Pletonen K, Vainio H,
271307 Hemminki K (eds) Butadiene and styrene:
51. Bois F, Jamei M, Clewell HJ (2010) PBPK assessment of health hazards, International
modelling of inter-individual variability in the Agency for Research on Cancer, Lyon,
pharmacokinetics of environmental chemicals. France
Toxicology 278:256267 55. Gelman A (2006) Prior distributions for
52. Fiserova-Bergerova V (1983) Physiological variance parameters in hierarchical models
models for pulmonary administration and (comment on article by Browne and Draper).
elimination of inert vapors and gases. In: Bayesian Anal 1:515534
Fiserova-Bergerova F (ed) Modeling of inhala- 56. Tanner MA, Wong WH (1987) The calculation
tion exposure to vapors: uptake, distribution, of posterior distributions by data augmentation
and elimination. CRC Press, Boca Raton, FL, (with discussion). J Am Stat Assoc 82:528550
pp 73100 57. Amzal B, Bois FY, Parent E, Robert CP (2006)
53. Deurenberg P, Weststrate JA, Seidell JC (1991) Bayesian optimal design via interacting
Body mass index as a measure of body fatness: MCMC. J Am Stat Assoc 101:773785
age- and sex-specific prediction formulas. Br J
Nutr 65:105141
INDEX
A Antiandrogens ............................................................... 288

Antibacterial .................................................................. 104
Absorption Antibiotics ..................................................................... 260
brain, Antibody ...................................................... 276, 277, 279
drug ............................................................................. 7
Anticancer............................................................... 56, 104
oral, Antifungal ...................................................................... 105
passive, Antihistaminic ............................................................... 104
rate,
Antimalarial ....................................................10, 105, 261
ACD/Labs .................................................................... 130 Antimicrobial................................................................. 106
Acetaminophen .................................................... 175, 177 Antioxidant.................................................................... 218
Acetylcholine,
Antipsychotic................................................................. 104
Acetylcholinesterase ...................................................... 106 Antitubercular ............................................................... 104
Acetyltransferases ................................................. 134, 166 Apoptosis ..................................................... 170, 175, 407
AcslX,
Aquatic toxicity ..............................................75, 104, 112
Adipose .......................................................................... 284 Artificial
ADME intelligence................................................37, 107, 308
evaluation................................................................. 300 neural networks ................................. 14, 16, 17, 315,
parameters,
316, 402, 515, 516, 518, 521
pharmacokinetic ...................................................... 259 Aryl hydrocarbon receptor
profiling, (AhR) .......................................217, 268, 278, 284
suite,
Aspirin,
ADMET Assessment
evaluation, exposure................................................. 126, 270, 275
parameters,
risk......................................................... 46, 62, 69, 85,
pharmacokinetic ........................................................ 56 87, 89, 95, 100, 126, 142, 143, 157,
prediction................................................................. 316 159, 298, 299, 305, 328, 342, 349
profiling, Autocorrelation .................................................... 114, 623
suite,
AutoDock ............................................................. 258, 259
ADMEWORKS ............................................................. 316 Automata ............................................................. 215, 383,
Agent 402, 403, 407
based ............................................................... 390393
oriented.................................................. 414, 417, 418 B
AhR. See Aryl hydrocarbon receptor (AhR)
Albumin ......................................................................... 259 Bayesian
Alcohol.............................................. 143, 555, 556, 559, classification ............................................................. 122
560, 566, 567, 571, 572 inference ......................................................... 597634
Alignment model ....................................................................... 318
methods ....................................................................... 9 statistics.................................................. 113, 597, 604
molecules ..................................................................... 9 Benchmark dose ............................................................ 299
network........................................................... 245, 265 Benzene ...................................... 152, 167170, 173175
Allometric ............................................................. 280, 299 Benzo[a]pyrene .................................................... 217, 222
Alprazolam, Berkeley Madonna,
Amber ............................................................................ 268 Biliary excretion,
Aminoacetophenone ....................................126, 143159 Binary
AMPAC ....................................................... 323, 325, 326 classification ..............................................61, 101, 316
Androgen receptor ........................................................ 279 nearest neighbor...................................................... 518
Anova ...................................................168, 170, 173, 174 qsar........................................................................... 318
DOI 10.1007/978-1-62703-059-5, # Springer Science+Business Media, LLC 2013
637
638 C OMPUTATIONAL TOXICOLOGY
Index
Binding membrane ................................................................ 500
affinity ............................................................... 40, 130 receptor.................................................. 167, 174, 175
domain, signaling................................................................... 384
energy, CellML.................................................................. 386, 405
ligand ....................................................................... 179 CellNetOptimizer ................................................ 179211
model, Cellular
plasma, automata ........................................383, 402, 403, 407
site ..................................................................... 24, 272 compartments........................................ 403, 405, 420
tissue ........................................................................ 275 systems ............................................................ 399423
Bioaccumulation .................................277, 288, 310, 314 Cephalosporins..................................................... 119, 124
Bioavailability ............................................................7, 300 CHARMM,
BioCarta....................................................... 362, 368, 370 ChEBI
Bioconductor.............................. 168, 169, 236, 360361 identifier..................................................................... 47
BioEpisteme ......................................................... 316317 ontology .................................................................... 49
Biomarker ................................... 235, 251290, 357, 359 web............................................................................. 52
BioSolveIT, Chemaxon .................................................................47, 50
Bond Chembl ............................................................................ 56
acceptor........................................................................ 9 Cheminformatics.......................................................50, 58
breaking, ChemOffice,
contribution ................................................................ 9 Chemometrics ...................................................... 500, 549
contribution method, ChemProp ...........................................100, 306, 319, 502
donor ........................................................................... 9 ChemSpider..................................................................... 50
Bone.....................................................254, 278, 614, 620 Classification
Boolean ................................................... 57, 11, 33, 181, models..................................... 99119, 501, 519, 521
182184, 192197, 200, 202204, 208, 209, molecular ................................................................. 119
211, 385, 402 qsar.................................................................... 61, 519
Bossa tree ......................................................... 103, 347, 516
predictivity ................................................................... 7 Clearance
rulebase ........................................................... 130, 138 drug ......................................................................... 396
structure....................................................................... 7 metabolite,
Brain barrier model,
penetration, process,
permeability ............................................................. 110 rate,
ClogP,
C Clonal growth,
Caco Cluster
cells........................................................................... 105 analysis .................................. 103, 107109, 504, 516
pbs,
permeability ............................................................. 105
Cadmium .............................................................. 277, 278 Cmax,
Caffeine, CODESSA ..................................319, 322, 325, 326, 502
Combinatorial chemistry ....................................... 47, 343
Calcium.......................................................................... 326
Cancer Comparative Molecular Field Analysis
risk................................................................... 270, 349 (CoMFA) ........................8, 9, 306, 321, 322, 502
risk assessment................................................ 233, 635 Comparative molecular similarity indices
analysis (CoMSIA) ........................................9, 322
Capillary,
Carbamazepine, Compartmental
Carcinogenic absorption................................................................ 382
analysis ..................................................................... 382
activity.........................................................69, 72, 106
effects .............................................................. 283, 299 model ....................................................................... 382
potency ...............................59, 74, 88, 112, 145, 281 systems ..................................................................... 382
COmplex PAthway SImulator
potential................................ 7, 7274, 131, 143, 342
Cardiovascular ............................................. 254, 256, 341 (COPASI) ................................................. 386, 407
Cell Comprehensive R Archive Network
cycle ......................................................................... 407 (CRAN) ............................................................ 360
growth ............................................................ 387, 541 CoMSIA. See Comparative molecular similarity
indices analysis (CoMSIA)
COMPUTATIONAL TOXICOLOGY
Index 639
Conformational physicochemical....................................................... 314
dynamics .................................................................. 403 prediction..................................................................... 7
energetic ...................................................................... 9 properties................................................................. 314
search ........................................................................... 9 QSAR........................................................................... 8
space.............................................................. 37, 38, 40 QSPR ........................................................................... 8
Connectivity Developmental
index ............................................................... 114, 115 effects ..................................................... 303, 315, 323
topochemical ........................................................... 114 toxicants.........................................308, 310, 312, 315
Consensus toxicity ....................... 55, 87, 90, 127, 129, 305337
predictions ............................................................... 313 Diazepam,
score, Dibenzofurans ............................................................... 306
COPASI. See COmplex PAthway Dietary
SImulator (COPASI) exposure assessment................................................ 128
COSMOS ...................................................................... 325 intake ....................................................................... 280
Covariance ................................. 512, 531533, 545, 550, Differential equations ................................ 180, 381, 383,
552554, 561, 568570, 576, 604, 633 388, 403, 405407, 475497, 618
matrix ............................................110, 458, 528, 530, Dioxin ................................ 217, 268, 281, 283287, 387
532534, 587, 633 Discriminant analysis................................ 14, 17, 19, 103,
CRAN. See Comprehensive R Archive 105, 110, 129, 314, 315, 516
Network (CRAN) DNA
Crystal structure............................................................ 317 adducts..................................................................... 134
Cytochrome binding.....................82, 83, 132, 138, 140, 142, 152
metabolism ........................................ 74, 77, 130, 141 Docking
substrate, methods .......................................................... 258, 259
Cytokine ..............................................167, 206, 209, 385 molecular,
Cytoscape.................................................... 168, 172, 186, scoring,
198, 211, 236, 240, 245, 246 simulations,
Cytotoxicity ............................................. 56, 61, 181, 310 tools,
Dose
D administered ................................................... 300, 593
Database extrapolation............................................................ 299
metric ....................................................................... 285
chemical ............................................................... 2951
KEGG ............................................171, 174, 175, 236 pharmacokinetic ...................................................... 257
network...................................................................... 92 reference ......................................................... 280, 298
response ..........................................55, 126, 288, 290,
search ..................................................... 29, 33, 34, 47
software................................................................36, 45 297301, 382, 385, 388, 592595
DDT (dichlorodiphenyltrichloroethane)..................... 287 Dosimetry ...................................174, 300, 381, 389, 390
3D QSAR ................................................ 8, 306, 322, 502
Decision
forest ........................................................................ 103 Dragon.................................................322, 324326, 502
processes, Dragon descriptors........................................................ 326
Drug
support............................................................ 342, 351
tree ................................................................. 6, 7479, binding..................................................................... 258
103, 104, 107, 108, 130, 318, 320, databases .................................................................. 100
343, 344, 348, 518 development ............................................99, 119, 253,
254, 260, 341, 342, 349, 350, 357,
Derek ....................................................68, 127, 138, 307,
309, 311, 345, 346, 348, 524 370, 371, 592
Dermal .................................................105, 275, 285, 297 distribution .............................................................. 593
drug interactions ...........................257, 268, 269, 317
Descriptors
chemical .......................................37, 5861, 324, 391 impurities ............................................... 342, 343, 346
models............................................................. 511, 522 induced toxicity ......................................357, 368372
metabolism ..........................................................74, 77
molecular ..................................................... 3, 4, 612,
14, 16, 1820, 38, 100, 103, 110, 308, 315317, plasma,
319323, 326, 327, 344, 502506, 511, 515, receptor........................................................... 104, 105
517, 522 resistance,
Index
Drug (cont.) dose .......................................................................... 126
safety .............................................. 101, 256, 341351 hazard ...................................................................... 126
solubility, level ....................................... 231, 270, 276, 277, 289
targets ............................................................. 235, 245 model ....................................................................... 173
DSSTox.......................................... 55, 88, 89, 92, 93, 95, population ............................................................... 276
145, 327, 328, 345 response ................................................................... 126
Dynamic systems ........................................................... 389 route......................................................................... 287
scenario .................................................................... 300
E
F
Ecotoxicity................................41, 83, 90, 110, 310, 500
Ecotoxicology.................................................................... 4 Factor analysis ........................................ 8, 105, 109110,
Effectors...............................................254, 269, 276, 298 308, 314, 320, 545, 576
Electrotopological ......................................................... 321 FastICA.......................................................................... 545
Elimination Fat
chemical ..................................................................... 16 compartment .................................382, 617, 618, 620
drug, tissue ..................................... 300, 381, 382, 618, 620
model ......................................................................... 16 Fate and transport,
process ....................................................................... 16 Fingerprint............................................... 35, 57, 318, 344
rate ........................................................................... 281 Food
Emax model, additives ................................................................... 285
Endocrine consumption data........................................... 276, 286
disruptors............................................... 270, 282, 287 intake ..................................................... 281, 285, 286
system ........................................................................ 19 safety ...............................................70, 126, 135, 307,
Ensemble methods.......................................103, 112113 311, 349, 350
Enterocytes, Force field,
Environmental Forest ...................................................103, 306, 309, 313
agents ........................................................58, 345, 350 decision ........................................................... 103, 313
contaminants .................................................. 276, 305 decision tree............................................................. 103
fate .................................................8, 32, 81, 125, 133 method ........................................................... 130, 313
indicators ........................................................ 275290 Formaldehyde,
pollutants ........................................................ 100, 275 Fortran ........................................................................... 315
protection .......................................72, 100, 279, 313, Fractal ................................................................... 386, 387
391, 394 Fragment based ............................................................... 38
toxicity ....................................................................... 32 Functional
Environmental public health analysis ............................................... 72, 74, 264, 318
indicators................................................... 275290 groups ................................................ 6, 7, 8284, 153
Enzyme theory.............................................................. 153, 245
complex, units ....................................................... 379, 390, 391
cytochrome .............................................................. 285 Fuzzy
metabolites .............................................................. 422 adaptive .................................................. 306, 309, 316
networks ........................................ 237241, 245247 logic ........................................................181184, 205
receptors .................................................................. 269
substrates ................................................................. 257 G
transporters.............................................................. 261
Gastrointestinal ........................................... 254, 268, 386
EPISuite................................................................ 322, 524 GastroPlus,
Epoxide...........................................................70, 270, 618 Genechip............................................ 359, 362, 363, 365,
Estrogenic...................................................................... 288
368, 370
Ethanol .......................................................................... 155 Gene, genetic
Excel.......................................................92, 168, 328, 412 algorithms..................................................12, 13, 103,
Expert systems.......................................... 69, 7084, 128,
306, 320, 328, 511
129, 136, 139, 143, 152, 306308, 311, 337, expression networks ....................................... 165177
343, 350 function ................................................................... 318
Exposure networks ................................................ 168, 177, 383
assessment.............................................. 126, 270, 275
Index 641
neural networks .............................................. 103, 318 Immunotoxicity..............................................55, 128, 313
omnibus ................................................................... 601 InChI .................................................................. 38, 78, 85
ontology ........................................170, 171, 262, 362 Information
profiling .........................................362, 377, 385, 389 systems ........................................................32, 37, 277
regulatory networks ...................................... 217, 221, theory....................................................................... 321
224, 377, 384, 402 Ingestion..............................................135, 275, 285, 286
regulatory systems................................................... 406 Inhalation .......................................... 135, 275, 285, 287,
Genotoxicity .................................. 72, 82, 106, 125159, 297299, 614, 617
270, 342, 343, 345 Interaction
Glomerular ...................................................................... 56 energy ..................................................................9, 404
Glucuronidation ............................................................ 143 fields ............................................................................. 9
Glutathione ................................................. 217, 368, 370 model ......................................................................... 68
Graph network........................ 168, 172, 217, 377, 385, 389
model .............................................322, 413, 599, 600 rules................................................................. 401404
theory....................................................................... 502 Interindividual variability........................... 276, 599, 601,
GraphViz ....................................198, 210, 361, 367, 368 605, 613616, 621, 624627, 629, 633, 634
Interspecies
H differences................................................................ 298
Hazard extrapolation............................................................ 298
assessment..................................71, 81, 126, 130131 Intestinal
characterization ........................................75, 126, 277 absorption,
permeability,
identification...............................................72, 95, 126
HazardExpert .................................... 128, 133, 146, 147, tract .......................................................................... 386
155, 156, 158, 307, 309, 312, 313 IRIS (Integrated Risk Information System)............89, 92
Hepatic
J
lobule ..................................................... 379, 389, 391
metabolism ..................................................... 302, 391 Java............................................171, 236, 309, 313, 361,
Hepatitis ............................................................... 260, 279 366, 410413
Hepatotoxic ................................175, 260, 262, 268, 370 JSim,
HERG
blockers........................................................... 104, 105 K
channel, KEGG
Hierarchical ligand .............................................237, 239, 247, 421
clustering ......................................108, 314, 322, 358,
pathway .........................................171, 174, 175, 237,
368, 370 238, 241, 244, 247, 263
models............................................ 599601, 605, 606 Ketoconazole,
HIV .............................................105, 115, 254, 258, 277
Kidney
Homeostasis ........................................370, 376, 382, 393 cortex ....................................................................... 268
Homology ..................................................................... 262 injury...................................................... 259, 260, 267
models,
K-nearest neighbor .................................... 103, 108, 128,
Hormone 316, 321, 505, 516, 517, 529
binding..................................................................... 285 KNIME.........................................................360362, 364
receptor.................................................................... 285 Kow (octanol-water partition
HPXR
coefficient) ............................................7, 8, 75, 83
activation,
agonists, L
antagonists,
HQSAR ....................................................... 306, 321, 322 Langevin,
Leadscope ........................................................47, 50, 313,
I 344, 345, 347, 524
Least squares........................................................... 9, 172,
Immune 314, 316, 319, 322, 429, 439, 466468,
cells.................................................................. 173174 470, 505, 506, 512, 529, 549577,
response ......................................... 167, 168, 173174
581, 592
Index
Ligand 446, 483, 502, 505, 512, 528, 533536, 582,
binding................................................... 179, 217, 306 586, 587, 598, 605
complex........................................................... 217, 421 Matlab......................................................... 182, 187, 188,
interactions ........................................ 8, 205, 404, 421 196, 197, 210, 323, 407, 494497, 575, 576
library, Maximum likelihood estimation
receptor........................................................ 8, 41, 179, (MLE) ......................................550, 581, 585594
185, 217, 306 MCSim.................................................609, 622, 630, 633
screening .................................................................. 259 Mercury ................................................................ 278, 280
Likelihood Meta-analysis ........................................................ 262, 601
functions .......................................582585, 589, 603, MetabolExpert ............................................ 128, 312, 313
605, 606, 619 Metabolic Network Reconstruction ................... 244247
method ................................................. 113, 127, 458, Metabolism
581, 582, 589 (bio)activation .................................................. 72, 300
ratio ........................................................588589, 593, drug .....................................................................74, 77
607, 608, 619, 623, 630 liver ...............................................140, 259, 382, 386,
Linear algebra ....................................................... 429473 387, 390392, 618
Linear discriminant analysis .................................... 14, 17, prediction............................................ 7, 41, 126, 130,
19, 103, 105, 110 133134, 140143, 153157, 310, 316
Lipinski, rate .................................................170, 276, 391, 618
Lipophilic..................................................... 278, 306, 320 Metabolomics/metabonomics,
Liver Metacore............................................................... 264, 358
enzyme......................................................55, 391, 392 Metadrug ....................................................................... 269
injury..................................... 259, 260, 357, 386, 390 Metal ....................................................................... 72, 317
microsomes.................................................................. 4 Metapc ........................................................................... 310
regeneration.................................................... 379, 387 Metasite ........................................................................408,
tissue ..................................................... 381, 382, 386, Meteor ........................................................ 127, 128, 133,
390, 392, 409, 618 134, 142, 143, 154, 155
toxicity ..................................................................... 260 Methanol,
LOAEL (lowest observed adverse Methemoglobin ............................................................ 311
effect level)............................................61, 62 129, Methotrexate,
298, 299, 308, 312, 315, 327, 329, 332 Metyrapone,
Logistic MexAlert,
growth ..................................................................... 402 Michaelis-Menten equation....................... 300, 301, 384,
regression.............................. 315, 328, 329, 344, 593 494496
Logit .......................................................... 585, 590, 591, Microglobulinuria ......................................................... 259
593, 594 Micronuclei alerts.......................................................... 276
function ................................................. 590, 591, 593 Microsomes ....................................................................... 4
Lognormal ..........................................601, 602, 615617, Milk............................................. 143, 276279, 282289
620, 621, 627, 629, 631, 632, 633 Minitab ................................................................. 117, 323
LogP .................................................8, 1013, 18, 4749, Missing data................................................................... 129
58, 128, 143, 326, 500, 502 MLE. See Maximum likelihood
Lungs ............................................................................. 618 estimation (MLE)
Model
M checking.......................................... 92, 229233, 622,
627629
Madonna (Berkeley-Madonna),
Malarial ...........................................................10, 105, 261 development ..............................................57, 62, 110,
Mammillary ................................................................... 269 302, 317, 319, 324, 337, 501, 513516
error ...............................................117, 192, 196, 519
Markov chain Monte Carlo ................................. 597, 607
Markup language ...........................................36, 386, 405 evaluation.................................................4, 19, 40, 54,
Maternal....................................................... 279, 280, 285 59, 78, 79, 100, 132, 159, 210, 220, 307, 309,
318, 320, 330331, 342, 343, 350, 412, 499,
Mathematica ......................................................3, 6, 8, 30,
3234, 68, 70, 107, 112, 180, 184, 215217, 501, 576, 610
219221, 224, 300, 301, 306, 308, 309, fitting ..............................................17, 200, 206, 302,
314316, 323, 335, 377, 383, 386, 387, 399, 318, 506, 508, 512, 513, 522, 605, 623
Index 643
identification............................................76, 100, 113, property ................................................ 4, 7, 8, 20, 75,
132, 172, 173, 191, 216, 225, 229, 76, 316, 320322, 500, 502503, 522
259, 343, 418 shape ..................................................... 326, 327, 404,
prediction....................................................... 100, 349, 405, 420, 421, 500, 502503
628, 633 similarity ............................................... 9, 38, 83, 180,
refinement......................................190, 194, 318, 415 259, 269, 308, 309, 318, 319, 344, 515
selection .....................................................4, 9, 12, 17, targets ....................19, 253, 254, 259, 268, 269, 362
57, 81, 85, 103, 132, 193, 194, 255, 256, 315, Molfile................................ 131, 132, 309, 311, 314, 315
319321, 329, 331, 504, 511, 512, 514516, Molpro........................................................................... 323
582, 608, 612 Monte Carlo simulation ............................................... 597
structure..................................................612, 629631 Mopac ..................................................316, 325, 326, 502
uncertainty.................................................19, 71, 112, Morphogenesis ...........................313, 381, 383, 384, 407
128, 299, 302, 336, 337, 588, 598, 599, 601, Multi-Agent Systems ..........................215, 400401, 415
602, 603, 611, 632, 633 Multidimensional drug discovery,
validation ........................................... 57, 86, 418, 499 Multidrug resistance,
Modeling Multiscale.............................................386389, 407, 409
homology................................................................. 262 Multivariate
molecular ............................................... 318, 413, 502 analysis ..................................................................... 358
in vitro .................................................. 54, 57, 58, 61, regression................................................................. 507
68, 88, 93, 95, 100, 136, 138, 261, 337, Mutagenicity
377379, 381, 390393, 406 alerts............................................................................. 7
Models ames test,
animal..............................................32, 54, 58, 62, 68, prediction,
88, 99, 104, 256, 259, 271, 280, 288, 299, 300, Myelosuppression,
301, 378, 409, 600, 601 MySQL ................................................................. 169, 236
antitubercular .......................................................... 104
biological activity......................................... 6, 8, 9, 11, N
19, 41, 57, 68, 88, 100, 101, 103, 104,
NAMD,
114, 118, 119, 306, 308, 312, 314, Nanoparticles,
318, 321, 322, 500 Nasal/pulmonary,
bone ................................................................ 254, 255 Nearest neighbor.......................................... 59, 103, 108,
carcinogenicity...........................................5, 7, 55, 59,
128, 313, 316, 321, 404, 505, 516,
60, 68, 70, 71, 76, 78, 79, 86, 88, 517, 518, 529
100, 104, 127129, 259, 299, 309, Neoplastic ...................................................................... 106
312, 313, 328, 348, 349
Nephrotoxicity .............................................................. 266
developmental .........................................8, 21, 41, 50, Nervous system,
55, 57, 62, 68, 71, 95, 100, 112 Network
intestinal .................................................................. 379
gene....................................... 175176, 217, 268, 383
myths, KEGG ...................................................................... 237
predict binding ........................................................ 403 metabolic ....................................... 217, 235248, 402
reproductive..............................................55, 306, 328
neural .................................14, 16, 17, 103, 113, 308,
MoKa, 314316, 318, 320, 402, 515, 516, 518, 521
Molecular Neurotoxicity ........................................... 55, 87, 90, 128,
descriptor ........................................ 3, 4, 612, 14, 16, 258, 288, 311, 313
1820, 38, 100, 103, 110, 308, 315317,
Newborn........................................................................ 328
319323, 326, 327, 344, 502506, 511, 515, Newton method ................................................... 475, 587
517, 522 NHANES ............................................................. 279, 289
docking .................................................................... 259
Nicotine,
dynamics ............................... 358, 367, 371, 403, 408 Nitrenium ion............................................. 134, 138, 140,
fragments ................................ 38, 308, 310, 314, 321 144, 152, 154, 155
geometry..................................................38, 215, 319,
NOAEL .......................................... 61, 62, 298, 299, 327
326, 405, 502503 Non
mechanics ....................................................... 321, 326 bonded interactions,
networks ........................ 76, 133, 134, 166, 378, 384 congeneric ..................... 17, 68, 69, 71, 76, 309, 351
Index
Non (cont.) Partition coefficient.................................... 7, 10, 75, 306,
genotoxic ......................................70, 76, 77, 78, 131, 312, 500, 613, 615, 616, 618, 620, 622
138, 143, 157, 263, 285, 347 Passi toolkit .......................................................... 416418
mutagenic ........................................84, 101, 146, 343 Pathway
Noncancer risk assessment........................ 72, 74, 89, 299 analysis .................................. 259, 263, 264, 269, 358
Non-compartmental analysis ........................................ 300 maps ......................................................................... 236
NONMEM.................................................................... 589 Pattern recognition .................... 103107, 315, 516, 521
Nonspecific binding .................................... 105, 290, 328 Pediatric ......................................................................... 260
Nuclear receptor............................................................ 172 Perchlorate,
Nucleic acids.................................................................... 50 Perfusion............................................................... 618, 619
Nucleophiles ........................................139, 144, 152, 154 Permeability
Numerical brain barrier ............................................................. 110
integration ...................................................... 479480 drug ....................................................... 105, 110, 113
methods ......................................... 478, 490497, 610 intestinal,
in vitro,
O Persistent organic pollutants (POPs) ..........................275,
Oasis database ....................................... 3740, 45, 46, 50 277, 279, 280, 282289
Pesticide .................................. 4, 5, 7, 40, 50, 55, 68, 69,
Objective function .......................................172, 192194
Occams razor, 90, 94, 139, 157, 279, 284, 285, 328
Occupational safety, Pharmacogenomics ..................................... 211, 253, 260
Ocular ............................................................................ 312 Pharmacophore ...........................................................9, 67
Physiome
OECD
guidelines.......................................................... 84, 307 jsim models,
qsar toolbox........................................ 71, 8284, 129, project,
Phytochemical ............................................. 259, 268, 269
132, 134137, 140, 157, 524
Omics.................................................................... 264, 528 Pitfalls ................................................................. 4, 10, 408
OpenMolGRID............................................................. 319 PKa................................................................................. 128
Plasma
Open MPI,
OpenTox Framework................................................8486 concentration .......................................................... 280
Optimization protein binding,
dosage ...................................................................... 592 Pollution ........................................................................ 284
Polybrominated diphenyl ethers
methods .......................................................... 195, 511
pre clinical, (PBDEs)...................................275, 279, 287288
Oral Polychlorinated biphenyls
(PCBs)......................................268, 279287, 312
absorption................................................................ 314
dose .......................................................................... 129 Polycyclic aromatic hydrocarbons
Organochlorine ........................................... 279, 284, 285 (PAHs) ......................................68, 138, 140, 217,
270, 276, 278
Orthologs ............................................................. 266268
Outlier ........................................... 15, 16, 110, 324, 330, Polymorphism ...................................................... 256258
503, 522, 523, 529531, 545, 600 Pooled data.................................................. 286, 289, 616
Poorly perfused tissues ................................................. 620
Overfitting ................................................... 337, 508, 570
Oxidative stress........................................... 175177, 218, Population based model ...................................... 279, 280
219, 221223, 227, 228, 231, 368, 370, 371, Portal vein ............................................................ 379, 388
540, 543, 544 Posterior distribution.........................597, 603, 606611,
613, 621626, 631, 633
P Predict
absorption................................... 7, 41, 259, 316, 382
Paracetamol, ADME parameters .................................................. 316
Paralogs.......................................................................... 265 aqueous solubility ..................................................... 75
Parameter binding............................................18, 40, 56, 75, 77,
estimation ................................................................ 300 82, 83, 104, 106, 130, 132, 138, 140, 142, 144,
scaling ...................................................................... 619 152, 154, 179, 258, 301, 306, 403
Paraoxon, biological activity.................................. 6, 8, 9, 11, 16,
Partial least squares (PLS) .................................. 9, 14, 17, 19, 41, 57, 67, 68, 88, 90, 99, 100, 101, 103,
57, 314, 316, 318, 319, 320, 322, 324, 504, 505, 104107, 110, 114, 118, 119, 306, 308, 310,
512, 549577 312, 314, 318, 321, 322, 500, 506, 516
Index 645
boiling point, interaction......................................172, 262, 264, 266
carcinogenicity...........................................5, 7, 55, 59, ligand,
60, 6795, 100, 104, 127130, 133, 135, 136, structure................................................................... 258
138140, 143, 151, 152, 259, 270, 309312, targets .................................................... 253, 258, 259
342, 348, 349, 499 Proteomics............................................................ 254, 265
clearance .........................................11, 14, 16, 58, 60, Prothrombin.................................................................. 105
68, 74, 92, 132, 146, 268, 288, 409, 601, 627 Pulmonary .................................................. 379, 613, 615,
CNS permeability, 616, 618620, 622
cytochrome P450 ......................................74, 77, 130, Pyrene ..................................................215, 217219, 222
134, 139, 141, 144
developmental toxicity ................................ 55, 87, 90, Q
127, 129, 305337
QikProp,
fate ................................................8, 32, 81, 125, 127, QSARPro ....................................................................... 321
128, 131, 133, 142 QSIIR ........................................................................5362
genotoxicity ...............................................72, 82, 106,
Quantum chemical descriptors.............................. 37, 324
125133, 136159, 270, 342, 343, 345 Quinone......................................................................... 105
Henry constant,
melting point................................................ 10, 58, 75 R
metabolism ....................................... 7, 41, 72, 74, 77,
125133, 136159, 235248, 259, 269, 301, R (statistical software)................................ 328, 358, 360,
302, 305, 310, 316, 320, 351, 382, 391, 392, 613 364, 589
mutagenicity ...............................7, 6774, 77, 7988, Random
91, 94, 95, 100, 125, 127130, 132, 133, effects ....................................................................... 570
136140, 145147, 152, 156, 158, 272, 309, forest ..............................................103, 306, 309, 313
311313, 344347, 499 Ranking.................................................73, 141, 170, 171,
pharmacokinetic parameters ............................ 99, 119 193, 265, 631
physicochemical properties......................... 3, 7, 8, 20, Reabsorption,
41, 58, 72, 75, 131, 305, 308, 311, 314, 315, Reactive intermediates ......................................... 139, 153
499, 502 Receptor
safety ........................................ 29, 53, 54, 70, 86, 89, agonists ........................................................... 278, 284
99, 101, 113, 114, 115, 118, 126, 135, 216, AhR ...................................... 166, 217219, 222, 223,
258, 259, 271, 301, 302, 305337, 341351 268, 278, 279, 284, 368
toxicity ..................... 85, 86, 100, 136, 155, 305337 binding affinity .......................................................... 40
PredictPlus, mediated toxicity..................................................... 502
Pregnancy ...................................279, 327, 328, 602, 617 Recirculation,
Pregnane Xenobiotic receptors .................................... 269 Reconstructed enzyme network................. 240, 245, 246
Prenatal ................................................................... 55, 328 Reference concentration (RfC) ........................... 298, 299
Prior distribution ....................................... 597, 599, 600, Reference dose (RfD) ................................. 280, 298, 299
601, 603, 604, 606, 607, 609, 613, 615, 620, Relational databases .....................................30, 4749, 69
621, 622, 634 Renal clearance,
Prioritization ..................................................68, 320, 337 Reproductive toxicity ............................................. 55, 328
toxicity testing, Reprotox ............................................................... 327, 328
Procheck, Rescaling........................................................................ 199
Pro Chemist .................................................................. 320 Residual errors...................................................... 592, 625
Progesterone, Respiratory
Project Caesar............................................. 100, 117, 127, system ...................................................................... 298
138, 147, 309 tract,
Propranolol, RetroMex,
ProSA, Reverse engineering .....................................172173, 180
Protein Richly perfused tissues,
binding................................................... 106, 300, 359 Risk
databank (PDB) ..................................................46, 50 analysis .................................................................54, 86
docking ........................................................... 258, 259 characterisation.......................................................... 75
folding estimation ................................................................ 349
Index
Risk (cont.) Singular value decomposition ................... 429, 460462,
Integrated risk information system 464, 465468, 472, 473, 549, 551, 554
(IRIS)...................................................89, 92, 318 Sinusoids ...................................................... 379, 387, 391
management ............................................................ 142 Size penalty................................. 195, 196, 200202, 204
Risk/safety assessment Skin
chemical ................................................................... 126 lesion ....................................583, 584, 585, 590, 592,
pharmaceutical......................................................... 300 593595
screening .................................................................. 301 sensitization ...................................... 74, 77, 100, 104,
testing ...................................................................... 299 130, 309, 311, 312
Robustness............................................19, 107, 110, 348, SMARTCyp ...........................................77, 130, 141, 142
512514, 606, 613 SMBioNet............................................................. 229233
SMILES ................................................31, 38, 55, 78, 82,
S 85, 86, 131, 132, 135, 144, 148151, 309,
Saccharomyces cerevisiae<! 538 312314, 327, 502, 503, 506
Smoking............................. 126, 134, 135, 139, 174, 217
Salicylic acid
Sample size .................................................................... 597 Sodium.................................................................. 326, 359
SBML.................................................................... 386, 405 Solubility prediction........................................................ 58
Source-to-effect continuum ........................................... 58
Scalability ....................................................................... 412
Scaling...................................................20, 193, 280, 299, SPARC,
365, 386, 420, 530, 531, 534, 619, 620 SPARQL,
factor ........................................................................ 365 Sparse data,
Species
procedure................................................................. 530
SCOP, differences....................................................... 268, 302
Screening extrapolation............................................................ 299
scaling ............................................................. 280, 420
drug,
drug discovery ...............................100, 101, 256, 343 SPSS ...................................................................... 117, 576
environmental chemicals.................................. 94, 259 Statistica ..........................................................20, 117, 323
Stereoisomers ................................................................ 326
hts............................................................................... 54
methods, Steroid................................................................... 105, 285
protocols .................................................................. 101 Stomach,
Screeplot .....................................535, 537, 539, 540, 546 Storage compartment,
Stress response............................................................... 175
Searchable toxicity database .................................. 88, 145
Secondary structure prediction, Structural
Selectivity index..........................101, 115, 116, 118, 119 alert ........................................ 7, 67, 7679, 125, 136,
138, 152, 270, 308, 311, 342344, 348
Self-organizing maps............................................ 316, 515
Sensitivity similarity ..........................................87, 153, 270, 301
analysis, Structure-activity relationship (SAR)
analysis ..............................72, 74, 129, 136, 342, 343
coefficient,
Sensitization ...........................................74, 77, 100, 104, model ................................6, 100, 323, 330, 331, 337
127, 130, 309, 311313 Sub
cellular...................................................................... 407
Sequence
alignment, compartments................................................. 387, 392
homology................................................................. 262 Substituted benzenes .................................................... 105
Serum albumin, Substrate
active site ................................................................. 422
Sevoflurane .................................................................... 266
Shellfish, binding............................................................ 104, 106
Signal transduction ....................170, 179, 185, 263, 402 inducers.................................................................... 121
inhibitor ................................................................... 106
SIMCA.......................................... 20, 103, 322, 516, 576
Simcyp, Substructure
Similarity searching ..............................................................33, 34
similarity ..............................................................92, 93
analysis ........................................................9, 318, 515
indices .......................................................................... 9 Sugar ........................................................... 555, 556, 559,
search ...................................... 34, 35, 38, 92, 93, 319 560, 566, 567, 572, 574
Simulated annealing ........................................................ 57 Sulfamoyl adenosines .................................................... 104
Index 647
Sulfur dioxide, database .................................................5456, 86, 87,
Supersaturation, 327, 328, 348, 349
Support vector machine drug .....................101, 341, 342, 357, 358, 368372
(SVM) .............................. 36, 103, 105, 110112 DSSTox.................................................................... 328
Surrogate endpoint .............................................. 254, 257 endocrine disruption........................................ 90, 104
Surveillance programs .........................260, 276, 279, 289 endpoint ........................................... 7, 56, 62, 6971,
Switch ......................................................... 383, 385, 437, 92, 100, 101, 127, 128, 159, 301, 311, 312, 313,
438, 446449, 538 342, 348, 349, 351, 358, 362, 371
SYBYL.......................................................... 315, 321, 322 environmental........................................ 32, 83, 88, 93
SYSTAT ....................................................... 117, 323, 328 estimates .................................................................. 307
Systems mechanism .......................................54, 167, 175, 301
biology ................................................. 215, 216, 253, organic,
261, 375, 377, 378, 383, 386, 399, pathways ............................................... 359, 362, 363,
402, 404, 405 366, 368, 377, 379
toxicology ....................................................... 375393 potential..........................90, 257, 306308, 312, 342
prediction.......................... 6971, 100, 155, 305337
T rodent carcinogenicity ..................................... 59, 312
screening .................................................................. 372
Tanimoto coefficient ..................................................... 269
TEFs. See Toxic equivalency factors (TEFs) testing .......................................................... 53, 54, 61,
Teratogenicity............................................ 128, 309311, 305, 351, 376, 377, 393
313, 342, 348 Toxicogenomics .................................................... 94, 259,
261266, 268, 357373
Tetrachlorodibenzo dioxin (TCDD) .................. 281, 285
Tetrahymena pyriformis .................................18, 104, 314 Toxicoinformatics.......................................................... 373
Theophylline, Toxicologically equivalent,
Toxicophore ........................ 67, 127, 130, 137, 311313
Therapeutic
doses, ToxMic............................................ 74, 76, 130, 133, 140
index ........................................................................ 115 TOXNET.................................................... 56, 88, 89, 90,
92, 93, 135, 137, 145, 324, 327
Thermodynamic properties,
Thiazoles............................................................... 106, 139 ToxPredict .................................................................85, 86
Threshold value .................................................... 517, 598 ToxRefDB..........................................................55, 90, 94,
Thyroid ........................................................ 280, 285, 379 145, 327, 328
Toxtree...................................................... 6, 7, 68, 70, 71,
Tissue
dosimetry ........................................................ 381, 390 7480, 82, 86, 117, 130, 133, 136,
grouping, 138, 139, 147, 155, 156, 158, 270,
344, 345, 348, 524
partition coefficient ........................................ 618, 620
volumes........................................................... 382, 619 TPSA,
Tmax, Tracers,
Training sets ...................................... 107, 147, 323, 346,
TNF stimulation............................................................ 208
Tolerable 350, 351, 501, 513
daily intake...................................................... 280, 286 Transcription factor.................................... 172, 179, 217,
362, 370
weekly intake,
Tolerance .............................................................. 329, 587 Transcriptome ............................................................... 360
TopKat .........................................69, 129, 138, 140, 147, Transduction ..............................170, 179, 185, 263, 402
151, 155, 156, 158, 307, 309, 312 Transit compartment,
Transport
Topliss tree,
Topological mechanisms ............................................................. 302
index ........................................................................ 114 models...................................................................... 381
proteins (transporters) .................................. 260, 261,
Total clearance,
ToxCast program ............................................................ 95 269, 369, 387
ToxCreate ..................................................................85, 86 Tree ......................................................... 6, 39, 7479, 82,
103, 104, 107, 108, 130, 136, 230, 318, 320,
Toxic equivalency factors (TEFs) ................................. 285
Toxicity/toxicological 344, 347, 386, 516518
chemical ................................... 53, 54, 58, 71, 8695, self organizing ......................................................... 320
301, 349, 376, 389, 392, 520 Trichloroacetic acid (TCA).................................. 544, 545
Index
TTC decision tree ......................................................... 130 Variability .........................................................70, 71, 262,
Tumor ....................................................... 55, 72, 88, 104, 276, 280, 289, 385, 404, 501, 528, 599602,
254, 269, 359, 387, 390, 391 604, 613616, 621, 624630, 632634
Turnover, Variable selection......................................... 12, 321, 504,
Tyrosinase inhibitors............................................ 104, 105 511512, 516
Tyrosine ......................................................................... 254 Vascular endothelial ............................................... 56, 105
VCCLAB,
U Venlafaxine,
Vinyl chloride,
UML ...........................................406, 414, 416, 418, 419
Uracils ................................................................... 114116 Virtual
Urinary cadmium concentration .................................. 278 high throughput screening (vHTS),
libraries .................................................................... 101
Urine............................................................ 276, 278, 279
screening ................................. 30, 101, 343, 512, 514
V tissue ..............................................375, 383, 384, 409
VolSurf,
Valacyclovir, Volume of distribution,
Validation
external ................................................. 6, 21, 59, 319, W
325, 337, 345, 349, 510, 514516, 524
internal.............................................17, 512, 514, 515 Warfarin ................................................................ 257, 261
loo .............................................................17, 330, 513 WinBUGS,
WinNonLin,
methods .........................................129, 330, 331, 554
qsar.......................................... 19, 325, 345, 499524 WSKOWWIN,
techniques.............................................. 319, 349, 512
X
Van der Waals .................................................................... 9
Vapor pressure ................................................................. 75 XPPAUT,

Computational Toxicology

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Computational Toxicology

Загружено:

Авторское право:

Доступные форматы

METHODS MOLECULAR BIOLOGY

For further volumes:

ISSN 1064-3745 ISSN 1940-6029 (electronic)

Springer Science+Business Media, LLC 2013

Printed on acid-free paper

Humana Press is a brand of Springer

Fort Collins, Colorado, USA Brad Reisfeld

PART I TOXICOLOGICAL/PHARMACOLOGICAL ENDPOINT PREDICTION

PART II BIOLOGICAL NETWORK MODELING

7 Gene Expression Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

PART III BIOMARKERS

PART IV MODELING FOR REGULATORY PURPOSES

PART V INTEGRATED MODELING/SYSTEMS TOXICOLOGY APPROACHES

16 Developing a Practical Toxicogenomics Data Analysis System Utilizing

PART VI MATHEMATICAL AND STATISTICAL BACKGROUND

TAKEHIRO HIRAI  Translational Medicine and Clinical Pharmacology Department,

SUCHA SUDARSANAM  Emiliem, Inc., San Francisco, CA, USA

Toxicological/Pharmacological Endpoint Prediction

Methods for Building QSARs

Key words: QSAR, SAR, Linear model, Nonlinear model, Validation

All branches of research benefit from the use of computers, even

be not limited in the choice of a statistical method to derive linear

A well-defined endpoint is critical for the design of an accurate

LD50 (mg/kg) LD50 (mg/kg)

Use of his/her own experimental toxicity data for deriving a

shown in an external validation exercise (17) aiming to estimate the

interpretation of the obtained QSAR models is straightforward.

3.2. 2D Molecular The absorption, distribution, metabolism, and excretion (ADME)

molecular descriptor in QSAR (42, 43) and for encoding the

(van der Waals) and electrostatic (Coulombic) fields around a set

Chemical log 1/EC50 ESDL10 log P log MP

some guidelines for selecting the most suited statistical method in a

where n is the number of individuals (i.e., chemicals), r is the

kind of problem. Different types of graphs are available for analyz-

log 1=EC50 0:5660:101 log P 3:200:676

log MP 10:152:104; (2)

n 18, r 0.82, r2 0.68, s 0.30, and F 15.8.

No. Chemical AD50a log P

algorithm (89). Genetic algorithms are rooted on the Darwinian

such as mutations and crossovers. A fitness function is used to

Fig. 3. A three-layer perceptron.

It is important to note that the hybridization of statistical

cause of the outlier behavior can be of inestimable value to gain

this internal validation is called leave-one-out (LOO) cross validation

Species Slope Intercept r2 s F n

widely represented in the training set (e.g., isomers of position), is

non-endocrine disruptors only because they gave a negative result

but totally meaningless. When an indicator variable is used, its

the crucial interest of this modeling step, divergences exist on

toxicity of chemicals to Hydra attenuata. SAR Ecotoxicology modeling. Springer, New

(quantitative) structureactivity relationships 141. Devillers J, Dore JC (2002) e-statistics for

Accessing and Using Chemical Databases

Key words: Chemical database, Molecular modeling, Cheminformatics

The primary advantage of database organization is the possibility of

biology, pharmacology, medicine, etc. Databases of structures and

Databases, or data organized in order to achieve efficient access or

Fig. 1. A chemical structure diagram: 2,4,6-tris (dimethylamino)methylphenol.

of structures. Therefore, various computer-readable representations

Fig. 3. A 3D structure diagram of 2,4,6-tris (dimethylamino)methylphenol generated from

possible (and, in particular, the biologically active) conformations

other details about an assay, such as test duration, organism, route

Database search, perhaps the most important database function, is

structures having a given fragment as a substructure reduces to a

Fig. 5. Similarity search.

TAKEHIRO HIRAI Translational Medicine and Clinical Pharmacology Department,

SUCHA SUDARSANAM Emiliem, Inc., San Francisco, CA, USA

l C{ar}O{}H_O{}{acy} C{scy}{10.2 <DISTANCE <11}