Академический Документы
Профессиональный Документы
Культура Документы
TM
IN
Series Editor
John M. Walker
School of Life Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
Volume II
Edited by
Brad Reisfeld
Department of Chemical and Biological Engineering and School of Biomedical Engineering
Colorado State University, Fort Collins, Colorado, USA
Arthur N. Mayeno
Department of Chemical and Biological Engineering,
Colorado State University, Fort Collins, Colorado, USA
Editors
Brad Reisfeld Arthur N. Mayeno
Department of Chemical Department of Chemical
and Biological Engineering and and Biological Engineering
School of Biomedical Engineering Colorado State University
Colorado State University Fort Collins, Colorado, USA
Fort Collins, Colorado, USA
Rapid advances in computer science, biology, chemistry, and other disciplines are enabling
powerful new computational tools and models for toxicology and pharmacology. These
computational tools hold tremendous promise for advancing applied and basic science,
from streamlining drug efficacy and safety testing to increasing the efficiency and effective-
ness of risk assessment for environmental chemicals. These approaches also offer the
potential to improve experimental design, reduce the overall number of experimental trials
needed, and decrease the number of animals used in experimentation.
Computational approaches are ideally suited to organize, process, and analyze the vast
libraries and databases of scientific information and to simulate complex biological phe-
nomena. For instance, they allow researchers to (1) investigate toxicological and phar-
macological phenomena across a wide range of scales of biological organization
(molecular cellular organism), (2) incorporate and analyze multiple biochemical and
biological interactions, (3) simulate biological processes and generate hypotheses based
on model predictions, which can be tested via targeted experimentation in vitro or in vivo,
(4) explore the consequences of inter- and intra-species differences and population varia-
bility on the toxicology and pharmacology, and (5) extrapolate biological responses across
individuals, species, and a range of dose levels.
Despite the exceptional promise of computational approaches, there are presently very
few resources that focus on providing guidance on the development and practice of these
tools to solve problems and perform analyses in this area. This volume was conceived as
part of the Methods in Molecular Biology series to meet this need and to provide both
biomedical and quantitative scientists with essential background, context, examples, useful
tips, and an overview of current developments in the field. To this end, we present a
collection of practical techniques and software in computational toxicology, illustrated with
relevant examples drawn principally from the fields of environmental and pharmaceutical
sciences. These computational techniques can be used to analyze and simulate a myriad of
multi-scale biochemical and biological phenomena occurring in humans and other animals
following exposure to environmental toxicants or dosing with drugs.
This book (the second in a two-volume set) is organized into six parts each covering a
methodology or topic, subdivided into chapters that provide background, theory, and
illustrative examples. Each part is generally self-contained, allowing the reader to start
with any part, although some knowledge of concepts from other parts may be assumed.
The final part provides a review of relevant mathematical and statistical techniques. Part I
explores the critical area of predicting toxicological and pharmacological endpoints, such as
mutagenicity and carcinogenicity, and demonstrates the formulation and application of
quantitative structureactivity relationships (QSARs) and the use of chemical and endpoint
databases. Part II details approaches used in the analysis of gene, signaling, regulatory, and
metabolic networks, and illustrates how perturbations to these systems may be analyzed in
the context of toxicology. Part III focuses on diagnostic and prognostic molecular indica-
tors and examines the use of computational techniques to utilize and characterize these
biomarkers. Part IV looks at computational techniques and examples of modeling for
risk and safety assessment for both internal use and regulatory purposes. Part V details
approaches for integrated systems modeling, including the rapidly evolving development
v
vi Preface
of virtual organs and organisms. Part VI reviews some of the key mathematical and
statistical methods used herein, such as linear algebra, differential equations, and least-
squares analysis, and lists other resources for further information.
Although a complete picture of toxicological risk often involves an analysis of environ-
mental transport, we believe that this expansive topic is beyond the scope of this volume,
and it will not be covered here; overviews of computational techniques in this area are
contained in a variety of excellent references [14].
Computational techniques are increasingly allowing scientists to gain new insights into
toxicological phenomena, integrate (and interpret) the results from a wide variety of
experiments, and develop more rigorous and quantitative means of assessing chemical
safety and toxicity. Moreover, these techniques can provide valuable insights before initiat-
ing expensive laboratory experiments and into phenomena not easily amenable to experi-
mental analysis, e.g., detection of highly reactive, transient, or trace-level species in
biological milieu. We believe that the unique collection of explanatory material, software,
and illustrative examples in Computational Toxicology will allow motivated readers to
participate in this exciting field and undertake a diversity of realistic problems of interest.
We would like to express our sincere thanks to our authors whose enthusiasm and
diverse contributions have made this project possible.
References
1. Clark, M.M., Transport modeling for environmental engineers and scientists. 2nd ed. 2009, Hobo-
ken, N.J.: Wiley.
2. Hemond, H.F. and E.J. Fechner-Levy, Chemical fate and transport in the environment. 2nd ed.
2000, San Diego: Academic Press. xi, 433 p.
3. Logan, B.E., Environmental transport processes. 1999, New York: Wiley. xiii, 654 p.
4. Nirmalakhandan, N., Modeling tools for environmental engineers and scientists. 2002, Boca Raton,
Fla.: CRC Press. xi, 312 p.
Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
11 Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Harmony Larson, Elena Chan, Sucha Sudarsanam, and Dale E. Johnson
12 Biomonitoring-based Environmental Public
Health Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Andrey I. Egorov, Dafina Dalbokova, and Michal Krzyzanowski
vii
viii Contents
HERVE ABDI School of Behavioral and Brain Sciences, The University of Texas
at Dallas, Richardson, TX, USA
SANJAY BAJAJ S.V. College of Pharmacy, Patiala, India
CHIARA LAURA BATTISTELLI Environment and Health Department, Istitituto
Superiore di Sanita, Rome, Italy
ROMUALDO BENIGNI Environment and Health Department, Istitituto Superiore
di Sanita, Rome, Italy
GILLES BERNOT I3S laboratory, UMR 6070 CNRS, University of Nice-Sophia
Antipolis, Sophia Antipolis, France
FREDERIC Y. BOIS Technological University of Compiegne, Royallieu Research Center,
Compiegne, France; INERIS, DRC/VIVA/METO, Verneuil en Halatte, France
CECILIA BOSSA Environment and Health Department, Istitituto Superiore di Sanita,
Rome, Italy
NICOLA CANNATA School of Science and Technology, University of Camerino,
Camerino, Italy
ELENA CHAN Emiliem, Inc., San Francisco, CA, USA
MAURO COLAFRANCESCHI Environment and Health Department, Istitituto Superiore
di Sanita, Rome, Italy
FLAVIO CORRADINI School of Science and Technology, University of Camerino,
Camerino, Italy
DAFINA DALBOKOVA Consultant, Sofia, Bulgaria
DANIELA DE ANGELIS MRC Biostatistics Unit, Institute of Public Health, University
Forvie Site, Cambridge, UK
JOHN C. DEARDEN School of Pharmacy & Biomolecular Sciences, Liverpool John Moores
University, Liverpool, UK
JAMES DEVILLERS CTIS, Rillieux La Pape, France
HARISH DUREJA Department of Pharmaceutical Sciences, M. D. University,
Rohtak, India
ANDREY I. EGOROV World Health Organization (WHO), Regional Office for Europe,
European Centre for Environment and Health (ECEH), Bonn, Germany
HISHAM EL-MASRI Integrated Systems Toxicology Division, Systems Biology Branch,
US Environmental protection Agency, Research Triangle Park, NC, USA
CHRISTINE RISSO-DE FAVERNEY ECOMERS laboratory, University of Nice-Sophia
Antipolis, Nice Cedex, France
PAOLA GRAMATICA QSAR Research Unit in Environmental Chemistry and
Ecotoxicology, Theoretical and Applied Sciences, University of Insubria,
via Dunant 3, Varese, Italy
DETLEF GROTH AG Bioinformatics, University of Potsdam, Potsdam-Golm, Germany
STEFANIE HARTMANN AG Bioinformatics, University of Potsdam,
Potsdam-Golm, Germany
ix
x Contributors
Abstract
Structureactivity relationship (SAR) and quantitative structureactivity relationship (QSAR) models are
increasingly used in toxicology, ecotoxicology, and pharmacology for predicting the activity of the mole-
cules from their physicochemical properties and/or their structural characteristics. However, the design of
such models has many traps for unwary practitioners. Consequently, the purpose of this chapter is to give a
practical guide for the computation of SAR and QSAR models, point out problems that may be encoun-
tered, and suggest ways of solving them. Attempts are also made to see how these models can be validated
and interpreted.
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_1, # Springer Science+Business Media, LLC 2013
3
4 J. Devillers
2. Biological Data
Table 1
Acute oral toxicity of some pesticides in male and female
rats (7, 8)
3. Molecular
Descriptors
For a chemical to trigger a biological activity when administered to
an organism, a number of processes must occur that depend on its
structural characteristics. The correct encoding of these character-
istics is the keystone of the design of SAR and QSAR models with
good predictive performances. There are different ways for describ-
ing a molecule depending on the endpoint of concern and the
characteristics of the set of molecules used for computing the
model. The different categories of descriptors are discussed below
focusing on their interest and limitations rather than their calcula-
tion procedures.
3.1. Indicator Variables The indicator variables, also termed 1D descriptors, allow to
account for structural features (i.e., atoms or functional groups)
that influence or which are responsible for the biological activity of
the molecules. They are also termed dummy variables or Boolean
descriptors when they encode the presence ( 1) or absence ( 0)
of a structural element in a molecule. These descriptors represent
the simplest way for describing a molecule. The Boolean descriptors
are particularly suited for encoding the position and/or the num-
ber of substituents on a molecule. The FreeWilson method (21) is
rooted on the use of such descriptors. Indeed, this approach allows
to mathematically quantify the contribution of a substituent to the
biological activity of a molecule. It is assumed that the substituents
on a parent molecule provide a constant contribution to the activity
according to a simple principle of additivity. The method operates
by the generation of a data matrix consisting of zero and one values.
Each column in this matrix corresponds to a particular substituent
at a specific position on the molecule and is treated as an indepen-
dent variable. The data table also contains a column of dependent
data (i.e., biological data). A multiple regression analysis is applied
between the dependent variable and the independent variables
using classical statistical criteria for measuring the goodness of fit.
The regression coefficients of the model represent the contribution
of the substituents of the molecule to its activity. The FreeWilson
approach has found numerous applications in QSAR; see, e.g., refs.
(2227). Its main advantage relies on the fact that the mechanistic
1 Methods for Building QSARs 7
3.3. 3D Molecular In the classical Hansch analysis and the related QSAR approaches,
Descriptors the descriptors are calculated from a flat representation of the
molecules. This has a limited interest to understand the receptor
ligand interactions. More than 20 years ago, Richard Cramer of
Tripos, Inc. proposed to use the field properties of molecules in
3D space to derive QSAR models (66). The method, called com-
parative molecular field analysis (CoMFA), was the first true 3D
QSAR approach and remains the most widely used (67, 68). The
basic assumption in CoMFA is that a suitable sampling of steric
1 Methods for Building QSARs 9
4. Model
Computation
There is a huge number of methods available to build models that
relate biological data to molecular descriptors. Rather than catalo-
ging all the approaches in use in the domain, it is better to give
10 J. Devillers
Table 2
Nineteen chemicals with their activity (log 1/EC50 in mM)
and their electrophilic superdelocalizability for atom 10
(ESDL10), 1-octanol/water partition coefficient (P),
and melting point (MP) (adapted from refs. (80, 81))
(data arranged from refs. 80, 81), it is possible to derive the follow-
ing three-parameter equation 1 that appears at first sight to be quite
reasonable:
log 1=EC50 0:5770:105 log P 3:1910:678 log MP
0:3760:168ESDL10 10:382:172; (1)
n 19, r 0.83, r 0.69, s 0.30, and F 11.1,
2
Fig. 1. Scatter plot of log 1/EC50 versus ESDL10 (see Table 2).
12 J. Devillers
Table 3
Anesthetic activity of monoketones in mice (88)
Fig. 2. Scatter plot of log P versus log 1/AD50 (see Table 3).
5. Model
Diagnostics
Once the model has been constructed, it is important to verify
whether no basic assumptions justifying the use of the selected
statistical method have been violated, to analyze the statistical
significance of its parameters, and to check whether the activities
of the chemicals from which it was computed have been correctly
predicted. The first two points can considerably change from a
statistical method to another while the last one only depends on
the type of biological data.
Regarding the SAR models aiming to predict categorical
response data such as positive versus negative responses, there are
four different possible model outcomes that are a true positive
(TP), a true negative (TN), a false positive (FP), and a false negative
(FN). A false positive is when the outcome is incorrectly classified as
active (or positive), when it is in fact inactive (or nega-
tive). A false negative is when the activity is incorrectly classified as
negative when it is in fact positive. True positives and true negatives
are obviously correct classifications. From these four types of
results, it is possible to calculate various parameters (105107),
the most important are the following:
Sensitivity or true positive rate (TPR) TP/(TP + FN).
False positive rate (FPR) FP/(FP + TN).
Specificity or true negative rate (TNR) TN/(FP + TN).
Accuracy TP + TN/(TP + FP + TN + FN).
It is noteworthy that the TPR versus FPR plot is called a
receiver operating characteristic (ROC) curve (108, 109). The
ROC curves are particularly useful for comparing different config-
urations or classifiers (108, 109).
Now, regarding the QSAR models, computed on continuous
biological data (e.g., log 1/LD50), the calculated activity value of
each chemical is subtracted to the corresponding experimental
activity value for that chemical. This difference is termed residual.
A too large residual is called an outlier. From a statistical point of
view, an outlier among residuals is one that is far greater than the
rest in absolute value and perhaps lies three or four standard devia-
tions or further from the mean of the residuals (91).
In practice, the outliers have to be the subject of lot of attention
because they pinpoint that there is more to understand about the
model before to safely use it. Moreover, an understanding of the
16 J. Devillers
6. Model
Performance
Estimation
After building a model of whatever type, it is necessary to assess
how well it might be expected to work. To do so, in a first step, its
performances are commonly estimated from the chemicals that
were used to compute the model. A simple method to perform
1 Methods for Building QSARs 17
Table 4
Log P-dependent QSAR equations for nonpolar
narcotics (130)
7. Interpreting
the Models
If the molecular descriptors included in a model are meaningful, the
signs and the values of their coefficients can be used to interpret it.
This is classically done with regression models where it is possible to
directly see the parameters that contribute positively or negatively
to the modeled activity. Mechanistic information is also obtained
from the comparison of the slope and intercept of a new QSAR
model with those of existing QSAR models designed on the same
type of molecules and with the same molecular descriptor. This is
exemplified in Table 4 that lists three log P-dependent QSAR equa-
tions for nonpolar narcotics obtained on three aquatic species (130).
This approach is called comparative QSAR (131, 132) but also
lateral validation because through this comparison exercise an indi-
rect validation of the equation parameters is obtained (132, 133).
Indeed, if the same molecular descriptor is present, with a similar
contribution, in the QSAR models being compared, more confi-
dence can be attributed to all the models.
Another point to consider is the way in which the modeling
results are interpreted. Unfortunately, abusive generalizations are
very often done. Thus, for example, in endocrine disruption mod-
eling (134), most of the SAR and QSAR models are aimed at
predicting the binding activity of chemicals on a specific endocrine
receptor. Very often, from such studies, chemicals are claimed as
1 Methods for Building QSARs 19
8. Concluding
Remarks
SAR and QSAR models are increasingly used to provide insights
into the mechanism of actions of the organic chemicals and to fill
data gaps. When possible, they have to be used as surrogate of the
toxicity tests on vertebrates in registration, evaluation, authoriza-
tion and restriction of chemicals (REACH), the EU regulation on
chemicals (136). However, to be safely used, an SAR or a QSAR
model needs to be correctly designed. According to the so-called
OECD principles for validation of the SAR and QSAR models, the
models must present (137):
(1) A defined endpoint.
(2) An unambiguous algorithm.
(3) A defined domain of applicability.
(4) Appropriate measures of goodness of fit, robustness, and pre-
dictivity.
(5) A mechanistic interpretation, if possible.
These five conditions represent basic requirements because, in
fact, there are numerous points to respect for deriving an SAR or a
QSAR model that does not fail in its predictions and, hence, that
can be used for research or regulation.
Preferably the biological activity data have to be obtained under
the same experimental conditions (i.e., same protocol). If it is
not the case, they have to be compatible. QSAR models need
activity data of quality, with enough magnitude, and that have
to be generally log10 transformed. If there is too much uncer-
tainty on their quality, they must be transformed into categori-
cal data for deriving SAR models. A particular attention has to
be paid to the issue that some statistical methods are very
sensitive to unbalanced classes. This is the case of the linear
discriminant analysis (15, 16, 92, 138140).
Use molecular descriptors that are informative, not redundant,
and not correlated. A descriptor can be highly discriminative
20 J. Devillers
References
1. Cros AFA (1863) Action de lalcool amylique 4. Lipnick RL, Filov VA (1992) Nikolai Vasilye-
sur lorganisme. Thesis, Strasbourg vich Lazarev, toxicologist and pharmacolo-
2. Dujardin-Beaumetz D, Audige (1875) Sur les gist, comes in from the cold. Trends
proprietes toxiques des alcools par fermenta- Pharmacol Sci 13:5660
tion. CR Acad Sci Paris LXXX:192194 5. Hansch C, Maloney PP, Fujita T et al (1962)
3. Overton E (1901) Studien u ber die Narkose. Correlation of biological activity of phenox-
Gustav Fischer, Jena yacetic acids with Hammett substituent
22 J. Devillers
constants and partition coefficients. Nature 21. Free SM, Wilson JW (1964) A mathematical
194:178180 contribution to structureactivity studies. J
6. Hansch C, Fujita T (1964) r-s-p analysis. A Med Chem 7:395399
method for the correlation of biological activ- 22. Serebryakov EP, Epstein NA, Yasinskaya NP
ity and chemical structure. J Am Chem Soc et al (1984) A mathematical additive model of
86:16161626 the structureactivity relationships of gibber-
7. Gaines TB (1960) The acute toxicity of pesti- ellins. Phytochemistry 23:18551863
cides to rats. Toxicol Appl Pharmacol 2:8899 23. Zahradnik P, Foltinova P, Halgas J (1996)
8. Gaines TB (1969) Acute toxicity of pesticides. QSAR study of the toxicity of benzothiazo-
Toxicol Appl Pharmacol 14:515534 lium salts against Euglena gracilis: the Free-
9. Kato R (1974) Sex-related differences in drug Wilson approach. SAR QSAR Environ Res
metabolism. Drug Metab Rev 3:132 5:5156
10. Devillers J (2004) Prediction of mammalian 24. Fouchecourt MO, Beliveau M, Krishnan K
toxicity of organophosphorus pesticides from (2001) Quantitative structurepharmacoki-
QSTR modeling. SAR QSAR Environ Res netic relationship modelling. Sci Total Envi-
15:501510 ron 274:125135
11. Kaiser KLE (2004) Toxicity data sources. In: 25. Globisch C, Pajeva IK, Wiese M (2006)
Cronin MTD, Livingstone D (eds) Predicting Structureactivity relationships of a series of
chemical toxicity and fate. CRC, Boca Raton, tariquidar analogs as multidrug resistance mod-
FL ulators. Bioorg Med Chem 14:15881598
12. Tan NX, Rao HB, Li ZR, Li XY (2009) Pre- 26. Alkorta I, Blanco F, Elguero J (2008) Appli-
diction of chemical carcinogenicity by cation of Free-Wilson matrices to the analysis
machine learning approaches. SAR QSAR of the tautomerism and aromaticity of azapen-
Environ Res 20:2775 talenes: a DFT study. Tetrahedron 64:3826
3836
13. Fjodorova N, Vracko M, Jezierska A et al
(2010) Counter propagation artificial neural 27. Baggiani C, Baravalle P, Giovannoli C et al
network categorical models for prediction of (2010) Molecularly imprinted polymers for
carcinogenicity for non-congeneric chemicals. corticosteroids: analysis of binding selectivity.
SAR QSAR Environ Res 21:5775 Biosens Bioelectron 26:590595
14. Mombelli E, Devillers J (2010) Evaluation of 28. Hall LH, Kier LB, Phipps G (1984) Struc-
the OECD (Q)SAR application toolbox and tureactivity relationship studies on the toxi-
Toxtree for predicting and profiling the carci- cities of benzene derivatives: I. An additivity
nogenic potential of chemicals. SAR QSAR model. Environ Toxicol Chem 3:355365
Environ Res 21:731752 29. Hall LH, Kier LB (1986) Structureactivity
15. Sanchez PM (1974) The unequal group size relationship studies on the toxicities of ben-
problem in discriminant analysis. J Acad Mark zene derivatives: II. An analysis of benzene
Sci 2:629633 substituent effects on toxicity. Environ Toxi-
col Chem 5:333337
16. Japkowicz N, Stephen S (2002) The class
imbalance problem: a systematic study. Intell 30. Devillers J, Zakarya D, Chastrette M et al
Data Anal 6:429450 (1989) The stochastic regression analysis as a
tool in ecotoxicological QSAR studies.
17. Devillers J, Mombelli E (2010) Evaluation of Biomed Environ Sci 2:385393
the OECD QSAR application toolbox and
Toxtree for estimating the mutagenicity of 31. Duewer DL (1990) The Free-Wilson para-
chemicals. Part 2. ab Unsaturated aliphatic digm redux: significance of the Free-Wilson
aldehydes. SAR QSAR Environ Res 21:771 coefficients, insignificance of coefficient
783 uncertainties and statistical sins. J Chemom
4:299321
18. Benigni R, Passerini L, Rodomonte A (2003)
Structureactivity relationships for the muta- 32. Ashby J, Tennant RW (1988) Chemical struc-
genicity and carcinogenicity of simple and a-b ture, Salmonella mutagenicity and extent of
unsaturated aldehydes. Environ Mol Mutagen carcinogenicity as indicators of genotoxic car-
42:136143 cinogenesis among 222 chemicals tested in
rodents by the U.S. NCI/NTP. Mutat Res
19. OECD QSAR Application Toolbox. http:// 204:17115
www.oecd.org/document/54/0,3343,en_
2649_34379_42923638_1_1_1_1,00.html 33. Benigni R, Bossa C (2008) Structure alerts for
carcinogenicity, and the Salmonella assay sys-
20. Toxtree. http://ecb.jrc.it/qsar/qsar-tools/ tem: a novel insight through the chemical
index.php?cTOXTREE
1 Methods for Building QSARs 23
relational databases technology. Mutat Res 47. Sangster J (1997) Octanol-water partition
659:248261 coefficients: fundamentals and physical chem-
34. Devillers J, Mombelli E, Samsera` R (2011) istry. Wiley, Chichester
Structural alerts for estimating the carcinoge- 48. Rekker RF, Mannhold R (1992) Calculation
nicity of pesticides and biocides. SAR QSAR of drug lipophilicity. The hydrophobic frag-
Environ Res 22:89106 mental constant approach. VCH, Weinheim
35. Tsakovska I, Gallegos Saliner A, Netzeva T 49. Hansch C, Leo A (1995) Exploring QSAR.
et al (2007) Evaluation of SARs for the pre- Fundamentals and applications in chemistry
diction of eye irritation/corrosion potential: and biology. American Chemical Society,
structural inclusion rules in the BfR decision Washington, DC
support system. SAR QSAR Environ Res 50. Devillers J, Domine D, Guillon C (1998)
18:221235 Autocorrelation modeling of lipophilicity
36. Gallegos Saliner A, Tsakovska I, Pavan M et al with a back-propagation neural network. Eur
(2007) Evaluation of SARs for the prediction J Med Chem 33:659664
of skin irritation/corrosion potential: struc- 51. Domine D, Devillers J (1998) A computer
tural inclusion rules in the BfR decision sup- tool for simulating lipophilicity of organic
port system. SAR QSAR Environ Res 18: molecules. Sci Comput Autom 15:5563
331342 52. Devillers J (2000) EVA/PLS versus autocor-
37. Dearden JC (1990) Physico-chemical descrip- relation/neural network estimation of parti-
tors. In: Karcher W, Devillers J (eds) Practical tion coefficients. Pespect Drug Discov Design
applications of quantitative structureactivity 19:117131
relationships (QSAR) in environmental chem- 53. Yaffe D, Cohen Y, Espinosa G et al (2002)
istry and toxicology. Kluwer, Dordrecht Fuzzy ARTMAP and back-propagation neural
38. Domine D, Devillers J, Chastrette M et al networks based quantitative structureprop-
(1992) Multivariate structureproperty rela- erty relationships (QSPRs) for octanol-water
tionships (MSPR) of pesticides. Pestic Sci partition coefficient of organic compounds.
35:7382 J Chem Inf Comput Sci 42:162183
39. Samiullah Y (1990) Prediction of the environ- 54. Tetko IV, Tanchuk VY (2002) Application of
mental fate of chemicals. Elsevier, London associative neural networks for prediction
40. Mackay D, Di Guardo A, Hickie B et al (1997) of lipophilicity in ALOGPS 2.1 program.
Environmental modelling: progress and pro- J Chem Inf Comput Sci 42:11361145
spects. SAR QSAR Environ Res 6:117 55. Lyman WJ, Reehl WF, Rosenblatt DH (1990)
41. Hemond HF, Fechner EJ (1994) Chemical Handbook of chemical property estimation
fate and transport in the environment. Aca- methods. American Chemical Society,
demic, San Diego, CA Washington, DC
42. Devillers J (1998) Environmental chemistry: 56. Reinhard M, Drefahl A (1999) Handbook for
QSAR. In: Schleyer PvR, Allinger NL, Clark estimating physicochemical properties of
T, Gasteiger J, Kollman PA, Schaefer HF, organic compounds. Wiley, New York, NY
Schreiner PR (eds) The encyclopedia of 57. Boethling RS, Howard PH, Meylan WM
computational chemistry, vol 2. Wiley, Chi- (2004) Finding and estimating chemical
chester property data for environmental assessment.
43. Devillers J (2007) Application of QSARs in Environ Toxicol Chem 23:22902308
aquatic toxicology. In: Ekins S (ed) Computa- 58. Cronin MTD, Livingstone DJ (2004) Calcu-
tional toxicology. Risk assessment for pharma- lation of physicochemical properties. In: Cro-
ceutical and environmental chemicals. Wiley, nin MTD, Livingstone DJ (eds) Predicting
Hoboken, NJ chemical toxicity and fate. CRC, Boca
44. Devillers J, Domine D, Bintein S et al (1998) Raton, FL
Fish bioconcentration modeling with log P. 59. Devillers J, Balaban AT (1999) Topological
Toxicol Methods 8:110 indices and related descriptors in QSAR and
45. Bintein S, Devillers J (1994) QSAR for QSPR. Gordon and Breach Science Publish-
organic chemical sorption in soils and sedi- ers, Amsterdam
ments. Chemosphere 28:11711188 60. Kier LB, Hall LH (1986) Molecular connec-
46. Trapp S, Rasmussen D, Samse-Petersen L tivity in structureactivity analysis. Wiley,
(2003) Fruit tree model for uptake of organic Letchworth
compounds from soil. SAR QSAR Environ 61. Kier LB, Hall LH (1999) Molecular structure
Res 14:1726 description: the electrotopological state.
Academic, New York, NY
24 J. Devillers
62. Todeschini R, Consonni V (2009) Molecular 73. Feher M, Schmidt JM (2000) Multiple flexi-
descriptors for chemoinformatics: volume I: ble alignment with SEAL: a study of mole-
alphabetical listing/volume II: appendices, cules acting on the colchicine binding site.
references, 2nd edn. Wiley-VCH, Weinheim J Chem Inf Comput Sci 40:495502
63. Topliss JG, Costello RJ (1972) Chance corre- 74. Pastor M, Cruciani G, McLay I et al (2000)
lations in structureactivity studies using GRid-INdependent descriptors (GRIND): a
multiple regression analysis. J Med Chem novel class of alignment-independent three-
15:10661068 dimensional molecular descriptors. J Med
64. Devillers J, Thioulouse J, Karcher W (1993) Chem 43:32333243
Chemometrical evaluation of multispecies- 75. Klebe G, Abraham U, Mietzner T (1994)
multichemical data by means of graphical Molecular similarity indices in a comparative
techniques combined with multivariate ana- analysis (CoMSIA) of drug molecules to cor-
lyses. Ecotoxicol Environ Saf 26:333345 relate and predict their biological activity.
65. Devillers J, Karcher W (1990) Correspon- J Med Chem 37:41304146
dence factor analysis as a tool in environmen- 76. Serafimova R, Walker J, Mekenyan O (2002)
tal SAR and QSAR studies. In: Karcher W, Androgen receptor binding affinity of pesti-
Devillers J (eds) Practical applications of cide active formulation ingredients. QSAR
quantitative structureactivity relationships evaluation by COREPA method. SAR QSAR
(QSAR) in environmental chemistry and toxi- Environ Res 13:127134
cology. Kluwer, Dordrecht 77. Petkov PI, Rowlands JC, Budinsky R et al
66. Cramer RD, Patterson DE, Bunce JD (1988) (2010) Mechanism-based common reactivity
Comparative molecular field analysis pattern (COREPA) modelling of aryl hydro-
(CoMFA). 1. Effect of shape on binding of carbon receptor binding affinity. SAR QSAR
steroids to carrier proteins. J Am Chem Soc Environ Res 21:187214
110:59595967 78. Turner DB, Willett P (2000) The EVA spec-
67. Cramer RD, DePriest SA, Patterson DE et al tral descriptor. Eur J Med Chem 35:367375
(1993) The developing practice of compara- 79. Todeschini R, Gramatica P (1997) The
tive molecular field analysis. In: Kubinyi H WHIM theory: new 3D molecular descriptors
(ed) 3D QSAR in drug design. Theory meth- for QSAR in environmental modelling. SAR
ods and applications. ESCOM, Leiden QSAR Environ Res 7:89115
68. Doucet JP, Panaye A (2010) Three dimen- 80. Selwood DL, Livingstone DJ, Comley JCW
sional QSAR: applications in pharmacology et al (1990) Structureactivity relationship of
and toxicology. CRC, Boca Raton, FL antifilarial antimycin analogues: a multivariate
69. Geladi P, Tosato ML (1990) Multivariate pattern recognition study. J Med Chem
latent variable projection methods: SIMCA 33:136142
and PLS. In: Karcher W, Devillers J (eds) 81. Livingstone DJ (1995) The trouble with che-
Practical applications of quantitative struc- mometrics. In: Sanz F, Giraldo J, Manaut F
tureactivity relationships (QSAR) in environ- (eds) QSAR and molecular modelling: con-
mental chemistry and toxicology. Kluwer, cepts, computational tools and biological
Dordrecht applications. Prous Science, Barcelona
70. Kearsley SK, Smith GM (1990) An alternative 82. Thioulouse J, Devillers J, Chessel D et al
method for the alignment of molecular (1991) Graphical techniques for multidimen-
structures: maximizing electrostatic and steric sional data analysis. In: Devillers J, Karcher W
overlap. Tetrahedron Comput Method 3: (eds) Applied multivariate analysis in SAR and
615633 environmental studies. Kluwer, Dordrecht
71. Korhonen SP, Tuppurainen K, Laatikainen R 83. Cleveland WS (1994) The elements of graph-
et al (2005) Comparing the performance of ing data. Hobart Press, Summit
FLUFF-BALL to SEAL-CoMFA with a large 84. Cook RD, Weisberg S (1994) An introduction
diverse estrogen data set: from relevant super- to regression graphics. Wiley, New York, NY
positions to solid predictions. J Chem Inf
Model 45:18741883 85. Devillers J, Chezeau A, Thybaud E et al
(2002) QSAR modeling of the adult and
72. Korhonen SP, Tuppurainen K, Laatikainen R developmental toxicity of glycols, glycol
et al (2003) FLUFF-BALL, a template-based ethers, and xylenes to Hydra attenuata. SAR
grid-independent superposition and QSAR QSAR Environ Res 13:555566
technique: validation using a benchmark ste-
roid data set. J Chem Inf Comput Sci 43: 86. Devillers J, Chezeau A, Thybaud E (2002)
17801793 PLS-QSAR of the adult and developmental
1 Methods for Building QSARs 25
of organic pollutants to Brachydanio rerio. In: for designing test series. In: Devillers J (ed)
Turner JE, England MW, Schultz TW et al Genetic algorithms in molecular modeling.
(eds) QSAR88, 3rd international workshop Academic, London
on quantitative structureactivity relationships 128. Devillers J, Bintein S, Domine D et al (1995)
in environmental toxicology, Knoxville A general QSAR model for predicting the
114. Devillers J, Boule P, Vasseur P et al (1990) toxicity of organic chemicals to luminescent
Environmental and health risks of hydroqui- bacteria (Microtox test). SAR QSAR Envi-
none. Ecotoxicol Environ Saf 19:327354 ron Res 4:2938
115. Cruciani G, Clementi S, Baroni M (1993) 129. Anonymous (1998) QSARs in the assessment
Variable selection in PLS analysis. In: Kubinyi of the environmental fate and effects of che-
H (ed) 3D QSAR in drug design. Theory, micals. Technical report no. 74. ECETOC,
methods and applications. ESCOM, Leiden Brussels
116. Cruciani G, Baroni M, Bonelli D et al (1990) 130. Schultz TW, Sinks GD, Bearden AP (1998)
Comparison of chemometric models for QSAR in aquatic toxicology: a mechanism of
QSAR. Quant Struct Act Relat 9:101107 action approach comparing toxic potency to
117. Efron B, Tibshirani RJ (1993) An introduc- Pimephales promelas, Tetrahymena pyriformis,
tion to the bootstrap. Chapman & Hall, and Vibrio fischeri. In: Devillers J (ed) Com-
New York, NY parative QSAR. Taylor and Francis, Washing-
118. Gray HL, Baek J, Woodward WA et al (1996) ton, DC
A bootstrap generalized likelihood ratio test 131. Hansch C, Gao H, Hoekman D (1998) A
in discriminant analysis. Comput Stat Data generalized approach to comparative QSAR.
Anal 22:137158 In: Devillers J (ed) Comparative QSAR. Tay-
119. Jonathan P, McCarthy WV, Roberts AMI lor & Francis, Washington, DC
(1996) Discriminant analysis with singular 132. Selassie CD, Klein TE (1998) Comparative
covariance matrices. A method incorporating quantitative structure activity relationships
cross-validation and efficient randomized per- (QSAR) of the inhibition of dihydrofolate
mutation tests. J Chemom 10:189213 reductase. In: Devillers J (ed) Comparative
120. Tropsha A, Gramatica P, Gombar VK (2003) QSAR. Taylor & Francis, Washington, DC
The importance of being earnest: validation is 133. Kim KH (1995) Comparison of classical
the absolute essential for successful applica- QSAR and comparative molecular field analy-
tion and interpretation of QSPR models. sis. Toward lateral validations. In: Hansch C,
QSAR Comb Sci 22:6977 Fujita T (eds) Classical and three-dimensional
121. Kolossov E, Stanforth R (2007) The quality QSAR in agrochemistry. ACS symposium
of QSAR models: problems and solutions. series 606, American Chemical Society,
SAR QSAR Environ Res 18:89100 Washington, DC
122. Gramatica P (2007) Principles of QSAR mod- 134. Devillers J (2009) Endocrine disruption
els validation: internal and external. QSAR modeling. CRC, Boca Raton, FL
Comb Sci 26:694701 135. Devillers J, Marchand-Geneste N, Dore JC
123. Rucker C, R ucker G, Meringer M (2007) et al (2007) Endocrine disruption profile
y-Randomization and its variants in QSPR/ analysis of 11,416 chemicals from chemome-
QSAR. J Chem Inf Model 47:23452357 trical tools. SAR QSAR Environ Res 18:181
193
124. Domine D, Devillers J, Chastrette M (1994)
A nonlinear map of substituent constants for 136. Regulation (EC) no 1907/2006 of the Euro-
selecting test series and deriving structure pean parliament and of the Council of 18
activity relationships. I. Aromatic series. J December 2006 concerning the Registration,
Med Chem 37:973980 Evaluation, Authorisation and Restriction of
Chemicals (REACH), establishing a Euro-
125. Domine D, Devillers J, Chastrette M (1994) pean Chemicals Agency, amending Directive
A nonlinear map of substituent constants for 1999/45/EC and repealing Council Regula-
selecting test series and deriving structure tion (EEC) No 793/93 and Commission
activity relationships. II. Aliphatic series. J Regulation (EC) No 1488/94 as well as
Med Chem 37:981987 Council Directive 76/769/EEC and Com-
126. Domine D, Devillers J, Wienke D et al (1996) mission Directives 91/155/EEC, 93/67/
Test series selection from nonlinear neural EEC, 93/105/EC and 2000/21/EC. Jour-
mapping. Quant Struct Act Relat 15:395 nal L396 30.12.2006
402 137. Anonymous, The principles for establishing
127. Putavy C, Devillers J, Domine D (1996) the status of development and validation of
Genetic selection of aromatic substituents
1 Methods for Building QSARs 27
Abstract
Computer-based representation of chemicals makes it possible to organize data in chemical databases
collections of chemical structures and associated properties. Databases are widely used wherever efficient
processing of chemical information is needed, including search, storage, retrieval, and dissemination.
Structure and functionality of chemical databases are considered. The typical kinds of information found
in a chemical database are consideredidentification, structural, and associated data. Functionality of
chemical databases is presented, with examples of search and access types. More details are included
about the OASIS database and platform and the Danish (Q)SAR Database online. Various types of chemical
database resources are discussed, together with a list of examples.
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_2, # Springer Science+Business Media, LLC 2013
29
30 N. Nikolov et al.
2. Materials
a c1(CN(C)C)c(O)c(CN(C)C)cc(CN(C)C)c1
b
000090 - 72 - 2 1 2 1 0 0 0 0
426 414343D 000090-72-2 1 6 2 0 0 0 0
Database Manager 4.3 v.1.5 1 19 1 0 0 0 0
46 46 1 2 V2000 2 3 1 0 0 0 0
-0.7465 -1.1911 -0.7348 C 0 0 0 0 0 0 0 0 0 0 0 0 2 20 1 0 0 0 0
-1.4785 -2.4503 -1.1019 C 0 0 0 0 0 0 0 0 0 0 0 0 2 21 1 0 0 0 0
-0.5861 -3.5982 -1.1975 N 0 0 0 0 0 0 0 0 0 0 0 0 3 4 1 0 0 0 0
-0.6256 -4.2608 -2.4772 C 0 0 0 0 0 0 0 0 0 0 0 0 3 5 1 0 0 0 0
-0.6593 -4.4890 -0.0684 C 0 0 0 0 0 0 0 0 0 0 0 0 4 22 1 0 0 0 0
-0.0143 -0.4830 -1.7121 C 0 0 0 0 0 0 0 0 0 0 0 0 4 23 1 0 0 0 0
-0.0489 -0.9976 -2.9860 O 0 0 0 0 0 0 0 0 0 0 0 0 4 24 1 0 0 0 0
0.6557 0.7087 -1.3793 C 0 0 0 0 0 0 0 0 0 0 0 0 5 25 1 0 0 0 0
1.3872 1.5197 -2.4066 C 0 0 0 0 0 0 0 0 0 0 0 0 .
2.6248 0.9153 -2.8926 N 0 0 0 0 0 0 0 0 0 0 0 0 .
3.2472 1.7297 -3.9152 C 0 0 0 0 0 0 0 0 0 0 0 0 .
3.5459 0.5461 -1.8408 C 0 0 0 0 0 0 0 0 0 0 0 0 M END
0.5952 1.1731 -0.0620 C 0 0 0 0 0 0 0 0 0 0 0 0 > <CHEMICAL>
. Phenol, 2,4,6-tris
. (dimethylamino)methyl -
. > <CAS>
90722
$$$$
Fig. 2. (a) SMILES and (b) SD record representations of the same chemical structure.
3. Methods
Fig. 4. Substructure search: (a) Fixed fragment, substructure mode, (b) varying atoms (R stands for any atom),
substructure mode, (c) varying atoms (R stands for any atom), no other heavy atoms are allowed except those found
in the fragment.
4. Examples
4.1. The OASIS Database The OASIS Database platform is a software framework for building
Platform for Creating chemical databases (9). It contains a database schema and accom-
Chemical Databases panying software for management of chemical information. It is a
part of the OASIS chemical software (19). The platform provides
an extensive list of features, such as:
l Storage of chemical 2D and 3D structures.
l User-defined structural (2D), conformational (3D), and atomic
descriptors and model (test protocol) information.
38 N. Nikolov et al.
Fig. 7. A search tree in OASIS Database Manager. Q0: CAS RN between 200,000 and
1,000,000; Q1: structure found in TSCA; Q2: structure found in IUCLID; Q3: MOL_WEIGHT
> 300; Q5: contains fragment c1ccccc1RX1 where the wildcard atom RX1 is one of F,Cl,
Br,I; Q6: there exist two oxygen atoms at a distance of min. 10 A and max. 12 A; single
queries are displayed green.
4.2. OASIS Centralized A centralized 3D database was built using the OASIS software
Database framework for building databases and managing chemical informa-
tion (9). Chemicals which are under regulation by the respective
government agencies are considered as existing. The individual
databases of regulatory agencies in North America and Europe,
including IUCLID of European Chemicals Bureau (with 61,573
chemicals), Danish EPA (159,448 chemicals), TSCA (56,882 che-
micals), HPVC_EU (4,750 chemicals), HPVC_USEPA (10,546
chemicals) and pesticides active/inactive ingredients of the US
EPA (1379), DSL of Environment Canada (10,851 chemicals),
and Japanese METI (16,811) were combined in the database
(Fig. 8). The structural information for all chemicals is pre-
calculated in terms of conformer multiplication of all chemicals
and quantum-chemical optimization of each conformer, using in-
house algorithms for complete coverage of the conformational
space. The 2D3D migration, conformational multiplication, and
quantum-chemical evaluation are described in refs. 20, 21. Pres-
ently the database contains approximately 185,500 structures
and 3,700,000 conformers with hundreds of millions of descriptor
data items.
The pre-calculation of the chemicals in the Centralized 3D
database combined with the flexible searching capabilities (on 2D
and 3D level) allows testing hypotheses on the structural condi-
tioning of modeled endpoints. Thus, the search of the database for
chemicals which could elicit significant estrogen receptor binding
affinity with earlier defined 3D structural pattern (22),
2 Accessing and Using Chemical Databases 41
4.3. The Danish (Q)SAR Chemical databases are often set up on the Web and coupled with
Database Online the necessary tools for Web-based search/retrieval in order to
provide wide access to the contained data. Many of these resources
are free or contain free functionality. Below we consider in detail the
Danish (Q)SAR Database as an example of a chemical database
available online; the next section lists some more examples.
The Danish Environment Protection Agency (Danish EPA) has
for a number of years worked with the development and use of
computer models for prediction of properties of chemical sub-
stances (23). (Quantitative) StructureActivity Relationships
(Q)SARare relations between structure properties of chemical
substances and some other property. The other property can be a
physicalchemical property or a biological activity, including the
ability to cause toxic effects. The Danish EPA together with the
Danish National Food Institute at the Technical University of Den-
mark has created a database integrating predictions from more than
45 (Q)SAR models on endpoints for physicochemical properties,
eco-toxicity, absorption, metabolism, and toxicity (23). More than
half of all the estimates are for mammalian (human) toxicity end-
points and include commercial data sets as well as many models
developed in-house.
A structure set of about 166,000 discrete organic chemicals,
including all discrete organic European Inventory of Existing
Chemical Substances (EINECS) substances whose structure was
available (57,014), has been batched through the models, and the
results are integrated in the database. When examining substances
in the database, predictions from all models for the substance are
displayed together with domain information giving a (Q)SAR
profile and allowing for interpretation of the agreements or
disagreements between models for related endpoints (Fig. 9).
Substructure comparisons can also be performed.
A reduced version of the Danish (Q)SAR Database has been
published online with support from the former European Chemi-
cals Bureau and is available at refs. (24, 25). The implementation
42 N. Nikolov et al.
Fig. 9. (continued)
44 N. Nikolov et al.
Fig. 9. (continued)
2 Accessing and Using Chemical Databases 45
Fig. 9. (continued)
4.4. Other Examples Chemical databases on the Internet are so abundant that it is well
of Chemical Database beyond the scope of this chapter to present an exhaustive list. For a
Resources list of available chemical resources see, e.g., refs. (30, 31); a list of
toxicology-related chemical databases is presented in ref. 32. The
examples below are just illustrative of different types of chemical
database resources.
While databases can be used to store various types of chemical
information, we focus on databases of chemical structures and
properties.
Database names in italics refer to the example list of chemical
databases (Subheading 6); the list also includes the relevant litera-
ture and Web page references.
Some databases are primarily used to register the identity of
chemical structures from a specific collection, e.g., from a regulatory
list of substances or from a list of substances sharing common
features. An example is the ECHA Database of registered substances
under REACH. The identity may include a representation of the
chemical structure, chemical names, or identification numbers.
General-purpose registration databasesregistration databases
aiming at including very large and diverse collection of chemi-
calsare also known (CAS Registry database, PubChem database,
etc.). Databases may introduce a registry numberan identification
number or text string assigned to all chemicals within the specific
database (e.g., CAS registry number, Pubchem Id, EC Number).
Non-registration data collections focus on some properties of
interest besides the identity of the contained chemicals (for exam-
ple, RTECS).
The type of chemical structures contained in chemical data-
bases varies. PDB is an example of a protein database and many
databases also contain inorganic and organometallic compounds
(e.g., Reaxys); however, the majority of the resources we mention
here deal with small organic compounds.
Some chemical database systems provide solely their fixed con-
tent, while others offer data as well as software tools for building
new chemical databases based on users data (e.g., OASIS Database
Platform, Accelrys, and many more).
Ordinary Database Management Systems (DBMS) do not have
means to deal specifically with chemical information. However,
2 Accessing and Using Chemical Databases 47
5. Relational
Databases
This section covers some notions from the theory of relational
databases, along with relevant examples. A more detailed introduc-
tion can be found in, e.g., ref. (33).
A relational database (34) organizes the data in the form of
tables, and all rows of a table must have the same structure
(Table 1).
The columns of the table, in contrast, may be of different data
types, such as number, text, binary flag (yes or no), date, etc. In
database terminology, a table is called a relation, a table row is a
record or a tuple, and a column is an attribute.
Database search is the selection of a set of rows from a table
that match a given condition. For example, selecting rows where
log P > 3 will return the second row of the table. More generally,
a search may only return some of the attributes of a tablefor
48 N. Nikolov et al.
Table 1
A relational database table
Table 2
A non-normalized relational database table
example, a search for CAS RN of the rows where log P > 3 will
only return one value, 97-23-4, and not a whole record.
Any attribute or attribute combination may be indexed:
assigned auxiliary data to improve search performance. Search
involving conditions on indexed attributes is much faster, at the
price of taking extra disk space for the auxiliary data and slightly
increasing the insertion and modification time.
A column or a combination of columns that identifies uniquely
a whole record is called a key; in the above table, CAS RN can be
chosen as a key, since the remaining attributes may coincide by
chance with those for another chemical while CAS will not.
Further, to store multiple values (e.g., chemical names) for a
single chemical, we may choose to add a table row for every value
with the disadvantage that 2D structure and descriptors have to be
repeated (Table 2).
A better solution is to introduce a second table for names
(Table 3): here, CAS RN will serve as the link between the tables,
the first table will still contain unique records for every chemical,
and data redundancy will be avoided. The relational database the-
ory defines formally how to avoid data redundancy in the general
case. A normalized databasedesigned according to the principles
recommended by the theorywill be easier to access and less prone
to programming errors.
The two-table construction presumes that for every record in
the second table, there is a record with the same CAS RN in the first
table (defining the structure). This referential integrity constraint
(relevant data is expected when a table refers to another) should be
enforced during all database operationsdeletion in Table 3a
should invoke deletion of the related records in Table 3b, etc.
2 Accessing and Using Chemical Databases 49
Table 3
A two-table database representing (a) CAS RN, 2D structure,
and descriptors and (b) chemical names
Mol.
CAS RN 2D structure wt. Log P CAS RN Chemical name
90-72-2 c1(CN(C)C)c(O)c(CN(C)C) 265.4 0.77 90-72-2 2,4,6-Tris (dimethylamino)
cc(CN(C)C)c1 methylphenol
97-23-4 C1(O)c(Cc2c(O)ccc(Cl)c2)cc 269.13 4.34 97-23-4 Dichlorophen
(Cl)cc1
97-23-4 2,20 -Methylenebis 4-chlorophenol
Modern databases handle this and other related issues using the
mechanism of transactions (grouped modifications). Two or more
modification operations can be declared to constitute a transaction,
and then they either succeed as a whole or fail as a wholeif a
record is deleted from Table 3a, the database will wait for records
with the same CAS number to be deleted from Table 3b and then
confirm the whole group of operations at once. If anything fails,
neither of the operations is confirmed and the database restores its
original state. This is implemented at the level of the DBMS,
software controlling all operations on the database and providing
data to the other application programs that use it.
6. Examples of
Chemical Database
Resources Online
The following list contains more information and references about
some chemical databases online:
1. Accelrys Databases are a source of supplier information as well
as contain various collections of biological and chemical data.
Accelrys Registration is a software tool for creation and man-
agement of chemical databases (35).
2. Aggregated Computational Toxicology Resource (ACToR) is a
collection of databases collated or developed by the US EPA
National Center for Computational Toxicology (NCCT) (36).
3. ChEBI is a database with both manual and programmatic
access, literature references for compounds, and a chemical
ontology (37, 38).
4. CAS Registry database is the chemical substance database of
CAS, the most authoritative and comprehensive source for
chemical and scientific information, also publishing Chemical
Abstracts, with millions of documents from the chemical jour-
nal and patent literature (39). The registry database contains
more than 56 million substances.
50 N. Nikolov et al.
References
1. Halpin TA, Morgan AJ (2008) Information 11. http://opsin.ch.cam.ac.uk/index.html
modeling and relational databases, 2nd edn. 12. Hardy B, Douglas N, Helma C et al (2010)
In: The Morgan Kaufmann series in data man- Collaborative development of predictive toxi-
agement systems. Elsevier, Amsterdam cology applications. J Cheminform 2(1):7
2. Dalby A, Nourse JG, Hounshell DW et al 13. Stein LD (2008) Towards a cyberinfrastructure
(1992) Description of several chemical struc- for the biological sciences: progress, visions
ture file formats used by computer programs and challenges. Nat Rev Genet 9:678688.
developed at molecular design limited. J Chem doi:10.1038/nrg2414
Inf Comput Sci 32:244255 14. http://www.w3.org/standards/webofservices
3. Hopfinger AJ, Wang S, Tokarski JS et al (1997) 15. Murray-Rust P, Rzepa H, Wright M et al
Construction of 3D-QSAR models using the (2000) A universal approach to web-based
4D-QSAR analysis formalism. J Am Chem Soc chemistry using XML and CML. Chem Com-
119:1050910524 mun 14711472
4. Miller M (2002) Chemical database techniques 16. Nature Vol. 451 (7179):648651. http://
in drug discovery. Nat Rev Drug Discov www.w3.org/standards/semanticweb
1:220227. doi:10.1038/nrd745
17. Murray-Rust P (2008) Chemistry for every-
5. Barnard JM (1993) Substructure searching one. Nature 451:648651. doi:10.1038/
methods: old and new. J Chem Inf Comput 451648a
Sci 33:532538
18. http://www.w3.org/standards/semanticweb/
6. Jamil H (2011) Computing subgraph isomor- ontology
phic queries using structural unification and
minimum graph structures, Proc. of the 26th 19. http://www.oasis-lmc.org
ACM symposium on applied computing SAC 20. Mekenyan O, Pavlov T, Grancharov V et al
2011, pp 10581065, doi:10.1145/ (2005) 2D3D migration of large chemical
1982185.1982415 inventories with conformational multiplica-
7. Ullmann JR (1976) An algorithm for sub- tion. Application of the genetic algorithm. J
graph isomorphism. J Assoc Mach 23 Chem Inf Model 45(2):283292
(1):3142. doi:10.1145/321921.321925 21. Mekenyan O, Dimitrov D, Nikolova N et al
8. Willett P (2003) Similarity searching in chemi- (1999) Conformational coverage by a genetic
cal structure databases. In: Gasteiger J (ed) algorithm. J Chem Inf Comput Sci 39(6):
Handbook of chemoinformatics. Wiley, Wein- 9971016. doi:10.1021/ci990303g
heim 22. Mekenyan OG, Kamenska, V, Serafimova R
9. Nikolov N, Grancharov V, Stoyanova G et al et al (2002) Development and validation of an
(2006) Representation of chemical informa- average mammalian estrogen receptor-based
tion in OASIS Centralized 3D database for (Q)SAR model. In: Mekenyan O, Schultz TW
existing chemicals. J Chem Inf Model 46 (eds) Proceedings of quantitative structure
(6):25372551 activity relationships in environmental
sciencesIX. SAR (Q)SAR Environ Res 13
10. Park J, Rosania GR, Shedden KA et al (2009) (6):579595
Automated extraction of chemical structure
information from digital raster images. J 23. http://www.mst.dk/English/Chemicals/
Chem Cent 3 (1): 1-16, doi:10.1186/1752- Substances_and_materials/qsar
153X-3-4 24. http://ecbqsar.jrc.it
52 N. Nikolov et al.
Abstract
Quantitative structure activity relationship (QSAR) is the most frequently used modeling approach to
explore the dependency of biological, toxicological, or other types of activities/properties of chemicals on
their molecular features. In the past two decades, QSAR modeling has been used extensively in drug
discovery process. However, the predictive models resulted from QSAR studies have limited use for
chemical risk assessment, especially for animal and human toxicity evaluations, due to the low predictivity
of new compounds. To develop enhanced toxicity models with independently validated external prediction
power, novel modeling protocols were pursued by computational toxicologists based on rapidly increasing
toxicity testing data in recent years. This chapter reviews the recent effort in our laboratory to incorporate
the biological testing results as descriptors in the toxicity modeling process. This effort extended the
concept of QSAR to quantitative structure in vitroin vivo relationship (QSIIR). The QSIIR study
examples provided in this chapter indicate that the QSIIR models that based on the hybrid (biological
and chemical) descriptors are indeed superior to the conventional QSAR models that only based on
chemical descriptors for several animal toxicity endpoints. We believe that the applications introduced in
this review will be of interest and value to researchers working in the field of computational drug discovery
and environmental chemical risk assessment.
Key words: QSAR, QSIIR, Computational toxicology, HTS, Predictive model, Compounds, Chemi-
cal descriptors, Biological descriptors
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_3, # Springer Science+Business Media, LLC 2013
53
54 H. Zhu
2. Availability of
Large Compound
Collections for In
Vivo and In Vitro Since 1990s, great efforts of developing toxicity testing methods
Toxicity Evaluation have generated an extensive amount of toxicity data (18). However,
most of the available toxicity databases that house these data are not
suitable for developing QSAR toxicity models. All the cheminfor-
matics tools require the biological data to be associated with molec-
ular structures, and these are not included in many existing
databases. Furthermore, the testing data may be not easily accessed
by modelers or the quality is questionable (18). In the end, the
existing errors of chemical structures also greatly affect the reliability
3 From QSAR to QSIIR: Searching for Enhanced Computational Toxicology Models 55
Table 1
Publicly available databases of in vivo toxicity endpoints
Table 2
Public databases of in vitro toxicity endpoints
NCGC qHTS Concentration-response profiles of 1,408 substances screened for their effects
cytotoxicity data (2) on cell viability are available through PubChem for 13 cell lines: HepG2
available through (human hepatoma; AID #433), H-4-II-E (rat hepatoma; AID #543), BJ
PUBCHEM (24) (human foreskin fibroblast; AID #421), Jurkat (clone E6-1, human acute T
(via PubChem cell leukemia; AID #426), HEK293 (transformed human embryonic kidney
AID#) cell; AID #427), MRC-5 (human lung fibroblast; AID #434), SK-N-SH
(human neuroblastoma; AID #435), N2a (mouse neuroblastoma; AID
#540), NIH 3T3 (mouse embryonic fibroblast; AID #541), HUV-EC-C
(human vascular endothelial cell; AID #542), SH-SY-5Y (human
neuroblastoma, subclone of SK-N-SH; AID #544), Renal Proximal Tubule
(rat kidney cell; AID #545) and Mesenchymal (human renal glomeruli cell;
AID #546). Each compound was tested at 14 concentrations ranging from
0.006 to 92 mM and the response was measured as % change in cell viability as
compared to vehicle control at each concentration
ChEMBLdb (29) A database of bioactive drug-like small molecules abstracted and curated from
available through the primary scientific literature. Bioactivities are represented by binding
PUBCHEM (24) constants, pharmacology and ADMET data. ChEMBL assays are available
(via PubChem through PubChem. Human toxicity related endpoints are primarily from
AID#) in vitro data, such as: cytotoxicity on SNU-354 cells (hepatoma cell line, AID
#200819), antiproliferative action on L02 cells (normal hepatocytes, AID
#416061), growth inhibition of SK-Hep1 cells (liver adenocarcinoma cell
line, AID #201649), cytotoxicity and anticancer activity on HepG2 cells (AID
#86696, 340104, 421266), etc.
ToxCast (30) Phase I (August 7 2009 update) provided 304 unique compounds characterized
in over 600 HTS endpoints. The endpoints include biochemical assays of
protein function, cell-based transcriptional reporter and gene expression, cell
line and primary cell functional, and developmental endpoints in zebrafish
embryos and embryonic stem cells. Additionally, mapping of these assays to
315 genes and 438 pathways was made publicly available. Phase II will
complete screens of additional 700 compounds, HTS data on nearly 10,000
chemicals will be available through Tox21 collaboration in 2010
ToxNET (31) A data network covering toxicology, hazardous chemicals, environmental health
and related areas. Managed by US National Library of Medicine
3. QSAR and
Current Challenge
of Computational
Toxicology Computational toxicology modeling relies on the use of QSAR
approaches to build the toxicity models for available reference
data. It traditionally derives the computed properties solely based
on the molecular structures as defined by descriptors or fingerprints
and has been broadly used to predict the side effects that chemicals
posed to human health or animals. The QSTR models developed
from computational toxicity studies strongly depend on the QSAR
approaches used in the modeling process. The optimization of the
variable selected or the weighting of variables is the core compo-
nent of a QSAR approach. This procedure selects only the most
meaningful and statistically significant subset of available chemical
descriptors in terms of correlation with biological activity. The
optimum selection is achieved by combining stochastic search
methods such as generalized simulated annealing (33), genetic
algorithms (34) or evolutionary algorithms (35) with the correla-
tion techniques such as MLR, PLS analysis, or artificial neural net-
works (3336).
Recent research has emphasized model validation as the key
component of QSAR modeling (37). Tropsha (37, 38) and
others (3942) have demonstrated that various commonly
accepted statistical characteristics of QSAR models derived for
a training set are insufficient to establish and estimate the pre-
dictive power of QSTR models. The only way to ensure the high
predictive power of a QSAR model is to demonstrate a signifi-
cant correlation between predicted and observed activities of
compounds for a validation (test) set, which was not employed
in model development. This goal can be achieved by a division
of an experimental SAR dataset into the training and test set,
which are used for model development and validation, respec-
tively. It is believed that special approaches should be used to
select a training set to ensure the highest significance and pre-
dictive power of QSAR models (43, 44). The recent reviews and
publications describe several algorithms that can be employed
for such division (37, 38, 44).
Most of previous QSAR studies showed that current available
QSTR models do not work well to evaluate in vivo toxicity poten-
tials, especially for external compounds. Due to this reason, several
reviews were published recently to challenge the feasibility and
reliability of using QSAR approaches in toxicity studies (45, 46).
Often, disappointing results could be linked to the key aspects of
the modeling procedure, many of which related to the original data
and their interpretation. Similarly, Lombardo et al. (47) noted
that not much progress has been made in developing robust and
58 H. Zhu
predictive models, and that the lack of accurate data, together with
the use of questionable modeling end-points, has hindered the real
progress. Most chemical toxicity models (predictors) are either
reported in the literature but not available to research community
or available in the form of commercial software that is not univer-
sally successful as discussed above. These examples illustrate that
although individual successes have been indeed reported as dis-
cussed above, in general there remains a strong need in developing
widely accessible and reliable computational toxicology modeling
techniques and specific end-point predictors.
4. Quantitative
Structure In
VitroIn Vivo
Relationship To stress a broad appeal of the conventional QSAR approach, it
should be made clear that from the statistical viewpoint QSAR
modeling is a special case of general statistical data mining and
data modeling where the data is formatted to represent objects
described by multiple descriptors and the robust correlation
between descriptors and a target property (e.g., chemical toxicity
in vivo) is sought. In previous computational toxicology studies,
additional physicochemical properties, such as water partition coef-
ficient (logP) (48), water solubility (49), and melting point (50)
were used successfully to augment computed chemical descriptors
and improve the predictive power of QSAR models. These studies
suggest that using experimental results as descriptors in QSAR
modeling could prove beneficial. The current available and still
rapidly growing HTS data for large and diverse chemical libraries
makes it possible to extend the scope of the conventional QSAR in
toxicity studies by using in vitro testing results as extra biological
descriptors. Therefore, in some of the most recent toxicology stud-
ies, the relationships between various in vitro and in vivo toxicity
testing results were generated (5154). Based on these reports, we
proposed a new modeling workflow called quantitative structure
in vitroin vivo relationship (QSIIR) and used it in animal toxicity
modeling studies (5557). The target properties of QSIIR model-
ing were still biological activities, such as different animal toxicity
endpoints, but the content and interpretation of descriptors and
the resulting models will vary. This focus on the prediction of the
same target property from different (chemical, biological, and
genomic) characteristics of environmental agents affords an oppor-
tunity to most fully explore the source-to-outcome continuum of
the modern experimental toxicology using cheminformatics
approaches.
3 From QSAR to QSIIR: Searching for Enhanced Computational Toxicology Models 59
5. Case Studies
5.1. Using Hybrid To explore efficient approaches for rapid evaluation of chemical
Descriptors for QSIIR toxicity and human health risk of environmental compounds, NTP
Modeling of Rodent in collaboration with the National Center for Chemical Genomics
Carcinogenicity (NCGC) has initiated an HTS Project (2, 58). The first batch of
HTS results for a set of 1,408 compounds tested in 6 human cell
lines was released via PubChem. We have explored this data in terms
of their utility for predicting adverse health effects of the environ-
mental agents (57). Initially, the classification k Nearest Neighbor
(kNN) QSAR modeling method was applied to the HTS data only
for the curated dataset of 384 compounds. The resulting models
had prediction accuracies for training, test (containing 275 com-
pounds together), and external validation (109 compounds) sets as
high as 89, 71, and 74%, respectively. We then asked if HTS results
could be of value in predicting rodent carcinogenicities. We identi-
fied 383 compounds for which data were available from both the
Berkeley Carcinogenic Potency Database and NTP-HTS studies.
We found that compounds classified by HTS as actives in at least
one cell line were likely to be rodent carcinogens (sensitivity 77%);
however, HTS inactives were far less informative (specificity 46%).
Using chemical descriptors only, kNN QSAR modeling resulted in
62% overall prediction accuracy for rodent carcinogenicity applied to
this data set. Importantly, the prediction accuracy of the model was
significantly improved (to 73%) when chemical descriptors were
augmented by the HTS data, which were regarded as biological
descriptors (Fig. 1). Our studies suggested that combining HTS
profiles with conventional chemical descriptors could considerably
improve the predictive power of computational approaches in chem-
ical toxicology.
0.9
0.8
0.7
0.6 Chemical (MolConnZ)
Descriptors
0.5
Chemical (MolConnZ)
0.4 Descriptors + Biological (HTS)
Descriptors
0.3
0.2
0.1
0
Sensitivity Specificity Overall Predictivity
Fig. 1. Comparison of the prediction power of QSAR models using conventional and hybrid descriptors for carcinogenicity of
external compounds.
60 H. Zhu
5.2. Using Hybrid We used the cell viability qHTS data from NCGC as mentioned in
Descriptors for the the above section for the same 1,408 compounds but in 13 cell
QSIIR Modeling of lines (59). Besides the carcinogenicity, we asked if HTS results
Rodent Acute Toxicity could be of value in predicting rodent acute toxicity (56). For this
purpose, we have identified 690 of these compounds, for which
rodent acute toxicity data (i.e., toxic or nontoxic) was also available.
The classification kNN QSAR modeling method was applied to
these compounds using either chemical descriptors alone or as a
combination of chemical and qHTS biological (hybrid) descriptors
as compound features. The external prediction accuracy of models
built with chemical descriptors only was 76%. In contract, the
prediction accuracy was significantly improved to 85% when using
hybrid descriptors. The receiver operating characteristic (ROC)
curves of conventional QSAR models and different hybrid models
are shown in Fig. 2. The sensitivity and specificity of hybrid models
are clearly better than for conventional QSAR model for predicting
the same external compounds. Furthermore, the prediction cover-
age increased from 76% when using chemical descriptors only to
93% when qHTS biological descriptors were also included. Our
studies suggest that combining HTS profiles, especially the dose
response qHTS results, with conventional chemical descriptors
could considerably improve the predictive power of computational
approaches for rodent acute toxicity assessment.
Fig. 2. The ROC curves for conventional QSAR model (bold line) and different hybrid
models for the same external compounds within acute toxicity modeling.
3 From QSAR to QSIIR: Searching for Enhanced Computational Toxicology Models 61
5.3. Hierarchical QSIIR A database containing experimental in vitro IC50 cytotoxicity values
Modeling of Rodent and in vivo rodent LD50 values for more than 300 chemicals was
Acute Toxicity Based compiled by the German Center for the Documentation and Vali-
on In VitroIn Vivo dation of Alternative Methods (ZEBET). The application of con-
Relationships ventional QSAR modeling approaches to predict mouse or rat acute
LD50 from chemical descriptors of ZEBET compounds yielded no
statistically significant models (60). Furthermore, analysis of these
data showed the correlation between IC50 and LD50 is obscure
(60). However, a linear IC50 versus LD50 correlation could be
established for a fraction of compounds. To capitalize on this
observation, a novel two-step modeling approach was developed
as follows. First, all chemicals are partitioned into two groups based
on the relationship between IC50 and LD50 values: one group is
formed by compounds with linear IC50 versus LD50 relationship,
and another group consists of the remaining compounds. Second,
conventional binary classification QSAR models are built to predict
the group affiliation based on chemical descriptors only. Third,
kNN continuous QSAR models are developed for each sub-class
individually to predict LD50 from chemical descriptors. All models
have been extensively validated using special protocols. We have
found that this type of in vitroin vivo correlations could be estab-
lished not only between cytotoxicity and rat acute toxicity (Fig. 3a)
but also between cytotoxicity and other types of rodent toxicity
Fig. 3. The identification of the baseline correlation between cytotoxicity (IC50) and various types of in vivo toxicity testing
results. (a) Rat LD50. (b) Mouse LD50. (c) Rat LOAEL. (d) Rat NOAEL. C1, class 1; C2, class 2.
62 H. Zhu
(Fig. 3bd), including two types of low dose toxicity endpoints: rat
low observed adverse effect level (LOAEL) and no observed
adverse effect level (NOAEL) (55).
6. Conclusion
Acknowledgments
References
1. Kola I, Landis J (2004) Can the pharmaceutical 4. Riley RJ, Kenna JG (2004) Cellular models for
industry reduce attrition rates? Nat Rev Drug ADMET predictions and evaluation of drug-
Discov 3(8):711715 drug interactions. Curr Opin Drug Discov
2. Inglese J, Auld DS, Jadhav A, Johnson RL, Devel 7(1):8699
Simeonov A, Yasgar A, Zheng W, Austin CP 5. Dix DJ, Houck KA, Martin MT, Richard AM,
(2006) Quantitative high-throughput screen- Setzer RW, Kavlock RJ (2007) The ToxCast
ing: a titration-based approach that efficiently program for prioritizing toxicity testing of
identifies biological activities in large chemical environmental chemicals. Toxicol Sci 95(1):
libraries. Proc Natl Acad Sci USA 103 512
(31):1147311478 6. Yang C, Valerio LG Jr, Arvidson KB (2009)
3. Cheeseman MA (2005) Thresholds as a unify- Computational toxicology approaches at the
ing theme in regulatory toxicology. Food Addit US Food and Drug Administration. Altern
Contam 22(10):900906 Lab Anim 37(5):523531
3 From QSAR to QSIIR: Searching for Enhanced Computational Toxicology Models 63
35. So SS, Karplus M (1996) Evolutionary optimi- 48. Klopman G, Zhu H, Ecker G, Chiba P (2003)
zation in quantitative structure-activity rela- MCASE study of the multidrug resistance
tionship: an application of genetic neural reversal activity of propafenone analogs. J
networks. J Med Chem 39(7):15211530 Comput Aided Mol Des 17(56):291297
36. So SS, Karplus M (1996) Genetic neural net- 49. Stoner CL, Gifford E, Stankovic C, Lepsy CS,
works for quantitative structure-activity rela- Brodfuehrer J, Prasad JVNV, Surendran N
tionships: improvements and application of (2004) Implementation of an ADME enabling
benzodiazepine affinity for benzodiazepine/ selection and visualization tool for drug discov-
GABAA receptors. J Med Chem 39(26): ery. J Pharm Sci 93(5):11311141
52465256 50. Mayer P, Reichenberg F (2006) Can highly
37. Tropsha A, Gramatica P, Gombar VK (2003) hydrophobic organic substances cause aquatic
The importance of being earnest: validation is baseline toxicity and can they contribute to
the absolute essential for successful application mixture toxicity? Environ Toxicol Chem 25
and interpretation of QSPR models. Quant (10):26392644
Struct Act Relat Comb Sci 22:6977 51. Forsby A, Blaauboer B (2007) Integration of
38. Golbraikh A, Tropsha A (2002) Predictive in vitro neurotoxicity data with biokinetic
QSAR modeling based on diversity sampling modelling for the estimation of in vivo neuro-
of experimental datasets for the training and toxicity. Hum Exp Toxicol 26(4):333338
test set selection. J Comput Aided Mol Des 52. Schirmer K, Tanneberger K, Kramer NI, Volker
16(56):357369 D, Scholz S, Hafner C, Lee LE, Bols NC, Her-
39. Norinder U (1996) Single and domain made mens JL (2008) Developing a list of reference
variable selection in 3D QSAR applications. chemicals for testing alternatives to whole fish
J Chemomet 10:95105 toxicity tests. Aquat Toxicol 90(2):128137
40. Zefirov NS, Palyulin VA (2001) QSAR for boil- 53. Piersma AH, Janer G, Wolterink G, Bessems
ing points of small sulfides. Are the high- JG, Hakkert BC, Slob W (2008) Quantitative
quality structure-property-activity regressions extrapolation of in vitro whole embryo culture
the real high quality QSAR models? J Chem Inf embryotoxicity data to developmental toxicity
Comput Sci 41(4):10221027 in vivo using the benchmark dose approach.
41. Kubinyi H, Hamprecht FA, Mietzner T (1998) Toxicol Sci 101(1):91100
Three-dimensional quantitative similarity- 54. Sjostrom M, Kolman A, Clemedson C, Cloth-
activity relationships (3D QSiAR) from SEAL ier R (2008) Estimation of human blood LC50
similarity matrices. J Med Chem 41(14): values for use in modeling of in vitro-in vivo
25532564 data of the ACuteTox project. Toxicol In Vitro
42. Novellino E, Fattorusso C, Greco G (1995) 22(5):14051411
Use of comparative molecular field analysis 55. Zhu H, Ye L, Richard A, Golbraikh A, Wright
and cluster analysis in series design. Pharm FA, Rusyn I, Tropsha A (2009) A novel two-
Acta Helv 70:149154 step hierarchical quantitative structure-activity
43. Golbraikh A, Tropsha A (2002) Beware of q2! relationship modeling work flow for predicting
J Mol Graph Model 20(4):269276 acute toxicity of chemicals in rodents. Environ
44. Golbraikh A, Shen M, Xiao Z, Xiao YD, Lee Health Perspect 117(8):12571264
KH, Tropsha A (2003) Rational selection of 56. Sedykh A, Zhu H, Tang H, Zhang L, Richard
training and test sets for the development of A, Rusyn I, Tropsha A (2011) . Use of in vitro
validated QSAR models. J Comput Aided Mol HTS-derived concentration-response data as
Des 17(24):241253 biological descriptors improves the accuracy
45. Stouch TR, Kenyon JR, Johnson SR, Chen of QSAR models of in vivo toxicity. Environ
XQ, Doweyko A, Li Y (2003) In silico Health Perspect 119:364370
ADME/Tox: why models fail. J Comput 57. Zhu H, Rusyn I, Richard AM, Tropsha A
Aided Mol Des 17(24):8392 (2008) Use of cell viability assay data improves
46. Johnson SR (2008) The trouble with QSAR the prediction accuracy of conventional quan-
(or how I learned to stop worrying and titative structure activity relationship models of
embrace fallacy). J Chem Inf Model 48(1): animal carcinogenicity. Environ Health Per-
2526 spect 116(4):506513
47. Lombardo F, Gifford E, Shalaeva MY (2003) In 58. Thomas CJ, Auld DS, Huang R, Huang W,
silico ADME prediction: data, models, facts and Jadhav A, Johnson RL, Leister W, Maloney
myths. Mini Rev Med Chem 3(8):861875 DJ, Marugan JJ, Michael S, Simeonov A,
3 From QSAR to QSIIR: Searching for Enhanced Computational Toxicology Models 65
Southall N, Xia M, Zheng W, Inglese J, Austin 60. ICCVAM and NICEATM (2001) Report of
CP (2009) The pilot phase of the NIH Chemi- the International Workshop on In Vitro Meth-
cal Genomics Center. Curr Top Med Chem 9 ods for Assessing Acute Systemic Toxicity.
(13):11811193 Interagency Coordinating Committee on the
59. Xia M, Huang R, Witt KL, Southall N, Fostel J, Validation of Alternative Methods and
Cho MH, Jadhav A, Smith CS, Inglese J, National Toxicology Program Interagency
Portier CJ, Tice RR, Austin CP (2008) Com- Center for the Evaluation of Alternative Toxi-
pound cytotoxicity profiling using quantitative cological Methods Report 01-4499 National
high-throughput screening. Environ Health Institutes of Health, Bethesda, MD
Perspect 116(3):284291
Chapter 4
Abstract
Aiming at understanding the structural and physical chemical basis of the biological activity of chemicals,
the science of structureactivity relationships has seen dramatic progress in the last decades. Coarse-grain,
qualitative approaches (e.g., the structural alerts), and fine-tuned quantitative structureactivity relation-
ship models have been developed and used to predict the toxicological properties of untested chemicals.
More recently, a number of approaches and concepts have been developed as support to, and corollary of,
the structureactivity methods. These approaches (e.g., chemical relational databases, expert systems,
software tools for manipulating the chemical information) have dramatically expanded the reach of the
structureactivity work; at present, they are powerful and inescapable tools for computer chemists, toxicol-
ogists, and regulators. This chapter, after a general overview of traditional and well-known approaches,
gives a detailed presentation of the latter more recent support tools freely available in the public domain.
Key words: QSAR, Structureactivity, Toxicology, Human health, Chemicals, Databases, Expert
systems, Predictive toxicology, Structural alerts
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_4, # Springer Science+Business Media, LLC 2013
67
68 R. Benigni et al.
2. StructureActivity
Relationships and
the Prediction of
Toxicological End Even though in general it is difficult to assess the contribution that
points qualitative, mechanistically based SAR information has given to risk
assessment and to the domestication of chemistry, various evidence
is however available, and indicates that the contribution is impor-
tant. A brilliant case is represented by the priority setting process at
the US National Toxicology Program in selecting chemicals to be
bioassayed. The analysis was performed when around 400 chemi-
cals had been bioassayed for their carcinogenic activity. It appears
that two-thirds were selected for the bioassay because suspect,
mainly on the basis of structural considerations. One-third was
selected because of production/exposure considerations. The anal-
ysis showed that the structural criteria adopted to short-list suspect
chemicals were able to enrich the target up to ten times. In fact,
70% of the chemicals bioassayed as suspect carcinogens were carci-
nogens, whereas 7% of the chemicals bioassayed only on produc-
tion/exposure considerations were carcinogen (22).
Another evidence on the positive influence of the mechanistic
information is that the rate of drugs and pesticides with known SAs
or with positive Salmonella results put into the market in recent
times has considerably decreased, as shown by the personal experi-
ence of one of the authors (R.B.) in his regulatory work. This is
confirmed by the much lower proportion of known SAs among
70 R. Benigni et al.
3. Sar-Based
Expert Systems
This section presents in detail three expert systems aimed at asses-
sing the hazard posed by chemicals, based primarily (but not only)
on structural considerations. The systems are Oncologic, Toxtree,
and the Organization for the Economic Co-operation and Devel-
opment (OECD) QSAR Toolbox. They have very different struc-
ture and extent, since they were designed in different periods of
time and have different specific aims. In particular, Oncologic aims
at predicting carcinogenicity, Toxtree has modules for a series of
toxicity end points, whereas the OECD QSAR Toolbox has a
number of tools that in combination support the users in the
process of hazard assessment.
72 R. Benigni et al.
3.2. Toxtree Toxtree version 2.1.0, developed by Ideaconsult Ltd under an ECB
contract, is an open source software application available as a free
download from the ECB Web site: http://ecb.jrc.ec.europa.eu/
qsar/qsar-tools/. Toxtree is capable of estimating different types
of toxic effects and modes of action by applying structural rules in a
decision tree approach. Currently, modules are available for apply-
ing the following rulebases:
1. Cramer classification scheme for threshold of toxicological
concern (TTC) estimation.
2. Extended Cramer scheme.
3. Verhaar scheme for predicting the mode of toxic action in
aquatic species.
4. Decision tree for estimating skin irritation and corrosion
potential.
5. Decision tree for estimating eye irritation and corrosion.
6. Mutagenicity and carcinogenicity rulebase.
7. ToxMic rulebase for the in vivo micronucleus assay.
8. Structural alerts for the identification of Michael acceptors.
9. START rulebase for persistence/biodegradation potential.
10. Kroes decision tree for TTC estimation.
11. Decision trees for estimating skin sensitization.
12. Cytochrome P-450 Drug Metabolism.
4 Mutagenicity, Carcinogenicity, and Other End points 75
3.3. OECD QSAR The OECD QSAR Application Toolbox is a stand-alone free soft-
Application Toolbox ware application http://www.oecd.org/document/54/0,3343,
en_2649_34379_42923638_1_1_1_1,00.html intended to be
used by the governments, the chemical industry, and the toxicolo-
gists to fill gaps in toxicity data needed for assessing the hazards
posed by chemicals.
The Toolbox contains databases with results from experimental
studies, a library of QSAR models, tools to estimate missing
80 R. Benigni et al.
The Toolbox has six work modules which are used in the
workflow.
1. Chemical input: Provides the user with several means of enter-
ing the chemical of interest or target chemical. Since all
subsequent functions are based on chemical structure, the
goal here is to make sure that the molecular structure assigned
is the correct one.
2. Profiling: Identifies structural and mechanistic properties of the
target chemical, which are stored in the Toolbox and can
subsequently be used in the module on category definition to
group the target chemical with similar chemicals.
3. End points: Retrieving physicalchemical properties, environ-
mental fate, and toxicity data which are stored in the Toolbox.
This data gathering can be executed in a global fashion (i.e.,
collecting all data of all end points) or on a more narrowly
defined basis (e.g., collecting data for a single or a limited
number of end points).
4. Category definition: Provides the user with several means of
grouping chemicals into a (toxicologically) meaningful cate-
gory that includes the target molecule. This is the critical step
in the workflow and several options are available in the Toolbox
to assist the user in refining the category definition via subcate-
gorization.
5. Filling Data Gaps: Provides the user with three options for
making an end point-specific prediction for the untested chem-
ical (i.e., the target molecule). These options, in increasing
order of complexity, are by read across, by trend analysis, and
through the use of QSAR models.
6. Report: Provides the user with a downloadable report of the
Toolbox prediction.
Since the definition of a chemical category for data gap filling is
the critical step in the workflow of the Toolbox, the selection and
use of the profilers is an important process. The Toolbox provides
several options (i.e., profilers) to assist the user in defining and
refining the category definition. A chemical category is a group of
chemicals with physicalchemical and toxicological properties
which are likely to be similar or follow a regular pattern because
of their similar chemical structure. Using this category approach,
not every chemical needs to be tested for every end point because
data gaps may be filled by read across from a tested chemical to an
untested chemical, trend analysis (interpolation or extrapolation),
or related QSAR methods. The category approach used in the
Toolbox enables robust hazard assessment through mechanistic
comparisons without testing.
82 R. Benigni et al.
Fig. 3. OECD QSAR Toolbox: Results of category definition and data collection for 3-methylhexanal.
84 R. Benigni et al.
while the purple dots the experimental results available for the
analogues, which are used for the read across (Fig. 4). A user have
a possibility to refine the read across results, for example deleting
some chemicals with functional groups which are not present in the
target molecule.
In our example the non-mutagenic potential could be pre-
dicted with confidence for the target chemical. During the last
step all history of the prediction can be printed or copied to a
detailed report. It is important to be always sure that the data
used in any read across or trend analysis predictions are quality-
checked, and that unusual or outlying data within a category are
investigated before use.
Fig. 5. An example of the ToxPredict toxicity prediction (substance: benzidine, selected model: ToxTree rulebase for
carcinogenicity and mutagenicity).
4.2. Databases The definition of toxicity databases (DB) in literature is variable; the
on Chemical definition proposed here is that a toxicity database is a valued set of
Toxicological electronic information (i.e., data) that can be related to the toxicity of
Information substances of which the information is accessible by computer,
organized by software, and utilized for safety and risk analysis of
4 Mutagenicity, Carcinogenicity, and Other End points 87
Table 1
Databases on chemical toxicological information
Table 1
(continued)
Table 1
(continued)
References
1. Ariens EJ (1984) Domestication of chemistry ular design approach for predicting mutagene-
by design of safer chemicals: structureactivity sis end points of ab-unsaturated carbonyl
relationships. Drug Metab Rev 15:425504 compounds. Toxicology 268:6477
2. Hansch C, Fujita T (1964) r-s-p Analysis. A 9. Dunn WJ, Wold S (1980) Structureactivity
method for the correlation of biological activity analyzed by pattern recognition: the asymmet-
and chemical structure. J Am Chem Soc 86: ric case. J Med Chem 23:595599
16161626 10. Wold S (1995) Chemometricswhat do we
3. Hansch C, Hoekman D, Leo A et al (2002) mean with it, and what do we want from it.
Chem-bioinformatics: comparative QSAR at Chemom Intell Lab Syst 30(1):109115
the interface between chemistry and biology. 11. Manallack DT, Ellis DD, Livingstone DJ
Chem Rev 102:783812 (1994) Analysis of linear and nonlinear QSAR
4. Livingstone DJ (2000) The characterization of data using neural networks. J Med Chem
chemical structures using molecular properties. 37:37583767
A survey. J Chem Inf Comput Sci 40:195209 12. Carlsson L, Helgee EA, Boyer S (2009) Inter-
5. Todeschini R, Lasagni M, Marengo E (1994) pretation of nonlinear QSAR models applied to
New molecular descriptors for 2D and 3D Ames mutagenicity data. J Chem Inf Model 49:
structures. Theory. J Chemom 8:263272 25512558
6. Basak SC, Mills D (2009) Predicting the 13. Michielan L, Moro S (2010) Pharmaceutical
vapour pressure of chemicals from structure: a perspectives of nonlinear QSAR strategies. J
comparison of graph theoretic versus quantum Chem Inf Model 50:961978
chemical descriptors. SAR QSAR Environ Res 14. Hansch C, Leo A (1995) Exploring QSAR. 1.
20:119132 Fundamentals and applications in chemistry
7. Estrada E (2008) Quantum chemical founda- and biology. American Chemical Society,
tion of the topological substructural molecular Washington, DC
design. J Phys Chem A 112:52085217 15. Benigni R (2003) Quantitative structureactiv-
8. Perez-Garrido A, Helguera AM, Lopez GC ity relationship (QSAR) models of mutagens
et al (2010) A topological substructural molec- and carcinogens. CRC, Boca Raton, FL
96 R. Benigni et al.
16. Kubinyi H (1993) 3D QSAR in drug design: tional databases technology. Mutat Res Rev
theory, methods and applications. ESCOM, 659:248261
Leiden 28. Hansch C (1991) Structureactivity relation-
17. Tute MS (1990) Hystory and objectives of ships of chemical mutagens and carcinogens.
quantitative drug design. In: Hansch C (ed) Sci Total Environ 109(110):1729
Comprehensive medicinal chemistry. Perga- 29. Benigni R (2005) Structureactivity relation-
mon, Oxford, pp 131 ship studies of chemical mutagens and carcino-
18. Franke R (1984) Theoretical drug design gens: mechanistic investigations and prediction
methods. Elsevier, Amsterdam approaches. Chem Rev 105:17671800
19. Franke R, Gruska A (2003) General introduc- 30. Enslein K, Gombar VK, Blake BW (1994) Use
tion to QSAR. In: Benigni R (ed) Quantitative of SAR in computer-assisted prediction of car-
structure-activity relationship (QSAR) models cinogenicity and mutagenicity of chemicals by
of mutagens and carcinogens. CRC, Boca the TOPKAT program. Mutat Res 305:4761
Raton, FL, pp 140 31. Klopman G (1992) Multicase 1. A hierarchical
20. Benigni R, Netzeva TI, Benfenati E et al computer automated structure evaluation pro-
(2007) The expanding role of predictive toxi- gram. Quant Struct Act Relat 11:176184
cology: an update on the (Q)SAR models for 32. Rosenkranz HS (2003) SAR in the assessment
mutagens and carcinogens. J Environ Sci of carcinogenesis: the MultiCASE approach.
Health C: Environ Carcinog Ecotoxicol Rev In: Benigni R (ed) Quantitative structureac-
25:5397 tivity relationship (QSAR) models of chemical
21. Woo YT, Lai DY, McLain JL et al (2002) Use mutagens and carcinogens. CRC, Boca Raton,
of mechanism-based structureactivity rela- FL, pp 175206
tionships analysis in carcinogenic potential 33. Helma C (2006) Lazy structureactivity rela-
ranking for drinking water disinfection by- tionships (lazar) for the prediction of rodent
products. Environ Health Perspect 110:7587 carcinogenicity and Salmonella mutagenicity.
22. Fung VA, Huff J, Weisburger EK, Hoel DG Mol Divers 10:147158
(1993) Predictive strategies for selecting 379 34. Benigni R, Richard AM (1998) Quantitative
NCI/NTP chemicals evaluated for carcino- structure-based modeling applied to character-
genic potential: scientific and public health ization and prediction of chemical toxicity.
impact. Fund Appl Toxicol 20:413436 Methods 14:264276
23. Greene N, Judson PN, Langowski JJ et al 35. Benigni R, Bossa C, Tcheremenskaia O et al
(1999) Knowledge-based expert systems for (2010) Alternatives to the carcinogenicity bio-
toxicity and metabolism prediction: DEREK, assay: in silico methods, and the in vitro and
StAR and METEOR. SAR QSAR Environ in vivo mutagenicity assays. Exp Opin Drug
Res 10:299314 Metab Toxicol 6:111
24. Woo YT, Lai DY (2005) Oncologic: a mecha- 36. Serafimova R, Fuart Gatnik M, Worth A
nism based expert system for predicting the (2010) Review of the QSAR models and soft-
carcinogenic potential of chemicals. In: Helma ware tools for predicting genotoxicity and car-
C (ed) Predictive toxicology. Taylor and Fran- cinogenicity. JRC Technical Report EUR
cis, Boca Raton, FL, pp 385413 24427 EN. Publications Office of the Euro-
25. Woo YT, Lai DY, Argus MF et al (1998) An pean Union, Luxenbourg
integrative approach of combining mechanisti- 37. Cronin MTD, Dearden JC (1995) QSAR in
cally complementary short-term predictive tests toxicology. 4. Prediction of non-lethal mam-
as a basis for assessing the carcinogenic potential malian toxicological end points, and expert
of chemicals. J Environ Sci Health C: Environ systems for toxicity prediction. Quant Struct
Carcinog Ecotoxicol Rev C16:101122 Act Relat 14:518523
26. Benigni R, Bossa C, Jeliazkova NG et al (2008) 38. Greene N (2002) Computer systems for the
The Benigni/Bossa rulebase for mutagenicity prediction of toxicity: an update. Adv Drug
and carcinogenicitya module of Toxtree. Deliv Rev 54:417431
EUR 23241 EN. Office for the Official Pub- 39. Hansch C (1977) On the predictive value of
lications of the European Communities, Lux- QSAR. In: Keverling Buisman JA (ed)
enbourg. EURScientific and Technical Biological activity and chemical structure. Else-
Report Series. Ref Type: Report vier, Amsterdam
27. Benigni R, Bossa C (2008) Structure alerts for 40. Hulzebos EM, Posthumus R (2003) (Q)SARs:
carcinogenicity, and the Salmonella assay sys- gatekeepers against risk on chemicals? SAR
tem: a novel insight through the chemical rela- QSAR Environ Res 14:285316
4 Mutagenicity, Carcinogenicity, and Other End points 97
41. Martin YC (2006) What works and what does idation and applicability in predicting carcino-
not: lessons from experience in a pharmaceutical gens. Regulat Pharmacol Toxicol 50:5058
company. QSAR Combinat Sci 25:11921200 55. Benigni R, Zito R (2004) The second National
42. Pearl GM, Livingstone-Carr S, Durham SK Toxicology Program comparative exercise on
(2001) Integration of computational analysis the prediction of rodent carcinogenicity: defin-
as a sentinel tool in toxicologic assessments. itive results. Mutat Res Rev 566:4963
Curr Top Med Chem 1:247255 56. Cramer GM, Ford RA, Hall RL (1978) Esti-
43. Richard AM (1998) Commercial toxicology mation of toxic hazarda decision tree
prediction systems: a regulatory perspective. approach. Food Cosmet Toxicol 16:255276
Toxicol Lett 102103:611616 57. Munro IC, Renwick AG, Danielewska-Nikiel B
44. Snyder RD, Pearl GM, Mandakas G et al (2008) The threshold of toxicological concern
(2004) Assessment of the sensitivity of the (TTC) in risk assessment. Toxicol Lett 180:
computational programs DEREK, TOPKAT, 151156
and MCASE in the prediction of the genotoxi- 58. Kroes R, Renwick AG, Cheeseman MA et al
city of pharmaceutical molecules. Environ Mol (2004) Structure-based thresholds of toxico-
Mutagen 43:143158 logical concern (TTC): guidance for applica-
45. Valerio LG, Arvidson KB, Chanderbhan RF tion to substances present at low levels in the
et al (2007) Prediction of rodent carcinogenic diet. Food Chem Toxicol 42:6583
potential of naturally occurring chemicals in 59. Munro IC, Ford RA, Kennepohl E et al (1996)
the human diet using high-throughput QSAR Correlation of structural class with No-
predictive modeling. Toxicol Appl Pharmacol Observed-Effect levels: a proposal for estab-
222:116 lishing a threshold of concern. Food Chem
46. Zeiger E, Ashby J, Bakale G et al (1996) Pre- Toxicol 34:829867
diction of Salmonella mutagenicity. Mutagen 60. Verhaar HJM, Solbe J, Speksnijder J et al
11:474484 (2000) Classifying environmental pollutants:
47. Benigni R, Bossa C, Netzeva TI, et al (2007) Part 3. External validation of the classification
Collection and evaluation of (Q)SAR models system. Chemosphere 40:875883
for mutagenicity and carcinogenicity. EUR 61. Walker JD, Gerner I, Hulzebos E et al (2005)
22772 EN. Office for the Official Publications The Skin Irritation Corrosion Rules Estimation
of the European Communities, Luxenbourg. Tool (SICRET). QSAR Combinat Sci 24:
EURScientific and Technical Research 378384
Series. 127-2007. Ref Type: Report 62. Gerner I, Schlegel K, Walker JD et al (2004)
48. Woo YT, Lai DY, Argus MF et al (1995) Devel- Use of physicochemical property limits to
opment of structureactivity relationship rules develop rules for identifying chemical sub-
for predicting carcinogenic potential of chemi- stances with no skin irritation or corrosion
cals. Toxicol Lett 79:219228 potential. QSAR Combinat Sci 23:726733
49. Benigni R, Zito R (2003) Designing safer 63. Hulzebos E, Walker JD, Gerner I et al (2005)
drugs: (Q)SAR-based identification of muta- Use of structural alerts to develop rules for
gens and carcinogens. Curr Top Med Chem identifying chemical substances with skin irrita-
3:12891300 tion or skin corrosion potential. QSAR Com-
50. Ashby J (1985) Fundamental structural alerts binat Sci 24:332342
to potential carcinogenicity or noncarcinogeni- 64. Gerner I, Liebsch M, Spielmann H (2005)
city. Environ Mutagen 7:919921 Assessment of the eye irritating properties of
51. Ashby J, Tennant RW (1988) Chemical struc- chemicals by applying alternatives to the Draize
ture, Salmonella mutagenicity and extent of rabbit eye test: the use of QSARs and in vitro
carcinogenicity as indicators of genotoxic tests for the classification of eye irritation.
carcinogenesis among 222 chemicals tested by Altern Lab Anim 33:215237
the U.S.NCI/NTP. Mutat Res 204:17115 65. Benigni R, Bossa C, Worth AP (2010) Struc-
52. Piegorsch WW, Zeiger E (1991) Measuring tural analysis and predictive value of the rodent
intra-assay agreement for the Ames Salmonella in vivo micronucleus assay results. Mutagen
assay. In: Rienhoff O, Lindberg DAB (eds) 25:335341
Statistical methods in toxicology. Springer, 66. Schultz TW (2007) Verification of the struc-
Heidelberg, pp 3541 tural alerts for Michael acceptors. Chem Res
53. Benigni R, Bossa C (2008) Predictivity of Toxicol 20:13591363
QSAR. J Chem Inf Model 48:971980 67. Enoch SJ, Madden JC, Cronin MT (2008)
54. Mayer J, Cheeseman MA, Twaroski ML (2008) Identification of mechanisms of toxic action
Structure activity relationship analysis tools: val- for skin sensitisation using a SMARTS pattern
98 R. Benigni et al.
based approach. SAR QSAR Environ Res digm for toxicity prediction. Toxicol Mech
19:555578 Meth 18:103118
68. Aptula AO, Patlewicz G, Roberts DW (2005) 75. Yang C, Benz RD, Cheeseman MA (2006)
Skin sensitization: reaction mechanistic appli- Landscape of current toxicity databases and
cability domains for structureactivity relation- database standards. Curr Opin Drug Discov
ships. Chem Res Toxicol 18:14201426 Dev 9:124133
69. Rydberg P, Gloriam DE, Zaretzki J et al (2010) 76. Yang C, Richard AM, Cross KP (2006) The art
SMARTCyp: a 2D method for prediction of of data mining the minefields of toxicity data-
cytochrome P450-mediated drug metabolism. bases to link chemistry to biology. Curr Com-
ACS Med Chem Lett 1:96100 put Aid Drug Des 2:135150
70. Serafimova R, Todorov M, Pavlov T et al 77. Benigni R, Bossa C, Richard AM et al (2008) A
(2007) Identification of the structural require- novel approach: chemical relational databases,
ments for mutagencitiy, by incorporating and the role of the ISSCAN database on asses-
molecular flexibility and metabolic activation sing chemical carcinogenicity. Ann Ist Super
of chemicals. II. General Ames mutagenicity Sanita` 44:4856
model. Chem Res Toxicol 20:662676 78. Richard AM (2004) DSSTox web site launch:
71. Hardy B, Douglas N, Helma C et al (2010) improving public access to databases for build-
Collaborative development of predictive toxi- ing structure-toxicity prediction models. Pre-
cology applications. J Cheminf 2:7 clinica 2:103108
72. Valerio LG (2009) In silico toxicology for the 79. Judson R, Richard AM, Dix D et al (2008)
pharmaceutical sciences. Toxicol Appl Pharma- ACToRaggregated computational toxicol-
col 241:356370 ogy resource. Toxicol Appl Pharmacol 233:
73. OECD (2007) Approaches to data gap filling 713
in chemical categories. Chapter 3: Guidance on 80. Dix DJ, Houck KA, Martin MT et al (2007)
grouping of chemicals, vol 80. OECD Series The ToxCast program for prioritizing toxicity
on Testing and Assessment, Paris, pp 3041 testing of environmental chemicals. Toxicol Sci
74. Richard AM, Yang C, Judson RS (2008) Tox- 95:512
icity data informatics: supporting a new para-
Chapter 5
Abstract
Frequent failure of drug candidates during development stages remains the major deterrent for an early
introduction of new drug molecules. The drug toxicity is the major cause of expensive late-stage develop-
ment failures. An early identification/optimization of the most favorable molecule will naturally save
considerable cost, time, human efforts and minimize animal sacrifice. (Quantitative) Structure Activity
Relationships [(Q)SARs] represent statistically derived predictive models correlating biological activity
(including desirable therapeutic effect and undesirable side effects) of chemicals (drugs/toxicants/envi-
ronmental pollutants) with molecular descriptors and/or properties. (Q)SAR models which categorize the
available data into two or more groups/classes are known as classification models. Numerous techniques of
diverse nature are being presently employed for development of classification models. Though there is an
increasing use of classification models for prediction of either biological activity or toxicity, the future trend
will naturally be towards the development of classification models capable of simultaneous prediction of
biological activity, toxicity, and pharmacokinetic parameters so as to accelerate development of bioavailable
safe drug molecules.
Key words: Classification models, Drug safety, Biological activity, QSAR models, Toxicity,
Bioavailability
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_5, # Springer Science+Business Media, LLC 2013
99
100 A.K. Madan et al.
2. Materials
and Methods
The compound classification approaches are usually based upon
either clustering or partitioning methods. Clustering of com-
pounds in chemical space involves the calculation of intermolecular
distances and compounds that are close to each other so as to form
clusters. In partitioning, chemical space is subdivided into sections,
based upon ranges of descriptor values, and compounds falling in
the same section are combined (16). These models are being widely
used for prediction of physicochemical property, biological activity
and toxicity of lead structures prior to their synthesis followed by
synthesis and subsequent development of few highly active and
nontoxic lead compounds for the chosen biological activity and
toxicity endpoint. Use of such classification schemes has also been
recommended in the second and sixth steps of workflow (Fig. 1) for
use of nontesting data in chemical assessment developed as per
102 A.K. Madan et al.
Fig. 1. Workflow for the use of nontesting data in chemical assessment (reproduced from ref. 17 with permission from
Wiley-VCH Verlag GmbH & Co. KGaA, Germany).
5 Classification Models for Safe Drug Molecules 103
Table 1
Examples of various classification based QSAR/QSTR models
Table 1
(continued)
Table 1
(continued)
Fig. 2. A decision tree for distinguishing between low LD50 and high LD50 (reproduced
from ref. 52 with permission from Inderscience Enterprises Limited, Switzerland).
2.3. Cluster Analysis Cluster analysis (CA) groups the data from the available information
that describes the object and their relationships (114). There are
numerous types of clusters which include prototype based, graph
based, density based, well-separated and shared property or
108 A.K. Madan et al.
Fig. 3. Topology of decision tree distinguishing active compounds {A} from inactive
compounds {B} (reproduced from ref. (53) with permission from Osterreichische
Apotheker-Verlagsgesellschaft m. b. H., Vienna, Austria).
Fig. 4. A dendrogram illustrating the results of the hierarchical k-NNCA of the set of 31
steroids used in the training and the prediction sets (reproduced from ref. (26) with
permission from Wiley-VCH Verlag GmbH & Co. KGaA, Germany).
2.4. Principal Principal component analysis (PCA) and factor analysis (FA) are
Component Analysis two techniques which can reveal relatively simpler patterns within a
and Factor Analysis complex set of variables. These techniques particularly seek to
discover if the observed variables can be explained largely or entirely
in terms of a much smaller number of variables called factors (115).
PCA is a well known multivariate procedure that minimizes
dimensionality of data space while retaining as much information
as possible (116). Data set obtained from a series of compounds
110 A.K. Madan et al.
tested for their biological activity in several test systems can be easily
and successfully subjected to PCA (20). Though both PCA and FA
aim to reduce the dimensionality of a set of data, the approaches to
do so are significantly different for these two techniques. Each
technique provides a different insight into the data structure, with
PCA dealing with the diagonal elements of the covariance matrix,
while FA emphasizing on the off-diagonal elements (112).
2.6. Support Vector SVM represents a powerful tool for general (nonlinear) classifica-
Machine tion, regression, and outlier detection with an intuitive model
development (79). SVM originated from the early concepts devel-
oped by Vapnik and Cortes (120122). SVMs are based upon the
structured risk minimization principle. SVMs construct a hyper-
plane which separates two classes (9). SVM is a learning system
involving hypothesis space of linear functions in a high-dimensional
feature space, trained with a learning algorithm from the optimiza-
tion theory (79). In the simplest case, compounds from different
categories can be easily separated by linear hyperplane which is
defined solely by its nearest compounds from the training set.
Such compounds are referred to as support vectors and are respon-
sible for the nomenclature of the said technique. SVMs constitute
widely used classification technique for differentiating drugs for
their classes using sets of molecular descriptors due to their high
performance in generalization, computational efficiency and
robustness in high dimensions (123). Figure 6 shows an example
of a classification model using SVM.
5 Classification Models for Safe Drug Molecules 111
Fig. 5. Plot of canonical variable 2 versus canonical variable 1 for LDA with three groups
(discriminant function FD2) (reproduced from ref. (119) with permission from Oxford
University Press).
Fig. 6. Support vectors and margins in linearly separable (a) and nonseparable (b)
problems. In nonseparable case, negative margins are encountered and their magnitude
is subject to optimization along with the magnitude of the positive margins (reproduced
from ref. (108) with permission from Bentham Science Publishings Ltd).
2.7. Ensemble Methods Ensemble method utilizes a combination of techniques for con-
structing a single QSAR model with the aim of improving predict-
ability. The method of ensemble aggregates the results from several
individual models so as to achieve substantial improvement over
various individual models (127). With the development of ensem-
ble approaches, multiple models of varying types are individually
developed resulting in different descriptor subsets for each model
type (128). An ensemble can be designed in diverse methodologies
such as bagging, random square, boosting methods (108,
127130) and three class classification models (131). An ensemble
method has been exemplified in Fig. 7. Recent research has revealed
that by capitalizing on the diversity of the individual models,
ensemble techniques can minimize uncertainty and produce more
Fig. 7. Individual decisions of three binary classifiers and the resulting classifier ensemble
with more accurate decision boundary (reproduced from ref. (108) with permission from
Bentham Science Publishings Ltd).
5 Classification Models for Safe Drug Molecules 113
2.8. Bayes Classifier The Bayes Classifier originates from the Bayes rule relating the
posterior probability of a class to its overall probability, the proba-
bility of the observations as well as the likelihood of a class with
respect to observed variables. In Bayes rule, the class/group mini-
mizing the posterior probability is selected as the prediction result
(108). Bayesian statistics compare the frequency of occurrences of
the features found in two or more groups to identify those features
which discriminate best between these groups (76). A distinct
advantage of the Nave Bayes Classifier is that it requires a relatively
smaller training data to estimate the parameters (means and var-
iances of the variables) needed for classification.
2.9. Moving Average MAA model is conceptually a classification model and is based upon
Analysis a method known as maximization of moving average (134). In each
of the models, the compounds are classified either as active or
inactive/toxic or nontoxic/permeable or impermeable/adductible
or nonadductible. According to this method, the minimum size of
range is based on moving average of 65% of the correctly predicted
compounds. However if the moving average percentage of correct
prediction lies between 50 15%, it is classified as transitional
range (21, 135). An active range initially bracketed by transitional
ranges and subsequently bracketed by inactive ranges simply reveals
gradual change in the activity and represents an ideal model. The
methodology used in the MAA aims at the development of suitable
models for providing lead molecules through exploitation of the
active ranges. MAA based models are unique and differ widely from
conventional (Q)SAR models. Both systems of modeling have their
advantages and limitations. In the MAA modeling, the system
adopted has distinct advantage of identification of narrow active
ranges, which may be erroneously skipped during regression analysis
in conventional QSAR (103). Since the ultimate goal of modeling is
to provide lead structures, therefore, active ranges of the MAA
based models can naturally play a vital role in providing lead struc-
tures. This method has been extensively used for development of
classification models for biological activities of diverse nature.
2.10. Moving Average As a part of regulatory assessment for investigational new drug
Model for Safe Drug (IND) application, submission of documented safety data of the
Molecules molecule along with biological efficacy is mandatory, therefore,
safety and activity predicted using computational methods will
114 A.K. Madan et al.
Table 2
Examples of MAA based classification models for simultaneous prediction/
consideration of biological activity and safety
30
25
20
Average EC50
W
15
MCI
10 AECI
0
LI LT Active UT UI
Nature of Range
12000
Average Selectivity Index
10000
8000 W
6000 MCI
AECI
4000
2000
0
LI LT Active UT UI
Nature of Range
W - Wieners index; MCI Molecular connectivity Index; AECI Augmented
ecentric connectivity index; LI Lower inactive; LT Lower transitional;
UT Upper transitional; UI Upper Inactive.
Fig. 8. Average EC50 and SI values of acylthiocarbamate compounds in various ranges of models developed using Wieners
index, molecular connectivity index and augmented eccentric connectivity index (141). W Wieners index, MCI molecular
connectivity index, AECI augmented eccentric connectivity index, LI lower inactive, LT lower transitional, UT upper
transitional, UI upper inactive.
10
Average EC50
6
AC
4
Wc
2
0
LI LT A UT UI UA
Nature of Range
200
150
Average SI
100
AC
Wc
50
0
LI LT A UT UI UA
Naturre of Range
Fig. 9. Average EC50 and SI values of 1-alkoxy-5-alkyl-6-(arylthio)uracils in Rvarious ranges of models developed using
Wieners topochemical index and superadjacency topochemical index (136). AC superadjacency topochemical index,
Wc Wieners topochemical index, LI lower inactive, LT lower transitional, A active, UT upper transitional, UI upper inactive,
UA upper active.
Table 3
Examples of software used for development of classification models
Table 4
Definitions of goodness of fit parameters
Table 4
(continued)
3. Conclusion
The past few decades have witnessed increasing trend toward devel-
opment of (Q)SAR models. These models constitute vital compo-
nent of drug discovery process. Numerous techniques of diverse
nature are being used for the development of both correlation and
classification models. These models normally predict either the
biological activity or toxicity of the compound. However, there is
a strong need for development of models for simultaneous predic-
tion of biological activity and toxicity so as to accelerate develop-
ment of safe drug molecules. Classification models play a vital role
in drug design. Active ranges of some of the classification models
such as those developed through MAA are of particular significance
because these simultaneously take into consideration both the
activity/potency as well as selectivity index. Active ranges of such
models have not only reportedly revealed high potency as indicated
by low value of EC50 or IC50 but they have also exhibited safety as
5 Classification Models for Safe Drug Molecules 119
References
1. Sneyd JR (2001) Drug development in 21st 13. Harvey SC (1980) Drug absorption, action
century. Curr Anaesth Crit Care 12:329334 and disposition. In: Osol A (ed) Remingtons
2. Kubinyi H (2003) Drug research: myths, pharmaceutical sciences, 16th edn. Mack,
hype and reality. Nat Rev Drug Discov 2: Pennsylvania
665668 14. Mandell GL, Petri WA Jr (1996) Antimicro-
3. Sussman NL, Kelly JH (2003) Saving time bial agents: penicillins, cephalosporins, and
and money in drug discovery-A pre-emptive other b-lactam antibiotics. In: Hardman JG,
approach. Business Briefing. Future Drug Limbird LE (eds) Goodman and Gilmans the
Discov. Business Briefing Ltd., London, UK, pharmacological basis of therapeutics, 9th
4649 edn. McGraw-Hill, New York
4. DiMasi J, Hansen R, Grabowski H (2003) 15. Tsantrizos YS (2010) Research and develop-
The price of innovation: new estimates of mentthe discovery process http://www.
drug development cost. J Health Econ 22: wei-c-chen.com/winter2010/chem503/04%
151185 20-%20ADME-PK%20drug%20delivery.pdf.
5. Marjana N, Marjana V (2010) QSAR model Accessed 4 Dec 2010
for reproductive toxicity and endocrine dis- 16. Bajorath J (2001) Selected concepts and
ruption activity. Molecules 15:19871999 investigation in compound classification,
6. Chen JW, Li XH, Yu HY et al (2008) Progress molecular descriptor analysis and virtual
and perspectives of quantitative structure- screening. J Chem Inf Comput Sci 41:
activity relationships used for ecological risk 233245
assessment of toxic organic compounds. Sci 17. Bassan A, Worth AP (2008) The integrated
China Ser B Chem 51:593606 use of models for the properties and effects of
7. Ibezim EC, Duchowicz PR, Ibezim NE et al chemicals by means of a structured workflow.
(2009) Computer-aided linear modeling QSAR Comb Sci 27:620
employing QSAR for drug discovery. Sci Res 18. Worth AP, Cronin MTD (2003) The use of
Essays 4:15591564 discriminant analysis, logistic regression and
8. Johnson DE, Wolfgang GHI (2000) Predict- classification tree analysis in the development
ing human safety: screening and computa- of classification models for human health
tional approaches. Drug Discov Today 5: effects. J Mol Str Theochem 622:97111
445454 19. OECD (2007), Guidance document on the
9. Hou T, Wang J, Zhang W et al (2006) Recent validation of (Quantitative) Structure-Activity
advances in computational prediction of drug Relationships [(Q)SAR] Models, p. 62,
absorption and permeability in drug discov- OECD Environment Health and Safety Pub-
ery. Curr Med Chem 13:26532667 lications Series on Testing and Assessment,
No. 69
10. Thai KM, Ecker GF (2008) A binary QSAR
model for classification of hERG potassium 20. Grover M, Singh B, Bakshi M, Singh S (2000)
channel blockers. Bioorg Med Chem 16: Quantitative structure-property relationships
41074119 in pharmaceutical research. Part 1. Pharm Sci
Technol Today 3:2835
11. Eitrich T, Kless A, Druska C et al (2007)
Classification of highly unbalanced cyp450 21. Gupta S, Singh M, Madan AK (2001) Predict-
data of drugs using cost sensitive machine ing anti-HIV activity: computational
learning techniques. J Chem Inf Model 47: approach using a novel topological descriptor.
92103 J Comput Aided Mol Des 15:671678
12. Ekins S, Boulanger B, Swaan PW et al (2002) 22. Worth AP, Cronin MTD (2000) Embedded
Towards a new age of virtual ADME/TOX cluster modelling: a novel quantitative
and multidimensional drug discovery. J Com- structure-activity relationship for generating
put Aided Mol Design 16:381401 elliptic models of biological activity. In: Balls
120 A.K. Madan et al.
M, van Zeller AM, Halder ME (eds) Progress learning methods. Eur J Med Chem 45:
in the reduction, refinement and replacement 11671172
of animal experimentation. Elsevier Science, 34. Boiani M, Cerecetto H, Gonzalez M et al
Amsterdam (2008) Modeling anti-Trypanosoma cruzi
23. Frimurer TM, Bywater R, Naerum L (2000) activity of N-oxide containing heterocycles.
Improving the odds in discriminating drug- J Chem Inf Model 48:213219
like from non drug-like compounds. J 35. Lin HH, Han LY, Yap CW et al (2007) Pre-
Chem Inf Comput Sci 40:13151324 diction of factor Xa inhibitors by machine
24. Weber KC, HonArio KM, Bruni AT et al learning methods. J Mol Graph Model
(2006) The use of classification methods for 26:505518
modeling the antioxidant activity of flavonoid 36. Gunturi SB, Theerthala SS, Patel NK et al
compounds. J Mol Model 12:915920 (2010) Prediction of skin sensitization poten-
25. Gillet VJ, Willett P, Bradshaw J (1998) Iden- tial using D-optimal design and GA-kNN
tification of biological activity profiles using classification methods. SAR QSAR Environ
substructural analysis and genetic algorithms. Res 21:305335
J Chem Inf Comput Sci 38:165179 37. Polishchuk PG, Muratov EN, Artemenko AG
26. Alvarez-Ginarte YM, Crespo-Otero R, et al (2009) Application of random forest
Marrero-Poncec Y et al (2006) In-silico classi- approach to QSAR prediction of aquatic tox-
fication of solubility using binary k-Nearest icity. J Chem Inf Model 49:24812488
Neighbour and physicochemical descriptors. 38. Zhu H, Rusyn I, Richard A et al (2008) Use
QSAR Comb Sci 25:881894 of cell viability assay data improves the predic-
27. Rodgers AD, Zhu H, Fourches D et al (2010) tion accuracy of conventional quantitative
Modeling liver-related adverse effects of drugs structure-activity relationship models of ani-
using k nearest neighbor quantitative mal carcinogenicity. Environ Health Perspect
structure-activity relationship method. Chem 116:506513
Res Toxicol 23:724732 39. Zhang S, Wei L, Bastow K et al (2007) Anti-
28. Khlebnikov AI, Schepetkin IA, Kirpotina LN tumor agents 252. Application of validated
et al (2008) Computational structure-activity QSAR models to database mining: discovery
relationship analysis of non-peptide inducers of of novel tylophorine derivatives as potential
macrophage tumor necrosis factor-alpha pro- anticancer agents. J Comput Aided Mol Des
duction. Bioorg Med Chem 16:93029312 21:97112
29. Nisius B, Goller AH, Bajorath J (2009) Com- 40. Asikainen A, Kolehmainen M, Ruuskanen J
bining cluster analysis, feature selection and et al (2006) Structure-based classification of
multiple support vector machine models for active and inactive estrogenic compounds by
the identification of human ether-a-go-go decision tree, LVQ and kNN methods. Che-
related gene channel blocking compounds. mosphere 62:658673
Chem Biol Drug Des 73:1725 41. Kadam RU, Roy N (2006) Cluster analysis
30. Kauffman GW, Jurs PC (2001) QSAR and k- and two-dimensional quantitative structure-
Nearest Neighbour Classification Analysis of activity relationship (2D-QSAR) of Pseudo-
selective cyclooxygenase-2 inhibition using monas aeruginosa deacetylase LpxC inhibi-
topologically-based numerical descriptors. J tors. Bioorg Med Chem Lett 16:51365143
Chem Inf Comput Sci 4:15531560 42. Hammann F, Gutmann H, Baumann U et al
31. Efferth T, Konkimalla VB, Wang YF et al (2009) Classification of cytochrome p(450)
(2008) Prediction of broad spectrum resis- activities using machine learning methods.
tance of tumors towards anticancer drugs. Mol Pharm 6:19201926
Clin Cancer Res 14:24052412 43. Petkov PI, Temelkov S, Villeneuve DL et al
32. Casanola-Martn GM, Marrero-Ponce Y, (2009) Mechanism-based categorization of
Khan MTH et al (2008) Atom- and bond- aromatase inhibitors: a potential discovery
based 2D TOMOCOMD-CARDD approach and screening tool. SAR QSAR Environ Res
and ligand-based virtual screening for the 20:657678
drug discovery of new tyrosinase inhibitors. J 44. Burton J, Danloy E, Vercauteren DP (2009)
Biomol Screen 13:10141024 Fragment-based prediction of cytochromes
33. Lv W, Xue Y (2010) Prediction of acetylcho- P450 2D6 and 1A2 inhibition by recursive
linesterase inhibitors and characterization of partitioning. SAR QSAR Environ Res 20:
correlative molecular descriptors by machine 185205
5 Classification Models for Safe Drug Molecules 121
computational chemistry approach to antimi- 80. Sato T, Matsuo Y, Honma T (2008) In silico
crobials. II. Multiple distance and triadic cen- functional profiling of small molecules and its
sus analysis of antiparasitic drugs complex applications. J Med Chem 51:77057716
networks. J Comput Chem 31:164173 81. Du H, Wang J, Watzl J et al (2008) Classifica-
70. Prado-Prado FJ, Borges F, Perez-Montoto tion structure-activity relationship (CSAR)
LG et al (2009) Multi-target spectral studies for prediction of genotoxicity of thio-
moment: QSAR for antifungal drugs vs. dif- phene derivatives. Toxicol Lett 177:1019
ferent fungi species. Eur J Med Chem 82. Tang H, Wang XS, Huang XP (2009) Novel
44:40514056 inhibitors of human histone deacetylase
71. Sun M, Chen J, Wei H et al (2009) Quantita- (HDAC) identified by QSAR modeling of
tive structure-activity relationship and classifi- known inhibitors, virtual screening, and
cation analysis of diaryl ureas against vascular experimental validation. J Chem Inf Model
endothelial growth factor receptor-2 kinase 49:461476
using linear and -linear models. Chem Biol 83. Leong MK, Chen TH (2008) Prediction of
Drug Des 73:644654 cytochrome P450 2B6-substrate interactions
72. Gozalbes R, Barbosa F, Nicola E et al (2009) using pharmacophore ensemble/support vec-
Development and validation of a tor machine (PhE/SVM) approach. Med
pharmacophore-based QSAR model for the Chem 4:396406
prediction of CNS activity. ChemMedChem 84. Leong MK (2007) A novel approach using
4:204209 pharmacophore ensemble/support vector
73. Garcia-Domenech R, Lopez-Pena W, machine (PhE/SVM) for prediction of hERG
Sanchez-Perdomo Y et al (2008) Application liability. Chem Res Toxicol 20:217226
of molecular topology to the prediction of the 85. Auerbach SS, Shah RR, Mav D et al (2010)
antimalarial activity of a group of uracil-based Predicting the hepatocarcinogenic potential
acyclic and deoxyuridine compounds. Int J of alkenylbenzene flavoring agents using tox-
Pharm 363:7884 icogenomics and machine learning. Toxicol
74. Castillo-Garit JA, Marrero-Ponce Y, Torrens Appl Pharmacol 243:300314
F et al (2008) Estimation of ADME proper- 86. Leong MK, Chen YM, Chen HB et al (2009)
ties in drug discovery: predicting Caco-2 cell Development of a new predictive model for
permeability using atom-based stochastic and interactions with human cytochrome P450
non-stochastic linear indices. J Pharm Sci 2A6 using pharmacophore ensemble/support
97:19461976 vector machine (PhE/SVM) approach.
75. Alvarez-Ginarte YM, Marrero-Ponce Y, Ruiz- Pharm Res 26:9871000
Garcia JA (2008) Applying pattern recogni- 87. Votano JR, Parham M, Hall LM (2006)
tion methods plus quantum and physico- QSAR modeling of human serum protein
chemical molecular descriptors to analyze the binding with several modeling techniques uti-
anabolic activity of structurally diverse ster- lizing structure-information representation. J
oids. J Comput Chem 29:317333 Med Chem 49:71697181
76. Kamphorst J, Cucurull-Sanchez L, Jones B 88. Fernandez M, Caballero J (2006) Ensembles
(2007) A performance evaluation of multiple of Bayesian-regularized genetic neural net-
classification models of human PEPT1 inhibi- works for modeling of acetylcholinesterase
tors and non-inhibitors. QSAR Comb Sci inhibition by huprines. Chem Biol Drug Des
26:220226 68:201212
77. Cannon EO, Amini A, Bender A et al (2007) 89. Tarasov VA, Mustafaev ON, Abilev SK (2005)
Support vector inductive logic programming Use of ensemble structural descriptors for
outperforms the naive Bayes classifier and increasing the efficiency of QSAR study.
inductive logic programming for the classifi- Genetika 41:9971005
cation of bioactive chemical compounds. J 90. Guha R, Jurs PC (2004) Development of lin-
Comput Aided Mol Des 21:269280 ear, ensemble, and nonlinear models for the
78. Sun H (2006) An accurate and interpretable prediction and interpretation of the biological
bayesian classification model for prediction of activity of a set of PDGFR inhibitors. J Chem
HERG liability. ChemMedChem 1:315322 Inf Comput Sci 44:21792189
79. Wang J, Liu H, Qin S et al (2007) Study on 91. Debeljak Z, Skrbo A, Jasprica I et al (2007)
structure-activity relationship of new anti- QSAR study of antimicrobial activity of some
HIV nucleoside derivatives based on the sup- 3-nitrocoumarins and related compounds. J
port vector machine method. QSAR Comb Chem Inf Model 47:918926
Sci 26:161172
5 Classification Models for Safe Drug Molecules 123
92. Bajaj S, Sambi SS, Madan AK (2004) Prediction 104. Dureja H, Madan AK (2007) Topochemical
of carbonic anhydrase activation of tri/tetra models for prediction of telomerase inhibi-
substituted-pyridinium-azole compounds- tory activity of flavonoids. Chem Biol Drug
Computational approach using novel topo- Des 70:4752
chemical descriptor. QSAR Comb Sci 105. Bajaj S, Sambi SS, Madan AK (2006) Models
23:506514 for prediction of anti-neoplastic activity of
93. Bajaj S, Sambi SS, Madan AK (2004) Topo- 1,2-Bis(sulfonyl)-1-methylhydrazines:
logical models for prediction of anti- Computational approach using Wieners
inflammatory activity of N-arylanthranilic indices. MATCH Commun Math Comput
acids. Bioorg Med Chem 12:36953701 Chem 56:193204
94. Bajaj S, Sambi SS, Madan AK (2005) Topo- 106. Blower PE, Cross KP (2006) Decision tree
chemical model for prediction of corticotro- methods in pharmaceutical research. Curr
pin releasing factor antagonizing activity of Top Med Chem 6:3139
N-phenylphenylglycines analogs. Acta Chim 107. Andres C, Hutter MC (2006) CNS perme-
Slov 52:292296 ability of drugs predicted by a decision tree.
95. Bajaj S, Sambi SS, Madan AK (2008) Topo- QSAR Comb Sci 25:305309
chemical models for predicting the activity of 108. Dudek AZ, Arodz T, Galvez J (2006)
a, g-diketo acids as inhibitors of the hepatitis Computational methods in developing quan-
C virus NS5B RNA-dependent RNA poly- titative structure-activity relationship: a
merase. Pharmaceut Chem J 40:650654 review. Comb Chem High Throughput
96. Sardana S, Madan AK (2002) Predicting anti- Screen 9:213228
convulsant activity of benzamides/benzyla- 109. Breiman L, Friedman JH, Olshen RA et al
mines: computational approach using (1984) Classification and regression trees.
topological descriptors. J Comput Aided Mol Wadsworth International Group, Belmont
Des 16:545550 110. Svetnik V, Liaw A, Tong C et al (2003) Ran-
97. Lather V, Madan AK (2005) Topological dom forest: a classification and regression tool
model for the prediction of MRP1 inhibitory for compound classification and QSAR mod-
activity of pyrrolopyrimidines and templates eling. J Chem Inf Comput Sci 43:19471958
derived from pyrrolopyrimidine. Bioorg Med 111. Tong W, Hong H, Fang H et al (2003) Deci-
Chem Lett 15:49674972 sion forest: combining the predictions of mul-
98. Gupta S, Singh M, Madan AK (1999) Super- tiple independent decision tree models. J
pendentic index: a novel topological descrip- Chem Inf Comput Sci 43:525531
tor for predicting biological activity. J Chem 112. Aboul-Kassim TAT, Simoneit BRT (2001)
Inf Comput Sci 39:272277 QSAR/QSPR and multicomponent joint
99. Kumar V, Madan AK (2006) Application of toxic effect modeling of organic pollutants at
graph theory: prediction of cytosolic phos- aqueous-solid phase interfaces, vol 5 E, The
pholipase A2 inhibitory activity of propan-2- handbook of environmental chemistry.
ones. J Math Chem 39:511521 Springer, Heidelberg
100. Kumar V, Madan AK (2005) Application of 113. Wold S, Dunnt WJ, Hellberg S (1985) Toxicity
graph theory: prediction of glycogen synthase modeling and prediction with pattern recogni-
kinase-3 inhibitory activity of thiadiazolidi- tion. Environ Health Perspect 61:257268
nones as potential drugs for the treatment of 114. http://www-users.cs.umn.edu/~kumar/
Alzheimers disease. Eur J Pharm Sci dmbook/ch8.pdf. Accessed 15 Aug 2010
24:213218
115. http://www.geosoft.com. Accessed 28 Nov
101. Kumar V, Madan AK (2007) Application of 2010
graph theory: models for prediction of car-
bonic anhydrase inhibitory activity of sulpho- 116. Jalali-Heravi M, Shahbazikhah P, Zekavat B
namides. J Math Chem 42:925940 et al (2007) Principal component analysis-
ranking as a variable selection method for the
102. Dureja H, Madan AK (2005) Topochemical simulation of 13C nuclear magnetic resonance
models for prediction of cyclin-dependent spectra of xanthones using artificial neural net-
kinase 2 inhibitory activity of indole-2-ones. works. QSAR Comb Sci 26:764772
J Mol Model 11:525531
117. Adenot M, Perriere N, Scherrmann JM et al
103. Dureja H, Madan AK (2006) Prediction of (2007) Applications of a blood-brain barrier
h5-HT2A receptor antagonistic activity of technology platform to predict CNS penetra-
arylindoles: computational approach using tion of various chemotherapeutic agents. 1.
topochemical descriptors. J Mol Graph Anti-infective drugs. Chemotherapy 53:7072
Model 25:373379
124 A.K. Madan et al.
Abstract
In this chapter, a range of computational tools for applying QSAR and grouping/read-across methods are
described, and their integrated use in the computational assessment of genotoxicity is illustrated through
the application of selected tools to two case-study compounds2-amino-9H-pyrido[2,3-b]indole (AaC)
and 2-aminoacetophenone (2-AAP). The first case study compound (AaC) is an environment pollutant and
a food contaminant that can be formed during the cooking of protein-rich food. The second case study
compound (2-AAP) is a naturally occurring compound in certain foods and also proposed for use as a
flavoring agent. The overall aim is to describe and illustrate a possible way of combining different informa-
tion sources and software tools for genotoxicity and metabolism prediction by means of a simple stepwise
approach. The chapter is aimed at researchers and assessors who have a basic knowledge of computational
toxicology and some familiarity with the practical use of computational tools. The emphasis is on how to
evaluate the data generated by multiple tools, rather than the practical use of any specific tool.
Key words: Mutagenicity, Genotoxicity, QSAR, Expert system, Structural alert, Read-across,
Chemical category, Toxicity, Metabolism, Prediction
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_6, # Springer Science+Business Media, LLC 2013
125
126 A.P. Worth et al.
2. Materials
2.2. Derek for Derek for Windows (DfW) is developed and marketed by Lhasa
Windows Ltd, a nonprofit company and educational charity in the UK
(https://www.lhasalimited.org/). DfW contains over 630 alerts
covering a wide range of toxicological endpoints in humans, other
mammals, and bacteria. An alert in DfW consists of a toxicophore
(a substructure known or thought to be responsible for the toxicity)
and is associated with literature references, comments, and examples
(14). A key feature of DfW is the transparent reporting of the
reasoning underlying each prediction.
All the rules in DfW are based either on hypotheses relating to
mechanisms of action of a chemical class or on observed empirical
relationships. Information used in the development of rules
includes published data and suggestions from toxicological experts
in the industry, regulatory bodies, and academia. The toxicity pre-
dictions are the result of two processes. The program first checks
whether any alerts in the knowledge base match toxicophores in the
query structure. The reasoning engine then assesses the likelihood
of a structure being toxic. There are nine levels of confidence:
certain, probable, plausible, equivocal, doubted, improbable,
impossible, open, and contradicted. DfW can be integrated with
Lhasas Meteor software, which makes predictions of fate, thereby
providing predictions of toxicity for both parent compounds and
their metabolites.
128 A.P. Worth et al.
2.6. OECD QSAR Toolbox The OECD QSAR Toolbox is a stand-alone software application
for gaps in (eco)toxicity data needed for assessing the hazards of
chemicals. Data gaps are filled by following a flexible workflow in
which chemical categories are built and missing data are estimated
by read-across or by applying local QSARs (trends within the
category). The Toolbox also includes a range of profilers to quickly
evaluate chemicals for common mechanisms or modes of action. In
order to support read-across and trend analysis, the Toolbox con-
tains numerous databases with results from experimental studies.
The first version of the Toolbox, released in March 2008, was a
proof-of-concept version. The second version (v. 2.0) was released
in October 2010. The release of version 3.0 is planned for October
2012. The Toolbox and guidance on its use are freely available:
http://www.qsartoolbox.org/.
2.7. OncoLogic This is a freely available expert system that assesses the potential of
chemicals to cause cancer. OncoLogic was developed by the US
EPA in collaboration with LogiChem, Inc. It predicts the potential
carcinogenicity of chemicals by applying the rules of structureac-
tivity relationship (SAR) analysis and incorporating what is known
about the mechanisms of action and human epidemiological stud-
ies. The software reveals its line of reasoning, like human experts, to
support the predictions made. It also includes a database of toxico-
logical information relevant to carcinogenicity assessment. The
Cancer Expert System comprises four subsystems that evaluate
fibers, metals, polymers, and organic chemicals of diverse chemical
structures. Chemicals are entered one by one and the user needs a
limited knowledge of chemistry in order to select the appropriate
subsystem. OncoLogic is freely downloadable from the US EPA
website: http://www.epa.gov/oppt/sf/pubs/oncologic.htm.
3. Methods
3.1. Stepwise A stepwise approach for using toxicity and metabolism prediction
Approach to Hazard tools in the context of a hazard assessment has been proposed in an
Assessment earlier work (21) and incorporated into the technical guidance for
REACH (22). The stepwise approach presented here is an adapta-
tion of the approach proposed previously.
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity 131
3.2. Step 1. Existing The workflow begins by identifying the information on chemical
Data and Information properties that is needed for in-house development purposes or to
meet a given set of regulatory requirements. In the case studies
presented, an assessment of genotoxic potential is performed as a
step towards evaluating carcinogenic potential.
Information needs to be collected about the chemical compo-
sition (identity of main chemical component, other components,
purity, impurities) of the substance of interest and a specific com-
pound is selected for the study (referred to as the parent com-
pound). This is necessary because predictions from (Q)SAR
methods and grouping approaches are generated from a single
well-defined structure (represented in a machine-readable format
such as SMILES code or mol file). The purity/impurity profile can
also be useful when explaining discrepancies between experimental
and non-testing data.
It is essential to verify the structure of parent compound, along
with key characteristics such as protonation, isomerism, and tau-
tomerism. If this is known by a CAS number or by its name, it is
necessary to derive its structure (e.g., in the form of a SMILES
code) to be used in the prediction generation process. This can be
achieved by using an application that serves as a Structure Con-
verter tool. If the structure is known, it is important to verify that
the structural information agrees with the CAS number or with the
name. In the case of large datasets, this can be a time-consuming
exercise, in which some degree of expert intervention is necessary.
Information on the parent compound should be collected from
a range of available databases. This profiling involves the collation
of available data on the physicochemical properties (cross-reference to
Dearden chapter (24)), ecological effects, fate, and health effects.
Data can be retrieved from the scientific literature, Internet-based
resources (free and commercial resources), or electronic databases.
Some useful starting points are illustrated in the case studies in
Subheading 4.
132 A.P. Worth et al.
3.3. Step 2. QSAR This step refers to the use of any (Q)SAR-based tool for predicting
Predictions the physicochemical, biological, or environmental property of
interest. The use of such tools is distinguished from the use of
those in step 3 in that the predictive algorithms are predefined,
and thus should be reproducible. These tools are typically user-
friendly and as such suitable for use by nonspecialists. Generally,
it is sufficient to enter the chemical structure of interest, by drawing
the molecule with an editor, or more easily by using a SMILES code
or a mol file. The challenge in using such tools is typically in the
interpretation of the result. For example, in the case of genotoxicity
prediction, what exactly is a positive prediction? The underlying
dataset and evaluation criteria should be checked, but these are not
always readily available or completely transparent, especially in the
case of commercial tools.
3.4. Step 3. Grouping In contrast with the use of (Q)SAR tools, the grouping and read-
and Read-Across across approach is a more ad hoc approach involving a range of
subjective choices in terms of categorization tools, similarity
metrics, datasets for the retrieval of analogue, and criteria for
analogue selection. A broad chemical and toxicological expertise
is needed to apply this approach. Consequently, the approach is
unlikely to be reproducible, unless all of the expert choices are
clearly documented and retraced. A detailed explanation of the
category and read-across approach is given by Enoch (25). Various
tools can be used to assist grouping and read-across, including the
freely available Toxmatch (http://ihcp.jrc.ec.europa.eu/our_labs/
computational_toxicology/qsar_tools/toxmatch) (26), AMBIT
(http://ambit.sourceforge.net/), and the OECD QSAR Toolbox
(http://www.qsartoolbox.org/).
For the purposes of the case studies reported here, the OECD
Toolbox was used. The following paragraphs provide some con-
siderations relevant to the use of this tool when grouping chemicals
according to their potential genotoxicity (DNA-binding potential).
To build a category, the first step is to profile the chemical
under study using the appropriate general mechanistic and
endpoint-specific profilers available in the tool. A crucial element
in this approach is the definition of similarity, i.e., choice of similar-
ity metric which is to some extent subjective. In the OECD QSAR
Toolbox, the following profilers are available for the categorization
of potentially genotoxic chemicals:
l General mechanistic profilers:
DNA binding by OASIS (based on the Ames Mutagenicity
models in the TIMES software).
DNA binding by OECD (this includes 60 structural alerts
based on organic chemistry mechanisms for the identification
of DNA-binding chemicals).
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity 133
l Endpoint-specific profilers:
Micronucleus alerts by BenigniBossa (based on the ToxMic
rulebase of the Toxtree software).
Mutagenicity/Carcinogenicity alerts by BenigniBossa
(based on the correspondent module of the Toxtree soft-
ware).
Once the compound has been profiled using the relevant mech-
anistic and endpoint-specific profilers, one has to consider various
possibilities to build a category around the profilers obtained. The
following general approach is recommended:
1. If all of the profilers agree on a single mechanism, one has high
confidence that the mechanism is correct. It is therefore rea-
sonable to build a category around this mechanism.
2. If conflicting results are obtained (i.e., multiple mechanisms),
one should be more cautious about the resulting category and
look into the suggested mechanisms: if one of the suggested
mechanisms (from the mechanistic profilers) is supported by an
alert from the endpoint-specific profilers, this provides confi-
dence about the choice of mechanism for category building.
3. Finally, if the mechanistic profilers give conflicting results, indi-
cating multiple mechanisms, and the endpoint-specific profilers
cannot help to resolve which of the mechanisms is more likely,
one is left with the option of building separate categories
relating to the different mechanisms. For example, if a sub-
stance is both an aromatic amine and an aldehyde (e.g., an
aminophenylacetaldehdye), it might interact with DNA both
via nitrenium formation (due to the amino group) and Schiff
base formation (due to the aldehyde moiety).
3.5. Step 4. Simulation One of the shortcomings of the QSAR and grouping approaches is
of Metabolism that metabolic and environmental fate are generally not taken into
account, at least explicitly (although this information may be
implicitly encoded in the underlying data, for example Ames test
results in the presence of metabolic activation [S9 mix]). Thus, it is
often desirable, perhaps even essential, to perform a computational
assessment of reactivity and fate both in the environment and in
biological organisms. In general, this is more difficult than toxicity
prediction, but a wide range of software tools have been developed
and applied, especially in the pharmaceutical sector (27).
Two of the most commonly used commercial tools are Meteor
and HazardExpert, described in Subheading 2. In the public
domain, the Chemical Reactivity and Fate Tool (CRAFT) and
Metabolic information Input System (METIS) have been devel-
oped under contract to the European Commissions JRC by
Molecular Networks GmbH (Erlangen, Germany). CRAFT was
134 A.P. Worth et al.
3.6. Step 5. Overall An evaluation of the adequacy of data obtained at each step is
Assessment carried out, and the need for further information is determined.
When all relevant steps have been applied, an overall assessment is
carried out, weighing all of the available information. Again, this is
a rather subjective process and there are no widely accepted guide-
lines on how to perform this WoE assessment in a consistent
manner.
4. Case Studies on
the Application of
Computational
Methods The amino-a-carboline, AaC, is a heterocyclic aromatic amine
(HAA) formed during ordinary cooking and also present in ciga-
rette smoke condensate and in diesel exhaust particles (28). AaC
4.1. Case Study 1: and other HAAs are mutagens and carcinogens formed as pyrolysis
2-Amino-9H-pyrido products during the cooking of protein-rich food such as meats and
(2,3-b)indole fish. The structural identifiers of AaC are summarized in Table 1.
4.1.1. Step 1. Existing The link between AaC and its mutagenic activity is well documen-
Data and Information ted in the literature. The bioactivation of AaC is mediated primarily
by cytochrome P450 1A2 in both rodents and humans, leading to
Mechanistic Knowledge
N-oxidation of the exocyclic amine group to form 2-hydroxya-
mino-9H-pyrido[2,3-b]indole (HONH-AaC). The HONH-AaC
metabolite can then undergo further activation by phase II enzymes
(O-acetyltransferases and sulfotransferases) to form the penultimate
ester species, which bind to DNA (29, 30). Generally, it is proposed
that the aryl nitrenium ions of HAAs form adducts by covalent
modification of bases in DNA. To date, the guanine adducts of
several HAs (MeAaC, AaC, Trp-P-2, Glu-P-1, MeIQ, DiMeIQx,
and PhIP) have been characterized (28). The sole DNA adduct of
AaC identified thus far is N-(deoxyguanosin-8-yl)-2-amino-9H-
pyrido[2,3-b]indole (dG-C8-AaC) (31, 32).
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity 135
Table 1
Identifiers for case study 1: AaC
Databases A search for existing genotoxicity data for the query compound (by
CAS number) was conducted using freely available, Internet-based
resources such as the online eChemPortal database (http://www.
echemportal.org/) and the downloadable OECD QSAR Toolbox.
The eChemPortal search resulted in the retrieval of data for the
query substance from the following participant databases: HSDB
(Hazardous Substance Data Bank), INCHEM (Chemical Safety
Information from Intergovernmental Organizations), ACToR
(US EPA Aggregated Computational Toxicology Resource), and
the US EPA SRS (US EPA Substance Registry Services). Among
these, only ACToR provided human genotoxicity information for
AaC, as the compound was positive in the sister chromatid
exchange (SCE) in vitro assay (cell type unspecified; data from
NLM TOXNET GENETOX). In addition, data in cultured Chi-
nese hamster lung cells were retrieved from INCHEM, indicating
that AaC induces mutations and chromosomal aberrations and
polyploidy.
From the Internet resources, no data were available on the
carcinogenicity of AaC to humans. However, the HSDB states
that the collected animal data are sufficient to conclude that the
compound is a possible carcinogen in humans. The carcinogenic
potency of AaC is recorded in the ISSCAN and CPDB databases
(http://potency.berkeley.edu/cpdb.html) with a mouse TD50
value equal to 49.8 mg/kg/day.
Information on human exposure is also provided in the above-
mentioned Internet resources. For example, in the HSDB it is stated
that the general population may be exposed to AaC via inhalation of
cigarette smoke or ingestion of food cooked at high temperatures.
136 A.P. Worth et al.
4.1.2. Step 2. QSAR Predictions of genotoxicity for AaC using different (Q)SAR soft-
Predictions ware tools are summarized in Table 2. The majority of the software
tools applied resulted in positive predictions for mutagenicity
(Ames test), in addition to other genotoxicity endpoints, i.e.,
rodent, in vivo MN test (Toxtree) and mammalian, in vitro chro-
mosome damage (DfW), and carcinogenicity (DfW, OncoLogic,
and Toxtree). The predictions obtained were positive with all soft-
ware except one (Lazar) and thus, overall, they were in concordance
with the experimental data available.
Some of the software tools provide additional information on
the (Q)SAR analysis performed such as statistical performance of
the model, inclusion of the query compound in the model training
set, and toxicity predictions for chemicals with a similar structure to
the query compound. This information is useful when evaluating
the reliability of the prediction made for the query compound.
CAESAR. In the case of the CAESAR software, five similar com-
pounds to AaC (considering the two-dimensional structure) were
retrieved from the model database and correctly predicted as muta-
gens in all cases (data not shown). This result helps to increase
confidence that the prediction made for AaC is also reliable.
For expert systems such as DfW and Toxtree, the applicability
domain is defined by the so-called mitigating factors or restrictions
given for each structural alert (SA), while the validation comments
provided in the DfW software and in the Toxtree manual, respectively,
demonstrate the predictive performance of each of the SA fired
against specified internal and/or external test sets. The DfW and
Toxtree expert systems search the respective rulebase system for SAs
matching the query chemicals. When evaluating in vitro mutagenicity
(Ames test) and carcinogenicity for AaC, the prediction and
reasoning of these systems ascribe the genotoxic carcinogenicity of
this compound to the presence of the aromatic amine which is part of
a polycyclic aromatic ring system (see Table 3 for details of the alerts;
6
Table 2
Experimental genotoxicity data for AaC in the OECD QSAR Toolbox
Table 3
Predictions of genotoxicity for AaC using different (Q)SAR software tools
TOPKAT. When this tool was applied to AaC, this query chemical
was flagged as being the model training set. In addition, 424 similar
compounds to AaC were found in the model training set (data not
shown). Also for HONH-AaC, 424 similar compounds were found
in the model database (including AaC).
Conclusion Step 2. The QSAR-based prediction of genotoxic poten-
tial for AaC genotoxicity seems reliable and in agreement with
available experimental data.
4.1.3. Step 3. Grouping The OECD QSAR Toolbox was used to profile AaC in terms of the
and Read-Across available mechanistic and endpoint-specific profilers relevant to
genotoxicity, these being the mechanistic schemes DNA binding
by OASIS and DNA binding by OECD and the endpoint-
specific Micronucleus alerts by BenigniBossa (ToxMic) and
Mutagenicity/Carcinogenicity alerts by BenigniBossa.
AaC contains the aromatic amine substructure and was
therefore profiled under the aromatic amine category of the DNA
binding scheme. This includes primary and secondary (hetero)aro-
matic amines, unless they bear structural features which have been
identified to prevent covalent DNA binding (33). This chemical
class falls within the SN1 mechanistic domain leading to DNA
adducts via nitrenium ion formation (34), in agreement with the
BenigniBossa scheme (33). Likewise, AaC satisfies the require-
ments listed under the Amines category definition of the
OECD DNA binding scheme, which is linked to the same mecha-
nism of action. When profiled using the endpoint-specific schemes,
the query compound triggered the following two structural alerts,
which are common in the ToxMic and BenigniBossa schemes used:
1. Heterocyclic polycyclic aromatic hydrocarbons.
2. Primary aromatic amine, hydroxyl amine, and its derived esters.
Conclusion Step 3: It is possible, for example by using the QSAR
Toolbox, to build chemical categories for specific mechanisms of
DNA binding. For example, if the category is based on nitrenium
ion formation, AaC would be predicted as genotoxic by read-across.
A full illustration of this process is given in the next case study.
4.1.4. Step 4. Simulation OECD QSAR Toolbox metabolism simulator. No data on observed
of Metabolism metabolism could be retrieved from the QSAR Toolbox. On the
other hand, the liver metabolism simulator of the Toolbox produced
nine metabolites of AaC (data not shown). One of these is the docu-
mented product of P450-mediated N-oxidation, HONH-AaC. The
other metabolites bear one or more hydroxyl groups at various
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity 141
Fig. 1. Use of SMARTCyp (v. 1.5.3) to predict the sites of metabolism in AaC.
Table 4
SMARTCyp ranking of AaC sites liable to cytochrome
450-mediated metabolism
positions in the fused ring system of AaC. All the predicted metabolites
were profiled in the same manner as the parent compound, and gave
the same genotoxicity profile as AaC (data not shown).
SMARTCyp metabolism simulator. The sites of P450-mediated
metabolism were predicted using the SMARTCyp Web server
(freely accessible at http://www.farma.ku.dk/smartcyp/). As
shown in Fig. 1, the primary site of metabolism for the query
compound was the amino N atom (ranked 1 in Table 4), followed
by the C4 and C5 atoms in the indole ring (ranked 2 and 3,
respectively; Table 4).
142 A.P. Worth et al.
Table 5
Simulation of metabolic transformation of AaC by meteor
Biotransformation
Parent Metabolite name Phase Enzyme Log P
Q M1 Hydroxylamines Phase I CYP450 2.194
from primary
aromatic amines
Q M2 Hydroxylation of Phase I CYP450 1.804
fused benzenes
M1 M3 O-Sulfation of N- Phase II SULT 0.37
hydroxy
compounds
M2 M4 Glucuronidation Phase II UGT 0.256
of aromatic
alcohols
M2 M5 Oxidation of 5- Phase II CYP450/ 1.07
hydroxyindole peroxidase,
derivatives GST, GT,
peptidase,
NAT
4.1.5. General Conclusions In summary, existing toxicological data indicate AaC as mutagenic
for 2-Amino-9H-pyrido in vitro, including the Ames test, chromosomal aberrations in
[2,3-b]indole cultured mammalian cells, and the SCE assay in human cells.
Rodent carcinogenicity data are also available. The in vitro mutage-
nicity of AaC was reliably predicted by most of the (Q)SAR and
expert systems applied, while the AaC active metabolite was iden-
tified by metabolism simulators. Thus, the totality of evidence
indicates the genotoxic potential of AaC. However, further infor-
mation would be needed to assess the carcinogenic potential of the
compound by non-genotoxic mechanisms.
4.2. Case Study 2: 2- 2-AAP is a substituted aromatic amine (see Table 6 for chemical
Aminoacetophenone identifiers). It is a naturally occurring substance in the following
food items: beer, corn tortillas, sweet corn, and milk (35). Quanti-
tative data on the natural occurrence in food have been reported for
two of these food items (36), up to 0.15 mg/kg in corn tortillas
and up to 0.4 mg/kg in sweet corn.
2-AAP is also a flavoring substance for which a risk assessment
needs to be conducted prior to its possible authorization and inclu-
sion in a positive list according to Commission Regulation (EC) No
1565/2000 (9, 37). The EU procedure for establishing the list of
authorized flavoring substances is laid down by Regulation (EC)
144 A.P. Worth et al.
Table 6
Identifiers for case study 2: 2-aminoacetophenone
Structural formula
Databases Table 7 shows results of retrieving data for 2-AAP from nine publicly
available databases. The chemical was found in only two databases:
the Danish QSAR database and ESIS (EU Existing Chemicals
ListEINECS). However, there were no data in ESIS and in the
Danish database; all data are predicted rather than experimental.
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity 145
Table 7
Summary of the results from looking for data for 2-aminoacetophenone
into publicly available databases
Table 8
Experimental data for in vitro genotoxicity of 2-aminoacetophenone
4.2.2. Step 2. QSAR When analyzing the predictions of the eight software tools (Table 9),
Predictions mixed messages are obtained. Five of the software tools predicted 2-
AAP to be non-active, two of them as active, and one as low-
moderate active. Clearly, it is difficult to make a conclusion based
only on these predictions. Thus, the next step is to analyze any
additional information which the different software tools provide.
HazardExpert. This software tool does not provide additional
information.
CAESAR. This predicts the chemical as active, and also provides
experimental data and mutagenicity predictions for analogues
provided by the software (Table 10).
Although the chemical is predicted as active, five out of six
analogues are non-mutagenic based on the experimental data.
Only acetophenone, 40 -(hydroxyamino) (the fourth chemical in
the table), is mutagenic based on the experimental data but this
chemical is a hydroxylamine rather than an amine. All six analogues
are predicted as non-active. Based on this comparative analysis, it
could be concluded that the predicted mutagenicity for 2-AAP is
not reliable.
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity 147
Table 9
Predictions and additional information provided by software tools for genotoxic
potential of 2-aminoacetophenone
Table 10
CAESAR analogues of 2-aminoacetophenone and their
experimental and predicted activity
SMILES: OC(c1ccccc1)C
Similarity: 0.884
Experimental class: Non-mutagen
Predicted class: Non-mutagen
SMILES: OC(O)c1ccccc1(N)
Similarity: 0.877
Experimental class: Non-mutagen
Predicted class: Non-mutagen
SMILES: OC(c1ccc(cc1)NO)C
Similarity: 0.861
Experimental class: Mutagen
Predicted class: Non-mutagen
SMILES: OC(N)c1ccccc1C
Similarity: 0.861
Experimental class: Non-mutagen
Predicted class: Non-mutagen
SMILES: OC(OC)c1ccccc1(N)
Similarity: 0.857
Experimental class: Non-mutagen
Predicted class: Non-mutagen
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity 149
Table 11
ToxBoxes analogues of 2-aminoacetophenone and their
experimental activity
SMILES: CC1C(N)C2C(CC1)C(O)
C1C(CCCC1)C2O
Experimental class: Mutagen
SMILES: C(C)(O)c1c(N)ccc(Cc2cc(C
(C)O)c(N)cc2)c1
Experimental class: Mutagen
SMILES: CC1CCCC(N)C1C
Experimental class: Mutagen
SMILES: CC1CC(C)C(N)CC1
Experimental class: Mutagen
Table 12
Lazar analogues of 2-aminoacetophenone and their
experimental activity
SMILES: CCC1C(CCCC1)N
Similarity: 0.82
Experimental class: Non-mutagen
SMILES: NC1C(C(CCC1)N)C(O)
O
Similarity: 0.81
Experimental class: Mutagen
SMILES: NC(O)C1CCC(CC1)N
Similarity: 0.80
Experimental class: Non-mutagen
SMILES: CCC1C(C(CCC1)CC)N
Similarity: 0.79
Experimental class: Mutagen
SMILES: CC1C(CCCC1)N
Similarity: 0.79
Experimental class: Mutagen
SMILES: CC1C(CCCC1NCO)
NCO
Similarity: 0.77
Experimental class: Mutagen
(continued)
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity 151
Table 12
(continued)
SMILES: CC2NC1C(CCCC1)C2
(C)C
Similarity: 0.74
Experimental class: Non-mutagen
SMILES: CC1C(CC(CC1)NCO)
NCO
Similarity: 0.74
Experimental class: Mutagen
SMILES: CC1CC(C(C(C1)C)N)C
Similarity: 0.73
Experimental class: Mutagen
4.2.3. Step 3. Grouping As described above, the OECD Toolbox software provides several
and Read-Across types of profilers which can help to define a chemical category:
predefined, general mechanistic, endpoint specific, and empirical.
The genotoxicity endpoint is associated with (a) two general mech-
anistic profilers (DNA binding by OECD and DNA binding by
OASIS) and (b) three endpoint-specific profilers (Micronucleus
alerts by Benigni/Bossa, Mutagenicity/Carcinogenicity alerts by
Benigni/Bossa, and Oncologic Primary classification).
The first step is to apply the selected profilers to the query
chemical. 2-AAP belongs to the aromatic amine (primary or
secondary) and nitrenium ion formation according to the DNA
binding by OECD profiler. It is not a binder according to DNA
binding by OASIS profiler. It belongs to the primary aromatic
amine, hydroxyl amine, and its derived esters by Micronucleus
alerts by Benigni/Bossa and Mutagenicity/Carcinogenicity alerts
by Benigni/Bossa and to Aromatic amine-type compounds based
on the Oncologic Primary classification. In other words, all profi-
lers (except the OASIS profiler) classified the query compound into
the same group: aromatic amines. This gives the user more confi-
dence. By applying one of these profilers, for example Mutagenic-
ity/Carcinogenicity alerts by Benigni/Bossa, to the Toolbox
databases connected with genotoxicity (CPDB, ISSCAN, OASIS
6 QSAR and Metabolic Assessment Tools in the Assessment of Genotoxicity 153
4.2.4. Step 4. Simulation Based on the electrophilic theory introduced by Miller and Miller
of Metabolism (39), genotoxic carcinogens have the unifying feature that they are
either electrophiles or can be activated by metabolism to electro-
philic reactive intermediates (pro-electrophiles). Thus, an investi-
gation of metabolism is an important step in the risk (hazard)
assessment procedure, and might be considered essential when
the prediction is inactive in the absence of metabolism. Conversely,
154 A.P. Worth et al.
Table 13
Genotoxicity predictions for two simulated metabolites of 2-aminoacetophenone
4.2.5. General Conclusions 2-AAP is non-genotoxic in vitro with and without metabolic
for 2-Aminoacetophenone activation (Ames test, DNA repair test, and Unscheduled DNA
synthesis). No data were available for in vivo genotoxicity.
2-AAP is probably non-genotoxic as based on the predictions
and additional information of the eight software tools. The same
conclusion is obtained by using the chemical category formation
and read-across. However, investigation of the possible metabolism
by using two independent software tools indicates that 2-AAP most
probably undergoes metabolism and some of its metabolites could
be genotoxic. On the basis of the totality of information, it might
be concluded that, on balance, 2-AAP is probably non-genotoxic,
or it might be decided that genotoxicity data for predominant
metabolites is needed to confirm the non-genotoxicity of the com-
pound. Since many metabolites can be formed, and simulation
tools have a tendency to overgenerate potential metabolites, a key
issue is which metabolites are toxically relevant. The concept of
toxicological relevance needs to take into account the amount of
each metabolite as well as its potency. The combination of parent
compounds and toxicologically relevant metabolites is sometimes
referred to as the residue definition. Guidance on how to establish a
residue definition suitable for the risk assessment of pesticides, is
currently being developed by EFSA. In the case of the EFSA
evaluation on the use of 2-AAP as a flavoring substance (35), it
Table 14
158
5. Concluding
Remarks
In this chapter, a range of computational tools for applying QSAR
and grouping/read-across are described, and their possible use in
the computational assessment of genotoxicity is illustrated by
means of the stepwise application of selected tools to two case-
study compoundsAaC and 2-AAP. The first case study com-
pound (AaC) is an environment pollutant and a food contaminant
that can be formed during the cooking of protein-rich food. The
second case study compound (2-AAP) is a naturally occurring
compound in certain foods and is also under evaluation in the EU
as a flavoring substance. In the case of 2-APP, EFSA concluded that
available experimental data were insufficient to exclude the geno-
toxic potential and therefore requested additional information
from the applicant to confirm that the flavorings are safe to use in
foods (35).
In the first case study (AaC), the genotoxic potential of the
parent compound is well established and is also evident from the
computational predictions. In this case, the assessment of metabo-
lites may not be warranted, since there is probably sufficient infor-
mation to conclude the risk assessment. In contrast, the second case
study (2-AAP) is more challenging, since the available toxicity data
and computational predictions give mixed messages, especially
when potential metabolites are taken into account.
It is emphasized that the software tools applied in these case
studies are only some of the many software tools available, and their
use is intended to be illustrative of a thought process, rather than
providing definitive results and conclusions.
There are at least two outstanding challenges in the computa-
tional assessment of toxicological endpoints, such as genotoxicity.
One is the further development of reliable models and their com-
bination into automated workflows to carry out stepwise assess-
ments such as those described here. The other is to develop a
structured and transparent framework for interpreting and weigh-
ing complex datasets in such a way that (apparent) discrepancies
between data can be resolved.
References
16. Maunz A, Helma C (2008) Prediction of 28. Frederiksen H (2005) Two food-borne hetero-
chemical toxicity with local support vector cyclic amines: metabolism and DNA adduct
regression and activity-specific kernels. SAR formation of amino-a-carbolines. Mol Nutr
QSAR Environ Res 19(56):413431 Food Res 49:263273
17. Button WG, Judson PN, Long A et al (2003) 29. King RS, Teitel CH, Kadlubar FF (2000) In vitro
Using absolute and relative reasoning in the bioactivation of N-hydroxy-2-amino-{alpha}-
prediction of the potential metabolism of xeno- carboline. Carcinogenesis 21:13471354
biotics. J Chem Inf Comput Sci 43:13711377 30. Schut HAJ, Snyderwine EG (1999) DNA
18. Kroes R, Renwick AG, Cheeseman M et al adducts of heterocyclic amine food mutagens:
(2004) Structure-based thresholds of toxico- implications for mutagenesis and carcinogene-
logical concern (TTC): guidance for applica- sis. Carcinogenesis 20:353368
tion to substances present at low levels in the 31. Pfau W, Schulze C, Shirai T et al (1997)
diet. Food Chem Toxicol 42:6583 Identification of the major hepatic DNA
19. Enoch SJ, Madden JC, Cronin MT (2008) adduct formed by the food mutagen 2-amino-
Identification of mechanisms of toxic action 9H-pyrido[2,3-b]indole (AaC). Chem Res
for skin sensitisation using a SMARTS pattern Toxicol 10:11921197
based approach. SAR QSAR Environ Res 19 32. Frederiksen H, Frandsen H, Pfau W (2004)
(56):555578 Syntheses of DNA adducts of two heterocyclic
20. Rydberg P, Gloriam DE, Olsen L (2010) The amines, 2-amino-3-methyl-9H-pyrido[2,3-b]
SMARTCyp cytochrome P450 metabolism pre- indole (MeA{alpha}C) and 2-amino-9H-pyr-
diction server. Bioinformatics 26:29882989 ido[2,3-b]indole (A{alpha}C) and identifica-
21. Bassan A, Worth AP (2008) The integrated use tion of DNA adducts in organs from rats
of models for the properties and effects of che- dosed with MeA{alpha}C. Carcinogenesis
micals by means of a structured workflow. 25:15251533
QSAR Comb Sci 27:620 33. Benigni R, Bossa C (2008) Structure alerts for
22. ECHA (2008) Guidance on information carcinogenicity, and the Salmonella assay sys-
requirements and chemical safety assessment. tem: a novel insight through the chemical rela-
European Chemicals Agency, Helsinki, Fin- tional databases technology. Mutat Res
land. http://guidance.echa.europa.eu/docs/ 659:248261
guidance_document/information_require 34. Enoch SJ, Cronin MTD (2010) A review of the
ments_en.htm electrophilic reaction chemistry involved in
23. Worth A, Lapenna S, Lo Piparo E, Mostrag- covalent DNA binding. Crit Rev Toxicol
Szlichtyng A, Serafimova R (2011) A Frame- 40:728748
work for assessing in silico toxicity predictions: 35. EFSA (2008) Scientific Opinion of the Panel
case studies with selected pesticides. JRC Tech- on Food Additives, Flavourings, Processing
nical Report EUR 24705 EN. Publications Aids and Materials in Contact with Food on a
Office of the European Union, Luxembourg. request from Commission on Flavouring
http://ihcp.jrc.ec.europa.eu/our_labs/com- group evaluation 48: Aminoacetophenone.
putational_toxicology/publications/ EFSA J 797:125, http://www.efsa.europa.
24. Dearden J (2011) Prediction of physicochemi- eu/en/scdocs/doc/797.pdf
cal properties. In: Resfeld B, Mayena AN (eds) 36. TNO (2000) Volatile compounds in food
Computational toxicology, Methods in molec- VCF Database. TNO Nutrition and Food
ular biology. Springer Science+Business Media, Research Institute, Boelens Aroma Chemical
New York Information Service BACIS, Zeist
25. Enoch SJ (2010) Chemical category formation 37. EFSA (2010) EFSA Panel on Food Contact
and read-across for the prediction of toxicity. Materials, Enzymes, Flavourings and Proces-
In: Puzyn T, Leszczynski J, Cronin MTD (eds) sing Aids. Draft Guidance on the data required
Recent advances in QSAR studiesmethods for the risk assessment of flavourings. EFSA J 8
and applications. Springer, Heidelberg, pp (6):1623, http://www.efsa.europa.eu/en/
209219 scdocs/scdoc/1623.htm
26. Patlewicz G, Jeliazkova N, Gallegos Saliner A 38. Colvin M, Hatch F, Felton J (1998) Chemical
et al (2008) Toxmatcha new software tool to and biological factors affecting mutagen
aid in the development and evaluation of chem- potency. Mutat Res 400:479492
ically similar groups. SAR QSAR Environ Res 39. Miller E, Miller J (1981) Searches for ultimate
19:397412 chemical carcinogens and their reactions with
27. Franklin RB (2009) In silico studies in ADME/ cellular macromolecules. Cancer 47:23272345
Tox: caveat emptor. Current computer-aided 40. Tates A, Kriek E (1981) Induction of chromo-
2009. Drug Design 5:128138 somal aberrations and sister-chromatid
162 A.P. Worth et al.
exchanges in Chinese hamster cells in vitro by Sensitivity, specificity and relative predictivity.
some proximate and ultimate carcinogenic ary- Mutat Res 584:1256
lamide derivatives. Mutat Res 88:397410 44. Aeschbacher H-U, Turesky RJ (1991) Mam-
41. Popescu N, Turnbull D, DiPaolo J (1977) Sis- malian cell mutagenicity and metabolism of
ter chromatid exchange and chromosome aber- heterocyclic aromatic amines. Mutat Res
ration analysis with the use of several 259:235250
carcinogens and noncarcinogens: brief com- 45. Bowden J, Chung K, Andrews A (1976) Muta-
munication. J Natl Cancer Inst 59:289293 genic activity of tryptophan metabolites pro-
42. Kazius J, McGuire R, Bursi R (2004) Deriva- duced by rat intestinal microflora. J Natl
tion and validation of toxicophores for muta- Cancer Inst 57:921924
genicity prediction. J Med Chem 48:312320 46. Thompson C, Hill L, Epp J et al (1983) The
43. Kirkland D, Aardema M, Henderson L et al induction of bacterial mutation and hepatocyte
(2005) Evaluation of the ability of a battery of unscheduled DNA synthesis by monosubsti-
three in vitro genotoxicity tests to discriminate tuted anilines. Environ Mutagen 5:803811
rodent carcinogens and non-carcinogens: I.
Part II
Abstract
With the advent of microarrays and next-generation biotechnologies, the use of gene expression data has
become ubiquitous in biological research. One potential drawback of these data is that they are very rich in
features or genes though cost considerations allow for the use of only relatively small sample sizes. A useful
way of getting at biologically meaningful interpretations of the environmental or toxicological condition of
interest would be to make inferences at the level of a priori defined biochemical pathways or networks of
interacting genes or proteins that are known to perform certain biological functions. This chapter describes
approaches taken in the literature to make such inferences at the biochemical pathway level. In addition this
chapter describes approaches to create hypotheses on genes playing important roles in response to a
treatment, using organism level gene coexpression or proteinprotein interaction networks. Also,
approaches to reverse engineer gene networks or methods that seek to identify novel interactions between
genes are described. Given the relatively small sample numbers typically available, these reverse engineering
approaches are generally useful in inferring interactions only among a relatively small or an order 10 number
of genes. Finally, given the vast amounts of publicly available gene expression data from different sources,
this chapter summarizes the important sources of these data and characteristics of these sources or
databases. In line with the overall aims of this book of providing practical knowledge to a researcher
interested in analyzing gene expression data from a network perspective, the chapter provides convenient
publicly accessible tools for performing analyses described, and in addition describe three motivating
examples taken from the published literature that illustrate some of the relevant analyses.
Key words: Gene expression, Networks, Biochemical pathways, Gene Expression Omnibus,
ArrayExpress, Cytoscape, Pathway enrichment, Network identification
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_7, # Springer Science+Business Media, LLC 2013
165
166 R. Thomas and C.J. Portier
2. Materials and
Methods
2.1. Common Software All suggested software to perform the various analyses described in
and Hardware Required this chapter can be installed and run on standard 32 or 64-bit, 2 GHz
or better CPU Windows, Mac, or Linux operating systems-based
desktops with internet access and over 500 MB RAM. Basic level of
literacy in using one of these operating systems (running desktop
software applications, accessing and using a Web browser, manipulat-
ing and working with text or Microsoft Excel format tables, etc.) is
assumed. A background in simple statistics (random variables, mean,
variance and correlation between random variables, standard proba-
bility distributions, hypothesis tests and their associated p-values, two
sample t-tests, ANOVA, F-tests, experiments, replicate samples) is
also assumed. There are two open-source analysis programs which are
useful to master for regularly performing the kinds of analyses
described in this chapter. First, there are the various Bioconductor
packages (7) installed in the R statistical environment (8) that will
allow one to consolidate all analyses into one environment. The main
drawbacks of working in R are that it is not user-friendly and requires
basic programming knowledge; it is possible that errors could creep
into the results due to conflicting functions and parameters defined
across multiple packages can conflict. Crawley (9) is an introductory
book to statistical analysis using R. In addition, there are several useful
tutorials and presentations on using the basic functions in R that can
be readily accessed online. One way of detecting the phantom errors
mentioned above would be to run the chosen R functions from the
installed packages on data where you have an understanding of the
expected results.
The second open source software (Cytoscape) is useful for
visualizing large scale networks and even performing certain
advanced analyses (10). Biological networks that are large (~103
nodes and ~104 edges) like the organism-wide proteinprotein
interaction network will typically require >2 GB RAM when
visualizing using Cytoscape.
7 Gene Expression Networks 169
2.2. Accessing Publicly There are several publicly accessible gene expression databases, see
Available Gene Table 1. The Gene Expression Omnibus (GEO) (3) and ArrayEx-
Expression Data press (4) represent the largest repositories of gene expression data
across a diverse set of organisms, microarray and next-generation
technologies. The Web pages for both GEO and ArrayExpress have
Advanced Search tabs where you can search for possible data
matching criteria you specify such as organism, tissue, number of
samples, platform technologies, experimental condition (e.g.,
blood gene expression of humans exposed to benzene on Illumina
microarrays). In addition there are packages in Bioconductor (7)
that allow programmatic access to the same search capabilities and
provide for downloading of datasets from GEO, GEOmetadb (11)
and GEOquery (12), and ArrayExpress, ArrayExpress (13). GEO-
metadb (11) requires that the user have a basic knowledge of
MySQL (14), a database query computer language.
Both raw data (e.g., as CEL files from Affymetrix platforms)
and contributor-processed and normalized data are typically avail-
able. Annotation files for the different probes on the microarray
platform used to generate the data should also be available. It is
really up to a user to judge whether the probes on the given
platform provide sufficient coverage of all genes in the organism
or cover the genes that are of particular interest to him/her.
Typically coverage of expressed genes can range anywhere from
Table 1
Public sources of gene expression data
R
programmatic
Database Web site access Data (as of Jan 2010)
Gene Expression Omnibus http://www.ncbi.nlm.nih. GEOquery, 16,807 experiments
(GEO) gov/geo/ GEOmetadb and 430,926 samples
ArrayExpress (AE) http://www.ebi.ac.uk/ ArrayExpress 10,666 experiments
microarray-as/ae/ and 298,824 samples
Center for Information http://cibex.nig.ac.jp/index. NA 88 experiments and 133
Biology gene EXpression jsp samples
database (CiBEX)
Chemical Effects in http://tools.niehs.nih.gov/ NA 24 experiments
Biological Systems cebs3/ui
(CEBS)
Stanford Microarray http://smd.stanford.edu/ NA 19,775 samples
Database (SMD)
Cancer Array (caArray) https://array.nci.nih.gov/ NA 156 experiments
caarray/home.action
170 R. Thomas and C.J. Portier
2.3. Identifying Given gene expression data like those described in the previous
Toxicologically section, the goal of gene set or pathway enrichment methods are
Relevant Gene to identify significantly affected pathways (e.g., apoptosis, aracha-
Expression Networks donic acid metabolism) or gene sets (e.g., various gene ontologies
(15), set of genes in a given region of a chromosome) when the
2.3.1. Gene Set or
biological/environmental conditions are altered.
Biochemical Pathway
There are three variations of gene set enrichment methods that
Enrichment have been used in the literatureOver-representation methods
(16, 17), Rank-based methods (18, 19), and Global methods
(19, 20). Over-representation methods require the user to pre-
identify all significantly affected genes in their experiment. Depend-
ing on the hypotheses being tested or scientific questions asked, the
user would perform a two-sample test, a one-way anova, a cox
proportional hazard test, or a trend test to obtain gene-specific
p-values. The p-values should then be adjusted for multiple testing
at an appropriately defined threshold (5% family-wise error rate or
5% false discovery rate) to obtain a list of genes significantly affected
by the experimental condition. The Over-representation meth-
ods require the background set of genesthese typically would be
all genes (irrespective of whether they were significantly affected or
not) on the microarray platform used. The methods work by
performing Fishers exact test or a chi-squared approximation of
it. When working with sets of genes from the Gene Ontology and
irrespective of which class of the three methods one wants imple-
mented, by far the most useful software seems to be the topGO
package (21) in R (8). In its analysis, topGO accounts for the
hierarchy of gene sets on GOthere is a GO gene set
corresponding to the set of all genes (~30,000 genes) known to
participate in some signal transduction process and a child GO
gene set corresponding to a set of five genes participating in a
lipoprotein mediated particle signaling. In addition there are
term enrichment methods that one can access on the Gene
Ontology Web site (15) and the DAVID Web site (22, 23).
The two other classes of methods are a generalization of the
Over-representation methods where instead of a 01 (not-
affected or affected by experimental condition, say determined
using one of the methods in the previous paragraph) weighting of
the genes on the microarray platform, genes are assigned a contin-
uous metric. It is up to the user to choose this continuous metric
it could be the T-statistic from a two-sample test, a SAM statistic
(24), an F-statistic from an ANOVA test or even the logarithm of
the fold change. As the name implies, Ranking methods like (18)
rank these gene-wise metrics across all the genes and look for
7 Gene Expression Networks 171
2.4. Reverse In cases where novel biological phenomena are studied, there is
Engineering Gene sometimes reason to believe in the existence of genegene/protein
Expression Networks interactions that have not been identified before. Gene expression
data generated from specially designed gene knockout or perturba-
tion experiments can theoretically be used to identify novel inter-
actions. Various analytic methods have been designed over the past
decade that propose solutions to this problem of network identifi-
cation. The models have differed in whether they treat gene expres-
sion measurements as binary values (gene is either expressed or not)
(30) or as continuous values (3133). They have differed in the
objective function used to identify the networkprobability distri-
bution based (entropy (34), mutual information (35), posterior
probability of data in a Bayesian setting (36, 37) or model-based
least squares fits (38, 39)). Mechanistic representations of gene
regulation have taken linear forms or slightly more complicated
log-linear forms. Thomas et al. (38, 39) have mechanistically tried
to model gene regulation directly through proteins. Unfortunately,
these models require estimates of half-lives of the different mRNAs
and their associated proteins that are under study and these param-
eter estimates are in general not readily available. Another differ-
ence in the methods available for network identification is whether
they can work with time-series data (39, 40) or not. The methods
designed to work with time-series data should have some way of
accounting for serial correlation in the data.
Despite a plethora of methods being developed to address this
problem, there is one serious handicap that all of them face. Net-
work identification is a very hard problem in the sense that in order
to identify the network of interactions between a set of genes, you
will need to have data from an exponential (in the number of genes)
number of independent experiments (30). For example, to fully
identify a network of 20 genes you will need to get data from
7 Gene Expression Networks 173
3. Examples
3.1. Immune Response Benzene is a well known leukemogen. The molecular mechanisms
Pathways in the Blood of human response in blood cells to low benzene exposure were not
Perturbed at Low well understood. McHale et al. (42) performed a study of the gene
Doses of Benzene expression profile of peripheral mononuclear blood cells from
Exposure workers in China exposed to varying low levels of benzene in air
in their work place. The analysis of these 125 workers is summar-
ized in Table 2 taken from ref. 42. The microarray analyses were
performed on Human Ref8-V2 Beadchips. The data was prepro-
cessed and quantile normalized. Individual messenger RNA levels
were modeled using linear mixed effects models. The exposed
workers were divided into five groups depending on their exposure
levels to benzene in air: controls, <<1, 1, 510 and >10 ppm. One
of the analyses involved determining those pathways whose genes
are perturbed in samples of at least one of the exposed groups. A 16
gene signature was identified that could serve as a potential bio-
marker of benzene exposure. The general trend was that of rapid
increase in mRNA levels at the lower dose ranges followed by
indication of saturation of these levels at the higher dose range.
Gene-wise F-statistics were computed using ANOVA comparisons
174 R. Thomas and C.J. Portier
Table 2
Characteristics of study subjects
Currently
Sex (n(%)) smoking (n(%))
Benzene exposure Subjects Air benzene
category (ppm) (n) (ppm)a Age (years) Male Female Yes No
Control 42 <0.04 b
29.5 8.2 17 (33) 25 (34) 9 (35) 33 (33)
Very low (<<1)c 29 0.3 0.9 30.3 9.2 8 (16) 21 (28) 6 (23) 23 (23)
Low (<1) d
30 0.8 0.8 27.9 7.2 19 (37) 11 (15) 5 (19) 25 (25)
High (510) 11 7.2 1.3 29.7 9.1 1 (2) 10 (14) 1 (4) 10 (10)
Very high (>10) 13 24.7 15.7 30.9 10.5 6 (12) 7 (9) 5 (19) 8 (8)
Slight variation of material reproduced from ref. 42 with permission from National Institute of Environ-
mental Health Sciences
Values for air benzene and age are mean SD
a
Air benzene level 3 months preceding phlebotomy
b
The limit of detection for benzene was 0.04 ppm
c
The average level of benzene was <1 ppm at most measurements in the 3 months preceding phlebotomy
and at all measurements in the prior month
d
The average benzene level was <1 ppm (in the 3 months preceding phlebotomy) but dosimetry levels were
not always <1 ppm in the previous 3 months
Table 3
Top pathways associated with overall benzene exposure
Pathwaya p-Valueb
Toll-like receptor signaling pathway <0.001
Apoptosis <0.001
Acute myeloid leukemia <0.001
Oxidative phosphorylation <0.001
B cell receptor signaling pathway <0.001
T cell receptor signaling pathway 0.001
Slight variation of material reproduced from ref. 43 with permission from
National Institute of Environmental Health Sciences
a
These pathways were taken from the KEGG pathway database (2527)
b
The p-values are based on the SEPEA_NT3 pathway enrichment (19),
p-values < 0.005 correspond to a Bonferroni adjustment for multiple testing
Fig. 1. Prkaa2 gene plays a central role in determining cardiomyopathy. Prkaa2 2nd
neighborhood network of differentially expressed genes. In the coexpression network
Prkaa2 is linked to 138 genes in the second neighborhood through the genes, Asb2 and
Tns1 (1st neighborhood genes). Of the 138 in the second neighborhood, 39 were
differentially expressed. Prkaa2 2nd neighborhood genes that exhibited significantly
higher expression in C3H/HeJ mice are shown in the ovals, where as those that exhibited
significantly lower expression in C3H/HeJ are shown in boxes (including Prkaa2). Genes in
the diamonds were not differentially expressed (43). Reproduced from ref. 43 with
permission from SAGE publications.
Disclaimer
The findings and conclusions in this report are those of the authors
and do not necessarily represent the views and positions of the
Centers for Disease Control and Prevention or the Agency for
Toxic Substances and Disease Registry.
7 Gene Expression Networks 177
Fig. 2. Inferred genetic networks of set of apoptotic and oxidative stress genes using rat-acetaminophen dose data. Two
networks resulting from application of the TAO-GEN algorithm (37) of all of the data for the 50 and 150 mg/kg
acetaminophen dose groups for the left network and the right consensus network consists of all of the data for the
1,500 mg/kg acetaminophen dose group (45). The orange and red colored cells refer to inferred interactions. The two
interactions identified as red cells were the only two common genegene interactions between the low-dose and the high-
dose identified networks. Reproduced from ref. 45 with permission from Elsevier publications.
References
1. Crick F (1970) Central dogma of molecular 9. Crawley MJ (2005) Statistics: an introduction
biology. Nature 227(5258):561563 using R. Wiley, Chichester
2. Greenbaum D et al (2003) Comparing protein 10. Shannon P et al (2003) Cytoscape: a software
abundance and mRNA expression levels on a environment for integrated models of biomo-
genomic scale. Genome Biol 4(9):117 lecular interaction networks. Genome Res 13
3. Barrett T, Edgar R (2006) Gene expression (11):2498
omnibus: microarray data storage, submission, 11. Zhu Y et al (2008) GEOmetadb: powerful
retrieval, and analysis. DNA Microarrays, Part alternative search engine for the Gene Expres-
B: Databases Stat 411:352369 sion Omnibus. Bioinformatics 24(23):2798
4. Parkinson H et al (2009) ArrayExpress 12. Davis S, Meltzer PS (2007) GEOquery: a
updatefrom an archive of functional geno- bridge between the Gene Expression Omnibus
mics experiments to the atlas of gene expres- (GEO) and BioConductor. Bioinformatics 23
sion. Nucleic Acids Res 37(Suppl 1):D868 (14):1846
5. Brazma A et al (2001) Minimum information 13. Kauffmann A et al (2009) Importing arrayex-
about a microarray experiment (MIAME) press datasets into r/bioconductor. Bioinfor-
toward standards for microarray data. Nat matics 25(16):2092
Genet 29(4):365371 14. Widenius M, Axmark D, DuBois P (2002)
6. Von Mering C et al (2006) STRING 7recent MySQL reference manual. OReilly & Associ-
developments in the integration and prediction ates, Inc., Sebastopol, CA
of protein interactions. Nucleic Acids Res 35 15. Ashburner M et al (2000) Gene ontology: tool
(Suppl 1):D358 for the unification of biology. Nat Genet 25
7. Gentleman RC et al (2004) Bioconductor: (1):25
open software development for computational 16. Al-Shahrour F, Daz-Uriarte R, Dopazo J
biology and bioinformatics. Genome Biol 5 (2004) FatiGO: a web tool for finding signifi-
(10):R80 cant associations of gene ontology terms with
8. Team RDC (2009) R: a language and environ- groups of genes. Bioinformatics 20(4):578
ment for statistical computing. R Foundation 17. Beibarth T, Speed TP (2004) GOstat: find
for Statistical Computing, Vienna statistically overrepresented gene ontologies
178 R. Thomas and C.J. Portier
Abstract
Mathematical models are useful tools for understanding protein signaling networks because they provide an
integrated view of pharmacological and toxicological processes at the molecular level. Here we describe an
approach previously introduced based on logic modeling to generate cell-specific, mechanistic and predic-
tive models of signal transduction. Models are derived from a network encoding prior knowledge that is
trained to signaling data, and can be either binary (based on Boolean logic) or quantitative (using a recently
developed formalism, constrained fuzzy logic). The approach is implemented in the freely available tool
CellNetOptimizer (CellNOpt). We explain the process CellNOpt uses to train a prior knowledge network to
data and illustrate its application with a toy example as well as a realistic case describing signaling networks in
the HepG2 liver cancer cell line.
Key words: Protein networks, Logic model, Boolean logic, Fuzzy logic, Cell-specificity, Signal
transduction
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_8, # Springer Science+Business Media, LLC 2013
179
180 M.K. Morris et al.
the cell (e.g., metabolic state) and are essential for proper cellular
response to the environment.
Signaling networks are typically studied by monitoring molec-
ular events upon treatments consisting of stimulation or perturba-
tion of key elements in the network at the protein, mRNA or gene
level. Combining measurement of these molecular signals with
phenotypic response under several treatment conditions enables
the study of these treatments on the architecture and function of
the underlying signaling networks as well as the relationship
between the network behavior and phenotypic response (2).
Recent technical developments allow for measurement of the
response of many signals and phenotypes to many treatments.
The resulting datasets are large in scale and difficult to interpret
by intuition alone.
Various computational techniques and tools can be applied to
aid in data interpretation (3, 4). Clustering methods group data
according to their similarity in order to identify functionally rele-
vant patterns (5, 6), and correlation and regression-based methods
predict species that drive a signaling or phenotypic response (79).
While these methods are useful for gaining a compact picture of the
data and can be used to generate hypotheses, they are limited as
tools for understanding the functionality of signaling networks.
Network-based approaches are a more natural means of gaining
this type of understanding. When prior knowledge on how protein
species interact is scarce, one can infer (i.e., reverse-engineer) the
network from the data itself (1013). However, reverse-
engineering methods generally require a large amount of data, in
part because they ignore what is already known about the signaling
network. If there is enough prior knowledge about how species
within the signaling network of interest relate to each other, one
can build a mathematical model that is then trained to data. These
models provide insight on both experimentally measured or per-
turbed species as well as other species included in the model. In the
particular context of toxicology and pharmacology, this ability is
critical because compounds may affect species that were not directly
measured or perturbed experimentally.
The amount of detail known about the biological system of
interest and specific type of data collected should guide the choice
of the type of mathematical model to construct. A common
approach is to describe the signaling biochemical processes as a
set of differential equations, providing a natural and detailed
description of the underlying molecular events. However, this
detailed description requires a great deal of knowledge about bio-
chemical interactions between proteins and a great deal of data for
training the resultant model (14) such that these models are typi-
cally limited to a dozen or so proteins and one or two pathways
(1517).
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 181
2. Materials
3. Methods
3.1. Logic Modeling CellNOpt trains either Boolean logic (BL) or constrained fuzzy
in CellNOpt logic (cFL) models. In the case of BL models, species states are
modeled as either inactive (zero) or active (one). In cFL, species
states are quantitative in that they can be any value on the closed
interval between 0 and 1.
3.1.1. Boolean Logic In the general sense, Boolean (logic) modeling refers to a common
system for specifying logical relationships between nodes that can
occupy one of two states (0 or 1) (41). In the context of the
CellNOpt methodology, Boolean logic is most easily conceptua-
lized by understanding the determination of a downstream species
state from its input species states. In the simplest case, when one
input activates a downstream output, the output is active if the
input is active (Eq. 1 in Fig. 1a). Conversely, when one input
inhibits a downstream output, the output is inactive if the input is
active (Eq. 2, Fig. 1a).
If two inputs influence a downstream output, CellNOpt relates
the inputs to outputs with either an AND or OR gate. In the case of
an AND gate, both species must be activated for the downstream
species to be active, whereas with an OR gate, activation of either
species is sufficient to activate the downstream species. Thus, if the
logic gate is an AND gate, the output state is the minimum of the
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 183
a Boolean Constrained
Logic Fuzzy Logic
A 1. 6.
D=A D = f (A)
D
A 2. 7.
D=1-A D = 1 - f (A)
D
A AND B 3. 8.
55
0.
0.8 EC
=
n~1
input n
k
0.7
Output Value
3;
output = g * (1 + kn)
=
0.6
n
input n + kn
1;
.5
0.5 = g=0
g
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
Input Value
3.1.3. Simulation Algorithm CellNOpt simulates networks by evaluating what each node should
and Its Limitations be given its inputs values at the previous simulation step (synchro-
nous updating). This is carried out until node states no longer
change, that is, the system reaches the logic steady state (20). If
networks contain feedback, this might cause node activation states
to oscillate and never stabilize. For example, if A activates B but B
inhibits A, when A is active, B will be active. However, the activa-
tion of B leads to the deactivation of A, which in turn deactivates B,
allowing A to be reactivated. If node states do not stabilize within a
prespecified number of simulation steps, their value is considered
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 185
3.3. Loading the Prior A PKN is a graph in which nodes are biological entities (proteins,
Knowledge Network and phosphorylation sites of proteins, mRNA, etc.). The edges in a
Data into CellNOpt PKN, termed interactions, are directed in that they relate an
input to an output node and signed in that they indicate if the
effect is activating or inhibitory.
186 M.K. Morris et al.
A B A B A B A B A B
C C C C C
Both A AND B A OR B can Only A Only B Neither A nor B
are necessary to activate C activates C activates C activate C
activate C
Inputs Output Inputs Output Inputs Output Inputs Output Inputs Output
A B C A B C A B C A B C A B C
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 1 1 0 1 0 0 1 1 0 1 0
1 0 0 1 0 1 1 0 1 1 0 0 1 0 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
b Experiment State
P Q Accessed
Activate Inhibit
Fig. 2. Possible logic gates underpinning prior knowledge network. (a) If it is believed that either A and/or B might influence
the activity of C, then the influence of A and/or B on C might take any of the logical formalisms indicated by the truth tables.
In order to differentiate between these possibilities, one must access all possible state combinations of inputs. (b) If A and
B are activated by P and Q as shown in (c), then in order to access all possible state combinations, the experiments shown
must be conducted, one of which includes inhibition of species B.
Table 1
Example of initial toy prior knowledge network in sif
format
Table 2
CellNOpt data input format, a MATLAB structure called
CNOProject
3.4. Training a Logic After the PKN is saved in the Models directory and the data has
Model with CellNOpt been imported into DataRail, one can either run CellNOpt through
DataRail using GUIs or save the data in the Data folder and run
CellNOpt with dedicated scripts (Fig. 4) see Note 5. The latter
allows for more flexibility but requires greater MATLAB scripting
capabilities. Regardless of which method is used, CellNOpt has many
parameters that affect PKN preprocessing and model training
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 189
b DataRail GUI
c MIDAS Importer
Choose File Load Data
d
Name Normalized
Info
Source RawData
Code Normalize
e
Name CNOData
Info
Source Normalized
Code CreateCNOData
Params
Fig. 3. Loading data into CellNOpt. DataRail is a freely available resource for the storage, transformation, and analysis of
experimental data describing the signaling and phenotypic response of a biological system to environmental perturbations
(54). (a) Data in the Midas format can be loaded into DataRail. The MIDAS format lists cell type, stimuli condition, and
190 M.K. Morris et al.
3.4.1. Model Preprocessing In the first preprocessing step, termed compression, species that are
either measured or perturbed are labeled as designated and
remain in the network. Species that are neither measured nor
perturbed, undesignated species, are removed from the prior
knowledge network if it is possible to remove them and ensure
that the relationships between designated species will be logically
consistent. Practically, logical consistency is preserved by not
removing an undesignated species if it is the output for more than
one interaction and the input for more than one interaction (26).
In the second preprocessing step, termed expansion, interac-
tions are converted into logic gates. OR gates are implicitly repre-
sented when a species has more than one possible input species.
However, AND gates must be explicitly added to the model. The
expansion step can create AND gates from all possible combina-
tions of inputs (26). This adds complexity to the model and can
significantly decrease computational efficiency. For example, if a
species has six possible inputs, 57 AND gates can be constructed,
22 of which have more than three inputs. In many biological
systems, it is highly unlikely that activation of four independent
pathways is necessary for activation of a downstream species.
Fig. 3. (continued) inhibitor conditions as treatments, indicated by TR in the column header. Data acquisition times
(DA) and Data Values (DV) for each measured species are also listed. Each row represents a different treatment
condition, and the values in each column indicate the treatment or measurement value for that condition. This spreadsheet
should be saved as a .csv file, which can then be loaded into DataRail. (b) After starting DataRail, the Load Data from
Local File button in the main GUI will allow the investigator to select the saved MIDAS file. (c) The importer window allows
the investigator to indicate which treatments are cell types, stimuli, and inhibitors. (d) After the data is loaded into DataRail,
a variety of normalization procedures can be selected from the Transform Data and Add Array option in the main GUI to
scale the data between zero and one. (e) Once the data processing is complete, the CreateCNOData code will create a
CNOProject through the Transform Data and Add Array option in the main GUI.
a
Load data into DataRail, Save model in Models
normalize, and convert folder of CellNOpt
to CellNOpt format
Run Optimization
Edit Parameters
b Example Script
[CNOProject SourceArray] =
CNOGetArrayFromFile(FileWithData,PGet);
Model = CNOLoadModel(ModelName);
Parameters.Model = Model;
%% Run Identification
DRArray=CNORunIdentification(CNOProject,Parameters);
Fig. 4. Running CellNOpt. (a) Once the model and data are in the proper format, CellNOpt can be run with DataRail or using
scripts. The CellNOpt button in DataRail starts the software and results in the appearance of a GUI for choosing the prior
knowledge network after selecting the required dataset. After selecting the appropriate PKN and pressing the Run
Optimization button, a GUI appears with parameters for the CellNOpt optimization process. (b) Example script for running
CellNOpt.
192 M.K. Morris et al.
Table 3
CellNOpt parameters
Table 3
(continued)
Table 3
(continued)
Thus, the number of inputs used to make AND gates can be limited
to a user-specified value.
3.4.2. Model Training The processed PKN is now a logic superstructure that contains all
allowed logic gates between species that were measured, perturbed,
or necessary to preserve logical consistency. The next step is to train
the superstructure to the data by finding the optimal combination
of gates as defined by an objective function (model score; Fig. 5).
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 195
Prior Knowledge
CNOProject
Network (PKN)
Remove unnecessary
undesignated species
Compressed PKN
Pre-Processing
Model
Insert AND gates. If cFL model
is to be trained, add transfer
function to each logic gate
Logic
Superstructure
Training
Model
(simulated value - measured value)
2
fit = data
number data values
size =
number inputs to gates in model
Trained Model
If model is cFL, reduction thresholds and
parameter optimization further refine models
Fig. 5. Workflow of CellNOpt. Prior knowledge networks are first preprocessed to remove unmeasured, unperturbed
species (i.e., undesignated species) that are not necessary to ensure logic consistency and AND gates are then added,
resulting in a logic superstructure. This superstructure is then trained to the data using a genetic algorithm which
minimizes the score: sum of mean squared error (MSE) and a size penalty if the model is Boolean or simply the MSE if the
model is cFL. If a cFL model was trained, reduction thresholds and parameter optimization can be used to further refine the
model (27).
3.5. Model Analysis Results of a CellNOpt optimization run can be analyzed in several
ways described below. CellNOpt contains functionality to aid in the
analysis of the models structures and fit to the data, described in
the manual that is shipped with CellNOpt. Other types of analysis
are more user-specific, but because CellNOpt is implemented in
MATLAB, they can easily be scripted.
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 197
Table 4
Results structure
3.5.1. Cross Validation and Cross validation can be accomplished by manually constructing a
Determination of Statistical subset of the data either chosen randomly or specified by the user as
Significance a test set. This test set can be left out of the training set and
CellNOpt used to train a PKN to the remaining training data.
The trained models can then predict the test set in order to estimate
the predictive capacity of the resultant models.
198 M.K. Morris et al.
3.5.2. Analysis of Model The simplest way to analyze the structures of the trained models is
Structural Features to visualize the average structure. CellNOpt contains functionality
for exporting models as either Cytoscape (www.cytoscape.org) or
graphviz (graphviz.org) structures. CellNOpt automatically gener-
ates graphviz layouts tailored to common analytical procedures
(see Note 8).
To compute the average structure of the trained models, one
simply averages the bits for each interactions. For example, when
averaging the results of ten structures, if a bit was present in every
trained model, it would have an average value of one. However, if a
bit were present in one of the ten models, it would have an average
value of 1/10 0.1. The resultant structure indicates which inter-
actions were present in a large fraction of trained models. Interac-
tions that are always present in the trained models are likely
necessary to fit the data. On the other hand, interactions that
are never or infrequently present are likely to be either inconsistent
with or not necessary to fit the data. These relationships can
easily be verified by examining experimental data of the removed
interactions inputs and outputs. The interactions that are some-
times present can be investigated and experiments devised to deter-
mine which structural feature is correct in the biological system of
interest.
3.5.3. Analysis of Model CellNOpt creates a plot of the data compared to the model simula-
Fit to Training Data tion. The background darkness is indicative of error (darker for
more error). If the models are consistently unable to fit a subset of
the data, this systematic error indicates that the prior knowledge
network was inadequate for capturing the data. Additional interac-
tions can be hypothesized from literature, databases, or experimen-
tal sources and added to the prior knowledge network to explain
the data (see Note 9).
4. Examples
4.1. Simple Toy We pose a toy example to illustrate several important considerations
Example when using CellNOpt and include the code for training the PKN
and analyzing the results (Fig. 6). The PKN and data files are
located in the Models/SampleModels and Data/SampleData
directories of CellNOpt, and the main directory contains the scripts
used to run all examples. For this example, the prior knowledge
network shown in Fig. 7a is compared to an in silico dataset that
describes the change in activation state of many species in the prior
knowledge network after stimulation with the TNFa and EGF
ligands (Fig. 7b). The data is bounded by zero and one, so no
rescaling is needed.
4.1.1. Model In our toy example, the PKN (Fig. 7a) indicates that EGF can
Preprocessing activate the Ras/Raf/Mek pathway while TNFa can activate the
PI3K or several inflammatory pathways (JNK, p38, and NFkB) via
TRAF6. In our in silico data set, many species were either measured
or perturbed (designated). However, the PKN also contained sev-
eral species that were neither measured nor perturbed (undesig-
nated). Many undesignated species are removed during the first
structure processing step in CellNOpt, the compression step
(Fig. 7c). The removal of the undesignated species does not com-
promise logical consistency because they have only one input or
only one output (Fig. 7a, dashed circles). However, one undesig-
nated species, MEK, was the input of more than one interaction
involving a designated species as well as the output of more than
one interaction involving a designated species. If it was removed,
logical consistency between designated species in the trained model
and starting PKN could not be guaranteed. Thus, it is kept in the
compressed model.
In the next step, model expansion, AND gates are added to
the model (Fig. 7d). Any node with more than one input could be
related to those inputs with AND gates, OR gates, or gates that are
independent of one or more of its possible inputs. For example, in
the compressed model, Hsp27 could be activated by either ERK or
TNFa. Thus, Hsp27 could be related to ERK and TNFa by an
AND gate, an OR gate, or through only one of the inputs (analo-
gous to the case presented in Fig. 2a). The expanded model con-
tains all of these possibilities.
200 M.K. Morris et al.
2. Save sif file or CNA files in directory: 2. Model is saved in CellNOpt directory:
CNO/Models/User - specified PKN Models/SampleModels/ToyPKNMMB
Name/
4. All mat result files are in 4. All mat result files are in
CNO/Results directory. CNO/Results directory.
Fig. 6. Training and analyzing a model with CellNOpt. The column on the left describes how to train and analyze a family of
models with CellNOpt, and examples of code for each step applied to the toy example are provided in the right column. In
practical applications, the data does not always constrain the models to one single optimum, and because a genetic
algorithm is used to train the models, running CellNOpt multiple times results in a family of models that fit the data well,
albeit with slightly different features. We provide here a description of how to compile and analyze the results of multiple
CellNOpt runs. see Note 7 for a description of functionality that generates a report of trained Boolean logic models resulting
from a single CellNOpt run.
4.1.2. BL Training The expanded model was then fit to the in silico data using the
and Analysis genetic algorithm described above. For our toy example, we per-
formed the fitting ten times with two size penalties: 0 and 105. We
then assessed the structures of the resulting models by calculating
the average structure.
The average structures of the ten models trained with a size
penalty of 0 (Fig. 8a) indicate that several AND gates could be used
to describe the interactions between species. However, inclusion of
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 201
TNF EGF
EGF TNF EGF + TNF
0.9
Akt
TRAF6 PI3K Ras 0.8
Hsp27 0.7
NFB 0.6
JNK p38 Akt Raf 0.5
ERK
0.4
p90RSK 0.3
MEK JNK 0.2
c-Jun 0.1
0
None
Raf
PI3K
None
Raf
PI3K
ERK
None
Raf
PI3K
NFB c-Jun Hsp27 p90RSK
EGF TNF
EGF TNF
JNK
Raf PI3K JNK Raf PI3K
Akt
Akt
MEK
MEK
ERK
ERK
p90RSK Hsp27 NFB c-Jun c-Jun
Fig. 7. Toy example and prior knowledge network processing. A prior knowledge network (PKN) consists of a signed,
directed graph that summarizes how the investigator believes the studied species will interact. We train a PKN (a)
summarizing how we think several downstream proteins depend on TNFa and EGF to an in silico dataset (b) of stimulation
with one or both of these ligands in the presence or absence of Raf or PI3K inhibition. In the PKN, nodes are light grey
rectangles for stimuli, light grey ellipses for measured, and dark grey rectangles for inhibited species. Undesignated
species are white with a dashed outline indicating that it will be removed during compression. Species both measured and
inhibited are light grey with a black outline. Before training the model, CellNOpt first compresses the PKN (c) by removing
unmeasured, unperturbed species not necessary for logical consistency. In the case of our toy example, all unmeasured,
unperturbed species were compressed except for MEK, which is the output of more than one interaction as well as the
input of more than one interaction. CellNOpt then expands the model (d) to explicitly add AND gates when a species has
more than one possible input.
a small size penalty of 105 (Fig. 8b), yielded trained models with
identical fit in which the additional complexity of AND gates is not
needed to describe the data. Thus, for the remainder of the analysis,
we will focus on models trained with a size penalty of 105.
The models resulting from training the PKN to in silico data
indicated that, although ERK was hypothesized to activate Hsp27
202 M.K. Morris et al.
a Trained with Size Penalty of Zero b Trained with Size Penalty of 10-5
Akt Akt
MEK MEK
ERK ERK
Hsp27
NFkB
Erk
p90RSK
Jnk
cJun
None
Raf
None
Raf
None
Raf
PI3K
PI3K
PI3K
Fig. 8. Training initial PKN to in silico data with Boolean logic. The initial PKN (Fig. 7a) was trained to the in silico data
(Fig. 7b) using CellNOpt for ten independent optimization runs with a size penalty of 0 and 105. The bitstrings of the ten
runs were averaged for each size penalty and graphed with CNOCompileMultiBoolSol such that the darkness
of the lines indicates the frequency of the interaction in the average structure. The models resulting from training with a
modest size penalty were simpler than those with no size penalty, although the fit to the data was identical. (c) Upon
examination of the fit of the models trained with a modest size penalty to the data where background darkness indicates
error (darker means more error), one immediately notices that the models failed to fit NFkB activation under EGF
stimulation. This is because the initial PKN (Fig. 7a) lacked a path from EGF to NFkB.
in the PKN, this gate was inconsistent with the data. Thus, it was
removed during the model training. By examining the data, we find
that the experimental basis of the removal of this edge was that Erk
but not Hsp27 is activated with EGF stimulation and the converse
is true with TNFa stimulation (Fig. 7b).
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 203
a
b Trained with Size Penalty of 10-5
EGF TNF
EGF TNF
Ras PI3K TRAF6
MEK
ERK
ERK
Akt
Hsp27
NFkB
Erk
p90RSK
Jnk
cJun
None
Raf
None
Raf
None
Raf
PI3K
PI3K
PI3K
Fig. 9. Training extended PKN to in silico data with Boolean logic. Due to the systematic error in modeling NFkB (Fig. 6), a
link from PI3K to NFkB and Ras to NFkB was added to the PKN (a). When this extended PKN was fit to the data using
CellNOpt, the resultant models contained only the PI3K to NFkB link (b) and were able to fit the activation of NFkB under
EGF (c).
4.1.3. cFL Training We thus use cFL to train the PKN to the initial dataset (Fig. 10). To
and Analysis train a cFL model, there are two requirements in addition to those
for BL training: (1) specification of the subset of transfer functions
to choose from with the GA and (2) specification of reduction
thresholds to use after model training (not needed for this toy
example, so we set the parameter to be an empty vector). In our
case, we specify seven possible transfer functions for each interac-
tion (Table 3). These transfer functions include one linear transfer
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 205
a
EGF TNF
Akt
Mek
Erk
Hsp27
NFkB
Erk
p90RSK
Jnk
cJun
None
PI3K
Raf
None
PI3K
Raf
None
PI3K
Raf
Fig. 10. Training extended PKN to in silico data with constrained fuzzy logic. Constrained
fuzzy logic was used to train the extended PKN (Fig. 9a) to the in silico data using CellNOpt
for ten independent optimization runs. The resultant models had identical structures (a),
albeit with slightly different transfer function parameters and were able to fit all aspects of
the data, even the species that were activated to an intermediate level (b).
4.2. Application to Data As our second example, we describe the use of CellNOpt to analyze
From the Hepatocyte the published data first used to validate the CellNOpt methodology
Cell Line HepG2 (9, 26). This dataset describes phosphorylation of 15 signaling
proteins in response to a battery of pro-growth and pro-
inflammatory cytokines. We will compare this data to a prior knowl-
edge network originally derived from the Ingenuity database and
altered as described in ref. 27 and Fig. 11a).
4.2.1. Model Preprocessing As previously mentioned, CellNOpt first processes the PKN struc-
ture before fitting it to the data. In this case, CellNOpt compresses
the PKN to result in a model containing the stimulated, perturbed,
and measured species as well as a few intermediate nodes necessary
to preserve logical consistency (Fig. 11b). CellNOpt then expands
the model to contain AND gates. In our case, we have specified that
the model should expand all inputoutput relationships into AND
gates if possible, but to limit the inputs into an AND gate to two by
setting the IdentGates parameter to All and the MaxInput-
sPerGate parameter to 2 (see Table 3). This limited expansion
allows us to capture some logical complexity without overly com-
plicating the optimization process. The resulting logic superstruc-
ture to be trained is shown in Fig. 11b.
4.2.2. Model Training After training this superstructure to the data with a modest penalty
and Analysis for model size (1 104) 100 times, we obtained a family of
solutions, some of which fit the data better than others
(Fig. 12a). These models were sorted according to how well they
fit the data, with a subset of best fitting models chosen for further
analysis ((26) describes one way to define such a subset). In this
case, there were two distinct populations of models, those with
scores greater than or <0.06. We thus chose the subset of 41
models with scores <0.06. We used the CellNOpt function CNO-
CompileMultiBoolSol to view the trained models directly
(Fig. 12b) and map the trained models to the original model to
see what pathways were kept by the optimization process
(Fig. 12c).
From the trained BL models, one can immediately deduce that
no AND gate was present in all models, indicating that they were
not necessary to fit the data. Furthermore, the models do not allow
TNFa or IL6 to activate downstream pathways through PI3K and
most interactions from a growth signal to an inflammatory signal
were removed during the training process (e.g., Akt ! Cot !
IkK; PI3K ! Rac ! Map3k1; and Ras ! Map3k1). Additionally,
by plotting the fit of these models to the original data (Fig. 12d),
one finds that most of the data was fit very well with the exception
of partially activated signals downstream of IL6 and TNFa as well as
partial activation of cJun after simulation with TNFa and TGFa. As
in the previous case, CellNOpt can be run with cFL in order to fit
the partially activated data values (27).
a Starting Model
grb2 pip2
pi3k pp2a
rac pip3
map3k1 pdk1
hsp27 atf2 cfos nfkbn histh3 atf1 creb p53 stat1n p70s6 irs1s stat3n gsk3 casp9 stat13
b Processed Model
ikk mkk4
jnk12 p38
gsk3 p70s6 irs1s p53 creb histh3 ikb stat3 cjun hsp27
Fig. 11. Preprocessing PKN to train to HepG2 data. (a) PKN consisting of 82 nodes and 115 interactions trained to HepG2
data. The data consisted of measurement of the phosphorylation of the 15 species depicted in light grey ellipses after
30 min of stimulation with one of the six the species in light grey rectangles, in the presence or absence of inhibition of one
of the seven species in dark grey rectangles. (b) During the preprocessing state, the PKN was compressed to include only
designated species and undesignated species necessary to ensure preservation of logical consistency. All possible two
input AND gates were then added in the expansion step, resulting in a logic superstructure with 30 nodes and 97
hyperedges.
a 60 b tgfa igf1 il6 il1a lps tnfa
Better
50 Fitting
Models ras pi3k traf6
Number of Models
40
20
ikk mkk4
10
jnk12 p38
0
0.05 0.055 0.06 0.065
Score
mtor p90rsk msk12
gsk3 p70s6 irs1s p53 creb histh3 ikb stat3 cjun hsp27
grb2 pip2
pi3k pp2a
rac pip3
map3k1 pdk1
hsp27 atf2 cfos nfkbn histh3 atf1 creb p53 stat1n p70s6 irs1s stat3n gsk3 casp9 stat13
GSK3
Ikb
JNK12
p38
p70S6
p90RSK
STAT3
cJUN
CREB
HistH3
HSP27
IRS1s
MEK12
p53
mTOR
mTOR
mTOR
mTOR
mTOR
mTOR
mTOR
MEK12
MEK12
MEK12
MEK12
MEK12
MEK12
MEK12
JNK12
JNK12
JNK12
JNK12
JNK12
JNK12
JNK12
GSK3
GSK3
GSK3
GSK3
GSK3
GSK3
GSK3
None
None
None
None
None
None
None
PI3K
PI3K
PI3K
PI3K
PI3K
PI3K
PI3K
p38
p38
p38
p38
p38
p38
p38
IKK
IKK
IKK
IKK
IKK
IKK
IKK
Fig. 12. Training PKN to HepG2 data with Boolean logic. (a) CellNOpt was used to train the PKN to the data 100 times, and
we readily identified that 41 of the optimization runs resulted in scores slightly better than the others. We focus the
remainder of our analysis on this subset, but note here that the other subset of models did not fit activation of IkB under
TNFa stimulation, resulting in smaller models that did not fit the data as well. (b) The average structure of the trained
models where darkness of the interactions indicates frequency in average structure. (c) The overlay of the average
structure onto the original model gives the user an idea of what pathways in the original PKN were kept and which removed
during training (AND gates are not explicitly shown). (d) The fit to data can be plotted to visualize systematic error where a
darker background indicates more error and distorted background indicates the data was unreliable.
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 209
5. Notes
1. Run Master Script from the command line with the Optimizer parameter set to
none and SaveResults parameter to true.
Fig. 13. Code to check CNOProject and PKN prior to training. To check that the names are
consistent in the PKN and CNOProject, after running the commands above, the file tagged
with InitialModelAndProject is the initial model compared to the initial CNOProject.
Species that are green are stimuli, orange are inhibited, and blue are measured. Check
that all species you meant to be included in the model are actually labeled as such. This is
a very good procedure for catching naming inconsistencies. Additionally, you will see
undesignated species that are compressed (dashed outline in black) or not compressed
(solid outline in black). The file tagged with ModelToTrain is the fully processed model.
You will see the results of model compression and all of the added AND gates. This model
can also be used to check for naming inconsistencies and see the additional complexity
added by the chosen expansion procedure.
Acknowledgments
References
1. Jorgensen C, Linding R (2010) Simplistic 3. Kestler HA, Wawra C, Kracher B, Kuhl M
pathways or complex networks? Curr Opin (2008) Network modeling of signal transduc-
Genet Dev 20:1522 tion: establishing the global view. Bioessays
2. Gaudet S, Janes KA, Albeck JG, Pace EA, Lauf- 30:11101125
fenburger DA, Sorger PK (2005) A compen- 4. Heinrichs A, Kritikou E, Pulverer B, Raftopou-
dium of signals and responses triggered by lou M (eds) (2006) Systems biology: a users
prodeath and prosurvival cytokines. Mol Cell guide. Nature Publishing Group, New York, NY
Proteomics 4:15691590
212 M.K. Morris et al.
5. Boutros PC, Okey AB (2005) Unsupervised 18. Calzone L, Tournier L, Fourquet S, Thieffry
pattern recognition: an introduction to the D, Zhivotovsky B, Barillot E, Zinovyev A
whys and wherefores of clustering microarray (2010) Mathematical modelling of cell-fate
data. Brief Bioinform 6:331343 decision in response to death receptor engage-
6. DHaeseleer P (2005) How does gene expres- ment. PLoS Comput Biol 6:e1000702
sion clustering work? Nat Biotechnol 19. Pandey S, Wang RS, Wilson L, Li S, Zhao Z,
23:14991501 Gookin TE, Assmann SM, Albert R (2010)
7. Janes KA, Albeck JG, Gaudet S, Sorger P, Lauf- Boolean modeling of transcriptome data reveals
fenburger DA, Yaffe MB (2005) A systems novel modes of heterotrimeric G-protein
model of signaling identifies a molecular basis action. Mol Syst Biol 6:372
set for cytokine-induced apoptosis. Science 20. Samaga R, Von Kamp A, Klamt S (2010) Com-
310:16461653 puting combinatorial intervention strategies
8. Miller-Jensen K, Janes K, Brugge J, Lauffen- and failure modes in signaling networks. J
burger D (2007) Common effector processing Comput Biol 17:3953
mediates cell-specific responses to stimuli. 21. Morris MK, Saez-Rodriguez J, Sorger PK,
Nature 448:604608 Lauffenburger DA (2010) Logic-based models
9. Alexopoulos L, Saez-Rodriguez J, Cosgrove B, for the analysis of cell signaling networks. Bio-
Lauffenburger D, Sorger P (2010) Networks chemistry 49:32163224
inferred from biochemical data reveal profound 22. Berger SI, Iyengar R (2009) Network analyses
differences in Toll-like receptor and inflamma- in systems pharmacology. Bioinformatics
tory signaling between normal and trans- 25:24662472
formed hepatocytes. Mol Cell Proteomics 23. Mitsos A, Melas IN, Siminelakis P, Chairakai
9:18491865 AD, Saez-Rodriguez J, Alexopoulos LG
10. Markowetz F, Spang R (2007) Inferring cellu- (2009) Identifying drug effects via pathway
lar networksa review. BMC Bioinform 8 alteractions using an Interger Linear Program-
(Suppl 6):S5 ming optimization formulation on phospho-
11. Bansal M, Belcastro V, Ambesi-Impiombato A, proteomic data. PLoS Comput Biol 5:
di Bernardo D (2007) How to infer gene net- e1000591
works from expression profiles. Mol Syst Biol 24. Saez-Rodriguez J, Alexopoulos L, Zhang MS,
3:78 Morris MK, Lauffenburger DA, Sorger PK
12. Prill RJ, Marbach D, Saez-Rodriguez J, Sorger (2011) Comparative logical models of signal-
PK, Alexopoulos LG, Xue X, Clarke ND, Altan- ing networks in normal and transformed hepa-
Bonnet G, Stolovitzky G (2010) Towards a tocytes. Cancer Res 71:1
rigorous assessment of systems biology models: 25. Cosgrove B, Alexopoulos L, Hang T, Hendriks
the DREAM3 challenges. PLoS One 5:e9202 B, Sorger P, Griffith L, Lauffenburger D (2010)
13. Sachs K, Perez O, Peer D, Lauffenburger DA, Cytokine-associated drug toxicity in human
Nolan GP (2005) Causal protein-signaling net- hepatocytes is associated with signaling network
works derived from multiparameter single-cell dysregulation. Mol Biosyst 6:11951206
data. Science 308:523529 26. Saez-Rodriguez J, Alexopoulos LG, Epperlein
14. Jaqaman K, Danuser G (2006) Linking data to J, Samaga R, Lauffenburger DA, Klamt S, Sor-
models: data regression. Nat Rev Mol Cell Biol ger PK (2009) Discrete logic modelling as a
7:813819 means to link protein signalling networks with
15. Chen WW, Schoeberl B, Jasper PJ, Niepel M, functional analysis of mammalian signal trans-
Nielsen UB, Lauffenburger DA, Sorger PK duction. Mol Syst Biol 5:331
(2009) Inputoutput behavior of ErbB signal- 27. Morris MK, Saez-Rodriguez J, Clarke DC,
ing pathways as revealed by a mass action Sorger PK, Lauffenburger DA (2011) Training
model trained against dynamic data. Mol Syst signaling pathway maps to biochemical data
Biol 5:239 with constrained fuzzy logic: quantitative anal-
16. Becker V, Schilling M, Bachmann J, Baumann ysis of liver cell responses to inflammatory sti-
U, Raue A, Maiwald T, Timmer J, Klingmuller muli. PLoS Comp Biol 7(3):e1001099
U (2010) Covering a broad dynamic range: 28. Alves R, Antunes F, Salvador A (2006) Tools
information processing at the erythropoietin for kinetic modeling of biochemical networks.
receptor. Science 328:14041408 Nat Biotechnol 24:667672
17. Rangamani P, Iyengar R (2008) Modelling cel- 29. Maly IVE (2009) Systems biology, vol 500.
lular signalling systems. Essays Biochem Humana, New York, NY
45:8394
8 Construction of Cell TypeSpecific Logic Models of Signaling Networks. . . 213
Sander C, Bader GD (2010) The BioPAX Haber DA, Settleman J (2007) Identification
community standard for pathway data sharing. of genotype-correlated sensitivity to selective
Nat Biotechnol 28:935942 kinase inhibitors by using high-throughput
54. Saez-Rodriguez J, Goldsipe A, Muhlich J, tumor cell line profiling. Proc Natl Acad Sci
Alexopoulos L, Millard B, Lauffenburger D, U S A 104:1993619941
Sorger P (2008) Flexible informatics for link- 59. Mori S, Chang JT, Andrechek ER, Potti A,
ing experimental data to mathematical models Nevins JR (2009) Utilization of genomic sig-
via DataRail. Bioinformatics 24:840847 natures to identify phenotype-specific drugs.
55. Bian D, Su S, Mahanivong C, Cheng RK, Han PLoS One 4:e6772
Q, Pan ZK, Sun P, Huang S (2004) Lysopho- 60. Iorio F, Bosotti R, Scacheri E, Belcastro V,
sphatidic acid stimulates ovarian cancer cell Mithbaokar P, Ferriero R, Murino L, Taglia-
migration via a Ras-MEK Kinase 1 pathway. ferri R, Brunetti-Pierri N, Isacchi A, di Ber-
Cancer Res 64:42094217 nardo D (2010) Discovery of drug mode of
56. Sander EE, van Delft S, ten Klooster JP, Reid T, action and drug repositioning from transcrip-
van der Kammen RA, Michiels F, Collard JG tional responses. Proc Natl Acad Sci U S A
(1998) Matrix-dependent Tiam1/Rac signal- 107:1462114626
ing in epithelial cells promotes either cell-cell 61. Lamb J, Crawford ED, Peck D, Modell JW,
adhesion or cell migration and is regulated by Blat IC, Wrobel MJ, Lerner J, Brunet JP, Sub-
phosphatidylinositol 3-kinase. J Cell Biol ramanian A, Ross KN, Reich M, Hieronymus
143:13851398 H, Wei G, Armstrong SA, Haggarty SJ, Clem-
57. Fanger GR, Johnson NL, Johnson GL (1997) ons PA, Wei R, Carr SA, Lander ES, Golub TR
MEK kinases are regulated by EGF and selec- (2006) The Connectivity Map: using gene-
tively interact with Rac/Cdc42. EMBO J 16: expression signatures to connect small mole-
49614972 cules, genes, and disease. Science 313:
58. McDermott U, Sharma SV, Dowell L, Grenin- 19291935
ger P, Montagut C, Lamb J, Archibald H, 62. Iskar M, Campillos M, Kuhn M, Jensen LJ, van
Raudales R, Tam A, Lee D, Rothenberg SM, Noort V, Bork P (2010) Drug-induced regula-
Supko JG, Sordella R, Ulkus LE, Iafrate AJ, tion of target expression. PLoS Comput Biol 6:
Maheswaran S, Njauw CN, Tsao H, Drew L, e1000925
Hanke JH, Ma XJ, Erlander MG, Gray NS,
Chapter 9
Regulatory Networks
Gilles Bernot, Jean-Paul Comet, and Christine Risso-de Faverney
Abstract
The usefulness of mathematical models for the biological regulatory networks relies on the predictive
capability of the models in order to suggest interesting hypotheses and suitable biological experiments.
All mathematical frameworks dedicated to biological regulatory networks must manage a large number of
abstract parameters, which are not directly measurable in the cell. The cornerstone to establish predictive
models is the identification of the possible parameter values. Formal frameworks involve qualitative models
with discrete values and computer-aided logic reasoning. They can provide the biologists with an automatic
identification of the parameters via a formalization of some biological knowledge into temporal logic
formulas. For pedagogical reasons, we focus on gene regulatory networks and develop a qualitative
model of the detoxification of benzo[a]pyrene in human cells to illustrate the approach.
Key words: Biological regulatory networks, Gene regulatory networks, Mathematical modeling,
Systems biology, Temporal logic, Model checking, Benzo[a]pyrene, Detoxification pathway, CYP,
Metabolizing enzymes
1. Introduction
Almost all the difficult questions that involve biological systems, with
several interacting entities, need mathematical models and
computer-aided reasoning in order to predict the global behavior
of the system, or to establish some characteristics of the dynamics of
the system. Domain-oriented formal frameworks are then required
in order to efficiently design the mathematical models, to perform
low-cost simulations, and to extract relevant predictions from the
models. The choice of the best suited formal framework is guided
by the biological question(s) under consideration. For instance if
the position of the entities in a two- or three-dimensional space is
important, as well as their trajectories, or the form of some com-
partments, diffusion speeds, etc., then frameworks such as cellular
automata (1), multi-agent systems (2), discrete geometry (3), and
so on may be suited. If, on the contrary, it is relevant to ignore the
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_9, # Springer Science+Business Media, LLC 2013
215
216 G. Bernot et al.
2. Example:
Detoxification
Induced by Benzo[a]
Pyrene Exposure Benzo[a]pyrene (BaP) is an environmental carcinogenic polycyclic
aromatic hydrocarbon (PAH) that is formed through incomplete
combustion of organic materials, and common sources are tobacco
smoke, automobile exhaust, and food (7, 8). BaP toxicity is largely
mediated through binding to the aryl hydrocarbon receptor (AhR),
a ligand-activated transcription factor, found in vertebrate species
from fish to humans (9).
The unliganded AhR is maintained in cytoplasm in association
with a chaperone complex (Hsp90/XAP/p23).
Depending upon BaP binding and concentration, the activated
AhR sheds the chaperon proteins and translocates into the nucleus,
where it forms a heterodimer with AhR nuclear translocator
(ARNT) already present in the nucleus (10). This complex recog-
nizes an enhancer DNA element, known as the aryl hydrocarbon
response element (AHRE)also called xenobiotic response ele-
ment (XRE), and dioxin response element (DRE)in the pro-
moter region of target genes collectively known as the AhR gene
battery, which results in their transcriptional activation (1113).
The AhR gene battery includes cytochrome P450 (e.g., Cyp 1
family) as well as non-P450 genes (e.g., a glutathione S-transferase
(Gsta1), a UDP glucuronosyltransferase (Ugt a6), . . .) that are
218 G. Bernot et al.
3. Method: Thomas
Framework
3.1. Mathematical In order to design a predictive mathematical model for regulatory
Models of Regulatory networks, one has to collect two kinds of biological knowledge:
Networks
1. The sensible set of biological objects that are supposed to drive
the biological system (or the biological phenomenon) under
consideration, and the mutual influences between these
objects; this knowledge will constitute the structure of the
model.
220 G. Bernot et al.
CYP
1 2
dp1 1 dp2
1 BM 2
CYP & not(BM) CYP & BaP
1
BaP
g2
g3
g1
g
0 1 2 3
Fig. 4. Example of sigmoid shapes where a source gene g activates its target genes g1
and g3 and inhibits g2. Four qualitatively different intervals appear, numbered by the
number of genes on which g is acting.
224 G. Bernot et al.
3.3. Dynamics: State The dynamics of a regulatory network define how the biological
Graphs system evolves autonomously by describing the successive states
that the system shall exhibit, starting from any initial state. Follow-
ing Rene Thomas, within a discrete model, a state is defined by the
number of the interval (as described before) associated to each
variable. Intuitively, the state of a given variable g represents the
number of edges in the graph on which g is acting (as already
pointed out, it can be lower if there are some shared thresholds).
A current state being given, it describes for each variable the
targets on which the variable is currently acting. For example
(BaP 1, CYP 1, BM 0) is a state where, according to Fig. 3:
l BaP activates CYP via the left-hand side edge with threshold 1
but not via the right-hand side edge with threshold 2, and BaP
is not acting on dp2.
l CYP is acting on dp1 but it is not acting on dp2.
l BM is not acting on dp1 (thus not(BM) is true) and BM does
not repress BaP.
Consequently, a current state being given, one can make an
inventory of the edges that are acting on a variable. For example
within the current state (BaP 1, CYP 1, BM 0):
l BaP is repressed by dp1 because CYP & not(BM) is satisfied
(because CYP passes its threshold 1 and BM does not); BaP is
not repressed by BM because BM does not pass the threshold 1.
l CYP is activated by BaP at level 1 (left-hand side edge), but not
at level 2 (right-hand side edge).
l BM is not activated by dp2 because CYP & BaP is false (because
CYP does not pass its threshold 2, and neither does BaP).
All mathematical frameworks for gene regulatory networks
consider that this inventory decides what are all the possible futures
9 Regulatory Networks 225
BM BM BM
1 1 1
0 0 0
CYP CYP CYP
0 1 2 0 1 2 0 1 2
transition from state (2, 0) transitions from state (1, 1) state graph (for BaP = 0)
l Let us consider now the state (2,0). Its focal state is still (0,0) as
shown on the left part of Fig. 5 with the long, crossed out
arrow. Obviously a direct transition from (2,0) to (0,0) would
be biologically impossible because the CYP degradation must
cross the interval numbered 1 instead of jumping from 2 to 0.
Consequently, the dynamics convert the long arrow into a
transition of length 1, as shown with the short arrow in the
same figure.
l Lastly, let us consider the state (1,1). Its focal state is still (0,0) as
shown in the middle part of Fig. 5 with the diagonal, crossed out
arrow. A transition from (1,1) to (0,0) would mean that both
CYP and BM cross their respective sigmoidal thresholds exactly
at the same time. In fact, one of them is likely to cross the
threshold first, depending on which one is closer to its thresh-
old in the real current state in vivo. Consequently, the dynamics
replace the oblique arrow by two transitions, as shown with the
two perpendicular arrows, one of them modifying the CYP state
alone and the other one modifying the BM state alone.
These two principles (the length of a transition is 1 and a
transition modifies only one variable at a time) define how to
build the state graph. The right part of Fig. 5 shows all the transi-
tions that stay in the BaP 0 plane with KCYP,{} 0 and KBM,
{} 0. This part of the state graph shows that, in the absence of
BaP, the state (CYP 0, BM 0) is a stable state toward which all
states converge and that CYP and BM can decrease in any order.
4. An Example
of Possible
Parameter Values,
Among Others When the environment brings BaP into the cell, one has to consider
all possible values of BaP and consequently the state graph is three
dimensional, with three planes (one for each value of BaP) whose
transitions are similarly deduced from the parameters and there are
several transitions that jump between planes when BaP varies. Let
us consider a first case where the quantity of intracellular BaP can be
handled by the dp1. This case is modeled by KBaP,{} 1, and
except KCYP,{} 0 and KBM,{} 0, the other parameters are a
9 Regulatory Networks 227
priori unknown. The next section explains how the computer can
help in finding the parameter values. For the moment, let us
arbitrarily consider the following reasonable values to complete
Fig. 3:
1. KBaP,{dp1} 0 (meaning that the dp1 can be sufficient to
reduce BaP from 1 to 0) and KBaP,{BM} 1 (following the
intuition that BM characterizes the dp2 by the presence of
oxidative stress, and that the role of the dp2 is to reduce BaP
from 2 to 1).
2. KCYP,{BaP1} 1 (the expression of CYP when BaP is main-
tained at level 1) and KCYP,{BaP1,Bap2} 2 (the expression of
CYP when BaP is maintained at level 2).
3. KBM,{dp2} 1 (BM trigger a significant oxidative stress when
both BaP and CYP are maintained at level 2).
For each of the 18 possible states, we exhaustively list the edges
that act on each variable, and this determines the parameter that
plays the role of focal point. The table on the left of Fig. 6 gives the
18 corresponding lines.
Then, by applying the two principles explained before, we get
the state graph drawn on the right of Fig. 6. Of course, the lower
level plane, where BaP 0, is the one obtained in Fig. 5, but we
can see that the state (BaP 0, CYP 0, BM 0) is not a stable
state anymore, because the new value KBaP,{} 1 creates the red
BM
BaP = 2
CYP
BM
BaP = 1
CYP
BM
BaP = 0
CYP
Fig. 6. A discrete model of the interleaving pathways dp1 and dp2 when environment imposes an in-between level of BaP.
The table shows for each possible state the set of resources of each variable and the possible evolution directions (left ).
The associated state graph (right ).
228 G. Bernot et al.
BM
BaP = 2
CYP
BM
BaP = 1
CYP
BM
BaP = 0
CYP
Fig. 7. A discrete model of the interleaving pathways dp1 and dp2 when environment imposes a high level of BaP.
The table shows for each possible state the set of resources of each variable and the possible evolution directions (left ).
The associated state graph (right ).
5. Materials: Model
Checking and
SMBioNet
Since the parameters are generally not measurable in vivo, finding a
suitable class of parameters constitutes a major issue of the model-
ing activity. In fact, while available data on the connectivity between
elements of the network are more and more numerous, the kinetic
data of the associated interactions remain difficult to interpret in
order to identify the strength of the gene activations or inhibitions.
While it is rather easy to construct the interaction graph, the
determination of the dynamics of the model is quite difficult. This
parameter identification problem constitutes the cornerstone of the
modeling activities. Then, it would be interesting to automatically
exhibit from some biologically known behaviors or some hypothet-
ical behaviors parameters of the model which lead to dynamics
coherent with the set of available knowledge on the behavior of
the system. In the context of purely discrete modeling presented
before, this problem is simpler because of the finite number of
parameterizations to consider. Nevertheless this number is so enor-
mous that a computer-aided method is needed to help biologists
to go further in the comprehension of the biological system
under study. We show in this section how formal methods from
computer science are able to perform computer-aided identification
of parameters.
230 G. Bernot et al.
5.1. Temporal Logic Temporal logics are languages that allow us to formalize known
biological behaviors or hypothetical behaviors in such a way that
computers can automatically check if a model exhibits those beha-
viors or not. The building blocks of a temporal logic are atoms,
connectives, and temporal modalities. Let us here consider the
Computation Tree Logic (26), CTL for short, which is one of the
most common temporal logics:
1. Atoms in CTL are simple statements about the current state of
a variable of the network: equalities (e.g., (BaP 2)) or
inequalities (e.g., (CYP < 1) or (CYP > 1)).
2. Connectives are the standard connectives: :, as negation
(e.g., :(BaP 0) is the negation of the atom (BaP 0));
, as and stands for the conjunction (e.g., (BaP 0)
(CYP > 1)); , as or, stands for the disjunction (e.g.,
(BaP 0)(CYP > 1)); ), as implies, stands for impli-
cation (e.g., (BaP 0) ) (CYP > 1)), and so on.
3. Temporal modalities are combinations of two types of informa-
tion:
(a) Quantifiers: A formula can be checked with respect to all
possible choices of path in the asynchronous state graph (uni-
versal quantifier, denoted by the character A), or one can
check if it exists at least one path such that the formula is
satisfied (existential quantifier, denoted by the character E).
(b) Discrete time elapsing: A formula can be checked at the
next state (character X), in some future state which is not
necessarily the next one (character F), and in all future
states (character G). Moreover a formula can be checked
until another formula becomes satisfied in the future (char-
acter U).
5.2. CTL to Encode CTL formulas are useful to express temporal properties of
Biological Properties biological systems. Once such properties have been elaborated, a
model of the biological system will be acceptable only if its state
graph satisfies the CTL formulas; otherwise, it is not considered
anymore. Considering our running example, three temporal prop-
erties seem relevant.
The first temporal property focuses on the behavior of the
system when the toxic exposure level is null (KBaP 0). In such a
case, the system is able to reset the expression level of CYP toward
its basal level, that is, toward 0. Let us first denote by (x,y,z) the
formula ((BaP x) (CYP y) (BM z)). Since (KBaP 0)
is equivalent to the fact that from the state (0,0,0), the increasing of
BaP is not possible, this behavior is translated into CTL as
j0 [(0,0,0) ) : EX (1,0,0)] [(BaP 0) ) AF(AG(CYP
0))]
The second property focuses on the behavior of the system
when the toxic exposure level is set to 1 (KbaP 1). The dp1 is
supposed to be sufficient to detoxify the cell completely. Besides
BaP cannot increase up to level 2. In addition the dp1 (when the
BaP level is decreasing from level 1 to 0) does not involve the
oxidative stress/ARE pathway (in other words BM 0). Also
(KBaP 1) is equivalent to the fact that, on the one hand, from
the state (0,0,0), the increasing of BaP is possible, and on the other
hand, from the state (2,0,0), the decreasing of BaP is possible.
Thus, these properties are translated in CTL as follows:
j1([(0,0,0) ) EX(1,0,0)] [(2,0,0) )EX(1,0,0)])
((BaP > 0) ){EF(BaP 0) AF(AG(BaP < 2)) AG
[((BaP 1) EX(BaP 0)) ) (BM 0)]})
The third temporal property focuses on the behavior when the
toxic exposure level is set to 2 (KbaP 2). In such a case, a path
which detoxifies completely exists:
j2 (BaP 2) ) EF(BaP 0)
In practice, CTL formulas are sufficient to express the majority
of useful biological properties even if in some cases the translation
of a property is tricky.
References
1. Kier LB, Bonchev D, Buck GA (2005) Model- 8. Phillips DH (1999) Polycyclic aromatic hydro-
ing biochemical networks: a cellular-automata carbons in the diet. Mutat Res 443
approach. Chem Biodivers 2:233243 (12):39147
2. Hoehme S, Drasdo D (2010) A cell-based sim- 9. Schmidt JV, Bradfield CA (1996) Ah receptor
ulation software for multicellular systems. Bio- signaling pathways. Annu Rev Cell Dev Biol
informatics 26(20):26412642 12:5589
3. Poudret M, Comet J-P, Le Gall P et al (2008) 10. Hankinson O (1995) The aryl hydrocarbon
Topology-based abstraction of complex receptor complex. Annu Rev Pharmacol Tox-
biological systems: application to the Golgi icol 35:307340
apparatus. Theory Biosci 127:7988 11. Gu YZ, Hogenesch JB, Bradfield CA (2000)
4. Eungdamrong NJ, Iyengar R (2004) Modeling The PAS superfamily: sensors of environmental
cell signaling networks. Biol Cell 96:355362 and developmental signals. Annu Rev Pharma-
5. Schuster S, Hilgetag C, Woods JH et al (2002) col Toxicol 40:519561
Elementary flux modes in biochemical reaction 12. Nebert DW, Roe AL, Dieter MZ et al (2000)
systems: algebraic properties, validated calcula- Role of the aromatic hydrocarbon receptor and
tion procedure and example from nucleotide [Ah] gene battery in the oxidative stress
metabolism. J Math Biol 45:153181 response, cell cycle control, and apoptosis. Bio-
6. Thomas R, dAri R (1990) Biological feedback. chem Pharmacol 59:6585
CRC, Boca Raton 13. Nebert DW, Dalton TP, Okey AB et al (2004)
7. Bostrom CE, Gerde P, Hanberg A et al (2002) Role of aryl hydrocarbon receptor-mediated
Cancer risk assessment, indicators, and guide- induction of the CYP1 enzymes in environ-
lines for polycyclic aromatic hydrocarbons in mental toxicity and cancer. J Biol Chem 279
the ambiant air. Environ Health Perspect 110 (23):2384723850
(suppl 3):451488
234 G. Bernot et al.
14. Nebert DW, Vasiliou V (2004) Analysis of the the Ah receptor and Nrf2. Biochem Pharmacol
glutathione S-transferase (GST) gene family. 73:18531862
Hum Genomics 1(6):460464 21. de Jong H (2002) Modeling and simulation of
15. Nioi P, Hayes JD (2004) Contribution of NAD genetic regulatory systems: a literature review.
(P)H:quinine oxidoreductase 1 to protection J Comput Biol 9(1):67103
against carcinogenesis, and regulation of its 22. Gillespie DT (1977) Exact stochastic simula-
gene by the Nrf2 basic-region leucine zipper tion of coupled chemical reactions. J Phys
and the arylhydrocarbon receptor basic helix- Chem 81:23402361
loophelix transcription factors. Mutat Res 555 23. Tyson JJ, Othmer HG (1978) The dynamics of
(12):149171 feedback control circuits in biochemical path-
16. Yoshinari K, Okino N, Sato T et al (2006) ways. Prog Theor Biol 5:162
Induction of detoxifying enzymes in rodent 24. Bernot G, Comet J-P, Richard A et al (2004)
white adipose tissue by aryl hydrocarbon recep- Application of formal methods to biological
tor agonists and antioxidants. Drug Metab Dis- regulatory networks: extending Thomas asyn-
pos 4:10811089 chronous logical approach with temporal logic.
17. Wills LP, Zhu S, Willett KL et al (2009) Effect J Theor Biol 229(3):339347
of CYP1A inhibition on the biotransformation 25. Khalis Z, Comet J-P, Richard A et al (2009)
of benzo[a]pyrene in two populations of Fun- The SMBioNet method for discovering models
dulus heteroclitus with different exposure his- of gene regulatory networks. Genes Genomes
tories. Aquat Toxicol 92(3):195201 Genomics 3(special issue 1):1522
18. Parkinson A (1996) Biotransformation of 26. Emerson EA (1990) Temporal and modal
xenobiotics. In: Klaassen CD (ed) Casarett logic. In: Van Leeuwen J (ed) Handbook of
and Doulls toxicology: the basic science of theoretical computer science. MIT Press, Cam-
poisons. McGraw-Hill, New York bridge
19. Yang SK (1988) Stereoselectivity of cyto- 27. Cimatti A, Clarke EM, Giunciglia EF et al
chrome P-450 isozyemes and epoxide hydro- (2002) NuSMV 2: An open source tool for
lase in the metabolism of polycylic aromatic symbolic model checking. In: Proceeding of
hydrocarbons. Biochem Pharmocol 37:6170 international conference on computer-aided
20. Kohle C, Bock KW (2007) Coordinate regula- verification (CAV 2002), pp 2731
tion of phase I and II xenobiotic metabolism by
Chapter 10
Abstract
Reconstruction of metabolic networks from metabolites, enzymes, and reactions is the foundation of the
network-based study on metabolism. In this chapter, we describe a practical method for reconstructing
metabolic networks from KEGG. This method makes use of organism-specific pathway data in the KEGG/
PATHWAY database to reconstruct metabolic networks on pathway level, and the pathway hierarchy data in
the KEGG/ORTHOLOGY database to guide the network reconstruction on higher levels. By calling upon
the KEGG Web services, this method ensures the data used in the reconstruction are correct and up-to-
date. The incorporation of a local relational database allows caching of pathway data improves performance
and speeds up network reconstruction. Some applications of reconstructed networks on network alignment
and network topology analysis are exampled and notes are stated in the end.
Key words: Metabolic network, Network reconstruction, KEGG, Web service, Relational database,
Network alignment, Network topology analysis
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_10, # Springer Science+Business Media, LLC 2013
235
236 T. Zhou
2. Materials
3. Methods
Fig. 1. Illustration of metabolic network representation scheme. Two reversible reactions, R1 and R2, are taken as
examples in (a). In R1, substrate S is transformed to product P under the catalysis of enzyme E1; in R2, substrate P, also the
product in R1, is transformed to Q under the catalysis of enzyme E2. Thus, metabolites S, P, and Q are connected to one
another via enzymes or reactions, which form a simple metabolite graph, as (b) illustrates. Meanwhile, enzyme E1 and E2,
reaction R1 and R2, interact with each other via the common metabolite P and form the enzyme graph (c) and the
reaction graph (d) respectively. In these graphs, the undirected edge is split into two arcs with opposite directions.
Fig. 2. Part of enzyme data in hsa00010.xml. The highlighted and italic parts are used in the enzyme relation extraction.
Fig. 3. Extraction of enzyme relations. The dash lines represent the relation data given in Fig. 2. The dual lines with left right
open-headed arrow indicate the extracted bidirectional enzyme relation.
Fig. 4. The reconstructed enzyme network of hsa00010. This network consists of 27 nodes and 91 arcs, rendered in
Cytoscape. Notice that this network shows strong modularity. Two main modules are distinguished by different gray level.
6. Retrieve the enzyme list from gene-enzyme table. This is the list
of enzymes in the hsa00010.
7. Retrieve all the entry-relations in hsa00010 using get_elemen-
t_relations_by_ pathway().
8. Form the gene relation list by filter out those whose type is not
ECrel. Notice that gene relations are directed. The direction
is from entry 1 to entry 2 in the ECrel relation.
9. Map the gene relation list to the enzyme relation list according
to the gene-enzyme table. The directions between gene rela-
tions are kept as enzyme relations in the enzyme relation list.
10. With the enzyme list derived from step 7 as the node list and
the enzyme relation list from step 9 as the arc list, complete the
reconstruction of enzyme network for hsa00010.
3.1.2. On Higher Level Pathways can be considered as subnetworks that have compara-
tively independent functions. Therefore, the reconstruction of
enzyme networks on higher levels can be considered as the recur-
sive union operation of the involved enzyme networks on the
pathway level in light of the organism-specific KO hierarchy. Since
KEGG/ORTHOLOGY provides nothing but the reference hierar-
chy, the key problem is how to determine the KO hierarchy for the
specific organism.
Below list shows how to reconstruct enzyme network of hsa on
the process level, that is, hsa01100. Because of the limitation of
chapter length, the reconstructed network is not shown.
1. Use get_ko_by_ko_class() to retrieve the ko list of systems on
subprocess level of hsa01100.
2. For each system in the subprocess-level ko list, use get_ko_
by_ko_class() again to get the ko list of pathways on the path-
way level; by now the three-level KO reference hierarchy is
retrieved.
3. Use list_pathways() to retrieve all the pathways of the organism
hsa. By now we get the pathway list of the given organism.
4. Use the pathway list as a filter to prune the three-level KO
reference hierarchy. Now the hsa-specific KO hierarchy is set up.
5. Reconstruct enzyme networks for all the pathways list in the
hsa-specific KO hierarchy.
6. Union the reconstructed enzyme networks into a whole net-
work. This is the enzyme network of hsa01100.
3.2. Pathway Network KEGG/PATHWAY contains not only the molecular interaction
Reconstruction data within the pathway, but also the relation data among pathways.
Figure 5 shows the example on how to extract the pathway relation
data from the pathway data file.
242 T. Zhou
Fig. 5. Part of pathway data in hsa00010.xml. The highlighted and italic parts are used in the pathway relation extraction.
Fig. 6. Extraction of pathway relations. The dash rectangle represents the pathway
hsa00010. The dash-dot rectangle represents the pathway hsa00030. Gene hsa:2821
is recorded as the entry id 31 in pathway data file hsa00010.xml and as the entry
id 4 in hsa00030.xml. This gene and its related enzymes and reactions are the
common part of these two pathways, which is placed in the overlap of these two
rectangles. It indicates how the relation between pathway hsa00010 and hsa00030 is
formed as maplink.
3.3. Local Database A local relational database helps manage the downloaded data. It
Design not only can help keeping the data accurate and speed up the
process of network reconstruction, but also allows you to reuse
the data for other purposes.
Generally speaking, the reference data such as the KO reference
hierarchy does not change frequently. Neither does the metabolic data
of organisms whose genome is sequenced very early. However, to
those whose genome is undersequenced or just got completely anno-
tated completely not very long ago, their metabolic data is updated
frequently. In this case, assigning different TTL (Time-to-live) to
different types of data will reduce network overhead and data retrieval
time caused by remotely visiting KEGG frequently. For example, it
would be helpful to set the renewal period of the KO reference
hierarchy data as 30 days while that of the organism-specific
244 T. Zhou
Fig. 7. Sequence diagram of pathway data retrieval. The CachingMetabolicDataDaoWrapper delegates the request to the
other DAOs in order to get the metabolic data and caching data along the way. The JdbcMatabolicDataDao gets the cached
metabolic data from the local database. The cached data need to be refreshed from time to time by having an expiry date.
The KeggMetabolicDataDao retrieves the latest metabolic data by calling the KEGG SOAP Web service.
4. Applications of
the Reconstructed
Metabolic Networks
Reconstruction of different types of metabolic network on different
levels paves the way of network-based researches on metabolism.
Most of these researches focus on but do not limit to the exploration
10 Computational Reconstruction of Metabolic Networks from KEGG 245
4.1. Network Alignment Network alignment helps locating the differences between meta-
bolic networks. Difference between the abnormal pathway in cer-
tain disease states and the healthy one indicates the potential drug
target. Difference between the same pathways of different organ-
isms reveals the possible evolution trail. In this section, we give a
simple example on the alignment of reconstructed enzyme net-
works of glycolysis/gluconeogenesis of H. sapiens and A. aeolicus
(hsa00010 and aae00010), to show the evolutional difference from
bacteria to eukaryotes on the glycolysis/gluconeogenesis.
The reconstructed enzyme network of hsa00010 consists of
27 nodes and 62 arcs, while the reconstructed enzyme network of
aae00010 contains only 13 nodes and 12 arcs. Using Cytoscape,
the alignment result is displayed as Fig. 8. In the integrated enzyme
network, 15 nodes are hsa-specific, accounting for 53.6% in the
total amount. This figure is much <82.5%, the proportion of hsa-
specific arcs in the integrate network. Since to some extent the node
amount determines the network size while the arc amount mea-
sures the network complexity, the alignment result shows that both
network size and network complexity are getting increased along
with the form of glycolysis/gluconeogenesis from simple to
advanced. Besides, the increase of network complexity is more
noticeable. This possibly implies that life forms would rather
make use of the available resource of their own than introduce the
new components under the evolutional force. It is reasonable since
adaption always consumes less energy than generation.
However, please keep it in mind that this is no more than an
example of network alignment which does not have enough statis-
tical significance yet. More statistical studies are needed to confirm
this conclusion.
4.2. Network Topology Most biological networks are complex networks. Properties of
Analysis complex networks such as scale-free, small-world and modularity
reflect how robust the network is from different angles. In complex
network theory, degree distribution (DD), cluster coefficient (CC),
and average shortest length (ASL), are often used to study the
scale-free, modular and small-world properties respectively (28).
Jeong et al. reconstructed and analyzed metabolite networks of
43 model organisms to draw the conclusion that metabolite net-
works are scale-free, small-world and modular (29). However, their
study does not show how this conclusion go on with enzyme net-
works. To answer this question, we reconstruct the entire enzyme
network for H. sapiens, hsa01100, from KEGG (release 50.0), and
regard this network as an example to study the topological
246 T. Zhou
Fig. 8. Alignment of enzyme network hsa0010 and aae00010. In this figure, the round nodes represent hsa00010-specific
enzymes, the rectangle node represents the aae00010-specific enzymes, and the triangle nodes represent the enzymes
shared by these two pathway. These nodes are also distinguished by different gray levels. The solid lines represent the
hsa00010-specific enzyme relations, the sinewave lines represent the aae00010-specific enzyme relations, and the dash
lines represent the enzyme relations they overlapped.
5. Notes
References
1. Papp B, Pal C, Hurst LD (2004) Metabolic 9. Kanehisa M, Goto S, Hattori M et al (2006)
network analysis of the causes and evolution From genomics to chemical genomics: new
of enzyme dispensability in yeast. Nature developments in KEGG. Nucleic Acids Res
429:661664 34:D354D357
2. Zhou T, Chan K, Wang Z (2008) TopEVM: 10. Kanehisa M, Goto S (2000) KEGG: kyoto
using co-occurrence and topology patterns of encyclopedia of genes and genomes. Nucleic
enzymes in metabolic networks to construct phy- Acids Res 28:2730
logenetic trees. LNCS (LNBI) 5265:225236 11. Caspi R, Karp PD (2007) Using the MetaCyc
3. Becker SA, Palsson BO (2005) Genome-scale pathway database and the BioCyc database col-
reconstruction of the metabolic network in lection. Curr Protoc Bioinform 20:151
Staphylococcus aureus N315: an initial draft to 12. Maltsev N, Glass E, Sulakhe D et al (2006)
the two-dimensional annotation. BMC Micro- PUMA2-grid-based high-throughput analysis
biol 5:8 of genomes and metabolic pathways. Nucleic
4. Borenstein E, Feldman MW (2009) Topologi- Acids Res 34:D369D372
cal signatures of species interactions in meta- 13. Ma H, Zeng A (2003) Reconstruction of met-
bolic networks. J Comput Biol 16:191200 abolic networks from genome data and analysis
5. Chou CH, Chang WC, Chiu CM et al (2009) of their global structure for various organisms.
FMM: a web server for metabolic pathway Bioinformatics 19:270
reconstruction and comparative analysis. 14. Zhao J, Ding G-H, Tao L et al (2007) Modular
Nucleic Acids Res. doi:10.1093/nar/gkp264 co-evolution of metabolic networks. BMC
6. Sridhar P, Kahveci T, Ranka S (2007) An itera- Bioinform 8:311
tive algorithm for metabolic network-based 15. Ma H, Sorokin A, Mazein A et al (2007) The
drug target identification. Pac Symp Biocom- Edinburgh human metabolic network recon-
put 12:8899 struction and its functional analysis. Mol Syst
7. Kell DB (2006) Systems biology, metabolic Biol 3:135
modelling and metabolomics in drug discovery 16. Klukas C, Schreiber F (2007) Dynamic explo-
and development. Drug Discov Today ration and editing of KEGG pathway diagrams.
11:10851092 Bioinformatics 23:344350
8. Kanehisa M, Goto S, Furumichi M et al (2010) 17. Zhang JD, Wiemann S (2009) KEGGgraph:
KEGG for representation and analysis of a graph approach to KEGG PATHWAY in R and
molecular networks involving diseases and bioconductor. Bioinformatics 25: 14701471
drugs. Nucleic Acids Res 38:D355D360
10 Computational Reconstruction of Metabolic Networks from KEGG 249
18. Antonov AV, Dietmann S, Mewes HW (2008) 24. Goto S, Nishioka T, Kanehisa M (1998)
KEGG spider: interpretation of genomics data LIGAND: chemical database for enzyme reac-
in the context of the global gene metabolic tions. Bioinformatics 14:591599
network. Genome Biol 9:R179 25. Ogata H, Goto S, Fujibuchi W et al (1998)
19. Batagelj V, Mrvar A (2002) Pajekanalysis and Computation with the KEGG pathway data-
visualization of large networks. In: Mutzel P, base. Biosystems 47:119128
J
unger M, Leipert S (eds) Graph drawing. 26. Kanehisa M, Araki M, Goto S et al (2008)
Springer, Berlin, pp 811 KEGG for linking genomes to life and the envi-
20. Cline MS, Smoot M, Cerami E et al (2007) ronment. Nucleic Acids Res 36:D480D484
Integration of biological networks and gene 27. Palsson B (2006) Systems biology: properties
expression data using Cytoscape. Nat Protoc of reconstructed networks. Cambridge Univer-
2:23662382 sity Press, New York, NY
21. Faust K, Croes D, van Helden J (2009) Meta- 28. Albert R, Barabasi A (2002) Statistical mechan-
bolic pathfinding using RPAIR annotation. J ics of complex networks. Rev Mod Phys
Mol Biol 388:390414 74:4797
22. Yamanishi Y, Vert JP, Kanehisa M (2005) 29. Jeong H, Tombor B, Albert R et al (2000) The
Supervised enzyme network inference from large-scale organization of metabolic networks.
the integration of genomic data and chemical Nature 407:651654
information. Bioinformatics 21(Suppl 1): 30. Zhou T, Yung K, Chan C et al (2010) Meta-
i468i477 Gen: a promising tool for modeling metabolic
23. Verkhedkar KD, Raman K, Chandra NR et al networks from KEGG. Prog Biochem Biophys
(2007) Metabolome based reaction graphs of 37:6368
M. tuberculosis and M. leprae: a comparative
network analysis. PLoS One 2:e881
Part III
Biomarkers
Chapter 11
Biomarkers
Harmony Larson, Elena Chan, Sucha Sudarsanam,
and Dale E. Johnson
Abstract
Biomarkers are characteristics objectively measured and evaluated as indicators of: normal biologic pro-
cesses, pathogenic processes, or pharmacologic response(s) to a therapeutic intervention. In environmental
research and risk assessment, biomarkers are frequently referred to as indicators of human or environmental
hazards. Discovering and implementing new biomarkers for toxicity caused by exposure to a chemical either
from a therapeutic intervention or accidentally through the environment continues to be pursued through
the use of animal models to predict potential human effects, from human studies (clinical or epidemiologic)
or biobanked human samples, or the combination of all such approaches. The key to discovering or
inferring biomarkers through computational means involves the identification or prediction of the molecu-
lar target(s) of the chemical(s) and the association of these targets with perturbed biological pathways. Two
examples are given in this chapter: (1) inferring potential human biomarkers from animal toxicogenomics
data, and (2) the identification of protein targets through computational means and associating these in one
example with potential drug interactions and in another case with increasing the risk of developing certain
human diseases.
Key words: Biomarkers, Adverse drug reactions, Pharmacogenomics, Systems biology, Meta-analysis,
Gene ontology, Biological pathways
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_11, # Springer Science+Business Media, LLC 2013
253
254 H. Larson et al.
2. Materials
2.1. Animal Study- Specific initiatives for risk identification using either animal toxicol-
Derived Biomarkers and ogy study data or advanced screening techniques include the Pre-
Animal/Human ADME dictive Safety Testing Consortium (PSTC): a unique publicprivate
Genes partnership, led by the nonprofit Critical Path Institute (C-Path)
(17), which brings together pharmaceutical companies to share and
validate each others safety testing methods. The PSTC is under
advisement of the United States FDA and its European counter-
part, the European Medicines Agency (EMA). The corporate mem-
bers of the consortium share internally developed preclinical safety
biomarkers in five workgroups: carcinogenicity, kidney, liver, mus-
cle, and vascular injury. Participants are attempting to link biomar-
kers with ADRs in vivo starting from establishing biomarkers in
animal models and linking those to effects in clinical trials. The
PSTC identified seven urinary protein biomarkers useful in detect-
ing drug-induced kidney injury in preclinical rat toxicology studies
in addition to the long established markers, serum creatinine and
Blood Urea-Nitrogen (BUN). These new biomarkers include
KIM-1, Albumin, Total Protein, b2-microglobulin, Cystatin C,
Clusterin, and Trefoil Factor-3. As of June 2010, the biomarkers
have been validated and qualified by all ICH regulatory agencies
and are being used extensively in preclinical studies.
PharmaADME (18) (is an initiative of scientists from academia,
the pharmaceutical industry, and the genomic technology industry.
They have identified key genes and variants involved in the absorp-
tion, distribution, metabolism, and excretion (ADME) of medica-
tions. The purpose of this initiative is to stimulate the production
260 H. Larson et al.
2.2. Human Samples ADRs are being studied prospectively using pharmacogenomics via
for ADR Research partnerships and networks of scientists in healthcare systems, aca-
demia, biopharmaceutical industry, nonprofits, and regulatory
agencies. Several networks have been established to study specific
ADRs, particularly as they apply to certain specific patient popula-
tions. These include the Canadian Genotype-specific Approaches to
Therapy in Childhood Program (GATC) focused on reducing
ADRs in children (19). The network includes ten major Canadian
pediatric health centers that monitor and report the occurrence of
ADRs in children. The GATC network collects DNA and plasma
samples for its genetic discovery studies. The initiative also includes
a surveillance program which collects reports from over 2,300
pediatricians. The GATC project has a mission to influence pediat-
ric medical practice on a global scale.
The United States Drug Induced Liver Injury Network
(DILIN) (20) was established with the aim of discovering underly-
ing causes of drug-induced liver disease. The endeavor is sponsored
by the National Institute of Diabetes and Digestive and Kidney
Diseases of the US National Institutes of Health. The overall goal
of the program is to discover why some individuals develop hepa-
totoxicity and others do not even when on the same drugs and
regimens. A registry of people experiencing liver injury from one of
four drugs since 1994 has been established. A retrospective study is
monitoring patients with hepatotoxicity from a specific class of
drugs, and a prospective study is ongoing and follows patients
who have recently experienced adverse liver reactions to any drug
or herbal medicine. The challenge in this endeavor is to define the
cause and effect link of hepatotoxicity to a specific drug particularly
within a patient context that can include alcoholism, hepatitis, and
other conditions that lead to liver toxicity.
The primary objective of the European collaboration to estab-
lish a casecontrol DNA collection for studying the genetic basis of
adverse drug reactions (EUDRAGENE) is to advance the under-
standing of the basis of adverse drug reactions, which hopefully will
lead to the development of tests for predicting individual suscepti-
bility to ADRs. Several studies have been established and sample
banks created. The network has 11 participating centers in Europe
and Canada and initially studied myopathy from cholesterol-
lowering drugs, agranulocytosis from several different drugs, ten-
donitis and tendon rupture from fluoroquinolone antibiotics, long
QT syndrome caused by several classes of drugs, liver injury caused
11 Biomarkers 261
3. Methods
3.2. Inferring Human Toxicogenomics studies are typically conducted in rodents, and
Endpoints from methodology as outlined above can be used to infer pathway
Animal Data mechanisms and further functional insights. But translating infer-
ences from animal data to mechanism of action in humans remains a
difficult problem to solve (6). The availability of complete or nearly
complete genomes of human, rodents, and dog (the relevant safety
species contained in the literature, especially for drugs) enables the
assignment of gene orthology, which offers one way to transfer
functional information between species. Orthology assignment is
not trivial due to the split of paralogs before and after speciation.
Remm and Sonnhammer (41) have proposed an algorithm that
takes into account the so-called in-paralogs (paralogs that arose
after the species split) and out-paralogs (paralogs that arose before
the species split). Due to the high degree of sequence conservation,
orthology assignment may be straightforward in some cases; how-
ever, orthology assignment may not imply that the gene functions
identically in two species as there may be cases in which rodent
orthology of a human gene may not exist. For example, IL8, which
is one of the major mediators of inflammatory response in humans,
has no known ortholog in rodents.
Focusing on conserved pathways between species is another
approach to inferring human functional significance from animal
data. For example, experimental evidence indicates that innate
immunity is conserved across nematodes, arthropods, and verte-
brates (Kim and Ausubel (42)) which results in conserved path-
ways. It is reasonable to infer that conserved pathways imply
conserved functions across species. Kelley et al. (43) outline a
network alignment strategy to detect pathways conserved between
yeast and bacteria. Applying this in a toxicogenomics context, this
could be accomplished as follows. First, a list of genes is converted
into a network of interacting genes using pathway or literature
network approach described above for human and the other species
266 H. Larson et al.
Table 1
Genes identified in the study along with putative human
and mouse orthologs
4. Examples
Fig. 1. (a) Literature evidence for interaction between rat genes in Table 1. Two genes are linked only if at least one paper
provides evidence for interaction between the two genes. (b) Provides evidence for interaction between mouse and (c) For
human orthologs from the literature.
4.1. More Complex Traditional Chinese Medicine (TCM) often uses teas and other
Problem recipe formulations of organic and mineral ingredients to treat vari-
ous maladies. Patients often combine the use of TCM preparations
with Western therapeutics frequently without a clear understanding
of the risks associated with potential drug-phytochemical interac-
tions. In this example several freely available databases (some requir-
ing registration) and a commercial software package (Genego) were
used to identify phytochemical ingredients of herbal preparations
and to associate/predict molecular targets of each phytochemical.
The function of each target was assessed and compared to those of
Western medications used in the same treatment regimen to predict
potential drug interactions. An example of such a recipe would be
the TCM formulation Kang Ai Pian, used to treat cervical, ovarian,
breast, nasopharyngeal, lung, liver, and gastrointestinal cancer. To
identify the molecular targets of Kang Ai Pian (KAP) (49) the
recipes primary ingredients were first identified as chen pi (Citrus
reticulata; Citri reticulatae Pericarpium), huang bo (Phellodendron
amurense; Phellodendri Cortex), huang lian (Coptis chinensis, Cop-
tis deltoidea and Coptis teeta; Coptidis Rhizoma), huang qin (Scu-
tellaria baicalensis; Scutellariae Radix), hu po (amber; Succinum),
11 Biomarkers 269
niu huang (Bos taurus domesticus; Bovis Calculus), and san qi (Panax
notoginseng; Notoginseng Radix). Next, each ingredients phyto-
chemicals were identified using a combination of the Comprehen-
sive Herbal Medicine Information System for Cancer (CHMIS-C)
from the University of Michigan, National University of Singapores
TCM-ID database, and Benskys Materia Medica. Approximately
201 compounds and 12 minerals were founds within Kang Ai
Pians ingredients altogether. Subsequently structural data files
(SDFs) of these 213 chemicals were generated using PubChem
and uploaded into GeneGo MetaDrug (a compound-based pathway
analysis Web-based software), where they were cross-referenced
against the softwares compound database for similar chemicals
using a Tanimoto coefficient similarity filter of 95100 %. KAP
chemicals with 95 % similarity or greater were identified via Gene-
Gos literature meta-searching and or predicted using GeneGos
QSAR modeling. Additionally, GeneCards and UniProtKB data-
bases were used to confirm the findings. Approximately 568 unique
molecular targets were found for KAP using this method. Using
these findings, detailed lists of potential synergy and antagonisms
within a recipe can be elucidated and verified by literature searching.
In addition, phytochemicaldrug interactions can be proposed, if
not previously known, based on known or predicted effects (induc-
tion or inhibition) of key metabolic enzymes and transporters. In an
example of synergy, KAP phytochemicals quercetin and norwogonin
activate CASP3, as do Western drugs cisplatin and 5-flurouracil. For
antagonism, quercetin inhibits CYP2C9 and CYP3A4, whereas
cyclophosphamide activates CYP2C9 and paclitaxel activates
CYP3A4. Chan et al. describe 15 Western oncology drugs with
potential interactions with 12 phytochemicals in the KAP recipe.
A similar in silico method was used to identify and prioritize
environmental compounds of concern that may increase the risk of
human breast cancer. The California Breast Cancer & Chemicals
Policy Project (3) first evaluated several disease-specific endpoints
that could be evaluated by data inquiry for an initial dataset of
216 compounds associated with mammary tumors in animals.
The first step in this example was to find or predict key molecular
targets and activating pathways for each chemical, which were
deduced or predicted from Genego software and pertinent pub-
lished information (50, 51). Metabolic activation of each com-
pound was assessed based on enzymes known to be expressed in
breast tissue (52). Known targets (genes) associated with breast
cancer were developed from data in Genego, and could also be
assessed using the CTD (25). Eleven compounds were predicted to
be modulators of human breast cancer-related targets/pathways
and about 25 compounds were predicted to activate key metabolic
enzymes known to play an important role in carcinogen metabo-
lism. Several compounds were shown to activate NR1I2/PXR
(Pregnane X receptor), PXR-related transport, and/or a network
of PXR upstream and downstream effectors.
270 H. Larson et al.
5. Notes
References
1. Johnson DE, Smith DA, Park BK (2007) Biomar- inhibitor for chronic myelogenous leukemia. J
kers: the pie`ce de resistance of innovative medi- Clin Invest 105:37
cines. Curr Opin Drug Discov Devel 10:2224 5. Frank R, Hargreaves R (2003) Clinical biomar-
2. Daly AK (2007) Individualized drug therapy. kers in drug discovery and development. Nat
Curr Opin Drug Discov Devel 10:2936 Rev Drug Discov 2:566580
3. Walker DB (2006) Serum chemical biomarkers 6. Johnson DE, Rodgers AD, Sudarsanam S
of cardiac injury for nonclinical safety testing. (2006) Future of computational toxicology:
Toxicol Pathol 34:94104 broad application into human disease and ther-
4. Druker BJ, Lydon NB (2000) Lessons learned apeutics. John Wiley & Sons, Inc, Hoboken,
from the development of an abl tyrosine kinase NJ
272 H. Larson et al.
7. Pepe MS, Etzioni R, Feng Z, Potter JD, 21. The pharmacogenomics knowledge base
Thompson ML, Thornquist M, Winget M, [PharmGKB], http://pharmgkb.org/
Yasui Y (2001) Phases of biomarker develop- 22. Waters MD, Fostel JM (2004) Toxicogenomics
ment for early detection of cancer. J Natl Can- and systems toxicology: aims and prospects.
cer Inst 93:10541061 Nat Rev Genet 5:936948
8. Johnson DE, Smith DA, Park BK (2009) Phar- 23. Thompson Reuters Genego http://www.gen-
macogenomics and adverse drug reactions; ego.com/
prospective screening for risk identification. 24. Ekins S (2006) Systems-ADME/Tox:
Curr Opin Drug Discov Devel 12:2730 resources and network approaches. J Pharma-
9. Ingelman-Sundberg M (2008) Pharmacoge- col Toxicol Methods 53:3866
nomic biomarkers for prediction of severe 25. Davis AP, King BL, Mockus S, Murphy CG,
adverse drug reactions. N Engl J Med Saraceni-Richards C, Rosenstein M, Wiegers T,
358:637639 Mattingly CJ (2011) The comparative toxico-
10. Link E, Parish S, Armitage J, Bowman L, genomics database: update 2011. Nucleic
Heath S, Matsuda F, Gut I, Lathrop M, Collins Acids Res 39:D1067D1072
R (2008) SLCO1B1 variants and statin- 26. Rhodes DR, Yu J, Shanker K, Deshpande N,
induced myopathya genomewide study. N Varambally R, Ghosh D, Barrette T, Pandey A,
Engl J Med 359:789799 Chinnaiyan AM (2004) Large-scale meta-
11. Office of the commissioner safety alerts for analysis of cancer microarray data identifies
human medical products - Phenytoin (mar- common transcriptional profiles of neoplastic
keted as dilantin, phenytek and generics) and transformation and progression. Proc Natl
fosphenytoin sodium (marketed as Cerebyx Acad Sci USA 101:93099314
and generics), http://www.fda.gov/Safety/ 27. Mattingly CJ, Rosenstein MC, Davis AP, Colby
MedWatch/SafetyInformation/SafetyAlerts GT, Forrest JN, Boyer JL (2006) The compar-
forHumanMedicalProducts/ucm094919.htm ative toxicogenomics database: a cross-species
12. Chiao SK, Romero DL, Johnson DE (2009) resource for building chemical-gene interac-
Current HIV therapeutics: mechanistic and tion networks. Toxicol Sci 92:587595
chemical determinants of toxicity. Curr Opin 28. The gene ontology, http://www.geneontology.
Drug Discov Devel 12:5360 org/
13. Bonnet E, Bernard J, Fauvel J, Massip P, Ruida- 29. GoMiner Home Page, http://discover.nci.nih.
vets J, Perret B (2008) Association of APOC3 gov/gominer/index.jsp
polymorphisms with both dyslipidemia and
lipoatrophy in HAART-receiving patients. AIDS 30. Currie RA, Bombail V, Oliver JD, Moore DJ,
Res Hum Retroviruses 24:169171 Lim FL, Gwilliam V, Kimber I, Chipman K,
Moggs JG, Orphanides G (2005) Gene ontology
14. Paik S, Shak S, Tang G, Kim C, Baker J, Cronin mapping as an unbiased method for identifying
M, Baehner FL, Walker MG, Watson D, Park T, molecular pathways and processes affected by
Hiller W, Fisher ER, Wickerham DL, Bryant J, toxicant exposure: application to acute effects
Wolmark N (2004) A multigene assay to predict caused by the rodent non-genotoxic carcinogen
recurrence of tamoxifen-treated, node-negative diethylhexylphthalate. Toxicol Sci 86:453469
breast cancer. N Eng J Med 351:28172826
31. Yu X, Griffith WC, Hanspers K, Dillman JF,
15. Wallach I, Jaitly N, Lilien R (2010) A Ong H, Vredevoogd MA, Faustman EM
structure-based approach for mapping adverse (2006) A system-based approach to interpret
drug reactions to the perturbation of underly- dose- and time-dependent microarray data:
ing biological pathways. PLoS One 5:e12063 quantitative integration of gene ontology anal-
16. Milletti F, Vulpetti A (2010) Predicting poly- ysis for risk assessment. Toxicol Sci 92:560
pharmacology by binding site similarity: from 577
kinases to the protein universe. J Chem Inf 32. Xirasagar S, Gustafson SF, Huang C, Pan Q,
Model 50:14181431 Fostel J, Boyer P, Merrick BA, Tomer KB,
17. Critical Path Institute, http://www.c-path. Chan DD, Yost KJ, Choi D, Xiao N, Stasiewicz
org/ S, Bushel P, Waters MD (2006) Chemical
18. www.pharmaadme.org - Home, http://phar effects in biological systems (CEBS) object
maadme.org/joomla/ model for toxicology data, SysTox-OM: design
19. Genome Canada, http://www.genomecanada. and application. Bioinformatics (Oxford, Eng-
ca/ land) 22:874882
20. Drug Induced Liver Injury Network DILIN 33. Mao X, Cai T, Olyarchuk JG, Wei L (2005)
site, https://dilin.dcri.duke.edu/ Automated genome annotation and pathway
identification using the KEGG Orthology
11 Biomarkers 273
Abstract
This chapter discusses the use of biomonitoring-based indicators of exposure to environmental pollutants in
environmental health information systems. Matrices for biomonitoring, organization and standardization of
surveillance programs, the use of intake and body burden data, and the interpretation of surveillance data are
discussed. The concept of environmental public health indicators is demonstrated using the Persistent
organic pollutants in human milk indicator implemented in the Environment and Health Information
System (ENHIS) of the WHO Regional Office for Europe. This indicator is based on the data from the
WHO-coordinated surveillance of persistent organic pollutants in human milk as well as data from selected
national studies. The WHO survey data demonstrate a steady decline in breast milk concentrations of dioxins
across Europe. The data from biomonitoring surveys in Sweden also show a steady decline of breast milk
concentrations of most persistent organic pollutants since 1970s with the exception of polybrominated
diphenyl ethers (PBDEs) which increased rapidly until the late 1990s and then started to decline after the
implementation of policy measures aiming at reducing exposures. The application of human biomonitoring
data in support of environmental public health policy actions requires carefully designed standardized and
sustainable surveillance, comprehensive interpretation of the data, and an effective communication strategy
based on credible information presented in the form of indicator factsheets.
Key words: Persistent organic pollutants, Dioxins, Polybrominated diphenyl ethers, Human milk,
Environment and Health Information System, Environmental public health indicators, World Health
Organization
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_12, # Springer Science+Business Media, LLC 2013
275
276 A.I. Egorov et al.
2. Materials
and Methods
2.1. Matrices Used The type of samples to be selected for biomonitoring depends on
for Biomonitoring many factors including bioaccumulation, metabolism and excretion
rates of the chemical, potential contamination, matrix interference,
required sample volume, and the ease of sampling. Biomonitoring
can provide a wealth of information on chemicals that are stored in
the body for a long period of time, such as Persistent Organic
Pollutants (POPs), lead, and cadmium. For chemicals that are
excreted rapidly, cross-sectional biomonitoring data reflect recent
exposure, while characterization of long-term exposure patterns
requires repetitive sampling.
2.1.1. Blood Blood is the preferred matrix for many chemicals because it is in
contact and in dynamic equilibrium with all tissues (57). The main
disadvantage of blood is the invasive sampling procedure. Blood is a
common matrix for biomonitoring of water soluble pollutants,
such as metals. Blood lead level (BLL) is the primary biomarker
of lead exposure (8). Blood (serum) is also widely used for biomo-
nitoring of fat-soluble chemicals, such as persistent organic pollu-
tants (POPs). Since the fat content of serum varies, concentrations
of fat soluble organic compounds are typically expressed per gram
of fat. Serum samples are also used for biomonitoring of antibody
responses to pathogens, such as Human Immunodeficiency Virus
(HIV), Herpes Simplex Virus HSV-2 (9), cytomegalovirus (10),
papillomaviruses (11), Helicobacter pylori (12), and Toxoplasma
gondii (13), and common food allergens, such as peanuts (14, 15).
278 A.I. Egorov et al.
2.1.2. Urine Urine is the second most common matrix for biomonitoring. Due to
the variable composition of urine, the results need to be adjusted for
the creatinine concentration. Another caveat is that the urine level of
a chemical or its metabolite does not directly reflect the body bur-
den. Thus, the application of toxicokinetics is necessary for inter-
preting the data. Urine is suitable for biomonitoring of water-
soluble chemicals, such as metals (8), and most useful for monitoring
rapidly metabolized organic chemicals, such as PAHs and phthalates
(57). Data on PAH metabolites in urine has been linked with the
biomarker of biological effect, serum Immunoglobulin E (IgE)
specific to indoor allergens, as well as allergy symptoms (16).
2.1.3. Hair and Nails Nails, especially toenails, which are less likely to be contaminated by
direct contact with environmental chemicals, are used for biomo-
nitoring of metals, such as mercury, lead, and cadmium (5, 8). Hair
is also a stable matrix that can be used in biomonitoring, most
commonly of metals. It is an especially reliable matrix for mercury.
Important issues for ensuring comparability of data are the length
of hair collected, position on the scalp, distance from the scalp, and
preparation of samples (5, 7).
2.1.4. Milk Breast milk is commonly used as a matrix for monitoring of lipophilic
POPs (6). Emerging nonpersistent chemicals, such as phthalates, can
also be detected in milk. As the content of milk is not constant, the
results are usually adjusted to the concentration of lipids. Due to
depuration of POPs from the mother during lactation, parity and
sampling time after delivery are important for the interpretation of
results. The age of the mother is also an important factor as concen-
trations of some POPs, such as dioxins, increase with age. The
quantitation of analytically complex POPs, such as dioxins, requires
the use of gas chromatographyhigh resolution mass spectrometry
(GS-HRMS) that can detect femtogram or picogram quantities of
chemicals. This technique requires very strict laboratory conditions
and is expensive. A lower cost alternative, the Chemically Activated
Luciferase eXpression (CALUX) assay, allows the quantitation of the
total activity dioxins and similar compounds acting as aryl hydrocar-
bon receptor (AhR) agonists. It shows a strong correlation with
chemical analysis of milk samples (17).
2.1.5. Other Matrices Many toxic elements are preferentially excreted from the body in
sweat (18). Difficulties with collecting sweat samples, however,
limit the application of this matrix in biomonitoring. Since 90% of
lead is accumulated in bones, the level of lead in bones or deciduous
teeth is a good marker of long-term exposure and toxic effects (8).
Lead in bones is measured using noninvasive methods, such as K-
line X-ray fluorescence (19). An important disadvantage of this
method is that survey participants need to come to a specialized
laboratory to be tested. Saliva offers many advantages as an easy-to-
12 Biomonitoring-based Environmental Public Health Indicators 279
3. Examples
Policy-makers
Contexts:
-Policy in place
-PH significance
Health professionals, Layersof information
researchers
Interpretation of data:
-Assessment of EH situation
Data:
- Presentations (graphs, charts)
- Data tables
- Meta-data (data-about-data).
Fig. 1. Communication and information dissemination in ENHIS: the indicator factsheet format.
3.2. ENHIS Indicator The Persistent organic pollutants in human milk indicator is
Persistent Organic based on data from the WHO survey of POPs in human milk
Pollutants in Human (Fig. 2) and selected data from national surveys in Sweden
Milk (Fig. 3). It addresses exposure to an important class of pollutants
12 Biomonitoring-based Environmental Public Health Indicators 283
50
1988 1993 2002 2007
40
pg TEQ / g fat
30
20
10
0
he ium
e in
y
om
lic
U lic
Sw ne
N n
en y
Fi a
C d
ia
ry
nd
an
ar
e
ri
an
a
at
ga
ak ub
ub
w
ai
ed
st
Sp
ze ingd
m
lg
rla
te rm
ro
nl
or
kr
Au
un
ov ep
ep
Be
H
R
R
K
G
D
et
ch
d
N
ni
Sl
U
4.5 120
DDE (g/g fat)
4
DDE, DDT, PCBs, HCB, PCNs and PBDEs
100
DDT (g/g fat)
3.5
Dioxins (PCDDs/PCDFs/PCBs)
3 80
PCBs (g/g fat)
2.5
60
2 HCB (g/g fat)
1.5 40
PCNs (ng/g fat)
1
20
0.5 PBDEs (ng/g fat)
0 0
1970 1980 1990 2000 2010 Dioxins (pg TEQ/g fat)
Fig. 3. POPs levels in human milk, Sweden, 19722007, with trend lines produced using two data points per moving
average.
3.2.2. Dioxin Levels Figure 2 presents country-level summary data on breast milk levels
in Human Milk of PCDDs, PCDFs and dioxin-like PCBs (hereafter called diox-
ins), an important class of POPs that have similar toxicological
properties. These compounds act as Aryl-hydrocarbon Receptor
(AhR) agonists and elicit profound biochemical responses in verte-
brates and altered regulation of a large number of genes (48, 49).
Endocrine disrupting effects of dioxins are likely due to a
12 Biomonitoring-based Environmental Public Health Indicators 285
3.2.3. Levels of Figure 3 presents (from the same ENHIS factsheet) the results of
Polybrominated Diphenyl several national surveys that involved monitoring of PCDDs/
Ethers and Other POPs PCDFs, PCBs, polybrominated diphenyl ethers (PBDEs),
in Human Milk polychlorinated naphthalenes (PCNs), hexachlorobenzene (HCB),
dichlorodiphenyltrichloroethane (DDT) and its breakdown product
dichlorodiphenyldichloroethylene (DDE) in human milk in Sweden
(refs. 4547 and Malisch et al., unpublished data). These national data
were included in the indicator factsheet to provide a detailed illustra-
tion of temporal trends of all major classes of POPs in pooled milk
samples. The graph shows that concentrations of most POPs have
been declining steadily since the 1970s. The previously published
statistical analysis of these data demonstrated that the declines fol-
lowed the first order kinetic. Breast milk concentrations of DDT, total
PCBs and dioxins (expressed in TEQ units) were decreasing by 50%
every 4, 14 and 15 years, respectively (46). It should be noted that the
Fig. 3 data on dioxins are not directly comparable with the WHO data
for Sweden in Fig. 2 due to different recruitment and sampling
procedures (41, 46).
One notable exception to the steady decline of POPs in Sweden
was a rapid exponential increase in PBDEs in pooled milk samples
until the late 1990s, with concentrations doubling every 5 years.
This was followed by a rapid reduction after a voluntary ban on the
use of most toxic and bioaccumulating PBDE congeners.
PBDEs have the general formula C12H(90)Br(110)O in which
the sum of H and Br atoms always equals ten. Congeners are
numbered in accordance with the International Union of Pure
and Applied Chemistry (IUPAC) system. PDBEs have been used
as flame retardants in plastics and upholstery since the 1970s.
These chemicals are mixed with the material but not chemically
linked to it. Therefore, they can migrate from the material. Three
commercial PBDE mixtures have been produced: pentabromodi-
phenyl ether (penta-BDE), octabromodiphenyl ether (octa-BDE),
and decabromodiphenyl ether (deca-BDE consisting primarily of
BDE-209). Most of the penta-BDE and almost half of octa- and
deca-BDEs produced in the world have been used within the USA.
As a result, body burdens in the USA are one to two orders of
magnitude greater than in western European countries (66, 67).
A major exposure route in adults is contaminated food,
especially fish, meat and poultry (47). Breastfed infants are especially
heavily exposed through breast milk due to lactational off-loading of
PBDEs (66). Inhalation of house dust is another important expo-
sure pathway, especially for higher brominated varieties (66).
288 A.I. Egorov et al.
4. Notes
References
1. Mun oz B, Albores A (2010) The role of molec- 10. Bate SL, Dollard SC, Cannon MJ (2010) Cyto-
ular biology in the biomonitoring of human megalovirus seroprevalence in the United
exposure to chemicals. Int J Mol Sci 11 States: the national health and nutrition exami-
(11):45114525 nation surveys, 19882004. Clin Infect Dis 50
2. Decordier I, Papine A, Vande Loock K et al (11):14391447
(2011) Automated image analysis of micronu- 11. Markowitz LE, Sternberg M, Dunne EF et al
clei by IMSTAR for biomonitoring. Mutagen- (2009) Seroprevalence of human papillomavi-
esis 26(1):163168 rus types 6, 11, 16, and 18 in the United States:
3. Rossnerova A, Spatova M, Schunck C et al National Health and Nutrition Examination
(2011) Automated scoring of lymphocyte Survey 20032004. J Infect Dis 200
micronuclei by the MetaSystems Metafer (7):10591067
image cytometry system and its application in 12. Brenner H, Berg G, Lappus N et al (1999)
studies of human mutagen sensitivity and bio- Alcohol consumption and Helicobacter pylori
dosimetry of genotoxin exposure. Mutagenesis infection: results from the German National
26(1):169175 Health and Nutrition Survey. Epidemiology
4. McGeehin MA, Qualters JR, Niskar AS (2004) 10(3):214218
National environmental public health tracking 13. Jones JL, Kruszon-Moran D, Sanders-Lewis K
program: bridging the information gap. Envi- et al (2007) Toxoplasma gondii infection in the
ron Health Perspect 112(14):14091413 United States, 1999 2004, decline from the
5. Esteban M, Castan o A (2009) Non-invasive prior decade. Am J Trop Med Hyg 77
matrices in human biomonitoring: a review. (3):405410
Environ Int 35(2):438449 14. Visness CM, London SJ, Daniels JL et al
6. Paustenbach D, Galbraith D (2006) Biomoni- (2009) Association of obesity with IgE levels
toring and biomarkers: exposure assessment and allergy symptoms in children and adoles-
will never be the same. Environ Health Per- cents: results from the National Health and
spect 114(8):11431149 Nutrition Examination Survey 20052006.
7. Clewell HJ, Tan YM, Campbell JL et al (2008) J Allergy Clin Immunol 123(5):11631169,
Andersen ME. Quantitative interpretation of 1169.e1-4
human biomonitoring data. Toxicol Appl Phar- 15. Branum AM, Lukacs SL (2009) Food allergy
macol 231(1):122133 among children in the United States. Pediatrics
8. Sanders T, Liu Y, Buchner V et al (2009) Neu- 124(6):15491555
rotoxic effects and biomarkers of lead expo- 16. Miller RL, Garfinkel R, Lendor C et al (2010)
sure: a review. Rev Environ Health 24 Polycyclic aromatic hydrocarbon metabolite
(1):1545 levels and pediatric allergy and asthma in an
9. Xu F, Sternberg MR, Markowitz LE (2010) inner-city cohort. Pediatr Allergy Immunol 21
Men who have sex with men in the United (2 Pt 1):260267
States: demographic and behavioral character- 17. Hui LL, Hedley AJ, Nelson EA et al (2007)
istics and prevalence of HIV and HSV-2- Agreement between breast milk dioxin levels
infection: results from National Health and by CALUX bioassay and chemical analysis in a
Nutrition Examination Survey 20012006. population survey in Hong Kong. Chemo-
Sex Transm Dis 37(6):399405 sphere 69(8):12871294
12 Biomonitoring-based Environmental Public Health Indicators 291
18. Genuis SJ, Birkholz D, Rodushkin I et al 30. Ruiz P, Fowler BA, Osterloh JD et al (2010)
(2011) Blood, urine, and sweat (BUS) study: Physiologically based pharmacokinetic (PBPK)
monitoring and elimination of bioaccumulated tool kit for environmental pollutantsmetals.
toxic elements. Arch Environ Contam Toxicol SAR QSAR Environ Res 21(78):603618
61(2):344357 31. Ritter R, Scheringer M, Macleod M et al
19. Hu H, Rabinowitz M, Smith D (1998) Bone (2011) Intrinsic human elimination half-lives
lead as a biological marker in epidemiologic of polychlorinated biphenyls derived from the
studies of chronic toxicity: conceptual para- temporal evolution of cross-sectional biomoni-
digms. Environ Health Perspect 106(1):18 toring data from the UK population. Environ
20. Morris-Cunnington MC, Edmunds WJ, Miller Health Perspect 119(2):225231
E et al (2004) A population-based seropreva- 32. Heinzl H, Mittlbock M, Edler L (2007) On
lence study of hepatitis A virus using oral fluid the translation of uncertainty from toxicoki-
in England and Wales. Am J Epidemiol 159 netic to toxicodynamic modelsthe TCDD
(8):786794 example. Chemosphere 67(9):S365S374
21. Griffin SM, Chen IM, Fout GS et al (2011) 33. Aylward LL, Brunet RC, Carrier G et al (2005)
Development of a multiplex microsphere immu- Concentration-dependent TCDD elimination
noassay for the quantitation of salivary antibody kinetics in humans: toxicokinetic modeling for
responses to selected waterborne pathogens. moderately to highly exposed adults from
J Immunol Methods 364(12):8393 Seveso, Italy, and Vienna, Austria, and impact
22. Centers for Disease Control and Prevention on dose estimates for the NIOSH cohort.
(2009) Fourth National Report on Human J Expo Anal Environ Epidemiol 15(1):5165
Exposure to Environmental Chemicals. 34. Aylward LL, Brunet RC, Starr TB et al (2005)
Atlanta, GA, USA Exposure reconstruction for the TCDD-
23. Bocca B, Mattei D, Pino A et al (2010) Italian exposed NIOSH cohort using a concentration-
network for human biomonitoring of metals: and age-dependent model of elimination. Risk
preliminary results from two regions. Ann Ist Anal 25(4):945956
Super Sanita 46(3):259265 35. Cheng H, Aylward L, Beall C et al (2006)
24. Haines DA, Arbuckle TE, Lye E et al (2011) TCDD exposure-response analysis and risk
Reporting results of human biomonitoring of assessment. Risk Anal 26(4):10591071
environmental chemicals to study participants: 36. Corvalan C, Kjellstrom T, Smith KR (1999)
a comparison of approaches followed in two Health, environment and sustainable develop-
Canadian studies. J Epidemiol Community ment: identifying links and indicators to pro-
Health 65(3):191198 mote action. Epidemiology 10(5):656660
25. Bonefeld-Jorgensen EC (2010) Biomonitoring 37. Pond K, Kim R, Carroquino MJ et al (2007)
in Greenland: human biomarkers of exposure Workgroup report: developing environmental
and effectsa short review. Rural Remote health indicators for European children: World
Health 10(2):1362 Health Organization Working Group. Environ
26. Rappaport SM (2011) Implications of the Health Perspect 115(9):13761382
exposome for exposure science. J Expo Sci 38. World Health Organization (2007) Childrens
Environ Epidemiol 21(1):59 health and the environment in Europe. A base-
27. Baccarelli A, Giacomini SM, Corbetta C et al line assessment report. World Health Organi-
(2008) Neonatal thyroid function in Seveso zation Regional Office for Europe,
25 years after maternal exposure to dioxin. Copenhagen, http://www.euro.who.int/
PLoS Med 5(7):e161 __data/assets/pdf_file/0009/96750/
28. Boogaard PJ, Hays SM, Aylward LL (2011) E90767.pdf. Accessed 04 Jul 2012
Human biomonitoring as a pragmatic tool to 39. World Health Organization (2010) Health and
support health risk management of chemi- environment in Europe: progress assessment.
calsExamples under the EU REACH World Health Organization Regional Office
programme. Regul Toxicol Pharmacol 59 for Europe, Copenhagen, http://www.euro.
(1):125132 who.int/__data/assets/pdf_file/0010/
29. Aylward LL, Lakind JS, Hays SM (2008) Deri- 96463/E93556.pdf. Accessed 04 Jul 2012
vation of biomonitoring equivalent (BE) values 40. World Health Organization (2010) Parma dec-
for 2,3,7,8-tetrachlorodibenzo-p-dioxin laration on environment and health. EUR/
(TCDD) and related compounds: a screening 55934/5.1. World Health Organization
tool for interpretation of biomonitoring data in Regional Office for Europe, Copenhagen,
a risk assessment context. J Toxicol Environ http://www.euro.who.int/en/home/confer
Health A 71(22):14991508 ences/fifth-ministerial-conference-on-
292 A.I. Egorov et al.
62. Gies A, Neumeier G, Rappolder M et al (2007) milk and cryptorchidism in newborn boys.
Risk assessment of dioxins and dioxin-like Environ Health Perspect 115(10):15191526
PCBs in foodComments by the German 69. US EPA (2008) Final integrated risk informa-
Federal Environmental Agency. Chemosphere tion system assessment for decabromodiphenyl
67(9):S344S349 ether (BDE-209). http://www.epa.gov/ncea/
63. Kiviranta H, Tuomisto JT, Tuomisto J et al iris/subst/0035.htm. Accessed 10 Jan 2011
(2005) Polychlorinated dibenzo-p-dioxins, 70. Directive 2003/11/EC of the European Parlia-
dibenzofurans, and biphenyls in the general ment and of the Council of 6 February 2003
population in Finland. Chemosphere 60 amending the for the 24th time Council Direc-
(7):854869 tive 76/769/EEC relating to restrictions on the
64. Kiviranta H, Vartiainen T, Tuomisto J (2002) marketing and use of certain dangerous sub-
Polychlorinated dibenzo-p-dioxins, dibenzo- stances and preparations (pentabromodiphenyl
furans, and biphenyls in fishermen in Finland. ether, octabromodiphenyl ether). 2003. Official
Environ Health Perspect 110(4):355361 Journal of the European Union. http://eur-lex.
65. Dewey KG, Heinig MJ, Nommsen LA et al europa.eu/LexUriServ/site/en/oj/2003/
(1991) Adequacy of energy intake among l_042/l_04220030215en00450046.pdf.
breast-fed infants in the DARLING study: rela- Accessed 04 Jul 2012
tionships to growth velocity, morbidity, and 71. Pakalin S, Cole T, Steinkellner J et al (2007)
activity levels. Davis area research on lactation, Review on production processes of decabro-
infant nutrition and growth. J Pediatr 119 modiphenyl ether (decaBDE) used in poly-
(4):538547 meric applications in electrical and electronic
66. Birnbaum LS, Cohen Hubal EA (2006) Poly- equipment, and assessment of the availability
brominated diphenyl ethers: a case study for of potential alternatives to decaBDE. European
using biomonitoring data to address risk assess- Commission. Directorate-General Joint
ment questions. Environ Health Perspect 114 Research Centre, Institute of Health and Con-
(11):17701775 sumer Protection, European Chemicals
67. Schecter A, Pavuk M, Papke O et al (2003) Bureau, Ispra, Italy. EUR 22693 EN
Polybrominated diphenyl ethers (PBDEs) in 72. Pruss-Ustun A, Vickers C, Haefliger P et al
U.S. mothers milk. Environ Health Perspect (2011) Knowns and unknowns on burden of
111(14):17231729 disease due to chemicals: a systematic review.
68. Main KM, Kiviranta H, Virtanen HE et al Environ Health 10(1):9
(2007) Flame retardants in placenta and breast
Part IV
Abstract
Chemicals provide many key building blocks that are converted into enduse products or used in industrial
processes to make products that benefit society. Ensuring the safety of chemicals and their associated
products is a key regulatory mission. Current processes and procedures for evaluating and assessing the
impact of chemicals on human health, wildlife, and the environment were, in general, designed decades ago.
These procedures depend on generation of relevant scientific knowledge in the laboratory and interpreta-
tion of this knowledge to refine our understanding of the related potential health risks. In practice, this
often means that estimates of doseresponse and time-course behaviors for apical toxic effects are needed as
a function of relevant levels of exposure. In many situations, these experimentally determined functions are
constructed using relatively high doses in experimental animals. In absence of experimental data, the
application of computational modeling is necessary to extrapolate risk or safety guidance values for
human exposures at low but environmentally relevant levels.
1. Dose/Response
Relationships
A doseresponse relationship is an association between dose and the
incidence of a defined biological effect in an exposed population
usually expressed as percentage. Dose is defined as the total quantity
of a substance administered to, taken up, or absorbed by an organ-
ism, organ, or tissue and can be measured with in vitro or in vivo
experiments. When measured in vivo, doses are usually expressed as
milligrams per kilogram of body weight in the case of oral exposure,
as parts per million or billion (ppm or ppb) in cases of inhalation
exposure, or milligrams per square meter in cases of dermal expo-
sure. In vitro dosing units will depend on the experiment that is
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_13, # Springer Science+Business Media, LLC 2013
297
298 H. El-Masri
PODHEC
RfD or RfC = ; (1)
UF MF
where:
POD[HEC] point of departure (NOAEL, LOAEL, OR BMC)
dosimeterically adjusted to a human equivalent concentration
(HEC)
13 Modeling for Regulatory Purposes (Risk and Safety Assessment) 299
1.1. Point of Departure The starting point for RfD/RfC calculation is the identification of
Estimates the POD for the critical effect in a key study. Subsequent steps
involve the following: (1) adjustment for the difference in duration
between experimental procedure (e.g., 6 h) and expected human
exposure (24 h), (2) calculation of HEC based on dosimetric
adjustments or allometric scaling using body weights or surface
areas, and (3) application of uncertainty/modifying factors.
Ideally, the POD used in the RfD/RfC calculation process
should be the no observed adverse effect level (NOAEL), low
observed adverse effect level (LOAEL), or benchmark dose
(BMD). The NOAEL is identified as the highest nonstatistically
significant dose tested. The LOAEL is identified as the lowest dose
tested with statistically significant effect. A BMD is defined as the
statistical lower confidence limit of the dose producing a predeter-
mined level of change in adverse response compared with the
response in untreated animals (the benchmark response, BMR)
(1). BMD is determined by modeling a doseresponse curve in
the region of the relationship where biologically observable data
are available. The BMR is generally set near the lower limit of
responses that can be measured directly in animal experiments of
typical size.
1.2. Modeling The interrelationship among dose, time of exposure, and response
DoseResponse Data is fundamental to the quantitative analysis of the doseresponse
relationship. Standard doseresponse models are generally based
300 H. El-Masri
2. Balancing and
Judging Uncertainty
With the understanding that there is no complete mechanistic
model that can accurately predict and describe all relevant ADME
and biological process of a chemical, regulatory agency in charge of
estimating risk or safe levels are usually faced with judging and
balancing the uncertainty inherent in every computational model.
Quantitative uncertainty analysis in the form of Bayesian analysis of
models is time consuming and resource and data exhaustive. In
many situations, scientists address models uncertainty using quali-
tative criteria based on their in depth evaluation of the structure and
parameterization of the model. Additional steps that adds confi-
dence in model behavior is its ability to simulate data that was not
used in its calibration (parameter determination by fitting to data).
Table 1 is an illustration of some examples that can add to or
Table 1
Information or items that may contribute to or reduce model
uncertainty
References
1. Crump K, Viren J, Silvers A, Clewell H 3rd, 3. Dedrick RL, Forrester DD (1973) Blood flow
Gearhart J, Shipp A (1995) Reanalysis of dose- limitations in interpreting Michaelis constants
response data from the Iraqi methylmercury for ethanol oxidation in vivo. Biochem Phar-
poisoning episode. Risk Anal 15:523532 macol 22:11331140
2. Bunce NJ, Remillard RB (2003) Habers rule:
the search for quantitative relationships in tox-
icology. Hum Ecol Risk Assess 9:15471559
Chapter 14
Abstract
Developmental toxicity may be estimated using commercial and noncommercial software that is already
available in the market and/or literature, or models may be built from scratch using both commercial and
noncommercial software packages. In this chapter, commonly available software programs that can predict
the developmental toxicity of chemicals are described. In addition, a method for developing qualitative
structureactivity relationship (SAR) models to predict the developmental toxicity of chemicals qualitatively
(yes/no prediction) and quantitative structureactivity relationship (QSAR) models to predict quantitative
estimates (e.g., LOAEL) of developmental toxicants is also described in this chapter. Additional information
described in this chapter include methods to predict physicochemical properties of chemicals that can be
used as descriptor variables in the model building process, statistical methods that be used to build QSAR
models as well as methods to validate the models that are developed. Most of the methods described in this
chapter can be used to develop models for health endpoints other than developmental toxicity as well.
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_14, # Springer Science+Business Media, LLC 2013
305
306 R. Venkatapathy and N.C.Y. Wang
2. Materials
2.1. Types of (Q)SAR SARs for predicting the developmental toxicity of chemicals are
Models either qualitative in nature or expert-system based. In the former
case, there is a qualitative relationship between developmental tox-
icity and a chemical and/or its substructure (such as its one- and
two-atom fragments). A substructure associated with a biological
activity may also be referred to as a structural alert. SARs are
generally able to qualitatively estimate the toxicity of a given chem-
ical as positive, negative or indeterminate. In the case of expert
systems, experts in teratology review the experimental data on a
case-by-case basis and make judgments regarding the developmen-
tal toxicity potential for any given chemical. Expert systems may be
classified as knowledge based (rules are based on human expert
knowledge), induction rule based (based on artificial intelligence,
neural networks, machine learning, or data mining to automatically
derive the rules) or hybrid (rules are initially based on human expert
knowledge; machine can automatically learn new rules).
A QSAR is a mathematical model between a quantifiable
biological activity (such as developmental toxicant or developmen-
tal toxicity LOAEL) and one or more physicochemical properties of
the chemical (also referred to as molecular descriptors) using vari-
ous statistical methods such as regression analysis, principal com-
ponent analysis (PCA) or factor analysis. In general, such
relationships may be classified as either mechanistic or correlative.
Correlative (or statistical) models try to find associations between
molecular descriptors (physicochemical properties) of chemical
structures and developmental toxicological data by statistical
means, which are then used to predict the developmental toxicity
of a test chemical. Mechanistic models, on the other hand, are
developed using human expertise on known teratogenic mechanisms
of action or are limited to a congeneric series of chemicals with the
assumption that congeneric chemicals have similar mechanism(s)
or mode(s) of action.
Other approaches to predicting the developmental toxicities of
chemicals include read-across and chemical-analogues. Read-across
is a nonformalized approach in which developmental toxicity infor-
mation for one or more source chemicals is used to make a toxicity
prediction for another chemical based on a structural or functional
form of similarity (15). Read-across can either be qualitative or
quantitative, depending on the whether the data being used
to make the prediction are qualitative or quantitative in nature.
To estimate the developmental toxicity of a given chemical,
read-across can be performed in a one-to-one manner (one analogue
is used to make a single toxicity prediction) or in a many-to-one manner
(two or more analogues are used to make a single toxicity prediction).
14 Developmental Toxicity Prediction 309
2.3. Developing (Q) Sometimes, the commercial and noncommercial software available
SAR Models to may not be suitable for a particular users application either because
Predict Developmental of the expense involved in acquiring the software or because the
Toxicity software is not suitable for chemicals that are of particular interest
to the user. In such a case, a user may be able to build a (Q)SAR or
other mathematical models to predict the developmental toxicity of
chemicals using a database of training chemicals that is relevant for
their purpose.
A large number of mathematical methods are available to
express the mathematical function in a generic (Q)SAR equation.
The most frequently used methods are multiple linear regression
(MLR), principal component and factor analysis, principal compo-
nent regression analysis, partial least squares, discriminant analysis
and other classification analysis and neural networks. All these
methods relate the biological activity (developmental toxicity) to
a number of physicochemical descriptors that characterize a
chemical structure. In general, the physicochemical properties/
descriptors may be categorized as experimental quantities, spectro-
scopic data, substituent constants (electronic, hydrophobic, steric),
parameters derived from molecular modeling and quantum chemi-
cal computations, graph theoretical indices, and variables that
describe the presence or absence of certain substructures or frag-
ments. The (Q)SAR equation(s) is of the general form:
1
log f V C; (3)
T
14 Developmental Toxicity Prediction 315
2.3.1. Software Suites Commercial and noncommercial applications that include descrip-
for Developing QSARs tor generator and statistical modules are available in the market to
build a SAR or QSAR model. Examples of such software include
ADAPT, Cerius 2, Discovery Studio, MDL QSAR, and Sybyl.
Commonly used software programs to predict the developmental
toxicity of chemicals are described in more detail below.
ADAPT. ADAPT (Automated Data Analysis using Pattern Recog-
nition Toolkit; Dr. Peter Jurs, University Park, PA; http://research.
chem.psu.edu/pcjgroup/adapt.html) is a software system designed
to allow the user to develop qualitative or quantitative SARs.
ADAPT provides facilities for graphical entry and storage of molec-
ular structures and their associated data, generation of 3D molecu-
lar models, molecular descriptor calculation and analysis of the
descriptors using multivariate statistical, pattern recognition, or
neural network methods to build predictive equations. In addition,
it can import structures as molfiles and 3D coordinates. ADAPT
has a large selection of molecular descriptor generation routines
(topological, geometrical, electronic, and physicochemical) and
the ability to generate hybrid descriptions that combine features.
Statistical approaches supported include MLR, clustering, discrim-
inant analysis and neural networks. ADAPT runs on Sun work-
stations under the UNIX operating system, and has been ported
to Linux using the Intel Fortran compiler. A number of scripts to
automate various tasks are also available. Approximately 120 MB of
storage are required for source code, libraries, executables, and
documentation.
316 R. Venkatapathy and N.C.Y. Wang
2.3.2. Unbundled Software Stand alone applications may also be used to build a SAR or QSAR
for Developing QSARs models. In this case, separate software programs will have to be
used to calculate molecular descriptors and performing statistical
analysis to build the final model. Todeschini and Consonni (25)
provide a comprehensive list of molecular descriptors that can be
used to develop QSAR models. Apart from the applications
mentioned in the previous section, other common software appli-
cations that can be used to calculate molecular descriptors include
ACD/Labs PhysChem and ADME (http://www.acdlabs.com/
products/pc_admet/), ADRIANA. Code (http://www.molecular-
networks.com/products/adrianacode), Almond (http://www.
moldiscovery.com/soft_almond.php), Cranium (http://www.mol
know.com/Cranium/cranium.htm), CODESSA PRO (http://
www.codessa-pro.com/index.htm), EPI Suite (http://www.epa.
gov/oppt/exposure/pubs/episuite.htm), Dragon (http://www.
talete.mi.it/products/dragon_description.htm), HYBOT-PLUS
14 Developmental Toxicity Prediction 323
(http://www.timtec.net/software/hybot-plus.htm), molinspira
tion (http://www.molinspiration.com/cgi-bin/properties), and
Molconn-Z. (http://www.edusoft-lc.com/molconn/). Some quan
tum mechanical (QM) descriptors mentioned by Todeschini (25)
cannot be computed by the software programs mentioned previ
ously. QM descriptors are generally dependent on computational
chemistry/molecular modeling software such as ADF (http://
www.scm.com/Products/Overview/ADFinfo.html), AMPAC
(http:// www.semichem.com/ampac/), Gaussian (http://www.
gaussian.com/), GAMESS(http://www.msg.ameslab.gov/
GAMESS/GAMESS.html), Molpro (http://www.molpro.net/),
NWChem (http://www.nwchem-sw.org/index.php/Main_
Page), pDynamo(http://www.pdynamo.org/mainpages/), and
Q-Chem (http://www.nwchem-sw.org/index.php/Main_
Page). Once molecular descriptors for the chemical structures
have been calculated, statistical programs such as CoStat/CoPlot
(http://www.cohort.com/costat.html), DataFit (http://www.
curvefitting.com/datafit.htm), IgorPro (http://www.wavemetrics.
com/ index.html), JMP (http://www.jmp.com/software/jmp9/),
Minitab (http://www.minitab.com/en-US/products/minitab/
default.aspx), Mathematica (http://www.wolfram.com/products/
mathematica/index.html), Maple (http://www.maplesoft.com/pro
ducts/ maple/), MATLAB (http://www.mathworks.com/pro
ducts/matlab/), Miner3D (http://miner3d.com/), NLREG
(http://www.nlreg.com/index.htm), Origin (http://www.originlab.
com/index.aspx?goProducts/Origin), R (http://www.r-project.
org/), SAS (http://www.sas.com/software/sas9/), SigmaStat, Sig
maPlot and Systat (http://www.systat.com/SystatProducts.aspx), S-
Plus (http://spotfire.tibco.com/products/s-plus/statistical-analysis-
software.aspx), Statistica (http://www.statsoft.com/), and WEKA
(http://www.cs.waikato.ac.nz/~ml/weka/index.html) may be used
to develop the final (Q)SAR model. Most of the software mentioned
in this section work on a Windows platform though a few work on
Linux-, UNIX-, or Mac-based platforms as well.
3. Methods
3.1. QSAR Analysis The general procedure for developing a QSAR model for predicting
the developmental toxicity of a wide variety of chemicals is outlined
in the following steps:
Choosing the training set. The values of the dependent variables for
the chemicals in the model training set such as the lowest dose that
causes developmental effects (for quantitative models) and whether
exposure to a chemical causes developmental effects (for qualitative
model) will be chosen from the literature. Since a good QSAR
324 R. Venkatapathy and N.C.Y. Wang
validation, which tests for balanced test sets. The validation process
involves developing the QSAR equation for a series of subsets of
chemicals from the initial chemical dataset and then comparing the
following: coefficients of the descriptors for the two resultant
QSAR equations (initial and subset), the Pearson correlation coef-
ficients and predictive ability of the subset equation for chemicals
that were not present in the subset. External validation involves
using the QSAR models developed in the previous step to predict
the toxicities of chemicals that are not part of the original dataset.
3.2. Independent Various independent variables can be used to predict the develop-
Variables mental toxicity of a wide variety of chemicals. For example, the
descriptors in Subheading 14.4 were calculated using the commer-
cial descriptor-generator programs Dragon, CAChe, AMPAC/
CODESSA and Molecular Connectivity (Molconn-Z). The meth-
ods used in deriving the independent variables for each descriptor-
generator program is described in the following paragraphs.
CAChe Descriptors. The chemicals considered in this study were first
optimized to their most stable conformation using the CONFLEX
module in CAChe WorkSystem Pro (version 6.1.12.33, Fujitsu,
Beaverton, OR). The independent variables (descriptors) from
CAChe were then calculated using the Project Leader interface in
CAChe.
Charge-based descriptors such as the partial charge, HOMO
and LUMO were calculated by first optimizing the chemicals in
vacuum using MOPAC with PM5 parameters followed by a popu-
lation analysis on the molecular orbitals. The following keywords
were used for the MOPAC job: GEO-OK PM5 NOMM ENPART
ESR LOCALIZE MULLIK PI POLAR BONDS DENSITY FOCK
GRADIENTS SPIN VECTORS PRECISE LARGE XYZ NODIIS
GRAPH ITRY2000 T10D GNORM0.0 SCFCRT1.D-15
DMAX0.1.
Solvent-based descriptors such as the solvent accessible surface
area, energy of solvation and dipole moment were calculated by first
optimizing the chemicals in water using MOPAC with PM5 para-
meters and the Conductor-like Screening Model (COSMO).
In addition to the keywords used for optimizations in vacuum,
the following two keywords were added: EPS78.4 RSOLV1.00.
Reactivity-based descriptors such as the nucleophilic, electro-
philic and radical superdelocalizability were calculated by optimiz-
ing the chemicals in vacuum using MOPAC with PM5 parameters
followed by running the Tabulator interface in CAChe to get the
isodensity values. In the Tabulator, a default reagent energy value of
8 eV, 2 eV, and 5 eV were assigned for calculations of electro-
philic, nucleophilic, and radical superdelocalizability, respectively.
Infrared descriptors were calculated by optimizing the chemi-
cals in vacuum using MOPAC with PM5 parameters followed by a
vibrational spectra calculation using the FORCE and THERMO
326 R. Venkatapathy and N.C.Y. Wang
REPROTOX (http://www.reprotox.org/Default.aspx) is an
information system developed by the Reproductive Toxicology
Center for its members. REPROTOX contains summaries on the
effects of medications, chemicals, infections, and physical agents on
pregnancy, reproduction, and development. The REPROTOX
system was developed as an adjunct information source for clini-
cians, scientists, and government agencies.
The Distributed Structure-Searchable Toxicity (DSSTox;
http://www.epa.gov/ncct/dsstox/) and Toxicity Reference Data-
base (ToxRefDB; http://www.epa.gov/ncct/toxrefdb/) are main-
tained by US EPAs National Center for Computational
Toxicology (NCCT). Results from 383 rat and 368 rabbit prenatal
studies on 387 chemicals, mostly pesticides, are present in the
ToxRefDB database.
US FDA Informatics and Computational Safety Analysis Staff
(ICSAS) maintains a genetic toxicity, reproductive and develop-
mental toxicity and carcinogenicity database (http://www.fda.
gov/AboutFDA/CentersOffices/CDER/ucm092217.htm). The
ICSAS reproductive and developmental toxicity database contains
data records from FDA segment I (reproductive toxicity in male
and female animals), segment II (teratology, organ toxicity, and
nonspecific toxicity to the fetus), and segment III (behavioral
toxicity in newborn pups) studies in Glires (primarily rats, mice,
rabbits, and hamsters) and other animals. The data were acquired
from publicly available sources, such as Shepards Catalog of Tera-
togenic Agents, TERIS, REPROTOX, and RTECS, as well as
studies reported in drug labeling, and other reproductive toxicity
studies obtained from the EPA Toxdata-1g database.
The International Life Sciences Institute (http://www.ilsi.org/
Pages/ViewActivityDetail.aspx?ID114&ListNameActivities&
diaPage/Pages/RiskAssessmentDiagram.aspx) is developing a
prototype database of developmental toxicity data; once populated,
the database will facilitate systematic consideration of alternative
approaches for utilizing toxicity data in SAR modeling efforts.
3.4. Statistical Analysis Statistical analyses can be performed using any commonly available
dedicated statistical software such as JMP (SAS Institute Inc., Cary,
NC), Microsoft Excel (Microsoft Corporation, Redmond, WA,
US), R (R-Project, http://www.r-project.org), SAS (SAS Institute
Inc., Cary, NC), SigmaPlot (Systat Software Inc., Chicago, IL), and
SYSTAT (Systat Software Inc., Chicago, IL). Alternatively, statisti-
cal analysis can be performed in any of the QSAR model building
software mentioned in Subheading 14.2. Common methods used
to develop QSAR models include stepwise MLRs, genetic algo-
rithms, logistical regression, and classification and regression trees
among others. In general, SAR models for predicting developmen-
tal toxicity of chemicals qualitatively (yes/no) are developed using
14 Developmental Toxicity Prediction 329
3.5. Domain The predictive ability of the QSARs also depends on their predic-
of Applicability tion domain, which is in turn dependent on the range of descriptor
values in the QSAR equation. One way of defining the prediction
domain is according to the principles of the domain of applicability
as defined by Gombar and Enslein (24). However, the procedure
for calculating their prediction domain is proprietary and hence
may not be available for implementation in QSAR models.
A second method of determining the QSAR prediction domain
is through a procedure based on the Hotellings T 2 statistic (29).
Briefly, the procedure involves performing a PCA on a training set
of chemicals and using the resultant loading matrix to calculate the
PCA scores for the external set. Using the scores and a given
confidence interval, a, a chemical i of the external set is considered
to belong in the prediction domain if its Hotellings score,Ti2 ,
satisfied the following criterion:
2
2 A N 1
Ti < F p a; (4)
N N A
where F(p a) is the tabulated value for a F-distribution using a
confidence interval a, A is the number of principal components
used to build the Hotellings test and N is the number of chemicals
in the training set.
X
A
t2
Ti2 ia
; (5)
a1
sa2
3.6. Performance There are two methods which are commonly used to determine the
Evaluation of a QSAR predictive capability of a (Q)SAR model (33). The first method is
Model the use of cross-validation, which includes LOO, LMO and k-fold
cross-validation. In LOO, a compound is left out of the training set
and the remaining compounds are used to train the machine
learning method. The derived (Q)SAR model is then used to
predict the activity of the left-out compound. This process is
repeated until every compound in the training set has been left
out once. In k-fold cross-validation, the training set is randomly
divided into k mutually exclusive subsets of approximately equal
size. k-minus-one of the subsets is combined to form a modeling
training set for developing a QSAR model. The remaining subset is
then used as a modeling testing set to assess the predictive capability
of the QSAR model. This process was repeated until all k subsets
have been tested at least once.
Cross-validation methods may sometimes not be a good indica-
tor of the prediction capability of a (Q)SAR model (3437). More-
over, cross-validation methods have a tendency of underestimating
the prediction capability of a QSAR model, especially if important
molecular features are present in only a minority of the compounds in
the training set (38, 39). Thus a model having low cross-validation
14 Developmental Toxicity Prediction 331
results can still be quite predictive (38). Some studies have suggested
that an independent validation set may provide a more reliable esti-
mate of the prediction capability of a QSAR model (33, 34). Despite
these disadvantages, cross-validation methods are still useful for asses-
sing QSAR models during optimization of parameters of machine
learning methods and during descriptor selection.
TN
SE 100%; (7)
TP FN
TP
SP 100%; (8)
TN TP
TP TN
Q 100%; (9)
TP FN TN FP
TP TN FN FP
MCC p ;
TP FN TP FP TN FN (TN FP)
(10)
4. Examples
Chemical dev_endpt v1 v2 v3 v4 v5 v6
1 2.759712 10.441 0.02 0.068041 0.1156 0.0203 26.1079
2 4.587676 10.871 0.018 0 0 0 12.8101
3 9.701267 9.134 0.189 0.096225 0 0 3.7703
4 9.502622 7.977 0.279 1.626968 0.0678 2.24E 05 5.4028
5 3.542805 6.989 0.269 0.602671 0 0 1.4531
6 6.612875 9.471 0.058 0.117851 0 0 0.4964
7 5.693281 10.747 0 0.146385 0.0379 3.04E 05 17.078
8 10.37322 9.289 0.003 4.373589 0 0.00E + 00 0.00E + 00
9 0.793055 9.966 0.015 0 0 0.00E + 00 25.3052
10 6.332414 8.153 0.306 0.66066 0 0.00E + 00 0.8482
11 6.991609 10.713 0.008 0 0 0.00E + 00 0.00E + 00
12 7.704999 10.243 0.123 0 0 0.00E + 00 4.9914
13 8.988404 10.826 0.044 0 0.1805 0.0248 10.1478
14 5.439386 8.68 0.72 0.166667 0 0.00E + 00 1.1007
15 9.1427 10.941 0.762 0.117851 0.1748 1.03E 04 24.8754
16 9.033307 8.64 0.207 0.51953 0.106 7.42E 08 7.6008
17 5.181727 7.811 0.094 0.85042 0 0.00E + 00 0.9331
18 8.114655 8.702 0.529 0.92633 0 0.00E + 00 0.8272
19 11.88782 9.646 0.175 1.494564 0.1245 7.86E 03 13.1668
20 8.617923 9.405 0.133 0.831987 0 0.00E + 00 0.6074
21 1.387535 9.579 0.067 0.141421 0.07 0.0266 21.9634
22 2.923085 9.631 0.116 0.1 0.0698 0.0271 15.2603
23 3.86922 9.485 0.122 0.1 0.0686 1.32E 03 10.7721
24 0.38433 9.635 0.086 0.1 0.0699 0.0302 19.4625
25 0.373352 9.591 0.039 0.2 0.0701 0.0165 28.6147
26 10.60792 8.375 1.004 0.199071 0.0677 8.37E 03 10.1463
27 3.864432 9.575 0.065 0.1 0.0697 0.0168 22.1775
28 5.488927 12.092 0.744 0.166667 0.1804 0.0554 25.7257
29 3.762447 8.677 0.096 0.147883 0 0.00E + 00 3.6157
30 9.795335 8.62 0.241 0.572211 0.1597 0.01 6.1058
31 10.6815 6.918 1.012 0.595928 0.0712 5.12E 04 1.5184
32 8.932089 8.198 0.579 0.45111 0.0722 2.40E 04 4.7366
33 4.46365 10.691 0.002 0 0 0.00E + 00 6.9086
334 R. Venkatapathy and N.C.Y. Wang
Fig. 1. Williams Plot for the QSAR model generated using the data in Table 1.
From the result, one can see that majority of training set is
indeed inside the AD as defined by the rectangular box as bounded
by h* and residuals (3 < residuals < 3); only two chemicals (#28
and #8) are outside the AD. At this point, one may consider
removing these two chemicals one-by-one and redeveloping a
new QSAR model or simply keep the current model (it already
has reasonable predictability). Lastly, one of the great utilities of
AD is that it allows one to test the predictability and uncertainty of
14 Developmental Toxicity Prediction 337
a QSAR model for new or untested chemical that is not part of the
training set (i.e., external validation). Here, three new hypothetical
chemicals, A, B and C, their respective descriptors have already
been generated (Lines bolded and italicized in the code). As one
can see, A and B have their leverages less than then the h*
(0.5454545), and C has a very high leverage (h 288.0573).
This result suggests that one should not use the current QSAR
model to predict Chemical C due to high extrapolation, and there-
fore, very high uncertainty. For A and B, predictions for these two
chemicals by the model may be reasonably reliable for screening
and prioritization. However, further testing (e.g., in vitro and/or
in vivo assays) should be conducted if these two chemicals are of
high concern, especially in a regulatory setting with policy implica-
tion. One should always provide the predicted toxicity values with
associated leverages and underlying model limitations.
5. Notes
References
1. Anderson ME, Al-Zoughool M, Croteau M 12. Matthews EJ, Kruhlak NL, Benz RD et al
et al (2010) The future of toxicity testing. (2007) A comprehensive model for reproduc-
J Toxicol Environ Health B Crit Rev 13(2 tive and developmental toxicity hazard identifi-
4):163196 cation: II. Construction of QSAR models to
2. European Union Directive (2003) DIREC- predict activities of untested chemicals. Regul
TIVE 2003/89/EC OF THE EUROPEAN Toxicol Pharmacol 47(2):136155
PARLIAMENT AND OF THE COUNCIL 13. Cassano A, Manganaro A, Martin T et al
of 10 November 2003 amending Directive (2010) CAESAR models for developmental
2000/13/EC as regards indication of the toxicity. Chem Cent J 4(Suppl 1):S4
ingredients present in foodstuffs. http://eur- 14. Merlot C (2008) In silico methods for early
lex.europa.eu/LexUriServ/LexUriServ.do? toxicity assessment. Curr Opin Drug Discov
uriOJ:L:2003:308:0015:0018:EN:PDF. Devel 11(1):8085
Accessed 23 Oct 2010 15. Worth AP, Bassan A, de Brujin J et al (2007) The
3. Doull J, Borzelleca JF, Becker R et al (2007) role of the European Chemicals Bureau in pro-
Framework for use of toxicity screening tools moting the regulatory use of (Q)SAR methods.
in context-based decision-making. Food Chem SAR QSAR Environ Res 18:111125
Toxicol 45(5):759796 16. Klopman G (1984) Artificial intelligence
4. Basak SC, Bertelsen S, Grunwald GD (1995) approach to structure-activity studies. Com-
Use of graph theoretic parameters in risk assess- puter automated structure evaluation of
ment of chemicals. Toxicol Lett 79:239250 biological activity of organic molecules. J Am
5. Devillers J, Chezeau A, Thybaud E et al (2002) Chem Soc 106:73157321
QSAR modeling of the adult and developmen- 17. Klopman G (1992) MULTICASE: a hierarchi-
tal toxicity of glycols, glycol ethers and xylenes cal computer automated structure evaluation
to Hydra attenuata. SAR QSAR Environ Res program. Quant Struct Act Relat 11:176184
13(5):555566 18. Klopman G, Rosenkranz HS (1994) Approaches
6. Devillers J, Chezeau A, Thybaud E (2002) to SAR in carcinogenesis and mutagenesis. Pre-
PLS-QSAR of the adult and developmental diction of carcinogenicity/mutagenicity using
toxicity of chemicals to Hydra attenuata. SAR Multi-CASE. Mutat Res 305:3346
QSAR Environ Res 13(78):705712 19. Maslankiewicz L, Hulzebos EM, Vermeire TG
7. Kavlock RJ (1990) Structure-activity relation- et al (2005) Can chemical structure predict
ships in the developmental toxicity of substi- reproductive toxicity? RIVM, Bilthoven. www.
tuted phenols: in vivo effects. Teratology 41 rivm.nl/bibliotheek/rapporten/601200005.
(1):4359 html. Accessed 24 Oct 2010
8. Richard AM, Hunter ES III (1996) Quantita- 20. Accelrys (2001) TOPKAT 6.1: User Guide.
tive structure-activity relationships for the Oxford Molecular Ltd, Burlington, MA
developmental toxicity of haloacetic acids in 21. Gombar VK (1998) Quantitative structure-
mammalian whole embryo culture. Teratology activity relationships in toxicology: from funda-
53(6):352360 mentals to application. In: Reiss C, Parvez S,
9. Vedani A, McMasters DR, Dobler M (1991) Labbe G et al (eds) Advances in molecular
Genetic algorithms in 3D-QSAR: predicting toxicology. VSP, Utrecht
the toxicity of dibenzodioxins, dibenzofurans 22. Purcell WP, Bass GE, Clayton JM (1973) Strat-
and biphenyls. ALTEX 16(1):914 egy of drug design: a guide to biological activ-
10. Matthews EJ, Kruhlak NL, Cimino MC et al ity. Wiley, New York
(2006) An analysis of genetic toxicity, repro- 23. Moudgal CJ, Venkatapathy R, Choudhury H
ductive and developmental toxicity, and carci- et al (2003) Application of QSTRs in the selec-
nogenicity data: II. Identification of tion of a surrogate toxicity value for a chemical of
genotoxicants, reprotoxicants, and carcinogens concern. Environ Sci Technol 37:52285235
using in silico methods. Regul Toxicol Pharma-
col 44(2):97110 24. Gombar VK, Enslein K (1996) Assessment of
n-octanol/water partition coefficient: when is
11. Matthews EJ, Kruhlak NL, Benz DR (2007) A the assessment reliable? J Chem Inf Comput
comprehensive model for reproductive and Sci 36:11271134
developmental toxicity hazard identification:
I. Development of a weight of evidence 25. Todeschini R, Consonni V (2000) Handbook
QSAR database. Regul Toxicol Pharmacol 47 of molecular descriptors, vol 11, Methods and
(2):115135 principles in medicinal chemistry. Wiley-VCH
Verlag GmbH, Weinheim
14 Developmental Toxicity Prediction 339
26. Ghose AK, Crippen GM (1987) Atomic physi- 34. Golbraikh A, Tropsha A (2002) Beware of q2!
cochemical parameters for three-dimensional- J Mol Graph Model 20(4):269276
structure-directed quantitative structure- 35. Kozak A, Kozak R (2003) Does cross valida-
activity relationships. 2. Modeling dispersive tion provide additional information in the eval-
and hydrophobic interactions. J Chem Inf uation of regression models? Can J Forest Res
Comput Sci 27(1):2135 33(6):976987
27. Kier LB (1986) Shape indexes of orders one 36. Reunanen J (2003) Overfitting in making com-
and three from molecular graphs. Quant Struct parisons between variable selection methods. J
Act Relat 5:17 Mach Learn Res 3:13711382
28. Kier LB, Hall LH (1999) The Kappa indices for 37. Olsson I-M, Gottfries J, Wold S (2004)
modeling molecular shape and flexibility. In: D-optimal onion designs in statistical molecu-
Devillers J, Balaban AT (eds) Topological lar design. Chemometr Intell Lab Syst 73
indices and related descriptors in QSAR and (1):3746
QSPR. Gordon and Breach, Reading 38. Mosier PD, Jurs PC (2002) QSAR/QSPR
29. Eriksson L, Johansson E, Kettaneh-Wold N studies using probabilistic neural networks
et al (1999) Introduction to multi- and mega- and generalized regression neural networks. J
variate data analysis using projection methods Chem Inf Comput Sci 42(6):14601470
(PCA & PLS). Umetrics, Inc., Kinnelon 39. Hawkins DM, Basak SC, Mills D (2004) Asses-
30. Eriksson L, Jaworska JS, Worth AP et al (2003) sing model fit by cross-validation. J Chem Inf
Methods for reliability and uncertainty assess- Comput Sci 43(2):579586
ment and for applicability evaluations of classi- 40. Matthews BW (1975) Comparison of the pre-
fication- and regression-based QSARs. Environ dicted and observed secondary structure of T4
Health Perspect 111:13611375 phage lysozyme. Biochim Biophys Acta 405
31. Gramatica P (2007) Principles of QSAR mod- (2):442451
els validation: internal and external. QSAR 41. Obach RS, Baxter JG, Liston TE et al (1997)
Comb Sci 26:694701 The prediction of human pharmacokinetic
32. Atkinson AC (1985) Plots, transformations parameters from preclinical and in vitro metab-
and regression. Clarendon, Oxford olism data. J Pharmacol Exp Ther 283
33. Wold S, Eriksson L (1995) Statistical validation (1):4658
of QSAR results. In: van de Waterbeemd H 42. Hawkins DM (2004) The problem of overfit-
(ed) Chemometric methods in molecular ting. J Chem Inf Comput Sci 44(1):112
design. VCH, New York
Further Reading
http://www.scripps.edu/rc/softwaredocs/msi/ estrogens: the categorical-SAR (Cat-SAR)
cerius45/qsar/T_qsar1.html approach. In: Devillers J (ed) Endocrine disrup-
http://voyagememoirs.com/pharmine/archives tion modeling. CRC, New York
Arena VC, Sussman NB, Mazumdar S et al (2004) Hewitt M, Ellison CM, Enoch SJ et al (2010) Inte-
Selection in developmental toxicity: comparative grating (Q)SAR models, expert systems and read-
analysis of logistic regression and decision tree across approaches for the prediction of develop-
models. SAR QSAR Environ Res 15(1):118 mental toxicity. Reprod Toxicol 30(1):147160
Cronin MT, Jaworska JS, Walker JD et al (2003) Knudsen TB, Martin MT, Kavlock RJ et al (2009)
Use of QSARs in international decision-making Profiling the activity of environmental chemicals
frameworks to predict health effects of chemical in prenatal developmental toxicity studies using
substances. Environ Health Perspect 111 the U.S. EPAs ToxRefDB. Reprod Toxicol 28
(10):13911401 (2):209219
Cunningham AR, Carrasquer CA, Mattison DR Martin MT, Mendez E, Corum DG et al (2009)
(2009) A categorical structure-activity relation- Profiling the reproductive toxicity of chemicals
ship analysis of the developmental toxicity of from multigeneration studies in the toxicity ref-
antithyroid drugs. Int J Pediatr Endocrinol. erence database (ToxRefDB). Toxicol Sci 110
doi:10.1155/2009/936154 (1):181190
Cunningham AR, Consoer DM, Iype SA et al OECD (1995) Test No. 421: Reproduction/
(2009) A structure-activity relationship (SAR) Developmental Toxicity Screening Test,
analysis for the identification of environmental OECD guidelines for the testing of chemicals,
340 R. Venkatapathy and N.C.Y. Wang
section 4: health effects, OECD Publishing. Tropsha A, Gramatica P, Gombar VK (2003) The
doi: 10.1787/9789264070967-en importance of being earnest: validation is the
Saiakhov RD, Klopman G (2008) MultiCASE absolute essential for successful application and
expert systems and the REACH initiative. Tox- interpretation of QSPR models. QSAR Comb
icol Mech Methods 18(23):159175 Sci 22(1):6977
Sakamoto K, Yamauchi A, Sasaki M et al (2007) A US EPA (1991) Guidelines for developmental tox-
structural similarity evaluation by SimScore in a icity risk assessment. Office of Research and
teratogenicity information sharing system. Development, Washington, DC, EPA/600/
J Comput Chem Jpn 6(2):117122 FR-91/001
Todeschini R, Consonni V (2000) Handbook of Yang CH, Valerio LG, Arvidson KB (2009)
molecular descriptors, vol 11, Methods and Computational toxicology approaches at the
principles in medicinal chemistry. Wiley-VCH US Food and Drug Administration. Altern
Verlag GmbH, Weinheim Lab Anim 37(5):523531
Chapter 15
Abstract
Use of predictive technologies is an important aspect of many efforts in todays research, development, and
regulatory landscapes. Computational methods as predictive tools for supporting drug safety assessments is
of widespread interest as the field of in silico assessments rapidly changes with emerging technologies and
the large amount of existing data available for modeling. There are challenges associated with application of
in silico analyses for drug toxicity predictions and need for strategies and harmonization to enable an
acceptable in silico evaluation for prediction of specific toxicity assay outcomes. This chapter will provide an
overview focused on computational tools using structureactivity relationships and will highlight initiatives
for use of computational assessments and realistic applications for predictive modeling in evaluating
potential toxicities of drug-related molecules.
Key words: Computational toxicology, Drug safety, Safety assessment, Drug toxicity, QSAR, SAR,
In silico toxicology, Predictive toxicology, Chemoinformatics, Molecular screening
1. Introduction
There are many recognized potential drug safety issues that can
arise during the course of drug development through post-
marketing. Among the most recognized are safety-based issues
centered on cardiovascular and hepatic adverse drug reactions (1).
In addition, some data indicate that the average success rate for new
molecular entities (NMEs) in aggregate of all therapeutic areas
starting from first-in-human studies to registration during
19912000 was approximately 11% (2). In 2003, the US Food
and Drug Administration (FDA) approved 21 NMEs (2), but that
has decreased recently with only 15 NMEs approved in 2010 (3).
Although lack of efficacy is a major contributor, toxicity can also be
a major cause of failure in drug development with estimates of
approximately 30% of root causes of attrition (4). In an effort to
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_15, # Springer Science+Business Media, LLC 2013
341
342 L.G. Valerio Jr.
Leadscope
Model SciQSAR Symmetry Toxtree Derek Nexus
Applier
Fingerprint 2D Structural
Molecular Connectivity
Molecular Descriptors, Structural Alert
Structure Indices (2D
Features / 3D in the Rules (Alerts) (Molecular
Interpretation Descriptors)
Scaffolds future Fragment)
2D 2D
Molecular 2D(n~800
(n~27,000 (n~200, Kier 2D 2D
Descriptors descriptors)
features) and Hall)
Similarity to Descriptor- Descriptor- Similarity to
Applicability training set based based
None knowledge
Domain + Features in Membership Membership in base
Domain in Class Class
Windows in
Operating
Windows Windows the server, any Windows Windows
System
in the clients
Application Web
Desktop Desktop Desktop Desktop
Architecture Application
Fig. 1. List of common computational toxicology software platforms. Information shows a wide range of approaches.
All software can be enabled to predict the genetic toxicity of chemicals using various modules (e.g., bacterial mutagenic-
ity). No endorsement intended or implied.
2. Materials
and Methods
Table 1
Summary of validation statistics reported in the literature for some computational
toxicology software that have QSAR models or SAR methods designed to predict
mutagenicity
Derek in the prediction of the Ames assay (27). The other factor to
consider is the training sets and knowledge base of the computa-
tional software programs that are used to make the models have
very low drug content (<25%) (30). Thus, the following question
can arise: if nondrug molecules present in the training set of a
model can predict the mutagenicity of drugs, and moreover, can
these public training sets deal with proprietary drug molecular
space? To help answer this question, a recent examination of a
public Ames QSAR training set using over 1,000 drug impurities
(public and private) as a test set found that only 14% of the drug
impurities were not in the applicability domain space of the model
(31, 63). In other words, useful Ames predictions could be made
for 86% of the drug impurities that were screened with the model.
Thus, it seems at least in this example of a QSAR training set of
public structures which are predominately not drugs, there is an
acceptable level of molecular coverage and overlapping chemical
space with a test set of drug impurities. The next logical step will be
to test drug impurities with known Ames assay outcomes and
validate the available computational models for accuracy, sensitivity
and specificity. There are various arguments that can be made
regarding a preference for a model to perform at high sensitivity
or specificity, which have been described recently (7, 9, 26, 28). In a
perfect world we would have 100% sensitivity and 100% specificity
in a single model. Since that is not possible, one has to usually settle
for one of them (e.g., higher sensitivity over specificity). However,
it is not impossible, to construct a model with somewhat balanced
sensitivity and specificity (32). As an alternative solution, construc-
tion of different models with optimized levels of predictive perfor-
mance for sensitivity or specificity is becoming more common in
order to customize models to the different use cases.
The software programs illustrated in Fig. 1 do not have special
software requirements since all these systems can operate in a
standard Windows environment. Other computational toxicology
software use Linux as the operating system (33), or require Oracle
databases (34), so depending upon the users Information Tech-
nology (IT) infrastructure there are considerations to be made.
The computational software that are most commonly used for
predicting drug safety endpoints using SAR and QSAR approaches
involve confidential in-house systems (35, 36), freely available sys-
tems (8, 37, 38), and several commercially available ones (12, 13,
39, 40, 67). The confidential in-house systems have an advantage of
being able to leverage knowledge from private study data into a
model and design algorithms that can be customized to better align
with business goals, work flow, and scientific processes. Moreover,
these systems are thought to be innovative as there are fewer
restrictions on design of the tools and clearly they can take advan-
tage in terms of quantity and quality of training set data compared
to public data sets that commercial systems often tend offer.
15 Predictive Computational Toxicology to Support Drug Safety Assessment 347
Fig. 2. Hierarchy classification tree of a QSAR training set used to develop a public model for the Ames Salmonella
mutagenicity assay using Leadscope technology (Enterprise version 3.0). Tree shows the positive structural features
contained in the QSAR training set that are important components of the model. Negative features except for negative
parents of positive features were filtered out. Frequency is the number of compounds in the training set with that feature.
The precision of each feature indicates how well an alert candidate would work against the training set. Features with
significant z-scores and large total (+) loadings and zero (or near zero) total ( ) loadings explain activity by themselves and
are candidates for use as alerts.
3. Discussion
Fig. 3. The importance of research and innovation of computational tools to support safety assessments during the drug
development process.
4. Summary
References
1. Redfern WS, Carlsson L, Davis AS, Lynch WG, 12. Contrera JF, Matthews EJ, Kruhlak NL, Benz
MacKenzie I, Palethorpe S, Siegl PKS, Strang RD (2008) In Silico Screening of Chemicals for
I, Sullivan AT, Wallis R, Camm AJ, Hammond Genetic Toxicity Using MDL-QSAR, Non-
TG (2003) Relationships between preclinical parametric Discriminant Analysis, E-State,
cardiac electrophysiology, clinical QT interval Connectivity, and Molecular Property Descrip-
prolongation and torsade de pointes for a tors. Toxicol Mech Methods 18:207216
broad range of drugs: evidence for a provisional 13. Valerio LG, Yang C, Arvidson KB, Kruhlak NL
safety margin in drug development. Cardiovasc (2010) A structural feature-based computa-
Res 58:3245 tional approach for toxicology predictions.
2. Kola I, Landis J (2004) Can the pharmaceutical Expert Opin Drug Metab Toxicol 6:505518
industry reduce attrition rates? Nat Rev Drug 14. Arvidson KB (2008) FDA toxicity databases
Discov 3:711715 and real-time data entry. Toxicol Appl Pharma-
3. Mullard A (2011) 2010 FDA drug approvals. col 233:1719
Nat Rev Drug Discov 10:8285 15. FDA (2008) Draft guidance for industry: gen-
4. Kola I (2008) The state of innovation in drug otoxic and carcinogenic impurities in drug sub-
development. Clin Pharmacol Ther stances and products: recommended
83:227230 approaches. U.S FDA/CDER, Silver Spring
5. FDA (2004) Challenge and opportunity on the 16. ICH (2010) Final concept paper. M7: geno-
critical path to new medical products. US toxic impurities. In: International conference
Department of Health and Human Services, F on harmonisation of technical requirements
(ed), FDA, Rockville. http://wcms.fda.gov/ for registration of pharmaceuticals for human
FDAgov/ScienceResearch/SpecialTopics/ use. International Conference on Harmonisa-
CriticalPathInitiative/CriticalPathOpportuni- tion, Geneva
tiesReports/ucm077262.htm 17. Ashby J, Lefevre PA, Styles JA, Charlesworth J,
6. FDA (2011) FDA Critical Path Initiative. In: Paton D (1982) Comparisons between carci-
Critical path website, US Food and Drug nogenic potency and mutagenic potency to
Administration, Silver Spring. http://www. Salmonella in a series of derivatives of 4-
fda.gov/ScienceResearch/SpecialTopics/Criti dimethylaminoazobenzene (DAB). Mutat Res
calPathInitiative/default.htm 93:6781
7. Arvidson KB, Chanderbhan R, Muldoon-Jacobs 18. Kazius J, McGuire R, Bursi R (2005) Deriva-
K, Mayer J, Ogungbesan A (2010) Regulatory tion and validation of toxicophores for muta-
use of computational toxicology tools and data- genicity prediction. J Med Chem 48:312320
bases at the United States Food and Drug 19. Bailey AB, Chanderbhan R, Collazo-Braier N,
Administrations Office of Food Additive Safety. Cheeseman MA, Twaroski ML (2005) The use
Expert Opin Drug Metab Toxicol 6:793796 of structure-activity relationship analysis in the
8. Yang C, Valerio LG Jr, Arvidson KB (2009) food contact notification program. Regul Tox-
Computational toxicology approaches at the icol Pharmacol 42:225235
US Food and Drug Administration. Altern 20. Munro IC, Ford RA, Kennepohl E, Sprenger
Lab Anim 37:523531 JG (1996) Correlation of structural class with
9. Valerio LG Jr (2009) In silico toxicology for no-observed-effect levels: a proposal for estab-
the pharmaceutical sciences. Toxicol Appl lishing a threshold of concern. Food Chem
Pharmacol 241:356370 Toxicol 34:829867
10. Arvidson KB, Valerio LG, Diaz M, Chanderb- 21. Benigni R, Bossa C (2008) Structure alerts for
han RF (2008) In silico toxicological screening carcinogenicity, and the Salmonella assay sys-
of natural products. Toxicol Mech Methods tem: a novel insight through the chemical rela-
18:229242 tional databases technology. Mutat Res
11. Matthews EJ, Kruhlak NL, Benz RD, Contrera 659:248261
JF, Marchant CA, Yang C (2008) Combined Use 22. Cramer GM, Ford RA, Hall RL (1978) Esti-
of MC4PC, MDL-QSAR, BioEpisteme, Lead- mation of toxic hazarda decision tree
scope PDM, and Derek for Windows Software approach. Food Cosmet Toxicol 16:255276
to Achieve High-Performance, High- 23. Worth A, Lapenna S, Lo Piparo E, Mostrag-
Confidence, Mode of Action-Based Predictions Szlichtyng A, Serafimova R (2010) The appli-
of Chemical Carcinogenesis in Rodents. Toxicol cability of software tools for genotoxicity and
Mech Methods 18:189206 carcinogenicity prediction: case studies relevant
to the assessment of pesticides. In: JRC
15 Predictive Computational Toxicology to Support Drug Safety Assessment 353
scientific and technical reports. EC Joint digm for toxicity prediction. Toxicol Mech
Research Centre Institute for Health and Con- Methods 18:103118
sumer Protection, Ispra, pp. 1819 38. Mostrag-Szlichtyng A, Zaldivar Comenges JM,
24. Gasteiger J (2007) De novo design and syn- Worth AP (2010) Computational toxicology at
thetic accessibility. J Comput Aided Mol Des the European commissions joint research cen-
21:307309 tre. Expert Opin Drug Metab Toxicol
25. Gasteiger J (2003) Physicochemical effects in 6:785792
the representation of molecular structures for 39. Saiakhov RD, Klopman G (2008) MultiCASE
drug designing. Mini Rev Med Chem Expert Systems and the REACH Initiative.
3:789796 Toxicol Mech Methods 18:159175
26. Naven RT, Louise-May S, Greene N (2010) 40. Marchant CA, Briggs KA, Long A (2008) In
The computational prediction of genotoxicity. silico tools for sharing data and knowledge on
Expert Opin Drug Metab Toxicol 6:797807 toxicity and metabolism: derek for windows,
27. Snyder RD (2009) An update on the genotoxi- meteor, and vitic. Toxicol Mech Methods
city and carcinogenicity of marketed pharma- 18:177187
ceuticals with reference to in silico predictivity. 41. Jaworska JS, Comber M, Auer C, Van Leeuwen
Environ Mol Mutagen 50:435450 CJ (2003) Summary of a workshop on regu-
28. Durham SK, Pearl GM (2001) Computational latory acceptance of (Q)SARs for human health
methods to predict drug safety liabilities. Curr and environmental endpoints. Environ Health
Opin Drug Discov Devel 4:110115 Perspect 111:13581360
29. Snyder RD, Ewing DE, Hendry LB (2004) 42. Fjodorova N, Novich M, Vrachko M, Smirnov
Evaluation of DNA intercalation potential of V, Kharchevnikova N, Zholdakova Z, Novikov
pharmaceuticals and other chemicals by cell- S, Skvortsova N, Filimonov D, Poroikov V,
based and three-dimensional computational Benfenati E (2008) Directions in QSAR mod-
approaches. Environ Mol Mutagen eling for regulatory uses in OECD member
44:163173 countries, EU and in Russia. J Environ Sci
30. Contrera JF, Matthews EJ, Kruhlak NL, Benz Health C Environ Carcinog Ecotoxicol Rev
RD (2005) In silico screening of chemicals for 26:201236
bacterial mutagenicity using electrotopological 43. Yang C, Hasselgren CH, Boyer S, Arvidson K,
E-state indices and MDL QSAR software. Aveston S, Dierkes P, Benigni R, Benz RD,
Regul Toxicol Pharmacol 43:313323 Contrera J, Kruhlak NL, Matthews EJ, Han
31. Myatt G, Cross KP, Valerio LG (2011) Sup- X, Jaworska J, Kemper RA, Rathman JF,
porting safety assessment of drug impurities Richard AM (2008) Understanding genetic
through examination of an Ames assay QSAR toxicity through data mining: the process of
model. In: SOT (ed) The toxicologist. Society building knowledge by integrating multiple
of Toxicology, Washington, DC, p 1812 genetic toxicity databases. Toxicol Mech Meth-
ods 18:277295
32. Benigni R, Bossa C (2008) Predictivity of
QSAR. J Chem Inf Model 48:971980 44. Benigni R, Zito R (2003) Designing safer
drugs: (Q)SAR-based identification of muta-
33. Boyer S (2009) The use of computer models in gens and carcinogens. Curr Top Med Chem
pharmaceutical safety evaluation. Altern Lab 3:12891300
Anim 37:467475
45. Matthews EJ, Kruhlak NL, Cimino MC, Benz
34. Ekins S, Andreyev S, Ryabov A, Kirillov E, RD, Contrera JF (2006) An analysis of genetic
Rakhmatulin EA, Sorokina S, Bugrim A, toxicity, reproductive and developmental toxic-
Nikolskaya T (2006) A combined approach to ity, and carcinogenicity data: I. Identification of
drug metabolism and toxicity assessment. carcinogens using surrogate endpoints. Regul
Drug Metab Dispos 34:495503 Toxicol Pharmacol 44:8396
35. Bercu JP, Morton SM, Deahl JT, Gombar VK, 46. Contrera JF, Kruhlak NL, Matthews EJ, Benz
Callis CM, van Lier RB (2010) In silico RD (2007) Comparison of MC4PC and MDL-
approaches to predicting cancer potency for QSAR rodent carcinogenicity predictions and
risk assessment of genotoxic impurities in the enhancement of predictive performance by
drug substances. Regul Toxicol Pharmacol 57 combining QSAR models. Regul Toxicol Phar-
(23):300306 macol 49:172182
36. Boyer S (2010) The use of computer models in 47. Benigni R, Bossa C (2008) Predictivity and
pharmaceutical safety evaluation. Altern Lab reliability of QSAR models: the case of muta-
Anim 37:467475 gens and carcinogens. Toxicol Mech Methods
37. Richard AM, Yang C, Judson RS (2008) Tox- 18:137147
icity data informatics: supporting a New para-
354 L.G. Valerio Jr.
48. Benfenati E, Benigni R, Demarini DM, Helma 57. Matthews EJ, Frid AA (2010) Prediction of
C, Kirkland D, Martin TM, Mazzatorta P, drug-related cardiac adverse effects in
Ouedraogo-Arras G, Richard AM, Schilter B, humansA: creation of a database of effects
Schoonen WG, Snyder RD, Yang C (2009) and identification of factors affecting their occur-
Predictive models for carcinogenicity and rence. Regul Toxicol Pharmacol 56:247275
mutagenicity: frameworks, state-of-the-art, 58. Keiser MJ, Setola V, Irwin JJ, Laggner C,
and perspectives. J Environ Sci Health C Envi- Abbas AI, Hufeisen SJ, Jensen NH, Kuijer
ron Carcinog Ecotoxicol Rev 27:5790 MB, Matos RC, Tran TB, Whaley R, Glennon
49. Di Carlo FJ (1990) Structure-activity relation- RA, Hert J, Thomas KLH, Edwards DD, Shoi-
ships (SAR) and structure-metabolism relation- chet BK, Roth BL (2009) Predicting new
ships (SMR) affecting the teratogenicity of molecular targets for known drugs. Nature
carboxylic acids. Drug Metab Rev 22:411449 462:175181
50. Pearl GM, Livingston-Carr S, Durham SK 59. Berger SI, Maayan A, Iyengar R (2010) Sys-
(2001) Integration of computational analysis tems pharmacology of arrhythmias. Sci Signal
as a sentinel tool in toxicological assessments. 3:ra30
Curr Top Med Chem 1:247255 60. Rodgers AD, Zhu H, Fourches D, Rusyn I,
51. Matthews EJ, Kruhlak NL, Daniel Benz R, Tropsha A (2010) Modeling liver-related
Ivanov J, Klopman G, Contrera JF (2007) A adverse effects of drugs using knearest neigh-
comprehensive model for reproductive and bor quantitative structure-activity relationship
developmental toxicity hazard identification: method. Chem Res Toxicol 23:724732
II. Construction of QSAR models to predict 61. Franke R, Gruska A, Bossa C, Benigni R (2010)
activities of untested chemicals. Regul Toxicol QSARs of aromatic amines: identification of
Pharmacol 47:136155 potent carcinogens. Mutat Res 691:2740
52. Pery AR, Desmots S, Mombelli E (2010) 62. NRC (2007) Toxicity testing in the 21st cen-
Substance-tailored testing strategies in toxicol- tury: a vision and a strategy. National Research
ogy: an in silico methodology based on QSAR Council of the National Academies, Washing-
modeling of toxicological thresholds and ton, DC
Monte Carlo simulations of toxicological test- 63. Valerio LG Jr, Cross KP (2012) Characteriza-
ing. Regul Toxicol Pharmacol 56:8292 tion and validation of an in silico toxicology
53. Glowienke S, Hasselgren C (2011) Use of struc- model to predict the mutagenic potential of
ture activity relationship (SAR) evaluation as a drug impurities. Toxicol Appl Pharmacol
critical tool in the evaluation of the genotoxic 260:209221.
potential of impurities. In: Teasdale A (ed) Gen- 64. Valerio LG Jr, Dixit A, Moghaddam S, Mora
otoxic impurities: strategies for identification O, Prous J, Valencia A. (2012) QSAR Model-
and control. Wiley, Hoboken, pp 97120 ing for the mutagenic potential of drug impu-
54. Arvidson K, McCarthy A, Yang C, Hristozov D rities with Symmetry1 Suppl. Toxicol. Sci.
(2011) Design and development of an institu- 2906, 437
tional knowledgebase at FDAs Center for 65. Nigsch F, Lounkine E, McCarren P, Cornett B,
Food Safety and Applied Nutrition. In: SOT Glick M, Azzaoui K, Urban L, Marc P, Muller
(ed) The toxicologist. Society of Toxicology, A, Hahne F, Heard DJ, Jenkins JL (2011)
Washington, DC, p 155 Computational methods for early predictive
55. Matthews EJ, Ursem CJ, Kruhlak NL, Daniel safety assessment from biological and chemical
Benz R, Sabate DA, Yang C, Klopman G, data. Expert Opin Drug Metab Toxicol
Contrera JF (2009) Identification of 7:14971511
structure-activity relationships for adverse 66. Valerio LG Jr (2012) Application of advanced
effects of pharmaceuticals in humans: B. Use in silico methods for predictive modeling and
of (Q)SAR systems for early detection of drug- information integration. Expert Opin Drug
induced hepatobiliary and urinary tract toxici- Metab Toxicol 8:395398
ties. Regul Toxicol Pharmacol 54(1):2342
67. Myshkin E, Brennan R, Khasanova T, Sitnik T,
56. Ursem CJ, Kruhlak NL, Contrera JF, Serebriyskaya T, Litvinova E, Guryanov A,
MacLaughlin PM, Benz RD, Matthews EJ Nikolsky Y, Nikolskaya T, Bureeva S (2012) Pre-
(2009) Identification of structure-activity rela- diction oforgan toxicity endpointsbyQSARmod-
tionships for adverse effects of pharmaceuticals eling based on precise chemical-histopathology
in humans. Part A: use of FDA post-market annotations DOI: 10.1111/j.1747-0285.
reports to create a database of hepatobiliary 2012.01411.x
and urinary tract toxicities. Regul Toxicol Phar-
macol 54:122
Part V
Abstract
Comprehensive gene expression analysis has been applied to investigate the molecular mechanism of
toxicity, which is generally known as toxicogenomics (TGx). When analyzing large-scale gene expression
data obtained by microarray analysis, typical multivariate data analysis methods performed with commercial
software such as hierarchical clustering or principal component analysis usually do not provide conclusive
outputs by themselves. To best utilize the TGx data for toxicity evaluation in the drug development process,
fit-for-purpose customization of the analytical algorithm with user-friendly interface and intuitive outputs
are required to practically address the toxicologists demands. However, commercial software is usually not
very flexible in the customization of their functions or outputs. Owing to the recent advancement and
accumulation of open-source software contributed by bioinformaticians all over the world, it becomes
easier for us to develop practical and fit-for-purpose analytical software by ourselves with fairly low cost and
efforts. The aim of this article is to present an example of developing an automated TGx data processing
system (ATP system), which implements gene set-level analysis toxicogenomic profiling by D-score method
and generates straightforward output that makes it easy to interpret the biological and toxicological
significance of the TGx data. Our example will provide basic clues for readers to develop and customize
their own TGx data analysis system which complements the function of existing commercial software.
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_16, # Springer Science+Business Media, LLC 2013
357
358 T. Hirai and N. Kiyosawa
Affy probe ID
1367659_s_at
1367667_at
Gene Title
dodecenoyl-Coenzyme A delta isomerase (3,2 trans-enoyl-Coenzyme A isomerase)
farnesyl diphosphate synthase (farnesyl pyrophosphate synthetase, dimethylallyltranstransferase, geranyltranstransferase)
Gene Symbol Entrez Gene Group1_ratio Group2_ratio Group3_ratio
Dci
Fdps
29740
83791
1.1
1.1
1.7
0.9
1.0
1.6
Gene set #1 D-score
1367668_a_at
1367689_a_at
stearoyl-CoA desaturase (delta-9-desaturase)
CD36 molecule (thrombospondin receptor)
Scd
Cd36
83792
29184
0.9
1.0
0.7
1.5
3.8
1.5 calculation
1367725_at
1367733_at
1367836_at
pim-3 oncogene
carbonic anhydrase II
carnitine palmitoyltransferase 1a, liver
Pim3
Car2
Cpt1a
64534
54231
25757
1.1
1.3
1.0
1.6
0.9
1.6
1.6
1.5
1.5
Gene set #2
Gene set #3
1367847_at nuclear protein 1 Nupr1 113900 0.9 1.8 1.1
1367932_at 3-hydroxy-3-methylglutaryl-Coenzyme A synthase 1 (soluble) Hmgcs1 29637 1.1 1.1 1.7
1367950_at solute carrier family 22 (organic cation/carnitine transporter), member 5 Slc22a5 29726 1.0 2.1 1.3
1367979_s_at cytochrome P450, subfamily 51 Cyp51 25427 1.1 1.0 2.4
1368007_at
1368009_at
1368020_at
deleted in malignant brain tumors 1
glucosamine (UDP-N-acetyl)-2-epimerase/N-acetylmannosamine kinase
mevalonate (diphospho) decarboxylase
Dmbt1
Gne
Mvd
170568
114711
81726
1.3
1.2
1.2
1.7
1.5
1.0
2.5
1.0
3.6
Gene set #4
1368047_at solute carrier family 13 (sodium-dependent dicarboxylate transporter), member 3 Slc13a3 64846 1.1 1.7 0.9
1368052_at
1368074_at
1368086_a_at
tetraspanin 8
UDP-galactose-4-epimerase
lanosterol synthase (2,3-oxidosqualene-lanosterol cyclase)
Tspan8
Gale
Lss
171048
114860
81681
1.0
1.2
1.3
1.2
1.2
0.9
1.6
1.6
3.7
Gene set #5
1368160_at insulin-like growth factor binding protein 1 Igfbp1 25685 1.0 5.3 0.8
1368173_at
1368189_at
nucleolar protein 5
7-dehydrocholesterol reductase
Nol5
Dhcr7
60373
64191
1.4
1.0
1.1
0.7
1.5
1.6
1368232_at mevalonate kinase Mvk 81727 1.2 1.2 2.1
1368251_at Janus kinase 3 Jak3 25326 1.3 2.1 1.2
1368255_at
1368260_at
1368265_at
neurotrimin
aurora kinase B
cytochrome P450 monooxygenase CYP2T1
Ntm
Aurkb
Cyp2t1
50864
114592
171380
0.8
1.1
1.0
1.0
0.6
0.8
1.7
1.5
1.6
Gene set #N
GeneChip data : >30,000 probe sets N gene sets N scores
Fig. 1. Concept of D-score analysis. Affymetrix Rat 230 2.0 GeneChip array contains >30,000 probe sets. Users are
required to define gene sets whose expression levels are associated with certain biological or toxicological pathways
(in this figure a total of N gene sets are defined). Gene expression data contained in the predefined gene sets were used for
the calculation of D-score, which represents the general tendency of expression changes of certain gene sets (2). During
this process, the dimension of the data was reduced from >30,000 to N, and therefore it is easier for toxicologists to
understand the overall profile of TGx data.
2. Open-Source
Software
The analytical environment described in the present article was
developed and implemented on Windows XP operating system.
We developed a computational workflow using KNIME software,
by which execution of procedures of TGx data analysis was
automated including data normalization, statistical analysis, QC
analysis, and D-score calculation, all of which were coded with
statistical software R. Although this approach may compromise
the computational performance and takes time to complete the
analysis, the KNIME workflow is easy for users who are not expert
in computational coding skills to understand the logical flow of
TGx data processing. Therefore, users can modify the nodes in the
workflow easily when they want to try an alternative algorithm for
the analysis. The following open-source software and resources
were utilized in our system:
l R (http://cran.r-project.org/) is one of the most preferred
statistical software in the molecular biology society (3). Part
of the reason we chose R is because of the existence of a good
community of users; a number of books and online instruction
tools are available, all of which are very helpful to solve the
problems users encounter; and the software is updated very
frequently. We used R-2.11.1 to develop this software.
l The Comprehensive R Archive Network (CRAN) (http://cran.r-
project.org/) archives the vast number of high quality user
contributed packages, including graphics, that are freely avail-
able. The following is the list of R packages we used in the
present article. Users can download the following packages
from the CRAN archives, and then install it from within R.
The installation is a call to the install.packages function
for the CRAN packages.
ggplot2 (http://cran.r-project.org/web/packages/ggplot2/
index.html) is a plotting system for R. It provides a powerful
model of graphics that makes it easy to produce complex
multilayered graphics (e.g., Heat map).
plotrix (http://cran.r-project.org/web/packages/plotrix/
index.html) has various plotting functions. We can use the
radial.plot command needed to generate the Radar
chart.
l Bioconductor (http://www.bioconductor.org/) project devel-
ops and archives R-based bioinformatic software to analyze
transcriptome, proteome, and metabolome data obtained by
microarray, LC/GC-mass spectrometry, and other techniques
(6), which can be implemented on Linux, Windows, and Mac
OS. The following is the list of R packages we used in the
16 Developing a Practical Toxicogenomics Data Analysis System Utilizing. . . 361
3. Methods
3.1. Gene Expression Affymetrix GeneChip microarray was used for obtaining TGx data
Data Sets sets. Numerical gene expression data was generated from the
scanned GeneChip data by using MAS5 algorithm (10), which
was implemented by GCOS software (Affymetrix). MAS5 algo-
rithm generates Signal (numerical value, which stands for the
gene expression level) and Detection Call (Presence, Marginal,
and Absence Call, which stands for the reliability of the Signal
value) obtained from individual probe sets, and both the Signal
and Detection Call are required for D-score calculation. Although
we use GeneChip data as an example in the present article, TGx
data obtained from any other microarray platforms may also be
utilized provided that they generate appropriate indices for the
gene expression level (i.e., Signal in GeneChip) and reliability of
data (i.e., Detection Call in GeneChip).
3.2. Preparation To perform the D-score analysis, users must predefine the gene sets
of Gene Sets whose expression levels are closely associated with certain biological
pathways or toxicity endpoints. Such gene sets can be prepared
either from published literatures or by experiment-based informa-
tion (4). In addition, gene set information can also be obtained
from public databases such as Gene Ontology (11), KEGG (12),
BioCarta (http://www.biocarta.com/genes/allPathways.asp), or
GenMAPP (13) biological pathway information. Notably, the
authors found that the Molecular Signatures Database (MsigDB)
available on the Gene Set Enrichment Analysis (GSEA) Web site
(http://www.broadinstitute.org/gsea/index.jsp) is an excellent
resource for users to obtain gene sets such as those for biological
pathways or targets of transcription factors, which are either
derived from public pathway databases or are computationally
determined by GSEA algorithm (14). Although the gene set infor-
mation obtained from public databases is useful as it is, directions of
gene expression changes (i.e., upregulation or downregulation)
induced by chemical treatment is usually not uniform in the gene
set, which will compromise the performance of D-score calculation.
To obtain the best result from D-score analysis, users are required
to carefully classify the genes into upregulated and downregulated
ones for each biological/toxicological pathway.
Data pre-processing
(normalization, signal log ratio,
Presence Call ratio, etc.) Prepare gene sets whose
expression levels are
associated with certain
Calculation of D-scores for all biological / toxicological
the gene sets pathways
Fig. 2. Overview of automated toxicogenomics data processing system (ATP system). Users can customize the content of
gene sets based on their own research interest. ATP system assists users to organize the gene sets, implement D-score
calculations after importing the TGx data, and summarize the results in an HTML format for efficient interpretation of the
results biological and toxicological significance.
Fig. 3. KNIME workflow. Each node represents programs written with statistical software R, and is incorporated into a
workflow designed with KNIME software. The ATP system includes functions for data normalization, QC, and statistical
analysis to generate a differentially expressed gene list in addition to D-score calculation. However, we will focus on
examples for D-score analysis in the present article.
b c
Fig. 4. Preparation of data set object. The following three files are required for implementing the ATP system: (a) MAS5-
analyzed GeneChip data, (b) group definition file to define which column of the GeneChip data is assigned to specific
groups, and (c) comparison definition file to determine the combination of groups to be compared in the D-score
calculation.
X
N
Index 2 SLR i PR i 2 =N
i1
Gene probe
sets
Statistics (P-value)
1 2 6 12 24
Time after BB treatment (h)
Fig. 5. Summarization of the calculated D-score. The calculated D-score for each gene set is presented in an HTML format.
The D-scores are linked to heat maps of corresponding expression changes, Presence/Absence Call, and statistics
(P-values) of the individual genes. In addition, the HTML has links to the radar chart (Fig. 5) and gene set-level network
presentation (Fig. 6).
a
>dscore <-c(1:4)
>dscore.names <-names(dscore) <-c("Set1", "Set2",
"Set3", "Set4")
>dscore <-c(-25, 70, 5, 10)
>library(plotrix)
>radial.plot(dscore, labels=dscore.names, rp.type ="p",
line.col = "blue",
radial.lim = c(-100,100))
b
2h after BB treatment
D-score = 66.9
(DNA damage)
Fig. 6. Radar chart of calculated D-score. (a) Sample R code. (b) The calculated D-scores
are presented in a radar chart. The solid line indicates D-scores for 60 gene sets.
a b
>library("Rgraphviz")
>rEG <- new("graphNEL", nodes = c("AhR", "Cyp1a1",
"Oxidative_stress"), edgemode = "directed")
>rEG <- addEdge("AhR", "Cyp1a1", rEG, 1)
>rEG <- addEdge("Cyp1a1", "Oxidative_stress", rEG, 1)
>nAttrs$fillcolor <-c (Cyp1a1 = "Red") #color can be
changed by D-Score
>plot(rEG, nodeAttrs = nAttrs)
AhR
Cyp1a
Oxidative
stress
Fig. 7. Gene set-level network. RGraphviz was utilized to draw the gene set-level network. (a) Sample code of the Graphviz.
Such network structure must be predefined by users. (b) Network structure for TGx profiling in the rat liver. The color of
each node represents its D-score (i.e., upregulation: orange and red and downregulation: blue), which helps users to
understand which biological/toxicological pathways were affected by chemical treatment.
4. Practical
Examples
for the Evaluation
of Drug-Induced Although the concept of ATP system is quite simple, the eventual
Toxicity output will be considerably enriched with toxicity relevant informa-
tion and helpful for interpreting the biological/toxicological sig-
nificance of the TGx data, considering that the contents of the gene
sets and structures of gene set-level network contain essential infor-
mation. In this section, we will present two practical examples on
how D-score analysis can be applied to preclinical evaluation of
drug-induced toxicity using TGx data. The ATP system was used
for D-score calculation, and the resulting output was analyzed
either by the ATP system or by commercial software. For presenta-
tion purposes, we used TIBCO Spotfire software (http://spotfire.
tibco.com/) to perform hierarchical clustering in Figs. 8 and 9.
4.1. Example 1: The first example shows how to identify genes and biological pathways
Identification of Genes that were affected in rat livers following treatment of 300 mg/kg
and Pathways bromobenzene (BBz), which is a representative hepatotoxicant that
Associated with Drug- causes oxidative stress following hepatic glutathione depletion (16).
Induced Toxicity The liver was harvested at 1, 2, 6, 12, and 24 h after a single BBz
treatment, and microarray analysis was performed on the obtained rat
liver using Affymetrix Rat 230 2.0 GeneChip. A total of 217 gene sets
associated with specific biological pathways registered in the BioCarta
16 Developing a Practical Toxicogenomics Data Analysis System Utilizing. . . 369
Hematotoxicity
Cyp1a2
Gene set (biological / toxicological pathway)
LXR targets
Cyp4a
Cyp2c
Cyp3a
Cyp2b
Carcinogenesis
Aldo-keto reductase
ABC transporter
Proteasome
DNA damage
UGT
Oxidative stress
Gst
Cyp1a1
Glutathione homeostasis
#3_high dose
#4
#8
#5
#9
#2
#6
#7
#3_low dose
#1_low dose
#1_high dose
80 0 80
D-score
Compounds
4.2. Example 2: The second example represents how D-score can be applied to
Application of D-Score screening for less toxic compounds in the preclinical drug develop-
for Compound ment process. Nine compounds that have the same pharmacological
Screening target were administered to rats, and the liver TGx data were sub-
jected to D-score calculation using a total of 12 gene sets, contents
of which are the same as previously published (17). The calculated
D-scores were then subjected to hierarchical clustering (Fig. 9). The
heat map of D-scores indicates the tendency for these compounds to
induce oxidative stress and affect glutathione homeostasis in the
rat liver. Downregulation of Cyp1a2 gene expression was also
noted, and hematotoxicity-associated gene expression change was
observed as well. Figure 9 demonstrates the magnitude of these
molecular perturbations tends to be greater in the left side of the
heat map, indicating that the potential to induce these molecular
perturbations would be higher for compounds located on the left
side such as #3, #4, and #8; and those located on the right side such
as #1 and #3 are considered to be less toxic. Interestingly, regarding
compound #1, D-score profile showed unique characteristics
16 Developing a Practical Toxicogenomics Data Analysis System Utilizing. . . 371
5. Conclusion
6. Notes
Acknowledgments
The authors would like to acknowledge the people who have con-
tributed to developing the open-source software and publicly avail-
able databases used herein. The authors are grateful to Dr. Kazumi
Ito, Dr. Takashi Yamoto, Kyoko Watanabe, Noriyo Niino, and
Miyuki Kanbori of Medicinal Safety Research Laboratories for
their devotion to the TGx research activity in Daiichi Sankyo.
Also, Drs. Masatoshi Nishimura, Koichi Tazaki, and Kazuhiko
Mori for their productive discussion and advice; Drs. Atsushi San-
buissho, Yuichi Kubo, Hideyuki Haruyama, and Sunao Manabe for
their continuous support of the toxicoinformatic research activity
in Daiichi Sankyo.
References
1. Bass AS, Cartwright ME, Mahon C, Morrison 7. Berthold MR, Borgelt C, Hoppner F, Klawonn
R, Snyder R, McNamara P, Bradley P, Zhou YY, F (2010) KNIME. In: Gries D, Schneider FB
Hunter J (2009) Exploratory drug safety: a dis- (eds) Guide to intelligent data analysis.
covery strategy to reduce attrition in develop- Springer, London, pp 375382
ment. J Pharmacol Toxicol Methods 60:6978 8. Gansner ER, North SC (1999) An open graph
2. Kiyosawa N, Ando Y, Watanabe K, Niino N, visualization system and its applications to soft-
Manabe S, Yamoto T (2009) Scoring multiple ware engineering. Softw Pract Exper 0:15
toxicological endpoints using a toxicogenomic 9. Carey VJ, Gentry J, Whalen E, Gentleman R
database. Toxicol Lett 188:9197 (2005) Network structures and algorithms in
3. R Development Core Team (2008) R: a Bioconductor. Bioinformatics 21:135136
language and environment for statistical com- 10. Hubbell E, Liu WM, Mei R (2002) Robust
puting. R Foundation for Statistical Computing, estimators for expression analysis. Bioinformat-
Vienna, Austria. ISBN 3-900051-07-0, http:// ics 18:15851592
www.R-project.org 11. Ashburner M, Ball CA, Blake JA, Botstein D,
4. Kiyosawa N, Ando Y, Manabe S, Yamoto T Butler H, Cherry JM, Davis AP, Dolinski K,
(2009) Toxicogenomic biomarkers for liver Dwight SS, Eppig JT, Harris MA, Hill DP,
toxicity. J Toxicol Pathol 22:3552 Issel-Tarver L, Kasarskis A, Lewis S, Matese
5. Grewal A, Lambert P, Stockton J (2007) Anal- JC, Richardson JE, Ringwald M, Rubin GM,
ysis of expression data: an overview. Curr Pro- Sherlock G (2000) Gene ontology: tool for the
toc Bioinformatics. Chapter 7, Unit 7.1 unification of biology. The Gene Ontology
6. Gentleman RC, Carey VJ, Bates DM, Bolstad Consortium. Nat Genet 25:2529
B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge 12. Ogata H, Goto S, Sato K, Fujibuchi W, Bono
Y, Gentry J, Hornik K, Hothorn T, Huber W, H, Kanehisa M (1999) KEGG: Kyoto Encyclo-
Iacus S, Irizarry R, Leisch F, Li C, Maechler M, pedia of Genes and Genomes. Nucleic Acids
Rossini AJ, Sawitzki G, Smith C, Smyth G, Res 27:2934
Tierney L, Yang JY, Zhang J (2004) Biocon- 13. Dahlquist KD, Salomonis N, Vranizan K,
ductor: open software development for Lawlor SC, Conklin BR (2002) GenMAPP, a
computational biology and bioinformatics. new tool for viewing and analyzing microarray
Genome Biol 5:R80
374 T. Hirai and N. Kiyosawa
Abstract
This unique overview of systems toxicology methods and techniques begins with a brief account of systems
thinking in biology over the last century. We discuss how systems biology and toxicology continue to
leverage advances in computational modeling, informatics, large-scale computing, and biotechnology.
Next, we chart the genesis of systems toxicology from previous work in physiologically based models,
models of early development, and more recently, molecular systems biology. For readers interested in
further details this background provides useful linkages to the relevant literature. It also lays the foundations
for new ideas in systems toxicology that could translate laboratory measurements of molecular responses
from xenobiotic perturbations to adverse organ level effects in humans. By providing innovative solutions
across disciplinary boundaries and highlighting key scientific gaps, we believe this chapter provides useful
information about the current state, and valuable insight about future directions in systems toxicity.
Key words: Systems toxicology, Cellular systems biology, Biological network inference, Agent-based
modeling, Virtual tissues, Doseresponse modeling, In vitro to in vivo extrapolation
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_17, # Springer Science+Business Media, LLC 2013
375
376 J. Jack et al.
Fig. 1. The progression from chemical stimulus to adverse clinical outcome transcends multiple levels of biological
resolution. Computational systems-based approaches can help to unravel the mode of action or pathways to toxicity,
bringing information on genetic and molecular perturbations to the scale of whole organ and organism. Reproduced from
ref. 64 with permission from Taylor & Francis.
1.1. Background The foundations of systems-level thinking are not unique to toxi-
cology and date back almost 50 years (2, 3). The trends in techno-
logical advancements over the last few decades may provide
valuable lessons about the evolution of systems approaches in the
future. Although systems biology is generally considered a fruit
of the post-genomic revolution (46), complex systems in nature
have been studied for decades. Early cell-based models of morpho-
genesis (7), genetic regulatory networks (8), and metabolic reac-
tions (9, 10) predate the genome sequencing (11) and some even
the knowledge of the double-helical structure of DNA. These
systems models involve some of the same mathematical methods
in use today. However, due to the dearth of molecular data they
were more phenomenological and provided limited insight in
molecular mechanisms. Post-genomic systems biology was driven
by key technological advancements in molecular profiling, data
storage and computational capacity. Hence, it focused on recon-
structing large-scale molecular interaction networks, investigating
their static and dynamic behaviors and translating these molecular
models to higher order cellular events (i.e., changes in cell
state and fate).
378 J. Jack et al.
2. Systems
Toxicology:
Challenges
Bridging the molecular effects with adverse chemical-induced out-
comes is an important challenge for systems toxicology. While this
does require the ability to identify key molecular mechanisms, it is
equally important to recognize their role in a broader physiological
context. Systems biology research has produced valuable tools to
analyze molecular networks and we will discuss the additional cap-
abilities required for toxicology applications. First, it is important to
estimate the adverse effects of chemical exposure across different
doses and durations. Second, it is important to evaluate the contri-
bution of life-stages, genotypes and health states to individual out-
comes. Third, there are practical issues in using in vitro data to make
in vivo predictions, and to infer human effects from animal models.
This will require a new in silico and in vitro toolbox for uncovering
key linkages across biological organization from chemical exposure
to effects, and to estimate their dose-dependence. Lastly, living
systems are homeostatic, which means that they have the capacity
to maintain their internal state through physiological programs that
operate at different levels of organization. A recurring theme in
systems toxicology research is to provide insight into the design
principles underlying the homeostatic behavior of living systems,
and factors that lead to disease progression.
2.1. Defining Defining the spatial and temporal bounds of the physiological
the System systems relevant for analyzing toxicity is important for designing
experiments and computational models. Based on our knowledge
of pathways that lead to toxicity, the early steps are generally molec-
ular events while the downstream events are usually organ level
injury. The complex series of events between these two extremes
depend on a range of factors that can be very difficult to elucidate.
While chemicals interact with proteins on the order of seconds,
17 Systems Toxicology from Genes to Organs 379
Fig. 2. The liver lobule is heterogeneous with respect to blood flow, blood composition,
and cellular phenotypes. Zonated cellular function both maintains and emerges from this
complicated mix of gradients. Reproduced from ref. 65 with permission from John Wiley
and Sons.
2.3. Probing Systems: Models are evaluated with regards to their ability to provide insight
Multiresolution Data while reproducing the phenomena being modeled. For this reason,
cell- and tissue-scale microscopylong the tools of pathologyare
needed both for development and validation of models to simulate
in vivo context for the results of in vitro toxicity assays. Histopa-
thology images have long been used to obtain information on
microanatomic regions, vasculature, individual cells, cell types,
and cell phenotypes from two- and three-dimensional images.
Though traditionally time-intensive, advances in automated extrac-
tion of information from histopathology images are making it
possible to analyze these images at a single cell resolution
(2123). Additionally it is possible to extract information about
the functional state of cells using cytomorphologic features or
molecular markers (24). In vivo tissues are three dimensional,
therefore the ability to quantify cellular phenotypes as well as tissue
organization (e.g., ref. 13) are crucial for simulating in vivo context
for in vitro data.
In toxicology, the amount of chemical that is distributed to a
tissue or part of a tissue, i.e., tissue dosimetry, is often estimated
using physiologically based pharmacokinetic (PBPK) models. A
PBPK model consists of a system of ordinary differential equations
for the concentration of a compound (or compounds) in different
tissues. Typically some key tissues are treated as separate compart-
ments for which a homogeneous, well-mixed tissue-specific con-
centration is calculated, while other tissues are modeled in
aggregate as a single compartment (e.g., rapidly perfused tissues).
More complicated dynamics within a tissue, such as diffusion or
membrane transport, are often modeled with additional subcom-
partments with each subcompartment being well mixed. PBPK
models relate the concentration of compounds inhaled or ingested
from the environment to internal tissue doses (2527).
3. Methods
for Dynamic
Systems Modeling
As we progress through the chapter, we start with some higher level
modeling approaches such as PBPK models and cell based models
of morphogenesis. Then we discuss some of the mechanistic driven
modeling approaches, which have become more feasible from the
advent of new assaying technologies developed in the latter half of
the twentieth century into the present. These techniques employ a
better understanding of the underlying processes of cell state and
function with molecular level detail, for tracing the effects of xeno-
biotics perturbations and the potential of tissue level outcomes.
Finally, we show how modeling approaches can be combined to
explore toxicity, for instance coupling a PBPK model with agent-
based liver simulation.
382 J. Jack et al.
3.1. Compartmental For pharmacokinetic (PK) models the available data is typically
Models chemical concentration in tissues over time. If only serum data is
available, then often only an empirical (e.g., one or two compart-
ment) model can be built. If additional tissues are available, a PBPK
model may be built where the actual tissue volumes and blood flows
are used along with the tissue concentrations to determine how
compounds partition into specific tissues. Tissue partitioning
reflects that the concentration of a chemical within the tissue may
be much higher or lower than the concentration of chemical in the
blood, perhaps due to sequestering within cells or tissue-specific
binding. Given heterogeneous tissue, chemical distribution within
a tissue is expected to be heterogeneous. Thus, in order to deter-
mine the average concentration within a tissue using a minimal
number of samples, the entire tissue (e.g., the many deposits of
fat throughout the body or the many lobes of the liver) is typically
homogenized before analysis. In other words, pharmacokinetic
tissue samples are homogenized in recognition of heterogeneous
tissue distribution and not because the tissues are thought to be
homogenous. While this assumption facilitates accounting for the
total amount of compound within the body it obscures precise
estimation on the localization of the compound within the tissue.
For this reason tissues within PBPK models are typically described
by a few well-mixed (i.e., homogenous) compartments. Given the
available data, PK models can do a good job of describing the
absorption, distribution, metabolism, and excretion (ADME) of a
compound, but without additional data further predictive capacity
on biological effects is unattainable.
Biologically based doseresponse (BBDR) models are often
used in conjunction with an empirical PK or PBPK model to predict
a biological, and often toxicological, response to a compound
within a tissue as a function of chemical exposure. The BBDR
models are only able to make additional predictions because they
have been built using additional dataeither chemical-specific
experiments or prior knowledge about a biological process. If
chemical-specific information is available an empirical model may
be calibrated to produce results as seen in experiments. Without
this information a greater mechanistic understanding of the
homeostatic function of the biological processes that are respond-
ing to chemical concentration must be had in order to predict
doseresponse.
The data needed to build a mechanistic understanding of
homeostasis for an arbitrary biological process sufficient to predict
the effects of a chemical perturbation on that process may surpass
all relevant biological knowledge. For this reason systems based
models in which simpler rules at a lower level produce the complex
phenomena (such as homeostasis) are attractive. If these simpler
rules can be validated then predictions might be possible. Cell-
based models may offer deeper insight into mechanism.
17 Systems Toxicology from Genes to Organs 383
3.2. Cell-Based Models A diverse collection of models are described in the literature on
simulating dynamic cellular behavior. Cell-based models, recapitu-
lating temporal changes in cellular phenotypes, have been used to
investigate emergent phenomenacomplex patterns/properties of
a system which cannot be derived simply from intuition and mech-
anism, but are discovered through simulation. For example, in the
last 50 years, impressive strides have been made in cell-based mod-
els of morphogenesis. These models were described mathematically
as Cellular Automata (CA) or Partial Differential Equations (PDE).
Indeed, the seminal work in the field of morphogenesis modeling
was encoded as a system of PDE (7).
There are important factors to consideration when selecting a
cell-based modeling approach. While one could argue that PDE
have greater precision, they also require detailed information on
kinetics which is often unavailable. A recurring theme in systems
toxicology (and, more generally, systems biology) is the lack of
sufficient data to fit mechanistic models based on partial or ordinary
differential equations. Additionally, PDE are computationally more
intensive than CA. While both techniques offer their own unique
contributions to the field, we will discuss the advent and develop-
ment of models involving CA.
The early pioneering efforts in CA include the following: John
von Neumanns self-replicating robot (28), Norbert Wiener and
Arturo Rosenblueths work on conduction in cardiac systems (29),
John Conways Game of Life; and the work of Stephen Wolfram
(30). The beauty in these models lies in the emergence of complex
patterns from sets of simple rules. Indeed, a wide range of phenom-
ena have been modeled using CA. The models we highlight illus-
trate the use CA in simulating developmental cues relevant to
systems toxicology and virtual tissues applications, since they
encompass mechanistic information at the molecular level which
may be useful in understanding chemical perturbations.
In ref. 31, a CA model on cell differentiation has been
described. Intracellular signaling is encoded as a genetic regulatory
network with binary activity of gene states, which was based on an
earlier CA model for single gene networks (32). Intracellular
dynamics, cell division, and cellcell interactions are juxtaposed in
the model to show main features of biological morphogenesis. An
important feature of this model is the use of temporal evolution of
genetic networks, which highlights the mechanism underlying the
higher order cell processes. Another model (33) incorporates a
genetic switch for differentiation of tissue into substrate-depleting
vessels as well as the lateral inhibition of an autocatalytic morpho-
gen. The model is able to recapitulate interesting features of mor-
phogenesis including: dichotomous and lateral branching, blind
vessel ends, and closed loops from anastosmosis. The model is
general enough to describe a variety of phenomena including leaf
veins, insect trachea, and neovascularization.
384 J. Jack et al.
3.3. Molecular Systems Molecular systems have been abstracted generally as graphical net-
Models works in which the nodes represent molecular entities (e.g., genes
or proteins) and edges are interactions. A molecular interaction
network represents a hypothesis about the causal structure of the
underlying biological system, which is derived from available evi-
dence. Different computational methods can be used to simulate
the dynamic behavior of networks in response to perturbations,
such as chemical-induced receptor activity. These methods can be
classified along a number of dimensions. For instance, the values of
interacting nodes in the network can be represented as discrete or
continuous variables. The procedures for updating the state of the
nodes over time can be deterministic or nondeterministic (or sto-
chastic). We will describe two distinct modeling techniques which
illustrate these contrary modeling assumptions: metabolic pathways
and genetic regulatory networks.
A metabolic pathway can be described as a network in which
individual metabolites are the nodes and enzymatic reactions are
the edges. The relationship between the continuous concentrations
of metabolites can be expressed using the law of mass action as a
system of deterministic ODE. In many cases ODE must be solved
numerically but under certain equilibrium assumptions, the
MichaelisMenten (10) closed form solution (for a single
17 Systems Toxicology from Genes to Organs 385
3.4. Organ Modeling By definition physiology varies from organ to organ, but in some
respects the approach to modeling any organ is standardized.
While the heart is perhaps the most thoroughly modeled (57, 58)
the liver draws more attention with respect to toxicology (5964).
The liver is often the site of initial exposure to hazardous com-
pounds and their metabolites due to first-pass metabolism of blood
from the gastrointestinal tract via the hepatic vein. Although
mechanisms of chronic chemical-induced injury are not completely
understood, it is believed to involve multiscale molecular and cel-
lular interactions that culminate in tissue damage.
In a review of liver tissue simulation approaches, Ierapetritou
et al. (65) identified approaches ranging from systems of ODE
describing spatially homogeneous tissues (e.g., ref. 66) to high
dimensional models including fluid dynamics approaches based
upon approximations of the NavierStokes partial differential equa-
tions (67). The more complicated the approach, the greater the
data and computational needs, especially given the convoluted and
dynamic cellular boundary of the sinusoidal spaces.
3.6. Multiscale The challenges of multiscale simulation are easily seen in terms of
Simulation the liver: a mixed-media (e.g., air, food, water) whole-body expo-
sure to a compound leads to a concentration in the blood (possibly
via the gut) that flows into the liver where it filters past hepatocytes,
potentially causing toxicity (including rearrangement of the vascular
flows) or induction of enzymes and transporters within the hepa-
tocytes, which in turn changes the rate that the chemical is
eliminated from the blood which in turn changes the amount
of whole body exposure that would be needed to achieve a
similar effect the next time. The different scales interact and
changes on one scale may manifest themselves as hysteresis on a
second scale; that is, memory of the earlier exposure may impact
the outcome of future exposures.
Many numerical models have been developed to assess the
impact of zonation on the metabolism of a compound. One of
the earliest models investigated differential distributions of
enzymes both competing for the same compound (73, 74).
PBPK models subcompartments have been used in order to
model inhomogeneous induction of metabolism enzymes after
chemical exposure: e.g., CYP1A1/A2 by dioxin (75). That
model, which is illustrated in Fig. 3, describes the uneven P450
distribution; there were no cells, but the concentrations of com-
pounds in different, discrete zones of a lobule were modeled con-
tinuously and could therefore be coupled to a PBPK model. Other
models have been developed with similar liver with subcompart-
ments coupled to PBPK structures for modeling zonal heterogene-
ity due to not only enzymes but also transporters (76, 77).
Since there are many interacting in-flows and out-flows, solv-
ing for blood flow through the sinusoids in nontrivial (67). This is
particularly true if one wishes to allow for changes to the geome-
try whether due to regeneration, tumor growth, or chemical
insult. Analytic mathematical descriptions obtain solutions via
regular, hexagonal approximations to the lobule structure in
order to obtain chemical disposition and effects (59, 75).
388 J. Jack et al.
Fig. 3. The typically homogenous compartments of a PBPK model were subdivided for the liver into five zones differing in
proximity from the central vein/portal triad. This allowed for zonal differences in induction of P450 enzymes as a function of
concentration in each subcompartment. Reproduced from ref. 75 with permission from Elsevier.
4. Methods
for System
Reconstruction
Dynamic systems models in biology are usually constructed based
on domain knowledge of the underlying mechanisms. The advent
of large-scale -omic profiling has produced an unprecedented
amount of data on the molecular state of living systems under
normal conditions, following chemical treatment, and for different
phenotypic outcomes. A variety of computational methods have
been developed to generate hypotheses about the putative molecu-
lar mechanisms of chemical-induced injury. In toxicology, useful
methods for system reconstruction should generate hypotheses
about the pathways relating molecular perturbations (e.g., from
-omic assays) to phenotypic effects (e.g., histological lesions). This
is relevant for understanding the mode of action (MOA) for toxic-
ity, and for quantitatively evaluating the role of key events in
dynamic systems models.
A number of bottom-up approaches have been used to recon-
struct maps of the metabolic, signaling and genetic regulatory
relationships in complex molecular interaction networks (referred
to as the interactome) in eukaryotes (79). Such methods have
also been applied to identify functionally relevant subnetworks
(called modules) from molecular profiles (80), and to link genes
to human diseases (81). It has been difficult to systematically
reverse-engineer the molecular network modules in cells, which
can explain their functional responses, using large-scale molecular
assays and publicly available data. In toxicology (and other areas of
disease research), this is motivating a top-down approach in which a
hypothesis-driven approach is used to relate phenotypic responses,
such as cellular outcomes and histologic effects, to the underlying
molecular pathways. Knowledge-based systems could offer a prac-
tical solution that encodes information on tissue-level effects, cel-
lular phenotypes, and molecular mechanisms, in order to support
hypothesis generation by top-down and bottom-up inference. The
mode of action (MOA) framework (82), for example, outlines the
main issues and describes an approach for synthesizing knowledge
by extensive evaluation of prior evidence about the events involved
in chemical-induced toxicity. This is a resource-intensive process
that results in a theory (hypothesis) about causal relationships
between molecular, cellular, and tissue-level effects.
390 J. Jack et al.
5. Agent-Based
Models of Tissues
Cell-oriented agent-based modeling (ABM) of tissues offers a
number of unique advantages (91, 92). First, since cells are the
functional units of tissues, the ABM has more physiologic relevance
than a continuum model. Second, the agent responses can be
calibrated and verified through comparison with actual cells
in vitro (or ex vivo). Third, spatial outcomes from the ABM can
be translated to histopathologic effects such as acute lesion and
tumor formation. While the agent-based strategy is suitable for
modeling tissue responses, the approaches to the liver taken so far
have not provided a framework for estimating tissue dosimetry.
One of the first cell-based approaches to understanding the
liver-specific injury was made by Hunt et al. (59), where individual
hepatocytes were represented by independent agents wherein
metabolism can occur. In this and other models the cells are under-
stood in terms of energythey generally act to minimize energy
expenditure while stochastically selecting some less favorable
energy options with a probability that decreases the more energy
17 Systems Toxicology from Genes to Organs 391
5.1. EPA Virtual Liver To provide simulated in vivo context for a suite of high-throughput
(v-Liver) assays conducted on thousands of chemicals (i.e., the ToxCast (1)
and Tox21 projects) the US Environmental Protection agency has
been developing a model to relate whole-body chemical exposures
to cell-scale concentrations within the hepatic lobule, and then
simulate the response of individual hepatocytes to chemical pertur-
bations. The goal of this Virtual Liver projects is to provide a
framework for simulating a canonical lobule for extended periods of
time ranging from hours to months.
5.3. Cellular and In the hybrid model of Ohno et al. (61) cells are represented with
Microvascular separate, but interacting, ODE models arrayed in a sequence from
Architecture periportal to centrilobular. This hybrid, cells as agents and blood as
a continuum, approach was expanded by Wambaugh and Shah (63)
to allow the impact of sinusoidal geometry to be examined, as in
Hunt et al. (59). Figure 4 is an illustration of a hybrid model (63),
the simulated flow through sinusoidal segments in a given geome-
try, allowing the response of individual hepatocytes (agents) to be
dependent on the microenvironment (chemical concentration or
nutrients) around each cell.
In order to quantitatively describe the gradient of any com-
pound within the sinusoids several quantities must be known: the
amount flowing into the liver (requires PBPK preferably calibrated
from time course data, but potentially predicted from partitioning
experiments or pure chemical descriptors (93); the fraction of the
compound not bound to protein (requires in vitro experiment
(94); the rate of hepatic metabolism per enzyme (requires addi-
tional in vitro experiments); and the distribution/induction of the
392 J. Jack et al.
Fig. 4. Flow through a spatially heterogeneous vascular network was simulated using
mass-balanced hemodynamics equations in order to allow the discrete agents (hepato-
cytes) to be coupled into a continuous PBPK model. Inset: The spatial complexity was
reduced by treating small segments of sinusoidal space as well-mixed subcompartments,
similar to the Andersen et al. (75) approach but with a complex, graph-based structure.
Reproduced from ref. 63 with permission from the Public Library of Science (PLoS).
17 Systems Toxicology from Genes to Organs 393
Fig. 5. Chemically induced expression of P450 metabolizing enzymes was compared between hepatocytes harvested from
rats (a) and predicted using an agent-based model (b). Perivenous (solid) and periportal (open) hepatocytes differed in
expression levels (i.e., displayed zonation). Three different concentrations (lowsquare, mediumtriangle, highcircle)
were used. Both axes were manually converted from semiarbitrary units of the ABM to the lab units. Reproduced from ref.
62 with permission from Elsevier.
6. Conclusion
Disclaimer
References
1. Dix DJ, Houck KA, Martin MT et al (2007) revealed by scanning electron microscopy. Cell
The toxcast program for prioritizing toxicity Tissue Res 148:111125
testing of environmental chemicals. Toxicol 16. Taub R (2004) Liver regeneration: from myth
Sci 95:512 to mechanism. Nat Rev Mol Cell Biol
2. Bertalanffy L (1957) Life, language, law: essays 5:836847
in honor of Arthur F Bentley. Antioch, Yellow 17. Fausto N, Campbell JS, Riehle KJ (2006) Liver
Springs, OH regeneration. Hepatology 43:S45S53
3. Bertalanffy L (1968) General systems theory: 18. Michalopoulos GK (2010) Liver regeneration
foundations, development, applications. after partial hepatectomy: critical analysis of
George Braziller, New York, NY mechanistic dilemmas. Am J Pathol 176:213
4. Ideker T, Galitski T, Hood L (2001) A new 19. Katz NR (1992) Metabolic heterogeneity of
approach to decoding life: systems biology. hepatocytes across the liver acinus. J Nutr
Annu Rev Genomics Hum Genet 2:343372 122:843849
5. Kitano H (2002) Systems biology: a brief over- 20. Gumucio JJ (1989) Hepatocyte heterogeneity:
view. Science 295:16621664 the coming of age from the description of a
6. Kitano H (2002) Computational systems biol- biological curiosity to a partial understanding
ogy. Nature 420:206210 of its physiological meaning and regulation.
7. Turing AM (1952) The chemical basis of mor- Hepatology 9:154160
phogenesis. Phil Trans Royal Soc Lond 21. Athelogou M, Schmidt G, Schape A et al
237:3772 (2007) Cognition network technologya
8. Kauffman SA (1969) Metabolic stability and novel multimodal image analysis technique for
epigenesis in randomly constructed genetic automatic identification and quantification of
nets. J Theor Biol 22:437467 biological image contents. In: Shorte SL,
9. Hill (1910) Proceedings of the physiological Frischknecht F (eds) Imaging cellular and
society: Jan 22 1910. J Physiol 40:ivii molecular biological functions. Springer, Ber-
lin, Heidelberg, pp 407422
10. Michaelis L, Menten M (1913) Die kinetik der
invertinwirkung. Biochem Z 49:333369 22. Roysam B, Ancin H, Bhattacharjya AK et al
(1994) Algorithms for automated characteriza-
11. Venter JC, Adams MD, Myers EW et al (2001) tion of cell populations in thick specimens from
The sequence of the human genome. Science 3-d confocal fluorescence microscopy data. J
291:13041351 Microsc 173:115126
12. Costanzo M, Baryshnikova A, Bellay J et al 23. Turner JN, Ancin H, Becker DE et al (1997)
(2010) The genetic landscape of a cell. Science Automated image analysis technologies for
327:425431 biological 3d light microscopy. Int J Imag Sys
13. Teutsch HF, Schuerfeld D, Groezinger E Technol 8:240254
(1999) Three-dimensional reconstruction of 24. Karacali B, Vamvakidou A, Tozeren A (2007)
parenchymal units in the liver of the rat. Hepa- Automated recognition of cell phenotypes in
tology 29:494505 histology images based on membrane- and
14. Crawford AR, Lin X-Z, Crawford JM (1998) nuclei-targeting biomarkers. BMC Med Imag-
The normal adult human liver biopsy: a quan- ing 7:7
titative reference standard. Hepatology 25. Andersen ME, Clewell HJ, Frederick CB
28:323331 (1995) Applying simulation modeling to pro-
15. Motta P, Porter KR (1974) Structure of rat blems in toxicology and risk assessment: a short
liver sinusoids and associated tissue spaces as
17 Systems Toxicology from Genes to Organs 395
perspective. Toxicol Appl Pharmacol based uncertainty model for gene regulatory
133:181187 networks. Bioinformatics 18:261274
26. Clark LH, Woodrow Setzer R, Barton HA 41. Biggar SR, Crabtree GR (2001) Cell signaling
(2004) Framework for evaluation of can direct either binary or graded transcrip-
physiologically-based pharmacokinetic models tional responses. 20:31673176
for use in safety or risk assessment. Risk Anal 42. Saez-Rodriguez J, Alexopoulos LG, Epperlein
24:16971717 J et al (2009) Discrete logic modelling as a
27. Clewell HJ III, Andersen ME, Barton HA means to link protein signalling networks with
(2002) A consistent approach for the applica- functional analysis of mammalian signal trans-
tion of pharmacokinetic modeling in cancer duction. Mol Syst Biol 5:331
and noncancer risk assessment. Environ Health 43. Klamt S, Saez-Rodriguez J, Lindquist J et al
Perspect 110(1):8593 (2006) A methodology for the structural and
28. von Neuman J (1966) Theory of self- functional analysis of signaling and regulatory
reproducing automata. Univeristy Illinois networks. BMC Bioinformatics 7:56
Press, Champaign, IL 44. Jack J, Wambaugh J, Shah I (2011) Simulating
29. Wiener N, Rosenblueth A (1946) The mathe- quantiative cellular responses using asynchro-
matical formulation of the problem of conduc- nous threshold boolean network ensembles.
tion of impluses in a network of connected BMC Systems Biology Accepted.
excitable eements spcifically in cardiac muscle. 45. Hucka M, Finney A, Sauro HM et al (2003)
Arch Inst Cardiol Mex 16:205265 The systems biology markup language (sbml):
30. Wolfram S, Gad-el-Hak M (2003) A new kind a medium for representation and exchange of
of science. Appl Mech Rev 56:B18B19 biochemical network models. Bioinformatics
31. Silva HS, Martins ML (2003) A cellular auto- 19:524531
mata model for cell differentiation. Physica A 46. Hucka M, Bergmann F, Hoops S et al (2010)
322:555566 The systems biology markup language (sbml):
32. de Sales JA, Martins ML, Stariolo DA (1997) language specification for level 3 version 1 core
Cellular automata model for gene networks. (release 1 candidate). Nature Precedings
Phys Rev E 55:3262 47. Cuellar A, Lloyd C, Nielsen P et al (2003) An
33. Markus M, Bohm D, Schmick M (1999) Sim- overview of cellml 1.1, a biological model
ulation of vessel morphogenesis using cellular description language. Simulation 79:740747
automata. Math Biosci 156:191206 48. Lloyd C, Halstead M, Nielsen P (2004) Cellml:
34. Savill NJ, Hogeweg P (1997) Modelling mor- its future, present and past. Model Cell Tissue
phogenesis: from single cells to crawling slugs. Funct 85:433450
J Theor Biol 184:229235 49. Bergmann FT, Sauro HM (2006) Sbwa
35. Glazier JA, Graner F, ccedil et al (1993) Simu- modular framework for systems biology. In:
lation of the differential adhesion driven rear- Proceedings of the 38th conference on winter
rangement of biological cells. Phys Rev E simulation. Winter Simulation Conference,
47:2128 Monterey, California, pp 16371645
36. Gillespie D (1976) A general method for 50. Hoops S, Sahle S, Gauges R et al (2006)
numerically simulating the stochastic time evo- Copasia complex pathway simulator. Bioin-
lution of coupled chemical reactions. J Comput formatics 22:30673074
Phys 22:403434 51. Paun A, Perez-Jimenez M, Romero-Campero
37. Gillespie DT (1977) Exact stochastic simula- F (2006) Modeling signal transduction using
tion of coupled chemical reactions. J Phys p systems. In: Hoogeboom H, Paun G, Rozen-
Chem 81:23402361 berg G, Salomaa A (eds) Membrane comput-
38. Gibson MA, Bruck J (2000) Efficient exact ing. Springer, Berlin, Heidelberg, pp 100122
stochastic simulation of chemical systems with 52. Manca V (2008) The metabolic algorithm for
many species and many channels. J Phys Chem p systems: principles and applications. Theor
A 104:18761889 Comput Sci 404:142155
39. Shmulevich I, Dougherty ER, Zhang W 53. Jack J, Paun A (2009) Discrete modeling of
(2002) Gene perturbation and intervention in biochemical signaling with memory enhance-
probabilistic boolean networks. Bioinformatics ment. In: Priami C, Back RJ, Petre I (eds)
18:13191331 Transactions on computational systems
40. Shmulevich I, Dougherty ER, Kim S et al biology xi. Springer, Berlin, Heidelberg,
(2002) Probabilistic boolean networks: a rule- pp 200215
396 J. Jack et al.
54. Jack J, Paun A, Rodrguez-Paton A (2010) A 69. West GB, Brown JH, Enquist BJ (1999) The
review of the nondeterministic waiting time fourth dimension of life: fractal geometry and
algorithm. Nat Comput 111 allometric scaling of organisms. Science
55. Priami C, Regev A, Shapiro E et al (2001) 284:16771679
Application of a stochastic name-passing calcu- 70. Baish JW, Jain RK (2000) Fractals and cancer.
lus to representation and simulation of molec- Cancer Res 60:36833688
ular processes. Inf Process Lett 80:2531 71. Di Ieva A, Grizzi F, Gaetani P et al (2008)
56. Curti M, Degano P, Priami C et al (2004) Euclidean and fractal geometry of microvascu-
Modelling biochemical pathways through lar networks in normal and neoplastic pituitary
enhanced [pi]-calculus. Theor Comput Sci tissue. Neurosurg Rev 31:271281
325:111140 72. Cross SS (1997) Fractals in pathology. J Pathol
57. Nickerson DP, Hunter PJ (2005) The noble 182:18
cardiac ventricular electrophysiology 73. Pang KS (1983) The effect of intercellular dis-
models in cellml. Prog Biophys Mol Biol tribution of drug-metabolizing enzymes on the
90:346359 kinetics of stable metabolite formation and
58. Bassingthwaighte J, Hunter P, Noble D (2009) elimination by liver: first-pass effects. Drug
The cardiac physiome: perspectives for the Metab Rev 14:6176
future. Exp Physiol 94:597605 74. Pang KS, Stillwell RN (1983) An understand-
59. Hunt CA, Yan L, Ropella G et al (2007) The ing of the role of enzyme localization of the liver
multiscale in silico liver. J Crit Care on metabolite kinetics: a computer simulation.
22:348349 J Pharmacokinet Pharmacodyn 11:451468
60. Hohme S, Hengstler JG, Brulport M et al 75. Andersen ME, Eklund CR, Mills JJ et al (1997)
(2007) Mathematical modelling of liver regen- A multicompartment geometric model of the
eration after intoxication with ccl4. Chem Biol liver in relation to regional induction of cyto-
Interact 168:7493 chrome p450s. Toxicol Appl Pharmacol
61. Ohno H, Naito Y, Nakajima H et al (2008) 144:135144
Construction of a biological tissue model 76. Abu-Zahra TN, Pang KS (2000) Effect of
based on a single-cell model: a computer simu- zonal transport and metabolism on hepatic
lation of metabolic heterogeneity in the liver removal: enalapril hydrolysis in zonal, isolated
lobule. Artif Life 14:328 rat hepatocytes in vitro and correlation with
62. Sheikh-Bahaei S, Maher JJ, Anthony Hunt C perfusion data. Drug Metab Dispos
(2010) Computational experiments reveal 28:807813
plausible mechanisms for changing patterns of 77. Liu L, Pang KS (2006) An integrated approach
hepatic zonation of xenobiotic clearance and to model hepatic drug clearance. Eur J Pharm
hepatotoxicity. J Theor Biol 265:718733 Sci 29:215230
63. Wambaugh J, Shah I (2010) Simulating micro- 78. Basciano C, Kleinstreuer C, Kennedy A et al
dosimetry in a virtual hepatic lobule. PLoS (2010) Computer modeling of controlled
Comput Biol 6:e1000756 microsphere release and targeting in a repre-
64. Shah I, Wambaugh J (2010) Virtual tissues in sentative hepatic artery system. Ann Biomed
toxicology. J Toxicol Environ Health Eng 38:18621879
13:314328 79. Li S, Armstrong CM, Bertin N et al (2004) A
65. Lerapetritou MG, Georgopoulos PG, Roth map of the interactome network of the meta-
CM et al (2009) Tissue-level modeling of zoan C. elegans. Science 303:540543
xenobiotic metabolism in liver: an emerging 80. Bruggeman FJ, Westerhoff HV (2007) The
tool for enabling clinical translational research. nature of systems biology. Trends Microbiol
Clin Transl Sci 2:228237 15:4550
66. Rowland M, Benet LZ, Graham GG (1973) 81. Goh K-I, Cusick ME, Valle D et al (2007) The
Clearance concepts in pharmacokinetics. J human disease network. Proc Natl Acad Sci
Pharmacokinet Pharmacodyn 1:123136 104:86858690
67. Rani HP, Sheu TWH, Chang TM et al (2006) 82. Meek ME, Bucher JR, Cohen SM et al (2003)
Numerical investigation of non-newtonian A framework for human relevance analysis of
microcirculatory blood flow in hepatic lobule. information on carcinogenic modes of action.
J Biomech 39:551563 Crit Rev Toxicol 33:591653
68. West GB, Brown JH, Enquist BJ (1997) A 83. Karp PD (2001) Pathway databases: a case
general model for the origin of allometric scal- study in computational symbolic theories. Sci-
ing laws in biology. Science 276:122126 ence 293:20402044
17 Systems Toxicology from Genes to Organs 397
84. Gruber TR (1995) Toward principles for the 92. Merks RMH, Glazier JA (2005) A
design of ontologies used for knowledge shar- cell-centered approach to developmental biol-
ing. Int J Hum Comput Stud 43:907928 ogy. Physica A 352:113130
85. Demir E, Cary MP, Paley S et al (2010) The 93. Poulin P, Theil F-P (2002) Prediction of phar-
biopax community standard for pathway data macokinetics prior to in vivo studies. Ii.
sharing. Nat Biotechnol 28:935942 Generic physiologically based pharmacokinetic
86. de Jong H (2002) Modeling and simulation of models of drug disposition. J Pharm Sci
genetic regulatory systems: a literature review. 91:13581370
J Comput Biol 9:67103 94. Brian Houston J, Carlile DJ (1997) Prediction
87. Aldridge BB, Saez-Rodriguez J, Muhlich JL of hepatic clearance from microsomes, hepato-
et al (2009) Fuzzy logic analysis of kinase path- cytes, and liver slices. Drug Metab Rev
way crosstalk in tnf/egf/insulin-induced sig- 29:891922
naling. PLoS Comput Biol 5:e1000340 95. Naritomi Y, Terashita S, Kagayama A et al
88. Sachs K, Perez O, Peer D et al (2005) Causal (2003) Utility of hepatocytes in predicting
protein-signaling networks derived from multi- drug metabolism: comparison of hepatic
parameter single-cell data. Science 308:523529 intrinsic clearance in rats and humans
89. Noy NF, Shah NH, Whetzel PL et al (2009) in vivo and in vitro. Drug Metab Dispos
Bioportal: ontologies and integrated data 31:580588
resources at the click of a mouse. Nucleic 96. Santostefano MJ, Richardson VM, Walker NJ
Acids Res 37:W170W173 et al (1999) Dose-dependent localization of
90. Kim J-D, Ohta T, Tateisi Y et al (2003) Genia tcdd in isolated centrilobular and periportal
corpusa semantically annotated corpus for hepatocytes. Toxicol Sci 52:919
bio-textmining. Bioinformatics 19:i180i182 97. Collins FS, Gray GM, Bucher JR (2008) Trans-
91. Noble D (2006) The music of life: biology forming environmental health protection. Sci-
beyond the genome, Oxford University Press ence 319:906907
Chapter 18
Abstract
Software agents are particularly suitable for engineering models and simulations of cellular systems. In a
very natural and intuitive manner, individual software components are therein delegated to reproduce in
silico the behavior of individual components of alive systems at a given level of resolution. Individuals
actions and interactions among individuals allow complex collective behavior to emerge. In this chapter we
first introduce the readers to software agents and multi-agent systems, reviewing the evolution of agent-
based modeling of biomolecular systems in the last decade. We then describe the main tools, platforms, and
methodologies available for programming societies of agents, possibly profiting also of toolkits that do not
require advanced programming skills.
Key words: Agent-based modeling and simulation, Agent-oriented software engineering, Behavioral
models of biological systems, Multi-agent systems
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_18, # Springer Science+Business Media, LLC 2013
399
400 N. Cannata et al.
1.1. Software Agents Complex software systems can be designed around autonomous
and Multi-agent interacting components. Robin Milner suggested to call each of the
Systems several parts of a software system as agent, with its own identity,
which persists through time (1). The name agent derives from the
Latin agere, to act. A software agent is definitively a piece of
software that acts for a user or for another program in a relationship
of agency. We can metaphorically think to some kind of agreement
to act on ones behalf. Such action on behalf of implies that the
agent has the authority to decide which action (if any) is appropri-
ate. Wooldridge defined an agent as a computer system situated in
some environment and capable of autonomous, flexible action in
that environment in order to meet its design objectives (2).
The autonomy property (autonomous agent) implies that the
agent has the control over its internal state and over its own
behavior. A comparison between agent-based programming and
object-oriented programming (OOP) can better explain this prop-
erty. Both the methodologies permit to define an abstraction of an
internal state. In the case of OOP, the state encapsulated in an
object can be controlled from the software entity controlling the
object and invoking its methods. In the case of an agent, instead,
the agent itself has the full control on the state it encapsulates.
The situatedness property (situated agent) implies that the
agent perceives its environment through some sensors and is able
to act in the environment through some actuators. Usually sen-
sors and actuators are software ones but we can think also to robotic
agents in which they actually are physical devices. The environment
in which the agent is situated is typically dynamic, likely open,
unpredictable, and populated from other agents (i.e., multi-agent).
The flexibility property (flexible agent) can be declined in
different ways. A reactive agent responds in timely fashion to envi-
ronmental change. An adaptive agent responds to environmental
change according to its internal state. A proactive agent acts in
anticipation of future goals.
An additional property of agents is mobility. A mobile agent is
able to move from one to another distributed environment.
A multi-agent system (MAS) consists of a number of agents
interacting with each other in a dynamic environment (3). Agents
of a MAS will be acting with their different goals and motivations.
Furthermore, agents will require the ability to communicate, coop-
erate, and perceive other agents. A well-designed MAS is the one
that achieves a global task through the tasks of the single agents and
18 Agent-Based Models of Cellular Systems 401
1.2. Agent-Based Biological systems are complex ones, i.e., many of their compo-
Modeling nents are coupled in a nonlinear fashion (4). They are characterized
by variables having complicated, discontinuous behaviors over
time. Furthermore, they exhibit the emergence property, i.e., com-
plex patterns do emerge from simpler interaction rules. The global
behavior of such a system can be determined by defining the lower-
level interaction rules among its components. Developing software
for agent-based systems can profit of modern software engineering
techniques, including decomposition, abstraction, and organiza-
tion. A problem can be divided into smaller, manageable subpro-
blems. Some details of a problem can be chosen to be modeled,
while others can be ignored. The relationships among the various
system components can be identified and managed. Software
agents are situated in space and time and have some properties
and some sets of local interaction rules. Though intelligent,
they cannot by themselves deduce the global behavior resulting
from their dynamic interactions. An agent-based system usually
evolves from the microlevel to the macrolevel. Usually agent-
based modeling (ABM) adopts a bottom-up design strategy rather
than a top-down one. Agents are commonly assumed to have well-
defined bounds and interfaces, as well as spatial and temporal
properties, including such dynamic properties as movement, veloc-
ity, acceleration, and collision. Agent-based systems also allow for
easy modification of interaction rules or behavior, as well as for
viewing agents or groups of agents at different levels of abstraction.
Modeling a system with a bottom-up approach requires that every
individual agents behavior be described. The greater the number
of details that go into describing the behavior of the system, the
greater is the computational power that is required to simulate the
behaviors of all constituent agents. This is a limitation in modeling
large systems using ABM (5). A reasonable approach is to provide
several levels of abstraction and granularity, which can be chosen
depending on the level of detail needed and the computational
resources available.
402 N. Cannata et al.
2. Materials
3. Methods
4. Examples
<<communicate>>
<<Agent>> <<communicate>>
AgMitochondrialInnerMembrane <<communicate>> <<communicate>>
Agent Inner
Mithocondrial
Membrane simulates the <<communicate>>
role of IMM in the
Transport and Electron Transport ElectronRespirationChain <<communicate>>
Respiration Chain <<communicate>>
(from AgMembranaMitocondrialeInter...(from AgMembranaMitocondrialeInter...
functions
<<Agent>>
<<communicate>> <<communicate>> Environment
<<Agent>>
AgMitochondrialMatrix UpdateSharedData
<<communicate>>
(from Environment
Agent Mithocondrial <<communicate>>
Matrix: simulates the
role of MM in the Partial
Oxidationof Pyruvate Environment: Service Agent
and Krebs Cycle PartialOxidationPyruvate KrebCycle simulates the execution environment
18 Agent-Based Models of Cellular Systems
functions (from AgMitochondrialMatrix (from AgMitochondrialMatrix in which any cell reaction takes place
Fig. 1. The PASSI agent identification UML diagram for our CO example.
419
420 N. Cannata et al.
Fig. 2. A snapshot of our metabolic reaction simulation, showing enzyme, metabolite, and
complex agents moving and acting in a portion of cytoplasm.
5. Notes
References
1. Milner R (1989) Communication and concur- 13. Lemerle C, Di Ventura B, Serrano L (2005)
rency. Prentice-Hall, New York, London Space as the final frontier in stochastic simula-
2. Wooldridge M (1997) Agent-based software tions of biological systems. FEBS Lett 579
engineering. IEE proceedings on software (8):17891794
engineering: 2637 14. Lloyd CM, Halstead MD, Nielsen PF (2004)
3. Wooldridge M (2002) An introduction to CellML: its future, present and past. Prog Bio-
MultiAgent systems. Wiley, West Sussex, UK phys Mol Biol 85(23):433450
4. Vallurupalli V, Purdy C (2007) Agent-based 15. Hucka M, Finney A, Sauro HM et al (2003)
modeling and simulation of biomolecular reac- The systems biology markup language
tions. Scalable Computing: Practice and Expe- (SBML): a medium for representation and
rience 8(2):185196 exchange of biochemical network models. Bio-
5. Bonabeau E (2002) Agent-based modeling: informatics 19(4):524531
methods and techniques for simulating 16. Takahashi K, Arjunan SN, Tomita M (2005)
human systems. Proc Natl Acad Sci USA 99 Space in systems biology of signaling
(Suppl 3):72807287 pathwaystowards intracellular molecular
6. Kitano H (2001) Foundations of systems biol- crowding in silico. FEBS Lett 579
ogy. MIT Press, Cambridge, MA (8):17831788
7. Gonzalez PP, Cardenas M, Camacho D et al 17. Tang J, Ley KF, Hunt CA (2007) Dynamics of
(2003) Cellulat: an agent-based intracellular in silico leukocyte rolling, activation, and adhe-
signalling model. Biosystems 68 sion. BMC Syst Biol 1:14
(23):171185 18. The breve simulation environment. http://
8. Wolfram S (1984) Cellular automata as models www.spiderland.org/. Accessed 19 Nov 2010
of complexity. Nature 311:419424 19. Guo Z, Sloot PM, Tay JC (2008) A hybrid
9. Wishart DS, Yang R, Arndt D et al (2005) agent-based approach for modeling microbio-
Dynamic cellular automata: an alternative logical systems. J Theor Biol 255(2):163175
approach to cellular simulation. Silico Biol 5 20. Guo Z, Tay JC (2008) Multi-timescale event-
(2):139161 scheduling in multi-agent immune simulation
10. Troisi A, Wong V, Ratner MA (2005) An models. Biosystems 91(1):126145
agent-based approach for modeling molecular 21. Walker DC, Southgate J, Hill G et al (2004)
self-organization. Proc Natl Acad Sci U S A The epitheliome: agent-based modelling of the
102(2):255260 social behaviour of cells. Biosystems 76
11. Emonet T, Macal CM, North MJ et al (2005) (13):89100
AgentCell: a digital single-cell assay for bacte- 22. MATLABThe language of technical com-
rial chemotaxis. Bioinformatics 21 puting. http://www.mathworks.com/pro
(11):27142721 ducts/matlab/. Accessed 19 Nov 2010
12. Korobkova E, Emonet T, Vilar JM et al (2004) 23. Pogson M, Smallwood R, Qwarnstrom E et al
From molecular noise to behavioural variability (2006) Formal agent-based modelling of intra-
in a single bacterium. Nature 428 cellular chemical interactions. Biosystems 85
(6982):574578 (1):3745
424 N. Cannata et al.
24. Walker DC, Southgate J (2009) The virtual 39. JADEJava Agent DEvelopment Framework.
cella candidate co-ordinator for middle-out http://jade.tilab.com/. Accessed 19 Nov 2010
modelling of biological systems. Brief Bioin- 40. JADEA White Paper. http://jade.tilab.
form 10(4):450461 com/papers/2003/WhitePaperJADEEXP.
25. Adra S, Sun T, MacNeil S et al (2010) Devel- pdf. Accessed 19 Nov 2010
opment of a three dimensional multiscale 41. IEEE foundation for intelligent physical
computational model of the human epidermis. agents. http://www.fipa.org/. Accessed 19
PLoS One 5(1):e8511 Nov 2010
26. Hoops S, Sahle S, Gauges R et al (2006) 42. IEEEthe worlds largest professional associa-
COPASIa COmplex PAthway SImulator. Bio- tion for the advancement of technology.
informatics 22(24):30673074 http://www.ieee.org/index.html. Accessed 19
27. Galvao V, Miranda JG (2010) A three- Nov 2010
dimensional multi-agent-based model for the 43. Java-based intelligent agent componentware.
evolution of Chagas disease. Biosystems 100 http://jiac.de/. Accessed 19 Nov 2010
(3):225230 44. SPADESmart python multi-agent develop-
28. Pennisi M, Pappalardo F, Palladini A (2010) ment environment. http://code.google.com/
Modeling the competition between lung p/spade2/. Accessed 19 Nov 2010
metastases and the immune system using 45. JACK autonomous software. http://aosgrp.
agents. BMC Bioinformatics 11(Suppl 7):S13 com/products/jack/index.html. Accessed 19
29. Bauer AL, Beauchemin CA, Perelson AS Nov 2010
(2009) Agent-based modeling of host- 46. Comparison of agent-based modeling soft-
pathogen systems: the successes and chal- ware. http://en.wikipedia.org/wiki/Compari-
lenges. Inf Sci (Ny) 179(10):13791389 son_of_agent-based_modeling_software.
30. Christley S, Lee B, Dai X et al (2010) Integra- Accessed 19 Nov 2010
tive multicellular biological modeling: a case 47. Macal CM, North MJ (2008) Agent-based
study of 3D epidermal development using modeling and simulation: desktop ABMS. In
GPU algorithms. BMC Syst Biol 4:107 proc. of winter simulation conference
31. Dematte L, Prandi D (2010) GPU computing 2007:95106
for systems biology. Brief Bioinform 11 48. Sanchez D, Isern D, Rodrguez-Rozas A et al
(3):323333 (2010) Agent-based platform to support the
32. Halling-Brown M, Pappalardo F, Rapin N et al execution of parallel tasks. Expert Syst Appl.
(2010) ImmunoGrid: towards agent-based doi:10.1016/j.eswa.2010.11.073
simulations of the human immune system at a 49. CUDA ZoneOfficial Website. http://www.
natural scale. Philos Transact A Math Phys Eng nvidia.com/object/cuda_home_new.html.
Sci 368(1920):27992815 Accessed 19 Nov 2010
33. Viceconti M, Clapworthy G, Van Sint JS 50. Allan R (2009) Survey of agent based model-
(2008) The virtual physiological humana ling and simulation tools. Technical report.
European initiative for in silico human model- computational science and engineering depart-
ing. J Physiol Sci 58(7):441446 ment, STFC Daresbury laboratory, Daresbury,
34. Shah I, Wambaugh J (2010) Virtual tissues in Warrington. http://epubs.cclrc.ac.uk/work-
toxicology. J Toxicol Environ Health B Crit details?w50398. Accessed 19 Nov 2010
Rev 13(24):314328 51. Nikolai C, Madey G (2009) Tools of the trade:
35. Devillers J, Devillers H, Decourtye A (2010) a survey of various agent based modeling plat-
Internet resources for agent-based modelling. forms. J Artif Soc Soc Simulat 12(2):2
SAR QSAR Environ Res 21(34):33750 52. Railsback SF, Lytinen SL, Jackson SK (2006)
36. Gilbert N, Bankes S (2002) Platforms and Agent-based simulation platforms: review and
methods for agent-based modeling. Proc Natl development recommendations. Simulation
Acad Sci USA 99(Suppl 3):71977198 82:609623
37. REPAST - Recursive porous agent simulation 53. Tobias R, Hofmann C (2004) Evaluation of
toolkit. http://repast.sourceforge.net/. free Java-libraries for social-scientific agent
Accessed 19 Nov 2010 based simulation. J Artif Soc Soc Simulat 7
38. A resource for agent- and individual-based (1):6
modelers and the home of Swarm. http:// 54. GroovyAn agile dynamic language for the
www.swarm.org/index.php/Main_Page. Java platform. http://groovy.codehaus.org/.
Accessed 19 Nov 2010 Accessed 19 Nov 2010
18 Agent-Based Models of Cellular Systems 425
55. Mason multiagent simulation toolkit. http:// 72. Cossentino M (2005) From requirements to
cs.gmu.edu/~eclab/projects/mason/. code with the PASSI methodology. In:
Accessed 19 Nov 2010 Henderson-Sellers B, Giorgini P (eds) Agent-
56. AgentSheets. http://www.agentsheets.com/. oriented methodologies. Idea Group, London
Accessed 19 Nov 2010 73. Bernon C, Camps V, Gleizes M-P et al (2005)
57. NetLogo Home Page. http://ccl.northwest Engineering adaptive multi-agent systems: the
ern.edu/netlogo/. Accessed 19 Nov 2010 ADELFE methodology. In: Henderson-Sellers
58. SeSAmIntegrated environment for multi- B, Giorgini P (eds) Agent-oriented methodol-
agent simulation. http://www.simsesam.de/. ogies. Idea Group, London
Accessed 19 Nov 2010 74. Cossentino M, Fortino G, Garro A et al (2008)
59. StarLogo on the Web. http://education.mit. PASSIM: a simulation-based process for the
edu/starlogo/. Accessed 19 Nov 2010 development of multi-agent systems. Int J
Agent-Oriented Softw Eng 2(2):132170
60. Why AnyLogic simulation software? http://
www.xjtek.com/anylogic/why_anylogic/. 75. Henderson-Sellers B, Giorgini P (2005)
Accessed 19 Nov 2010 Agent-oriented methodologies. Idea Group,
London
61. SimBioSys class framework. http://www.luci
fer.com/~david/SimBioSys/. Accessed 19 76. The FIPA agent UML Web site. http://www.
Nov 2010 auml.org/. Accessed 19 Nov 2010
62. Unified modeling language, resource page. 77. Padgham L, Winikoff M, Deloach S, Cossen-
http://www.uml.org/. Accessed 19 Nov 2010 tino M (2009) A unified graphical notation for
AOSE. In proc. agent-oriented software engi-
63. Rational unified process: best practices for soft- neering IX. doi:10.1007/978-3-642-01338-
ware development teams whitepaper. http:// 6_9
www.augustana.ab.ca/~mohrj/courses/2000.
winter/csc220/papers/rup_best_practices/ 78. Garca-Magarin o I, Gutierrez C, Fuentes-Fer-
rup_bestpractices.html. Accessed 19 Nov 2010 nandez R (2009) The INGENIAS develop-
ment kit: a practical application for crisis-
64. Bernon C, Cossentino M, Pavon J (2005) management. In bio-inspired systems: compu-
Agent-oriented software engineering. Knowl tational and ambient intelligence. Lect Notes
Eng Rev 20(2):99116 Comput Sci 5517:537544
65. Macal CM, North MJ (2008) Agent-based 79. PASSI toolkit. http://sourceforge.net/pro1
modeling and simulation: desktop ABMS. In jects/ptk/. Accessed 19 Nov 2010
proc. of winter simulation conference 2007,
pp. 95106 80. Cossentino M, Fortino G, Gleizes MP et al
(2010) Simulation-based design and evalua-
66. Axelrod R (1997) Advancing the art of simula- tion of multi-agent systems. Modell Pract The-
tion in social sciences. In: Conte R, Hegsel- ory 18(10):14251427
mann R, Terna P (eds) Simulating social
phenomena. Springer-Verlag, Berlin 81. Fortino G, Garro A, Russo W (2005) An
integrated approach for the development and
67. Wooldridge M, Jennings NR, Kinny D (2000) validation of multi-agent systems. Comput Syst
The Gaia methodology for agent-oriented Sci Eng 20(4):94107
analysis and design. J Auton Agent Multi-
Agent Syst 3(3):285312 82. Fortino G, Garro A, Mascillaro S et al (2007)
ELDATool: a statechart-based tool for proto-
68. Zambonelli F, Jennings N, Kinny D (2003) typing multi-agent systems. Proc. of workshop
Developing multiagent systems: the Gaia on objects and agents (WOA 07)
methodology. ACM Trans Softw Eng Metho-
dol 12(3):417470 83. Cannata N, Corradini F, Merelli E et al (2005)
An agent-oriented conceptual framework for
69. Bresciani P, Giorgini P, Giunchiglia F et al systems biology. Trans Comput Syst Biol,
(2004) Tropos: an agent-oriented software Lect Notes Comput Sci (3737):105122
development methodology. J Auton Agent
Multi-Agent Syst 8:203236 84. Finkelstein A, Hetherington J, Li L et al (2004)
Computational challenges of systems biology.
70. Padgham L, Winikoff M (2002) Prometheus: a Computer 37(5):2633
methodology for developing intelligent agents. In
proc. of 3rd international conference on agent- 85. Corradini F, Merelli E, Vita M (2005) A multi-
oriented software engineering III:174185 agent system for modelling carbohydrate oxi-
dation in cell. Lect Notes Comput Sci
71. Pavon J, Gomez-Sanz J, Fuentes R (2005) The 3481:12641273
INGENIAS methodology and tools. In:
Henderson-Sellers B, Giorgini P (eds) Agent- 86. Cannata N, Corradini F, Merelli E (2008)
oriented methodologies. Idea Group, London Multiagent modelling and simulation of
426 N. Cannata et al.
Linear Algebra
Kenneth Kuttler
Abstract
This chapter is a short review of linear algebra leading to a discussion of the singular value decomposition.
Key words: Linear algebra, Inner product space, Singular value decomposition
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_19, # Springer Science+Business Media, LLC 2013
429
430 K. Kuttler
a; b b; a; (1)
a; a 0 and equals zero if and only if a 0; (2)
aa bb; c aa; c bb; c; (3)
c; aa bb ac; a bc; b; (4)
jaj2 a; a; (5)
0 a; b a;aa ajaj2 :
Therefore, a must be a real number which is nonnegative.
To get the other form of the triangle inequality,
aabb
so
jaj ja b bj
ja bj jbj:
Therefore,
jaj jbj ja bj: (9)
Similarly,
jbj jaj jb aj ja bj: (10)
It follows from Eqs. 9 and 10 that Eq. 8 holds. This is because
jjaj jbjj equals the left side of either Eq. 9 or Eq. 10 and either
way, jjaj jbjj ja bj:
ja bj2 a b a b
jaj2 jbj2 2a b
and so comparing the above two formulas,
a b jajjbj cos y: (14)
Since cosy 1,
ja bj jajjbj:
Also note that Eq. 14 provides an easy way to find the angle
between two vectors and also a method to tell whether two vectors
are perpendicular.
4. Addition
and Scalar
Multiplication
A matrixis a rectangular array of numbers. Several of them are
referred to as matrices. For example, here is a matrix.
0 1
1 2 3 4
@ 5 2 8 7 A:
6 9 1 2
This matrix is a 3 4 matrix because there are three rows and four
columns. The first row is 1 2 304, 1 the second row is 5 2 8 7,
1
and so forth. The first column is @ 5 A: The convention in dealing
6
with matrices is to always list the rows first and then the columns. Also,
you can remember the columns are like columns in a Greek temple.
They stand upright while the rows just lay there like rows made by a
tractor in a plowed field. Elements of the matrix are identified accord-
ing to position in the matrix. For example, 8 is in position 2, 3 because
it is in the second row and the third column. You might remember
that you always list the rows before the columns by using the phrase
Rowman Catholic. The symbol aij refers to a matrix in which the i
denotes the row and the j denotes the column. Using this notation on
the above matrix, a 23 8; a 32 9; a 12 2; etc.
There are various operations which are done on matrices. They
can sometimes be added, multiplied by a scalar and sometimes
multiplied. To illustrate scalar multiplication, consider the follow-
ing example:
434 K. Kuttler
0 1 0 1
1 2 3 4 3 6 9 12
3@ 5 2 8 7 A @ 15 6 24 21 A:
6 9 1 2 18 27 3 6
The new matrix is obtained by multiplying every entry of the
original matrix by the given scalar. If A is an m n matrix,A is
defined to equal1A.
Two matrices which are of the same size can be added. When
this is done, the result is the matrix which is obtained by adding
corresponding entries. Thus
0 1 0 1 0 1
1 2 1 4 0 6
@3 4A @ 2 8 A @ 5 12 A:
5 2 6 4 11 2
Two matrices are equal exactly when they are the same size and the
corresponding entries are identical. Thus
0 1
0 0
@ 0 0 A 6 0 0
0 0
0 0
because they are different sizes. As noted above, you write cij for the
matrix C whose ijth entry is cij. In doing arithmetic with matrices
you must define what happens in terms of the cij sometimes called
the entries of the matrix or the components of the matrix.
The above discussion stated for general matrices is given in the
following definition.
Definition 4.1: Let A aij and B bij be two m n matrices. Then
A B C where
C c ij
for c ij a ij b ij : Also if x is a scalar,
xA c ij ;
where cij xaij. The number Aij will typically refer to the ijth entry
of the matrix A. The zero matrix, denoted by 0, will be the matrix
consisting of all zeros.
Note there are 2 3 zero matrices, 3 4 zero matrices, etc. In
fact for every size there is a zero matrix.
With this definition it is easy to verify all of the following
properties valid for A, B, and C, m n matrices and 0 an m n
zero matrix.
A B B A; (15)
the commutative law of addition,
A B C A B C ; (16)
19 Linear Algebra 435
5. Matrix
Multiplication
All the above is fine, but the real reason for considering matrices is
that they can be multiplied. This is where things quit being banal.
The rule for matrix multiplication is this. You can multiply an m n
matrix times an n p matrix. The following may be helpful.
these must match
d
m n n p m p
with the jthcolumn of B and the dot product must make sense!
In symbols,
X
AB ij Aik B kj :: (23)
k
6. Systems of Linear
Equations and
Matrices
A system of linear equations is something of the form
a11 x 1 a12 x 2 a1n x n b 1
..
.
am1 x 1 a m2 x 2 amn x n b m
The variables to solve for are x1, . . ., xn. You notice there are m
equations for the n variables. As just explained, another way to
write the above system of equations in terms of matrix multiplica-
tion is in the form
0 1
0 1 x1 0 1
a 11 a 12 a1n B C b1
B .. .. .. CB x 2 C B .. C
@ . . . AB . C @ . A;
@ .. A
am1 a m2 amn bm
xn
Ax b for short. Thus a solution to the system of equations will be
x 2Rn such that Ax b. Another way to say the same thing is to
find x such that
x
A b 0:
1
Here the matrix on the left is obtained by starting with A and
adding a column at the end which consists of b. The vector which
follows is obtained by starting with x and adding an entry at the
bottom consisting of 1.
7. Row Operations
When one deals with matrices there are operations called row
operations. These are as follows:
Definition 7.1: The row operations consist of the following:
1. Switch two rows.
2. Multiply a row by a nonzero number.
3. Replace a row by a multiple of another row added to it.
The major result for row operations is the following.
Lemma 7.2: Let A be an m n matrix and letx be a vector in Fn such that
Ax 0. Suppose B is an m n matrix which results from A by doing a row
operation to A. Then Bx 0 also.
438 K. Kuttler
x
It is now easy to identify a vector which the above matrix
1
sends to 0. It is to let
55 9 10
x x; y; z ; ; :
52 8 13
This x is the solution to the original system of equations. Note the
solution to the original system is just the column vector on the right
in the reduced row echelon form.
More can be said here but this chapter is not oriented in this
direction. It suffices to see that this is always a reasonable way to
find solutions to a system of equations. Later, least squares solu-
tions will be discussed. These are important when there is no
solution to such a system of equations just discussed. To accomplish
the process of row reduction just described, you can use a computer
algebra system. Here are directions using Maple 12.
1. First open the application. On the left you will see a column of
gray boxes. Click on the one which is labeled matrix to obtain a
display which says rows, and right below it, columns. Enter the
appropriate numbers for the augmented matrix.
2. Next left click on insert matrix this will give you a template in
which you can fill in the appropriate numbers. It is convenient
to use the tab key on your keyboard to advance from one slot to
the next when entering the entries of the matrix.
3. When you have entered the entries of the matrix, right click on
it. Then click on solvers and forms, place the cursor on row
echelon form, and click on reduced. This gives the reduced
row-echelon form from which it will be maximally easy to spot
the solution to the original system.
8. Properties
of Matrix
Multiplication
First note the following example.
Example 8.1: Multiply the following in two different orders:
0 1
1 2 0
@ 0 3 1A 1 2 1
0 2 1
2 1 1
Note that it is not possible to multiply these matrices in the given
order because the first is 3 3 and the second is 2 3. However, you
can multiply them in the other order. You should check that
440 K. Kuttler
0 1
! 1 2 0 !
1 2 1 B C 1 9 3
B 0 3 1C
@ A
0 2 1 2 7 3
2 1 1
Order Matters!
Matrix multiplication is not commutative. This is very different
than multiplication of numbers!
As pointed out above, sometimes it is possible to multiply
matrices in one order but not in the other order. What if it makes
sense to multiply them in either order? Will the respective products
be equal then?
1 2 0 1 0 1 1 2
Example 8.2: Compare and :
3 4 1 0 1 0 3 4
A aB bC a AB b AC : (25)
B C A BA CA: (26)
A BC AB C: (27)
What happened? The first column became the first row and the
second column became the second row. Thus the 3 2 matrix
became a 2 3 matrix. The number 3 was in the second row and
the first column and it ended up in the first row and second column.
This motivates the following definition of the transpose of a matrix.
Definition 8.4: Let A be an m n matrix. Then AT denotes the n m
matrix which is defined as follows:
T
A ij A ji :
AB T B T A T (28)
and if a and b are scalars,
aA bB T aA T bB T : (29)
In the case of the adjoint,
AB B A (30)
442 K. Kuttler
Proof: From the definition and the fact that the complex conjugate of a
product equals the product of the conjugates,
m X
X n X
n X
m X
n X
m
Ax; y A ji x i y j xi A Tij y j xi A ij y j
j 1 i1 i1 j 1 i1 j 1
x; A y: &
Proof: X
AI n ij A ik dkj A ij :
k
the sum of the entries down the main diagonal. Then here is a nice
theorem about the trace of a product.
Theorem 8.11: Let A be an m n matrix and let B be an n m matrix.
Then
Proof: !
X X XX
trace AB A ik B ki B ki A ik trace BA &
i k k i
9. Inverses
A ei
bi
such that is a solution to the system
1
bi
A ei 0:
1
This involves doing row operations till you obtain
I bi
and then observing that this bi works.
Each time this is done, the same row operations can be used to
get A ! I. Therefore, a shortcut to doing this is to simply write the
n 2n matrix
A I
and do row operations on it till the left half equals I and then what
sits on the right side will be the inverse of A.
0 1
1 3 4
Example 9.3: Find the inverse of A @ 5 2 3 A:
2 1 5
You write
0 1
1 3 4 1 0 0
@ 5 2 3 0 1 0 A:
2 1 5 0 0 1
19 Linear Algebra 445
Then you do row operations on this until the left half equals I, and
it follows that what is left on the right side will be the inverse.
0 1
13 11 1
B 1 0 0
B 34 34 2CC
B 1C
B 0 1 0 19 3
C :
B 2C
B 34 34 C
@ 9 5 1 A
0 0 1
34 34 2
Thus the inverse of the given matrix is
0 1
13 11 1
B 34
B 34 2CC
B 19 3 1C
B C :
B 34 2C
B 34 C
@ 9 5 1 A
34 34 2
It is a good idea to check your work when you do one of these by
hand. From the above observation that the left and right inverses
are the same for a square matrix, it suffices to multiply on only one
side. In this case,
0 13 11 11
B 34 34 2C 0 1 0 1
B C 1 3 4 1 0 0
B 19 3 1 C
B C @ A @ A
B 34 C 5 2 3 0 1 0 :
B 34 2 C 2 1 5 0 0 1
@ A
9 5 1
34 34 2
Of course you dont want to do this sort of busy work. It is
easier to use a computer algebra system. Here is how you would
find the inverse of a matrix using maple 12.
1. First open the application. On the left you will see a column of
gray boxes. Click on the one which is labeled matrix to obtain a
display which says rows, and right below it, columns. Enter the
appropriate numbers.
2. Next left click on insert matrix. This will give you a template
in which you can fill in the appropriate numbers. It is conve-
nient to use the tab key on your keyboard to advance from one
slot to the next when entering the entries of the matrix.
3. When you have entered the entries of the matrix, right click on
it. Then click on inverse in standard operations. (When
you place the mouse on standard operations it will give you a
choice and one will be the inverse.)
It is that easy. If you dont want your answer to be in terms of
fractions, simply write one number in the display for the matrix
446 K. Kuttler
with a decimal point and it will give the answer in terms of decimals.
For example, write 5. 0 instead of 5.
The next section will be on the determinant and many of its
applications. Determinants are extremely unpleasant, but they pro-
vide the shortest proofs of many important theorems. This is why I
am presenting certain things in terms of determinants. They also
occur in the description of the multivariate normal distribution
which will be discussed later.
In what follows sgn will often be used rather than sgnn because the
context supplies the appropriate n.
Definition 11.1: Let f be a real-valued function which has the set of
ordered lists of numbers from {1, . . ., n} as its domain. Define
X
f k1 kn
k1 ;...;kn
to be the sum of all the f(k1 kn) for all possible choices of ordered
lists (k1, . . ., kn) of numbers of 1, . . ., n. For example, in the case
where n 2,
X
f k1 ; k2 f 1; 2 f 2; 1 f 1; 1 f 2; 2:
k1 ;k2
where the sum is taken over all ordered lists of numbers from 1, . . ., n.
Note it suffices to take the sum over only those ordered lists in which
there are no repeats because if there are, sgnk1, . . ., kn 0 and so that
term contributes 0 to the sum.
Let A be an n n matrix, A aij and let r1, . . ., rn denote an
ordered list of n numbers from 1, . . ., n. Let Ar1, . . ., rn denote the
matrix whose kth row is the rk row of the matrix, A. Thus
X
detA r 1 ; . . . ; r n sgnk1 ; . . . ; kn a r 1 k1 ar n kn (35)
k1 ;...;kn
and
A 1; . . . ; n A:
and then showing that when you switch i, j in in both sides, the
sign of both sides changes. Thus a sequence of switches sufficient to
obtain A r1, . . ., rn on the left will yield sgn r1, . . ., rn det A on the
right.
Proposition 12.1: Let
r 1 ; . . . ; r n
be an ordered list of numbers from 1,. . .,n. Then
sgnr 1 ; . . . ; r n detA
P
sgnk1 ; . . . ; kn a r 1 k1 ar n kn (36)
k1 ;...;kn
detA r 1 ; . . . ; r n : (37)
13. A Symmetric
Definition
With the above, it is possible to give a more symmetric description
of the determinant from which it will follow that detA detAT.
Corollary 13.1: The following formula for det A is valid:
1
det A
n!
X X
sgnr 1 ; . . . ; r n sgnk1 ; . . . ; kn ar 1 k1 a r n kn : (38)
r 1 ;...;r n k1 ;...;kn
And also det AT det A where ATis the transpose of A. (Recall that
for AT aijT, aijT aji.)
Proof: From Proposition 1, if the ri are distinct,
X
det A sgnr 1 ; . . . ; r n sgnk1 ; . . . ; kn ar 1 k1 a r n kn :
k1 ;...;kn
Summing over all ordered lists, r1, . . ., rn where the ri are distinct,
(If the ri are not distinct, sgnr1, . . ., rn 0 and so there is no
contribution to the sum.)
19 Linear Algebra 449
X X
n! det A sgnr 1 ; . . . ; r n sgnk1 ; . . . ; kn
r 1 ;...;r n k1 ;...;kn
a r 1 k1 a r n kn :
This proves the corollary since the formula gives the same number
for A as it does for AT.
14. Properties
of Determinants
The following corollary says that the determinant is linear in each
row (column). The meaning of this term is described in the state-
ment of the corollary. The proof of this corollary is immediate from
the above definition of the determinant.
Corollary 14.1: If two rows or two columns in an n n matrix, A, are
switched, the determinant of the resulting matrix equals1 times the deter-
minant of the original matrix. If A is an n n matrix in which two rows are
equal or two columns are equal then det A 0. Suppose the ith row of A
equalsxa 1 yb 1 ; . . . ; xa n yb n . Then
15. Cofactor
Expansions
The most important theoretical property of determinants is the
method of Laplace expansion. In fact, this method is sometimes
used as the basis for the definition of the determinant. Virtually all
the major applications of determinants depend on it. Nevertheless,
it is interesting that it yields an impractical way to actually compute
the determinant.
Definition 15.1: Let A aij be an n n matrix. Thena new matrix called
the cofactor matrix, cofA, is defined by cofA cij where to obtain cij delete
the ith row and the jth column of A, take the determinant of the
n 1 n 1 matrix which results (this is called the ijth minor of A
) and then multiply this number by 1ij . To make the formulas easier to
remember, cofAij will denote the ijth entry of the cofactor matrix.
Now here is the method of cofactor expansions.
Theorem 15.2: Let A be an n n matrix where n 2. Then
X
n X
n
detA a ij cof A ij a ij cof Aij :
j 1 i1
The first formula consists of expanding the determinant along the ith
row and the second expands the determinant along the jth column.
Note that this gives an easy way to write a formula for the
inverse of an n n matrix.
1
a1
ij det A cof A ji
X
n
a ir cof A ir det A1 detA det A1 1:
i1
Now consider
X
n
air cof A ik det A1
i1
19 Linear Algebra 451
when k6r. Replace the kth column with the rth column to obtain a
matrix Bk whose determinant equals zero by Corollary 1. However,
expanding this matrix along the kth column yields
X
n
0 detB k det A 1 a ir cof A ik det A 1 :
i1
Summarizing,
X
n
a ir cof A ik det A 1 drk :
i1
det B det A 1
and so detA60. Therefore from Theorem 1, A1 exists. Therefore,
A 1 BAA 1 B AA 1 BI B:
The case where CA I is handled similarly.
The conclusion of this corollary is that left inverses, right
inverses and inverses are all the same in the context of n n
matrices.
Theorem 1 says that to find the inverse, take the transpose of
the cofactor matrix and divide by the determinant. The transpose of
the cofactor matrix is called the adjugate or sometimes the classical
adjoint of the matrix A. In words, A1 is equal to one divided by
the determinant of A times the adjugate matrix of A.
452 K. Kuttler
1. det A 0.
2. A,ATare not one to one. (For B A,AT, there exists x60 such
that Bx 0.)
3. A is not onto.
An equivalent formulation of the above theorem is the follow-
ing corollary.
Corollary 19.4: Let A be an n n matrix. Then the following are
equivalent:
1. detA60.
2. A and ATare one to one.
3. A is onto.
Proof: This follows immediately from the above theorem.
20. Schurs
Theorem
Consider the following system of equations for x1, x2, . . ., xn:
Xn
aij x j 0; i 1; 2; . . . ; m; (39)
j 1
xT x 1 ; x 2 ; . . . ; x n 2 Fn ;
454 K. Kuttler
there exists x60 such that the components satisfy each of the equa-
tions of Eq. 39. Here F denotes numbers, either the real numbers R
or the complex numbers C.
Proof: The above system is of the form
Ax 0;
where A m n matrix with m < n. Therefore, if you form the
is an
A
matrix ; an n n matrix having nm rows of zeros on the
0
bottom, it follows this matrix has determinant equal to 0. There-
fore, from Theorem 3, there exists x60 such that Ax 0.
Definition 20.2: A set of vectors in Fn ; F R or C, x1, . . ., xk is called an
orthonormal setof vectors if
1 if i j;
xi ; xj xi xj dij
0 if i 6 j:
U 1 U 2 T U 1 U 2 U T2 U T1 U 1 U 2 I
U 1 U 2 U 1 U 2 U 2 U 1 U 1 U 2 I
19 Linear Algebra 455
U AU T ; (40)
where T is an upper triangular matrix having the eigenvalues of A on
the main diagonal listed according to multiplicity as zeros of the
characteristic equation. If A has all real entries and eigenvalues,
then U can be chosen to be orthogonal.
Proof: The theorem is clearly true if A is a 1 1 matrix. Just let U 1 the
1 1 matrix which has 1 down the main diagonal and zeros elsewhere.
Suppose it is true for n 1 n 1 matrices and let A be an n n
matrix. Then let v1 be a unit eigenvector for A. There exists l1 such that
Av1 l1 v1 ; jv1 j 1:
By Theorem 3, there exists v1, . . ., vn, an orthonormal set in Rn. Let
U0 be a matrix whose ith column is vi. Then from the above, it
follows U0 is unitary. Then from the way you multiply matrices
U0AU0 is of the form
456 K. Kuttler
0 1
l1
B0 C
B . C;
@ .. A1 A
0
where A1 is an n 1 n 1 matrix. The above matrix is similar to
A so it has the same eigenvalues and indeed the same characteristic
equation. Now by induction there exists an n 1 n 1 uni-
tary matrix Ue 1 such that
e A1 U
U e 1 T n1 ;
1
U AU T ;
where T is upper triangular. Now
T U A U U AU T
19 Linear Algebra 457
Ax; x l1 jxj2 :
If Ax, x 0, then all the eigenvalues of A are nonnegative.
Proof: Let U be the orthogonal matrix of Theorem 1. Then
Ax x xT Ax xT U D U T x
X
T
2
UTx D UTx li
U x i
i
X
l1
U T x
2 l1 U T xU T x
i
i
T
l1 U T x U T x l1 xT UU T x l1 xT I x l1 jxj2
For the last part, if Ax lx, then
0 Ax; x ljxj2
so l 0.
U AU D: (41)
where D is a diagonal matrix with the eigenvalues of A down the
main diagonal. Hence A UDU. Define A 1=2 UD 1=2 U . Here
D1/2 comes from replacing each diagonal entry of D with its square
root. This is clearly symmetric and
2
A 1=2 UD 1=2 U UD 1=2 U UD 1=2 D 1=2 U UDU A:
458 K. Kuttler
AU UD
and so, considering the ith column of both sides,
Aui li ui :
The columns of U are orthonormal because
ui ; uj ui uj dij
by the fact that UU I.
23. An Application
to Statistics
A random vector is a function X : O ! Rp where O is a probability
space. A probability space involves also a s algebra of measurable
sets F and a probability measure P : F ! 0, 1. In practice, people
often dont worry too much about the underlying probability
space and instead pay more attention to the distribution measure
of the random variable. For E a suitable subset of Rp, this measure
gives the probability that X has values in E. There are often excel-
lent reasons for believing that a random vector is normally
distributed(central limit theorem). This means that the probability
that X has values in a set E is given by
1 1 1
p=2
exp x m S x m dx:
E 2p det S1=2 2
The expression in the integral is called the normal probability
density function. There are two parameters, m and S where m is
called the mean and S is called the covariance matrix. It is a
symmetric matrix which has all real eigenvalues which are all posi-
tive. While it may be reasonable to assume this is the distribution, in
general you wont know m and S, and in order to use this formula
to predict anything, you would need to know these quantities.
What people do to estimate these is to take n independent
observations x1, . . ., xn and try to predict what m and S should
be based on these observations. One criterion used for making this
determination is the method of maximum likelihood. In this
method, you seek to choose the two parameters S, m in such a
way as to maximize the likelihood which is given as
Y n
1 1 1
1=2
exp xi m S xi m :
i1 det S
2
19 Linear Algebra 459
For convenience the term 2pp/2 was ignored. This leads to the
estimate for m as
1X n
m xi x:
n i1
This part follows fairly easily from taking the ln and then setting
partial derivatives equal to 0. The estimation of S is harder. How-
ever, it is not too hard using the theorems presented above. I am
following a nice discussion given in Wikipedia. It will make use of
Theorem 11 on the trace as well as the theorem about the square
root of a linear transformation given above. First note that the trace
of a 1 1 matrix is the single entry of the matrix. Therefore, by
Theorem 11,
11
z}|{
xi m S1 xi m trace xi m S1 xi m
trace xi mxi m S1
Therefore, the thing to maximize is
Y
n
1 1 1
1=2
exp trace xi mxi m S
i1 det S
2
!
1 n=2 1 X n
1
det S exp trace xi mxi m S
2 i1
0 1
S
n=2 B 1 X n
C
det S1 exp@ trace xi mxi m S1 A
2 i1
1 n=2 1 1
det S exp trace SS ;
2
where S is the p p matrix indicated above, m defined as x. Now S is
symmetric and has eigenvalues which are all nonnegative because
X
n X
n
xi m y; xi m y jxi m yj 0:
2
Sy; y
i1 i1
nX 1X
p p
ln li li :
2 i1 2 i1
the eigenvalues 1/n. It follows B1 must equal the diagonal matrix
which has 1/n down the diagonal. The reason for this is that B is
similar to a diagonal matrix because it is symmetric. Thus
B P 1 n1 IP n1 I because the identity commutes with every
matrix. But now it follows that S n1 S: Of course this is just an
estimate and so we write S^ instead of S.
This has shown that the maximum likelihood estimate for S is
the following for m x.
X
n
^1
S xi mxi m :
n i1
0 1
s1 0
B .. C
@ . A
0 sk
and the bottom row of zero matrices in the partitioned matrix, as
well as the right columns of zero matrices are each of the right size.
Either could vanish completely. However, I will write it in the above
form. It is easy to make the necessary adjustments in the other two
cases.
Theorem 24.3: Let A be an m n matrix. Then there exist unitary
matrices U and V of the appropriate size, such that
s 0
U AV ;
0 0
where s is of the form
0 1
s1 0
B .. C
s@ . A
0 sk
for the si the singular values of A, arranged in order of decreasing
size.
Proof: By the above lemma and Theorem 1, there exists an orthonormal
basis, vii 1n such that AAvi si2vi where si2 > 0 for i 1, . . ., k, si >
0, and equals zero if i > k. Just order the columns of V where VAAV
D according to decreasing eigenvalues. Thus for i > k, Avi 0 because
V v1 vn :
Thus U is the matrix which has the ui as columns and V is defined as
the matrix which has the vi as columns. Then
0 1
u1
B .. C
B . C
B C
U AV B
C
B uk CA v 1 v n
B . C
@ .. A
um
0 1
u1
B .. C
B . C
B C s 0
B C
B uk C s1 u1 sk uk 0 0 0 0 ;
B . C
@ .. A
um
where s is given in the statement of the theorem.
There is some interesting terminology connected to the singu-
lar value decomposition. The vectors
yi Avi si ui ; i k
obtained in the construction of the singular value decomposition
are called the principle component vectors. As pointed out in the
above derivation, these principal component vectors are orthogo-
nal.
The singular value decomposition has as an immediate corol-
lary the following interesting result.
Corollary 24.4: Let A be an m n matrix. Then the rank of A and
Aboth equal the number of singular values of A.
Proof: Since V and U are unitary, they are each one to one and onto and so
it follows that
s 0
rankA rankU AV rank
0 0
number of singular values.
Also since U, V are unitary,
25. Approximation
in the Frobenius
Norm
A norm is a measure of magnitude which satisfies the axioms of a
norm which are:
jju v jj jjujj jjvjj
jjaujj jajjjujj
jjujj 0 and equals 0 only if u 0:
The Frobenius norm is one of many norms for a matrix. It is
arguably the most obvious. Here is its definition.
Definition 25.1: Let A be a complex m n matrix. Then
Proof: From the definition and letting U, V be unitary and of the right
size,
2 X
s s0 0
r
0 2
jjA A jjF
U
V
s2k :
0 0 F kl1
0 0
Thus A is approximated by A where A has rank l < r. In fact, it is
0
also true that out of all matrices of rank l, this A is the one which is
closest to A in the Frobenius norm. Here is why.
Let B be a matrix which has rank l. Then from Lemma 2
s 0
U BV
;
0 0 F
and since the singular values of A decrease from the upper left to
the lower right, it follows that for B to be closest as possible to A in
the Frobenius norm,
0
s 0
U BV ;
0 0
0
which implies B A above.
This is obvious if you look at a simple example. Say
0 1
3 0 0 0
s 0
@0 2 0 0A
0 0
0 0 0 0
19 Linear Algebra 465
for example. Then what rank 1 matrix would be closest to this one
in the Frobenius norm? Obviously
0 1
3 0 0 0
@ 0 0 0 0 A:
0 0 0 0
A(x)
A(Fn)
Proof: Let
jy A x tyx zj2
y Ax tyAx Az; y Ax tyAx Az
jy Axj2 t 2 jAx Azj2 2tReyy Ax;Ax Az
jy Axj2 t 2 jAx Azj2 2t jA y A Ax; x zj:
(44)
If x is a least squares solution then this is no smaller than yAx2
which requires
t 2 jAx Azj2 2t jA y A Ax; x zj 0:
If jA y A Ax; x zj>0; then the above could not be valid for
all t, as seen by taking t very small and positive. It follows that
A y A Ax; x z 0
for all z 2 Fn . Since z is arbitrary, this requires that
A y A Ax 0:
If the condition AAx Ay holds, then if z is arbitrary,
jy Azj2 jy Ax A x zj2
jy Axj2 jA x zj2 2Re y Ax;A x z
jy Axj2 jA x zj2 2Re A y Ax; x z
jy Axj2 jA x zj2 ;
which yields x is a least squares solution.
Do least squares solutions exist? Yes they always do, and there is
an interesting connection with the singular value decomposition. It
is desired to find a solution to the equation AAx Ay. Consider
this equation in terms of the singular value decomposition.
A A A
z
}|
{ z
}|
{ z
}|
{
s 0 s 0 s 0
V U U V xV U y:
0 0 0 0 0 0
Therefore, this yields the following, using block multiplication and
multiplying on the left by V.
2
s 0 s 0
V x U y: (45)
0 0 0 0
468 K. Kuttler
28. Linear
Regressions
An important application of least squares is to the problem of
finding a straight line approximating some data. Thus you are
given data points ti, xi, i 1, . . ., m and you would like to find m
and b such that the line x mt b has the property that
x i mt i b: Of course this will be impossible to do and this is
why you look for a least squares solution to the system
0 1 0 1
1 t1 x1
B 1 t 2 C B x 2 C
B C b B C
B .. .. C B .. C:
@. . A m @ . A
1 tn xn
Thus you want to find a solution to
0 1
1 t1
B C
1 1 1 B 1 t2 C b
B. . C
t 1 t 2 t n @ .. .. A m
1 tn
0 1
x1
B C
1 1 1 B x2 C
B . C:
t 1 t 2 t n @ .. A
xn
Simplifying this, you need
P P
Pn P i t2i b
P i xi
:
iti iti m it ixi
Solving this for b and m yields the formulas for the linear regression
line. Thus
P P 2 P P
ixi iti iti i t i xi
b P P 2
n i t 2i iti
P P P
n it ixi ixi iti
m P 2 P 2 :
n iti iti
19 Linear Algebra 469
People used to plug in and compute these things and then draw
the graphs on graph paper. Fortunately you dont have to work so
hard anymore. Suppose you wanted to find the regression line
which goes with the data 1, 3, 1, 4, 1, 5, 2, 4, 2, 7, 3, 8, 4, 11,
and 5, 7. Here is how you can do it using Maple 12.
1. Open maple. Then click on file and then click on worksheet
mode. At > type the following: with(Statistics) and press
return. It will display all sorts of stuff which you can ignore.
2. At the next > you type X:Vector([1,1,1,2,2,3,4,5]);Y:
Vector([3,4,5,4,7,8,11,7]) (The X comes from the first com-
ponents in the data and the Y comes from the second compo-
nents.) Press return. This has now defined X and Y and they will
be displayed in blue on the screen.
3. At the next > you type the following. Fit(a*x+b,X,Y,x) and
then press return. It will then give you the desired equation. In
this case it is
1:29921259842519632x 3:03937007874015785:
Now suppose you want to graph this along with the data
points. You do the following.
4. At the next > you type the following. with(plots); A:plot(X,
Y,stylepoint, symbolasterisk, symbolsize15, colorred);
B:plot(1.29921259842519632*x+3.03937007874015785,
x0..6, color black, tickmarks [4, 4],font[TIMES,
Bold,24]); display(A,B); and then press return. This will graph
the regression line along with the data points which will be
plotted in this case as asterisks.
You can adjust the various parameters if you like. If you want to
do another one, just go back and redefine your vectors, press return
and etc. It does all the work for you. In the given example, the plot
it returns is as follows.
470 K. Kuttler
My
x
ker(A A)
AA y y
jAx yjfor all x
AA A A; A AA
A ; A A and AA are Hermitian: (47)
lim A A dI 1 A A
d!0
472 K. Kuttler
T
s 0 s 0
AU V ; A V U :
0 0 0 0
2
s dI 0
Therefore, A A dI V V and so itclearly has
0 dI
s1 0
an inverse. Now also recall that A is given by A V
+
0 0
U while
2 1 !
1 s dI 0 s 0 T
A A dI A V V V U
0 d1 I 0 0
2 1 !
s dI 0 s 0 T
V U
0 d1 I 0 0
2 1 !
s dI 0 s 0
V U
0 d1 I 0 0
2 1 !
s dI s 0
V U
0 0
after adjusting the size of the zero blocks as described above. Now a
short computation verifies that
2 1 1 1
s dI s s1 d s2 dI s :
Now recall the formula for the inverse of a matrix given in Theorem1
16.1. Using this formula, it is clear that limd!0 d s2 dI
s1 0 in the sense that each entry the matrices on the left converges
to 0.
Note that from the formula, you could get an idea of approxi-
mately how close it is to the desired solution,
and it is not a very
1 1
good way to approximate if the entries of s2 dI s are large.
This would happen for example if some singular values are small.
Still, it is an interesting formula.
Example 29.6: Find an approximation for the Moore Penrose inverse A+ if
1 2 3
A :
4 2 3
To do it right, you would get the singular value decomposition
and use it. However, you can also pick small d > 0 and follow the
above procedure.
19 Linear Algebra 473
0 !T
1 2 3
1 2 3
@
4 2 3 4 2 3
0 111
1 0 0 0 1
B CC 1:0 4:0
B CC B C
:0001B 0 1 0 CC @ 2:0 2:0 A
@ AA
3:0 3:0
0 0 1
0 1
0:0 215 2 0: 144 63
B C
B C
B 0: 233 85 0: 141 53 C
@ A
0: 184 62 :0 461 7
It is hoped this would be close. You could try another d and see if it
varies by much. You could also find it directly from the above
formula for A+ involving the singular value decomposition. If you
do it this way, you find that it should be
0 1
2: 153 846 16 102 0: 144 615 385
B C
B 0: 141 538 462 C
@ 0: 233 846 154 A:
0: 184 615 385 4: 615 384 64 102
References
1. Baker R (2001) Linear algebra. Rinton Press, 6. Horn R, Johnson C (1985) Matrix analysis.
Princeton, NJ Cambridge University Press, Cambridge, UK
2. Friedberg S, Insel A, Spence L (2003) Linear 7. Marcus M, Minc H (1964) A survey of matrix
algebra. Prentice Hall, Upper Saddle River, NJ theory and matrix inequalities. Allyn and Bacon,
3. Golub G, Van Loan C (1996) Matrix computa- Boston
tions. Johns Hopkins University Press, Balti- 8. Nobel B, Daniel J (1977) Applied linear algebra.
more, MD Prentice Hall, Upper Saddle River, NJ
4. Hofman K, Kunze R (1971) Linear algebra. 9. Gilbert S (1980) Linear algebra and its applica-
Prentice Hall, Upper Saddle River, NJ tions. Harcourt Brace Jovanovich, San Diego,
5. Householder A (1975) The theory of matrices in CA
numberical analysis. Dover, New York
Chapter 20
Abstract
In this chapter we provide an overview of the basic theory of ordinary differential equations (ODE).
We give the basics of analytical methods for their solutions and also review numerical methods. The chapter
should serve as a primer for the basic application of ODEs and systems of ODEs in practice. As an example,
we work out the equations arising in MichaelisMenten kinetics and give a short introduction to using
Matlab for their numerical solution.
Key words: Ordinary differential equations, ODE, Systems of ODE, Numerical methods, Matlab,
MichaelisMenten kinetics
1. Introduction
to Differential
Equations
A differential equation is simply an equation that depends not only
on the value of a variable but also on its derivatives. For example,
1.1. Differential Newtons law of cooling with a varying ambient temperature yields
Equations the equation
dx
x 2 cos t: (1)
dt
Here x is the dependent variable (temperature) and t is the indepen-
dent variable (time). A solution to (1) is simply a function x(t) that
satisfies the equation. All solutions to (1) can be written as
xt cos t sin t Ce t ;
for an arbitrary constant C. If we write an expression for all solu-
tions of a differential equation we often call it the general solution.
Suppose the temperature at a given time is known (an initial
condition). For example, x(0) x0 for some constant x0. We solve
for C to obtain C x 0 1. We obtain a particular solution that
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_20, # Springer Science+Business Media, LLC 2013
475
476 J. Lebl
0 1 2 3 4 5
2 2
1 1
0 0
1 1
2 2
0 1 2 3 4 5
1.2. Systems of When there is more than one dependent variable and more than
Differential Equations one equation we will talk of a system of ODEs. Let us write the
dependent variables as a vector
2 3 2 3
dx 1
x1 dt
6 x2 7 6 dx 2 7
6 7 d~
x 6 dt 7
~
x 6 .. 7; then we write 6 7:
4 . 5 dt 4 ... 7
6
5
xn dx n
dt
1.3. Analytical Versus Some ODEs can be solved explicitly by analytical methods.
Numerical Solutions In general, we will not be able to find an explicit formula for the
solution, and we will have to be satisfied with a numerical approxi-
mation. When analytical methods do apply, such as for linear con-
stant coefficient ODEs, the form of the solution often tells us much
more about the behavior of the system then simply knowing its
values approximately at certain points. Furthermore, a complicated
system can be linearized and the linearization can be solved explic-
itly to obtain a local approximation.
Analytical methods should also not be shunned for their com-
plexity. Symbolic computation can greatly simplify the effort
required. If we can find an explicit solution symbolically, it is almost
always preferable, both in computational time and gained theoreti-
cal insight, to applying a purely numerical method.
Numerical methods have an advantage of always working.
On the other hand, they only give us an approximation of the
solution, and it is up to us to find out how good the approximation
is. Furthermore, applying numerical methods blindly might lead to
nonsense solutions. We will touch on the topics of error estimation
and stability of algorithms later.
1.4. Further Reading For further details on differential equations see the authors online
book (7) or the many other available books (1, 2, 4, 5, 6). For a
reference of ODEs with applications in biology and related sciences
see especially part II of (3).
2. Analytical
Methods
2.1. First-Order Perhaps the most important differential equation to understand is
Equations the equation of exponential growth or decay:
2.1.1. The Exponential dx
kx; (3)
dt
for some constant k. When k is positive, the equation gives expo-
nential growth, when k is negative we get decay. The equation
comes up often. Whenever the rate of growth (or decay) of a
quantity is proportional to the quantity itself. For example, the
equation gives the simplest model of population growth. Without
much difficulty we can guess the solution to be the exponential
x Ce kt :
Here C is an arbitrary constant. In fact, one way to define and
compute the exponential function is precisely as the solution to (3)
(with k 1 and C 1). It turns out that the majority of interesting
20 Ordinary Differential Equations 479
2.1.4. Linear Equations Linear equations appear commonly in applications. They have
many nice properties and of all differential equations are perhaps
the best understood. Let us first focus on first-order linear equa-
tions. A first-order equation is linear if we can put it into the
following form:
dx
ptx f t: (5)
dt
Solutions of linear equations have nice properties. For example,
the solution exists wherever p(t) and f(t) are defined, and has the
same regularity (that is, has the same number of derivatives). But
most importantly for us right now, there is a method for solving
linear first-order equations.
First we find a function r(t) such that
dx d h i
rt rtptx rtx :
dt dt
Then we can multiply both sides of (5) by r(t) to obtain
d h i
rtx rtf t: (6)
dt
Now we integrate both sides. The right-hand side does not depend
on x and the left-hand side is written as a derivative of a function.
Afterwards, we can solve for x. The function r(t) is called the
integrating factor and the method is called the integrating factor
method.
In particular, we let
rt e ptdt : (7)
We substitute (7) into (6) and compute:
d h ptdt i
e x e ptdt f t;
dt
e ptdt x e ptdt f t dt C;
x e ptdt e ptdt f t dt C :
20 Ordinary Differential Equations 481
2 dx 2 2 2
et 2te t x e tt e t ;
dt
d h t2 i
e x et :
dt
We integrate
2
e t x e t C;
x e tt Ce t :
2 2
x e tt 2e t :
2 2
0 5 10 15 20
10.0 10.0
7.5 7.5
5.0 5.0
2.5 2.5
0.0 0.0
2.5 2.5
5.0 5.0
0 5 10 15 20
dx
Fig. 2. Slope field and some solutions of dt 0:1 x 5 x.
x=M
x=0
2.2. Systems of ODEs Often we do not have just one dependent variable and one equa-
tion. We write
d~
x
~
F x; t ~
;~ 0
dt
for a first-order system of ODE.
As we mentioned in the introduction, we can write any nth-
order equation as a first-order system of n equations. A similar
process can be followed for a system of higher-order differential
equations. For example, a system of k differential equations in k
unknowns, all of order n, can be transformed into a first-order
system of n k equations and n k unknowns.
2.2.1. Linear First-Order An example of a system with a rather simple and elegant solution is a
Systems and Linearization constant coefficient first-order system. That is, suppose that we
have a constant k k matrix A and we wish to solve the system
~ x ~
x 0 A~ F t;
484 J. Lebl
for some k-vector ~x. We will mostly assume that ~ F t 0. That is,
we will consider the homogeneous system ~ x 0 A~
x.
Many of the equations and systems that come up in applications
are neither constant coefficient nor linear, and to solve them we
must satisfy ourselves with a numerical approximation. In this case,
however, we could linearize the equations just as we do in calculus.
We will obtain a constant coefficient linear system, which we can
then analyze easily. This technique can be useful if we wish to study
approximately the behavior for a short period of time. For example,
in the next section we will see that simply studying the eigenvalues
and eigenvectors of the resulting matrix can tell us much about the
behavior of the system. Some of these qualitative observations may
be hard to see if we simply look at numerical solutions. For exam-
ple, we can analyze the stability of equilibrium solutions of an
autonomous system.
For example, we start with a nonlinear autonomous system
d~
x ~
F ~
x :
dt
We wish to study the system near some point ~
x 0 and for some small
period of time. We could instead study the linear constant coeffi-
cient system
d~
x
J~F ~ x ~
x 0 ~ x 0 ;
dt
where J ~
F is the Jacobian matrix of ~ F evaluated at the point ~
x 0 . That
is,
2 3
@F 1 @F 1
j~ j~
6 @x 1. 0 . @x k 0
x x
J~ x 0 6
F ~ .. 7 7
4 . . . . . 5:
@F k
@x 1 j~x0 @F@x k j~
k
x0
e lI B e lI e B :
We know how to compute elI. As Bk 0, the power series for eB is
finite.
2.2.2. Two-Dimensional Let us consider a very simple system of two equations. The geo-
Homogeneous Constant metric intuition gained from this simple example can be easily
Coefficient Systems generalized to more complex systems, not necessarily constant
coefficient nor homogeneous. Suppose we have a real diagonaliz-
able 2 2 matrix A and the system
486 J. Lebl
3 2 1 0 1 2 3
3 3
2 2
1 1
0 0
1 1
2 2
3 3
3 2 1 0 1 2 3
0
x x
A : (11)
y y
We will be able to visually tell how the vector field looks once we
find the eigenvalues and eigenvectors of the matrix A.
Note that when we diagonalize the matrix A EDE 1 we note
that the columns of E are the eigenvectors of A. As E 1 is just a
constant matrix, we can see that we can write any solution to (11) as
x
v1 e l1 t c 2~
c 1~ v 2 e l2 t ;
y
where l1 and l2 are the two eigenvalues of A, ~ v1 and ~ v2 are the
corresponding eigenvectors, and c1 and c2 are constants.
Case 1. Suppose that the eigenvalues are real and positive. If (x, y) is
on the line determined by an eigenvector ~ v corresponding to a real
positive eigenvalue l then our solution is simply a multiple of ~ ve lt .
That is, our solution moves along a line away from the origin along
the vector ~v.
If we start at a general point in the plane we simply move along a
certain
curve away from the origin. See Fig. 4 for an example where
1 1
A . The eigenvalues are 1 and 2 and the corresponding
0 2
1 1
eigenvectors are and . The eigenvectors are shown in the
0 1
figure by the two large arrows. Several solutions are also shown.
This type of behavior is called a source or an unstable node.
20 Ordinary Differential Equations 487
3 2 1 0 1 2 3
3 3
2 2
1 1
0 0
1 1
2 2
3 3
3 2 1 0 1 2 3
3 2 1 0 1 2 3
3 3
2 2
1 1
0 0
1 1
2 2
3 3
3 2 1 0 1 2 3
3 2 1 0 1 2 3
3 3
2 2
1 1
0 0
1 1
2 2
3 3
3 2 1 0 1 2 3
2.2.3. Nonhomogeneous x 0 A~
So far we have only talked about homogeneous systems ~ x.
Linear Systems Let us see what to do with a nonhomogeneous equation
~ x ~
x 0 A~ F t: (12)
For many problems the homogeneous part of an equation tells us
what the system wants to do on its own, while ~ F t will be some
external input.
The way the system is solved is to simply find any one particular
solution ~x p to (12). Then we find the general solution ~ x h to the
homogeneous equation ~ x 0h A~
x h . Finally, we use the linearity of
the equation to note that
~ xh ~
x ~ xp
is the general solution to (12) and we can solve for the initial
conditions.
Therefore, the trick is to find a solution ~
x p (any solution) to
(12), ignoring any initial conditions we have. There exist analytic
490 J. Lebl
methods for finding such a solutions, but they are beyond the scope
of this chapter. Often one simply makes an educated guess to find
the solution.
3. Numerical
Methods
3.1. Eulers Method As we said before, unless f(t, x) is of a special form, it is generally
very hard if not impossible to get a nice formula for the solution of
the problem
dx
f t; x; xt 0 x 0 :
dt
What if we want to find the value of the solution at some
particular t? Or perhaps we want to produce a graph of the solution
to inspect the behavior. In this section we will learn about the basics
of numerical approximation of solutions.
The simplest method for approximating a solution is Eulers
method. There are much better numerical methods, but Eulers
method best illustrates the ideas.
It works as follows: We take t0 and compute the slope k f
(t0, x0). The slope is the change in x per unit change in t. We follow
the line of this slope for an interval of length h on the t axis. Hence if
x x0 at t0, then we will say that x1 (the approximate value of x at
t 1 t 0 h) will be x 1 x 0 hk. We repeat the procedure to
compute t2 and x2 in the same way using t1 and x1.
More abstractly, for any i 1, 2, 3, . . ., we compute
t i1 t i h; x i1 x i h f t i ; x i :
The line segments we get are an approximate graph of the solution.
See Fig. 8 for the plot of the real solution and the first two steps of
the approximation with h 1.
x2
Let us see what happens with the equation dx dt 3 , x(0) 1.
Let us try to approximate y(2) using Eulers method. In Fig. 8 we
have essentially graphically approximated x(2) with step size 1.
With step size 1 we have x(2) 1. 926. The real answer is 3.
The difference between the actual solution and the approxi-
mate solution we will call the error. We will usually talk about just
the size of the error (absolute value) and we do not care much
about its sign. The main point is that we usually do not know the
real solution, so we only have a vague understanding of the error.
So using h 1 we are approximately 1.074 off. Let us halve the
step size. Computing x4 with h 0. 5, we find that x(2) 2. 209,
so an error of about 0.791. Table 20.1 gives the values computed
for various parameters.
We notice that except for the first few times, every time we
halved the interval the error approximately halved. This halving of
the error is a general feature of Eulers method as it is a first-order
20 Ordinary Differential Equations 491
1 0 1 2 3
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
1 0 1 2 3
Fig. 8. Two steps of Eulers method (step size 1) and the exact solution for the equation
dx x2
dt 3 with initial conditions x(0) 1.
Table 1
Eulers method approximation of x(2) where of dx/dt =
x2/3, x(0) = 1
h Approximate x(2) Error Error
Previous error
1 1.92593 1.07407
0.5 2.20861 0.79139 0.73681
0.25 2.47250 0.52751 0.66656
0.125 2.68034 0.31966 0.60599
0.0625 2.82040 0.17960 0.56184
0.03125 2.90412 0.09588 0.53385
0.015625 2.95035 0.04965 0.51779
0.0078125 2.97472 0.02528 0.50913
Table 2
Attempts to use Eulers to approximate x(3) where
of dx/dt = x2/3, x(0) = 1
h Approximate x(3)
1 3.16232
0.5 4.54329
0.25 6.86079
0.125 10.80321
0.0625 17.59893
0.03125 29.46004
0.015625 50.40121
0.0078125 87.75769
same error. This reduction can be a big deal. With ten halvings
(starting at h 1) we have 1,024 steps, whereas with five halvings
we only have to do 32 steps, assuming that the error was compara-
ble to start with. So a higher-order method should reduce the
computations necessary drastically. We will give a fourth-order
method (RungeKutta) below, which happens to be essentially
the simplest practical method.
Note that we do not know the error! How do we know what is
the right step size? We can keep halving the interval and we expect
that we can estimate the error from a few of these calculations
and the assumption that the error goes down by a factor of
one half each time (if we are using Eulers method). That is, we
compute the approximation for h and h2 and we compute their
difference. This difference we can expect to be about half the
error of the approximation with step size h.
x2
Let us talk a little bit more about the example dxdt 3 , x(0) 1.
Suppose that instead of the value x(2) we wish to find x(3). The
results of this effort are listed in Table 2 for successive halvings of h.
What is going on here? If we solve the equation exactly we will
notice that the solution does not exist at t 3. In fact, the solution
goes to infinity when we approach t 3.
Further problems might arise when the solution oscillates
wildly near some point.
3.2. Fourth-Order In real applications we would not use a simple method such as
RungeKutta Eulers. The simplest method that would likely be used in a real
application is the standard fourth-order RungeKutta method.
20 Ordinary Differential Equations 493
t i1 t i h;
k1 2k2 2k3 k4
x i1 x i h:
6
That is, k1 is just the slope at the starting point as in Eulers.
However, we then only go half the distance and compute the slope
k2 at this new approximation. We repeat this step using k2 to
compute k3. Finally we use k3 and a full step to obtain the slope
k4. We then use a weighted average of the slopes to compute the
next xi + 1.
Choosing the right method to use and the right step size can be
very tricky. There are several competing factors to consider.
l Computational time: Each step takes computer time. Even if
the function f is simple to compute, we do it many times over.
Large step size means faster computation, but perhaps not the
right precision.
l Roundoff errors: Computers only compute with a certain num-
ber of significant digits. Errors introduced by rounding numbers
off during our computations become noticeable when the step
size becomes too small relative to the quantities we are working
with. So reducing step size may in fact make errors worse.
l Stability: Certain equations may be numerically unstable. Small
errors may lead to large errors down the line. What may also
happen is that the numbers may never stabilize no matter how
many times we halve the interval. In the worst case the numeri-
cal computations might be giving us bogus numbers that look
like a correct answer.
We have seen just the beginnings of the challenges that appear
in real applications. Numerical approximation of solutions to
494 J. Lebl
3.3. Systems and The numerical methods that we have described can be easily applied
Higher-Order Equations to systems of ODEs. What is simply done is to consider the depen-
dent variable a vector. That is, if we have the equation
x0 ~
~ f t;~
x ; ~
x t 0 ~
x0:
Let us describe Eulers method. We pick a step size h as before. And
as before we compute ti + 1 and ~
x i1 from ti and ~
xi:
t i1 t i h; ~ xi h ~
x i1 ~ f t i ;~
x i :
RungeKutta and other methods are converted for systems of
differential equations in the same exact way.
As we have said in the introduction, we can convert any equa-
tion of any order (or any system of any order) into a first-order
system. Thus, having numerical methods for first-order systems of
ODEs is sufficient for solving any single ODE or system of ODEs.
3.3.1. Example: An example where systems of ODEs which must be solved with
MichaelisMenten Kinetics numerical methods are the equations arising from enzyme kinetics.
For more details on these derivations see for example (3).
We construct a model that allows us to understand how fast a
certain reaction occurs and what is the concentration of product
and substrate at any particular point in time. We will obtain a
nonlinear system of ODEs that we will solve numerically.
Let S be the concentration of substrate, let P be the concentration
of the product. Let ES denote the concentration of substrate-bound
enzyme and EF the concentration of free enzyme. All of S, P, ES, and
EF are functions of the time t. The reactions occurring are as follows
k1
E F S E S !k2 E F P;
k1
dP
k2 E S :
dt
We also need to understand how ES and EF change. EF is being
produced by the first reaction at the rate k1 E S k1 SE F and by the
second reaction at the rate of k2ES. That is,
dE F
k1 E S k1 SE F k2 E S :
dt
Assuming that E E S E F , the total concentration of enzyme, is
constant we obtain
dE S
k1 E S k1 SE F k2 E S :
dt
We have obtained an autonomous system of four variables and four
unknowns. We can still do some simplification. First we note that
only one equation depends on P. As soon as we know ES, we also
know dPdt . So we might forget about the equation dt k2 E S until
dP
dP V max S
k2 E S ;
dt KM S
where we let V {max} k2E. The value V {max} becomes the maxi-
mum reaction rate. This simplified equation makes it easier to
determine the constants KM and V {max} from the measured data.
3.3.2. Using Matlab Let us give a short overview of how to use Matlab (8) to find and
or Octave plot numerical solutions to ODEs. All examples also work in the
free software clone Octave (9), which has essentially equivalent
syntax. In particular we will use the system of equations we derived
in the last example.
Let us pick some initial values for the parameters. Of course, in
practice obtaining these numbers can be very hard. Let the reaction
rate constants be k1 3, k1 12, k2 5. Let the initial total
concentration of enzyme be E 0. 004. Let the initial concentration
of the substrate-bound enzyme ES 0. Finally we set the initial
concentration of substrate S to be 1 and the initial concentration of
the product P to be 0.
We must define the system for Matlab. We define a vector ~ x to
store the variables S and ES, that is, x1 S and x2 ES. We are also
interested in P therefore we will let x3 P. Our system becomes
dx 1
k1 k1 x 1 x 2 k1 x 1 E;
dt
dx 2
k1 k1 x 1 k2 x 2 k1 x 1 E;
dt
dx 3
k2 x 2 :
dt
Adding in the last variable is not strictly necessary, but if we are
interested in P, then we might as well let Matlab solve for all
variables at once. We define new Matlab variables E, km1, k1, and
k2 to hold the parameters E, k 1, k1, and k2. We also define a vector
xinit to hold the initial values for S, ES, and P.
E0.004;
km13;
k112;
k25;
xinit[1,0,0];
0 25 50 75 100
1.00 1.00
Product
Substrate
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0 25 50 75 100
The three dots allow us to split the input over several lines.
Matlab now knows about the function called dxdt that is a func-
tion of t and the vector x.
To solve the ODE, we use the function ode45. This will use a
variable step method from the RungeKutta family of methods and
return two vectors. First a vector of times, and second a column
vector of values (which are in turn 3-vectors, so an n by 3 matrix).
Let us solve for time from t 0 to t 100.
[t,x] ode45(dxdt, [0,100], xinit);
References
1. Berg PW, McGregor JL (1966) Elementary par- 3. Edelstein-Keshet L (1988) Mathematical mod-
tial differential equations. Holden-Day, San els in biology. Random House, New York
Francisco, CA 4. Edwards CH, Penney DE (2008) Differential
2. Boyce WE, DiPrima RC (2008) Elementary dif- equations and boundary value problems: com-
ferential equations and boundary value pro- puting and modeling, 4th edn. Prentice Hall,
blems, 9th edn. Wiley, New York Englewood Cliffs, NJ
498 J. Lebl
5. Farlow SJ (1994) An introduction to differential 7. Lebl J (2012) Notes on diffy Qs: differential equa-
equations and their applications. McGraw-Hill, tions for engineers. http://www.jirka.org/diffyqs/.
Princeton, NJ 8. Matlab (2012) The MathWorks Inc., Massachusetts
6. Ince EL (1956) Ordinary differential equations. 9. Octave (2012) http://www.gnu.org/software/
Dover Publications, New York octave/.
Chapter 21
Abstract
The fundamental and more critical steps that are necessary for the development and validation of QSAR
models are presented in this chapter as best practices in the field. These procedures are discussed in the
context of predictive QSAR modelling that is focused on achieving models of the highest statistical quality
and with external predictive power. The most important and most used statistical parameters needed to
verify the real performances of QSAR models (of both linear regression and classification) are presented.
Special emphasis is placed on the validation of models, both internally and externally, as well as on the need
to define model applicability domains, which should be done when models are employed for the prediction
of new external compounds.
Key words: Statistical methods, Regression, Classification, Validation, Applicability domain, Molecu-
lar descriptors, OECD principles
1. Introduction
In recent years more and more studies have been carried out in
which computational methods (so-called in silico) have been used
to predict the physicochemical properties and biological activities
(acute toxicity, carcinogenicity, mutagenicity, etc.) of chemical
compounds. The applications of these studies are historically
older and more frequent on pharmaceuticals in Drug design,
but environmental applications have recently been greatly
increased, mainly in the context of the new European legislation
on chemicals REACH (Registration, Evaluation, Authorisation and
restriction of Chemicals) (1). The OECD principles for QSAR
model validation have been defined (2) to establish recognized
rules for the use of QSAR predictions in regulation: these principles
are an excellent summary of the most important aspects of QSAR
modelling, which are commented below.
Quantitative structureactivity relationship (QSAR) modelling
is based on the fundamental assumption that the structure of a
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_21, # Springer Science+Business Media, LLC 2013
499
500 P. Gramatica
2. Important
Concepts
2.1. Input Input experimental data (often called endpoints) are any kind of
Experimental Data: response (physicochemical properties, toxicity or ecotoxicity end-
The Dependent points, environmental parameters) that could be modelled by
Variable Y QSAR or QSPR as dependent variable (Y). The number of input
data must be reasonably as high as possible, in order to have a wide
21 On the Development and Validation of QSAR Models 501
2.2. Molecular The second crucial point for QSAR modelling is the need to
Descriptors: The translate the molecular structure of the studied chemicals into
Independent numbers (molecular descriptors). A molecular descriptor is either
Variables X the result of some standardized experiment (a physicochemical
property) or the final result of a logical and mathematical procedure
that transforms the chemical information, encoded within a sym-
bolic representation of a molecule, into a useful number. In a QSAR
model, molecular descriptors, which characterize a specific aspect of
a molecule, are the independent variables X and are predictors of a
dependent variable Y (the modelled response).
Molecular descriptors include empirical, quantum chemical,
and other theoretical parameters. Empirical descriptors can be
measured or estimated and include physicochemical properties
(such as for instance descriptors for the hydrophobic properties
(logP) as well as solubility, ionization constants, etc.). Quantum
chemical properties include charge and energy values. Nonempiri-
cal or theoretical descriptors can be based on individual atoms,
substituents, or the whole molecule. They are typically structural
features (i.e., spatial disposition like conformation, geometry, shape
and molecular volume). They can be based on topology or graph
theory and, as such, they are developed from the knowledge of the
2D structure, or they can be calculated from the 3D structural
conformations of a molecule. Binding properties involve biological
macromolecules and are important in receptor-mediated responses
(i.e., in 3D-QSAR as Comparative Molecular Field Analysis
(CoMFA)).
In modern QSAR approaches, it is becoming quite common to
calculate, as input variables, a wide set of molecular descriptors of
different kinds able to capture all the structural aspects of a chemical
to translate its molecular structure into numbers. In fact, different
descriptors are different ways or perspectives to view a molecule,
taking into account the various features of its chemical structure.
A lot of software is available to calculate wide sets of different
theoretical descriptors, from SMILES, 2D-topological graphs or
3D-x,y,z-coordinates. Some of the more used are mentioned here:
ADAPT (5) OASIS (6), CODESSA (7), MolConnZ (8), and
DRAGON (9). Freely available software, such as MOPAC (10), is
able to perform the calculations of a variety of molecular orbital
properties (electronic properties, including the energies of the
highest occupied and lowest unoccupied molecular orbitals
(EHOMO and ELUMO respectively), atomic charges and superdelo-
calizabilities, dipole moment, and electrostatic potential), which are
estimated by applying quantum chemical calculations (semiempiri-
cal or ab initio methods) to molecular structures. Several thousand
molecular descriptors are now available and most of them have been
summarized and explained (11). The great advantage of theoretical
descriptors is that they can be calculated homogeneously and in a
21 On the Development and Validation of QSAR Models 503
2.3. Data Exploration Before modelling, the chemicals under study should be analyzed by
by Principal explorative methods to highlight their distribution and trends in
Component Analysis chemical space and to identify potential strong outliers due to their
peculiar structure.
Probably the most widely known and used explorative multivar-
iate method is Principal Component Analysis (PCA) (13). In PCA,
linear combinations of studied variables (here molecular descrip-
tors) are created, and these combinations explain, to the greatest
possible degree, the variation in the original data. The first principal
component (PC1) accounts for the maximum amount of possible
data variance in a single variable, while subsequent PCs account for
successively smaller quantities of the original variance. Principal
components are derived in such a way that they are orthogonal.
Indeed, it is good practice, especially when the original variables
have different ranges of scales, to derive the principal components
from the standardized data (mean of 0 and standard deviation of 1),
i.e., via the correlation matrix. In this way all the variables are treated
as if they are of equal importance, regardless of their scale of mea-
surement. To be useful, it is desirable that the first two PCs account
for a substantial proportion of the variance in the original data, thus
they can be considered sufficiently representative of the main infor-
mation included in the data, while the remaining PCs condense
irrelevant information and noise. It is quite common for a PCA
504 P. Gramatica
Fig. 1. Biplot of the first two principal components of a PCA. The points are the scores (the studied objects), the lines are the
loadings (the used variables/descriptors).
2.4. Statistical The core of QSAR modelling lies in the statistical methods, applied
Methods: Regression to relate the response (Y) to the molecular descriptors (X). This
and Relative relationship between the molecular descriptors of the chemical
Statistical Parameters structure and the modelled endpoint in a QSAR model is called
the algorithm of the model (the unambiguous algorithm of
OECD Principle 2).
A brief overview of the simplest algorithms, namely for linear
models, commonly used in QSAR, are presented here.
Regression analysis is the use of statistical methods for the
modelling of a dependent variable Y in terms of predictors X
(independent variables or molecular descriptors). Univariate
regression (ULR) involves only one dependent response variable
(Y) and one independent variable (X) that models a simple
21 On the Development and Validation of QSAR Models 505
Fig. 3. Plot of experimental vs. predicted values in a regression model. The training and prediction chemicals are labeled
differently.
Table 1
Statistical parameters for fitting
s i
np1
6 F F-value MSS=p
F RSS =np1
21 On the Development and Validation of QSAR Models 509
Table 2
Statistical parameters for cross-validation and K multivariate correlation index
Q 2 1 PRESSCV
4 2 Explained variance in prediction
Q
TSS
5 K Multivariate correlation index P Pln 1
n
n ln p
K 100
2p 1
p
ln eigenvalues obtained from the correlation
matrix of the dataset X(n, p)
n number of objects
p number of variables
Table 3
Statistical parameters for external validation
1 PRESS
5 2 Variance explained in external prediction
QF1 (32) QF1 2 EXT
SSEXT
PynTREXT
SSEXT y TR i1 yi y TR 2
y TR average of training observed
responses
1 PRESS
6 2 Variance explained in external prediction
QF2 (33) QF2 2 EXT
SSEXTPy EXT
nEXT
SSEXT y EXT i1 yi y EXT 2
y EXT average of external observed
responses
q
7 r 2 m (34) Closeness between the R2 and R02 r 2 m 1 R2 R02
determination coefficients
2
.
8 QF3 (35, Variance explained in external prediction 2
QF3 1 PRESSEXT TSS
nEXT nTR
36)
nEXT number of external objects
nTR number of training objects
Pn
9 CCC (37) Concordance correlation coefficient 2 x
x yi y
CCC Pn 2
Pni
i1
i1
xi
x 2
i1
yi y nxy 2
xi external response observed for the
ith object
yi external response predicted using the
model
x average of observed responses
y average of responses predicted by the
model
Note: observed and predicted responses
can be interchanged
21 On the Development and Validation of QSAR Models 511
2.4.1. Variable Selection The number of independent variables in a MLR model must be as
low as possible, and the ratio of observations to variables must be as
high as possible (a ratio of 5:1 is considered an absolute minimum).
Care must also be taken to ensure that all the variables in a MLR
analysis are significant (this can be assessed by the standardized
regression coefficient) and preferably that no combinations of inde-
pendent variables are collinear (in fact collinearity of variables
should be avoided in MLR, unless an appropriate method has
been applied to control the collinearity (see below) (14)).
For a large number of independent variables (i.e., physico-
chemical and/or structural properties) variable selection techni-
ques are commonly applied. This selection may be an empirical
process of the model developer, i.e., the selection of properties
known or thought to be important. Alternatively, variable selection
may employ stepwise selection techniques (forward or backward),
best subsets selection, or the use of genetic algorithms.
Genetic algorithms (GA) (15) are optimization methods based
on evolutionary principles (16). In GA terminology, a chromosome
is a p-dimensional vector (a string of bits) where each position
(a gene) corresponds to a variable (1 if included in the model,
0 otherwise). Each chromosome or individual in the population
represents a model with a subset of variables. A population of
models is obtained that evolves, according to genetic algorithm
rules, in order to maximize the predictive power of the models
(for instance, the explained variance in prediction Q2 (see below)).
In the first generation, the variables are chosen randomly. In
the next step reproduction takes place, so that the new individual
contains characteristic of both its parents. The next steps are cross-
overs and mutations, which allow better variable combinations to
be found. This reproductioncrossovermutation process is
repeated during the evolution of the population until a desired
target fitness score is reached. Only the models producing the
highest predictive power are finally retained and further analyzed.
GAs are used in QSAR analysis as a strategy for variable subset
selection (VSS) in multivariate situations where a large number of
molecular descriptors are potential X-variables (1722). There are
different types of GA analysis, which perform reproduction, cross-
over and mutation in different ways. An important characteristic of
512 P. Gramatica
2.5. Validation A necessary condition for the validity of a regression model is that
of QSAR Models the multiple correlation coefficient R2 is as close as possible to one
(OECD Principle 4) and the standard error of the estimate s is small. However, this
condition, which measures how well the model is able to mathe-
matically reproduce the end point data of the training set (fitting
ability), is an insufficient condition for model robustness and valid-
ity, as it do not express the ability of the model to make reliable
predictions on new data. However, it is important to highlight that
the predictive QSAR modelling utility should be mainly in obtain-
ing predicted data, in filling the data gaps and mainly in virtual
screening for prioritizing chemicals. A necessary approach is to
apply various cross-validations (24). Cross-validation refers to
the use of one or more statistical techniques for internal validation
in which different proportions of chemicals are omitted from the
training set (e.g., Leave-one-out (LOO), Leave-More-out (LMO),
bootstrapping) and iteratively put in test (or validation) set. QSAR
is developed on the basis of the data of the remaining chemicals,
and then used to make predictions for the chemicals that were
omitted (test chemicals). This procedure is repeated a number of
times, so that a number of statistics can be derived from the com-
parison of predicted data with the known data. Cross-validation
techniques allow the assessment of the internal prediction power
21 On the Development and Validation of QSAR Models 513
90
R2
80
70
60
50
Q2
40
30
20
1 2 3 4 5 6 7 8 9 10
Number of components
Fig. 4. Comparison of the explained variance in fitting with the explained variance in cross-validation prediction.
2.5.1. External Validation Validation on chemicals not used in model development, the so-
called external validation, is especially important in the context of
using QSAR models for prediction of new data in virtual screening
21 On the Development and Validation of QSAR Models 515
more than one criteria each time or to prefer our new proposal in
the QSAR modelling, the concordance correlation coefficient CCC
(Eq. 9), which is the more precautionary in accepting QSAR mod-
els as externally predictive (22, 37). All these validation criteria are
implemented in the new software QSARINS for QSAR MLR
model development and validation developed by the authors
group and will be freely available on the Web (38)
The standard deviation error in prediction (SDEP) is similar to
SDEC, but the residuals are calculated by using the predicted value
of the dependent variable when an observation is left out from the
training set and put in the test set (CV) or when it is calculated on
chemicals in prediction set (EXT).
Finally, when the developed QSAR model has been verified for
its predictive ability by one or more different splittings the set of
combined descriptors is used to derive a full model based on all the
chemicals: this is the final proposal, in order to have the maximum
information possible from the available experimental data (2022).
In Fig. 5 a scheme of the complete procedure in predictive QSAR
modelling is presented.
Fig. 6. A CART decision tree. At each node the value of the specified variable is applied for
class discrimination till the end in the root assigned classes (here 4 classes).
Fig. 7. The confusion matrix of a classification model. In the diagonal: the number of
chemicals correctly assigned to each of the G classes; out of the diagonal: the number of
misclassified chemicals with their erroneous class assignments.
Statistical parameters
Sensitivity The proportion (or percentage) a/(a + b)
of the active chemicals
(chemicals that give positive
results experimentally) that
are predicted to be active
Specificity The proportion (or percentage) d/(c + d)
of the inactive chemicals
(chemicals that give negative
results experimentally) that
are predicted to be inactive
Concordance or The proportion (or percentage) (a + d)/
accuracy of the chemicals that are (a + b + c + d)
classified correctly
Positive predictivity The proportion (or percentage) a/(a + c)
of the chemicals predicted to
be active that give positive
results experimentally
Negative predictivity The proportion (or percentage) d/(b + d)
of the chemicals predicted to
be inactive that give negative
results experimentally
False positive rate The proportion (or percentage) c/(c + d)
(overclassification) of the inactive chemicals that 1 specificity
are falsely predicted to be
active
False negative rate The proportion (or percentage) b/(a + b)
(underclassification) of the active chemicals that 1 sensitivity
are falsely predicted to be
inactive
0.9
08
0.7 kNN
Lazy
RF
0.6
Sensitivity
Consensus
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 - Specificity
Fig. 8. The ROC plot for comparison of the performances in sensitivity and specificity of
different classification models.
2.7. Other Statistical For modelling nonlinear relationships, both in regression and
Modelling Methods classification, Artificial Neural Networks (ANN) are particularly
accurate. ANN are nonlinear computational models that make
predictions by simulating the functioning of human neurons.
They can be used for pattern recognition problems and for QSAR
modelling. However they are often considered as black-boxes, as
522 P. Gramatica
2.8. Applicability Not even a robust, significant, and validated QSAR model can be
Domain (OECD expected to reliably predict the modelled property for the entire
Principle 3) universe of chemicals. In fact, only predictions for new chemicals
falling within the model domain can be considered reliable like
those for training chemicals and not extrapolations of the model.
The applicability domain is a theoretical region in physicochemical
space (the response and chemical structure space) for which a
QSAR model should makes predictions with a given reliability.
This region is defined by the nature of the chemicals in the training
set, and can be characterized in various ways (43).
While the range of the independent variable X (descriptor) is
useful to define the chemical domain of a univariate QSAR (based
on one descriptor), in multivariate models the descriptor ranges are
too limited to highlight those chemicals lying outside the domain.
In Multiple Linear Regression, where the data and residual
distribution is generally normal, the predicted data Y are obtained
from the experimental data Y through the Hat matrix of influence.
1 T
^y X X T X X y Hy
where X is the descriptor matrix.
A simple measure of a chemical being too far from the applica-
bility domain of the model is its leverage in the original variable
space, hii, which is defined (44) as:
1
hii xiT X T X xi i 1; . . . :; n
where xi is the descriptor row-vector of the query compound, and
X is the n p matrix of p model descriptor values for n training set
compounds. The superscript T refers to the transpose of the
matrix/vector. The ii main diagonal entry of the Hat matrix
(H, (hii)) provides a measure of how far observation i is from the
center of the X data (leverage).
The warning leverage h* is generally fixed at 3k/n, where n is
the number of training compounds, and k the number of model
parameters plus one (p + 1). A chemical with high leverage in the
training greatly influences the regression: the fitted regression line
is forced near the observed value and the residuals are small, thus
the chemical in the training set is not an outlier for the response
fitting. On the contrary, a chemical in a test set with an Hat value
greater than the warning leverage h* means that the predicted
21 On the Development and Validation of QSAR Models 523
Fig. 9. The Williams plot for the graphical visualization of outliers for the response (on the Y axis: standardized residuals
>2.5s) or for the structure (on the X axis: highest Hat value: >h* cut-off line) in regression models.
3. Examples
more commonly used and which can be easily applied, taking into
account all the crucial points explained above for QSAR model
reliability, particularly external validation and applicability domain.
Freely available computational tools:
EPI Suite : http://www.epa.gov/opptintr/exposure/pubs/
episuite.htm
Caesar models: http://www.caesar-project.eu/
Toxtree: http://toxtree.sourceforge.net/download.html http://
ecb.jrc.ec.europa.eu/qsar/qsar-tools/index.php?
cTOXTREE
OECD QSAR toolbox: http://www.oecd.org/document/54/
0,3746,en_2649_34379_42923638_1_1_1_1,00.html
OpenTox models: http://www.opentox.org/
CADASTER models: http://www.cadaster.eu/
Commercial computational tools:
ACD labs Advanced Chemistry Development (http://acdlabs.
com/home/)
MultiCASE http://www.multicase.com/
PASS: http://195.178.207.233/PASS/Ref.html
Leadscope: http://www.leadscope.com/model_appliers/
Derek: http://www.lhasalimited.org
Topkat: http://accelrys.com/products/discovery-studio/predictive-
toxicology.html
Acknowledgments
References
1. REACH (2007) http://ec.europa.eu/environ- 4. Tropsha A (2010) Best practices for QSAR
ment/chemicals/reach/reach_intro.htm model development. Validation, and Exploita-
2. OECD Guidelines (2004) http://www.oecd. tion Mol Inform 29:476488
org/dataoecd/33/37/37849783.pdf 5. http://www.netsci.org/Resources/Software/
3. Fourches D, Muratov E, Tropsha A (2010) Modeling/CADD/adapt.html
Trust, but verify: on the importance of chemi- 6. http://oasis-lmc.org
cal structure curation in chemoinformatics and 7. Katritzky AR, Karelson M, Petrukhin R
QSAR modelling research. J Chem Inf Model CODESSA PRO, University of Florida 2001
50:11891204 2005. http://www.codessa-pro.com/
21 On the Development and Validation of QSAR Models 525
8. MolConnZ (2003) Ver. 4.05, Hall Ass. Con- and applicability evaluations of regression
sult., Quincy, MA. http://www.edusoft-lc. based and classification QSARs. Environ
com/molconn/ Health Perspect 111:13611375
9. DRAGONSoftware for the calculation of 24. Hawkins DM (2004) The problem of overfit-
molecular descriptors. Talete srl, Milan, Italy. ting. J Chem Inf Comput Sci 44:112
(http://www.talete.mi.it/products/dragon_ 25. Golbraikh A, Tropsha A (2002) Beware of q2.
description.htm) J Mol Graph Model 20:269276
10. http://openmopac.net/ 26. Tropsha A, Gramatica P, Gombar VK (2003)
11. Todeschini R, Consonni V (2009) Molecular The importance of being earnest: validation is
descriptors for chemoinformatics. Wiley-VCH, the absolute essential for successful application
Weinheim and interpretation of QSPR models. QSAR
12. (2002) HyperChem 7.03 Hypercube, Inc., Comb Sci 22:6977
Florida, USA. www.hyper.com 27. Gramatica P (2007) Principles of QSAR mod-
13. Jackson JE (1991) A users guide to principal els validation: internal and external. QSAR
components. Wiley, New York Comb Sci 26:694701
14. Todeschini R, Consonni V, Maiocchi A (1999) 28. Efron B (1979) Bootstrap methods, another
The K correlation index: theory development look at the jackknife. Ann Stat 7:126
and its application in chemometrics. Chemom 29. Marengo E, Todeschini R (1992) A new algo-
Int Lab Syst 46:1329 rithm for optimal distance-based experimental
15. Leardi R, Boggia R, Terrile M (1992) Genetic design. Chemom Int Lab Syst 16:3744
algorithms as a strategy for feature selection. 30. Golbraikh A, Tropsha A (2002) Predictive
J Chemom 6:267281 QSAR modeling based on diversity sampling
16. Kubinyi H (1996) Evolutionary variable selec- of experimental datasets for the training and
tion in regression and PLS analyses. J Chemom test set selection. J Comput Aid Mol Des
10:119133 16:357369
17. Gramatica P, Pilutti P, Papa E (2004) Validated 31. Gasteiger J, Zupan J (1993) Neural networks
QSAR prediction of OH tropospheric degrad- in chemistry. Angew Chem Int Ed Engl 32
ability: splitting into training-test set and con- (503):527
sensus modelling. J Chem Inf Comp Sci 32. Shi LM, Fang H, Tong W et al (2001) QSAR
44:17941802 models using a large diverse set of estrogens.
18. Papa E, Villa F, Gramatica P (2005) Statistically J Chem Inf Comput Sci 41:186195
validated QSARs and theoretical descriptors for 33. Schuurmann G, Ebert RU, Chen J et al (2008)
the modelling of the aquatic toxicity of organic External validation and prediction employing
chemicals in Pimephales promelas (Fathead the predictive squared correlation coefficients
Minnow). J Chem Inf Model 45:12561266 test set activity mean vs training set activity
19. Liu H, Papa E, Gramatica P (2006) QSAR mean. J Chem Inf Model 48:21402145
prediction of estrogen activity for a large set 34. Roy PP, Somnath P, Indrani M et al (2009) On
of diverse chemicals under the guidance of two novel parameters for validation of predic-
OECD principles. Chem Res Toxicol tive QSAR models. Molecules 14:16601701
19:15401548 35. Consonni V, Ballabio D, Todeschini R (2009)
20. Gramatica P, Giani E, Papa E (2007) Statistical Comments on the definition of the Q2 param-
external validation and consensus modeling, A eter for QSAR validation. J Chem Inf Model
QSPR case study for Koc prediction. J Mol 49:16691678
Graph Model 25:755766 36. Consonni V, Ballabio D, Todeschini R (2010)
21. Gramatica P (2009) Chemometric methods Evaluation of model predictive ability by
and theoretical molecular descriptors in predic- external validation techniques. J Chemom 24:
tive QSAR modeling of the environmental 194201
behaviour of organic pollutants. In: Puzyn T, 37. Nicola Chirico N, Gramatica P (2011) Real
Leszczynski J, Cronin MTD (eds) Recent external predictivity of QSAR models: how to
advances in QSAR studies. Springer, New York evaluate it? Comparison of different validation
22. Bhhatarai B, Gramatica P (2010) Per- and criteria and proposal of using the concordance
poly-fluoro toxicity (LC50 inhalation) study correlation coefficient. J Chem Inf Model 51
in rat and mouse using QSAR modeling. (9):23202335
Chem Res Toxicol 23:528539 38. Chirico N, Papa E, Kovarich S, Cassani S,
23. Eriksson L, Jaworska J, Worth A et al (2003) Gramatica P (2011) QSARINS, software for
Methods for reliability, uncertainty assessment, QSAR MLR model calculation and validation,
526 P. Gramatica
Abstract
Principal components analysis (PCA) is a standard tool in multivariate data analysis to reduce the number of
dimensions, while retaining as much as possible of the datas variation. Instead of investigating thousands of
original variables, the first few components containing the majority of the datas variation are explored. The
visualization and statistical analysis of these new variables, the principal components, can help to find
similarities and differences between samples. Important original variables that are the major contributors
to the first few components can be discovered as well.
This chapter seeks to deliver a conceptual understanding of PCA as well as a mathematical description.
We describe how PCA can be used to analyze different datasets, and we include practical code examples.
Possible shortcomings of the methodology and ways to overcome these problems are also discussed.
Key words: Principal components analysis, Multivariate data analysis, Metabolite profiling, Codon
usage, Dimensionality reduction
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_22, # Springer Science+Business Media, LLC 2013
527
528 D. Groth et al.
2. Important
Concepts
The -omics technologies mentioned above require careful data
preprocessing and normalization before the actual data analysis can
be performed. The methods used for this purpose have recently
been discussed for microarrays (2) and metabolite data (3). We here
only briefly outline important concepts specific to these data types
and give a general introduction using different examples.
2.1. Data Normalization Data normalization aims to remove the technical variability and
and Transformation to impute missing values that result from experimental issues.
Data transformation, in contrast, aims to move the data distribu-
tion into a Gaussian one, and ensures that more powerful para-
metric statistical methods can be used later on. The basic steps for
data preparation are shown in Fig. 1 and are detailed below.
22 Principal Components Analysis 529
150
6
100
4
50
2
0
0
cm kg grade cm kg grade
3
20
2
15
1
10
0
1
5
cm kg grade cm kg grade
2.2. Principal We next demonstrate the PCA using the example of the students
Components Analysis dataset, with the variables cm, kg, and grade. The students
are the samples in this case. As it can be seen in Table 1, height,
measured in cm, and the weight, measured in kg, have a higher
variance in comparison to the to the grade, which ranges from 1 to
5. After scaling to unit variance, it can be seen in Table 2 that all
variables have variance of one. The weight and the height of our
sample students have a larger covariance than to that between grade
and both weight and height. Remember that the covariance
between variables indicate the degree of correlation between two
532 D. Groth et al.
50 60 70 80 90 110 1 2 3 4
200
190
J E J E
F F 180
Y V OR V OR Y
cm CK
QTUBXI P U Q
M KC
ITP XB
170
W ZM H W HD N Z
N
SADL
G AL S G
160
150
H 110
3 H 100
2 90
E E
1 P R
F kg F
P
80
J J OR 70
I O I XB
0 UBTX V UV
MLKQ T 60
Q
L Z MK Y D
W A C NS G
YZ
A
SGD
NW C
1 50
Z Z
2 Y Y
G G
1 BX BX
E E
S S
0
N TP
I R F
N T P
I R F
grade
D D
H C
K Q O C KQ O H
AL M J A LM J
1
W V W V
U U
2
1 0 1 2 1 0 1 2 3
Fig. 3. Pairs plot of unscaled (upper triangle) and scaled (lower triangle) data. Individuals can be identified by their letter
codes.
Table 1
Covariance matrix for the original data. The diagonal contains the variances
cm kg Grade
cm 47.54 36.90 0.11
kg 36.90 123.18 0.84
Grade 0.11 0.84 0.62
Table 2
Covariance/correlation matrix for scaled data
cm kg Grade
cm 1.00 0.48 0.02
kg 0.48 1.00 0.10
Grade 0.02 0.10 1.00
Table 3
Covariance matrix for the principal components, the matrix
is a diagonal matrix, all nondiagonal values are zero. This
means that the principal components are uncorrelated to
each other
a b
1.5
1.5
1.0 o
Variances
Variances
1.0
o
0.5
0.5
0.0
0.0
c d
3
1.0
2
W U
0.5
V
1
AL M
H
D CK Q O J
PC2
PC2
kg
R 0.0 kg
I
0
SN T Pcm F
cm
BX
1
G E
grade
2
Z Y
1.0
grade
3
Fig. 4. Common plots for PCA visualization. (a) Screeplot for the first few components using a bar plot; (b) screeplot using
lines; (c) biplot showing the most relevant loading vectors; (d) correlation plot.
Fig. 5. Illustrative projection of a three-dimensional data point cloud onto a two-dimensional surface.
22 Principal Components Analysis 537
a b 3
l
H l
H
3
2
2
l
E
PC2
l 1
kg
F
1 l
P l
R l
l Jl l lL
lA
S
l
P
lI O 0 lN
G
D
lW
l Zl Ml UlBl l lI l
l l
X R l
F
0 lB
U l
V Kl l X
lT Ol
ll
T l
C Q l
E
l
L l Q
lK
M
l
l
l Z C
D l
Y V Jl
1 ll G
A
S llW
N l l 1 l
Y
2 1 0 1 2 1 0 1 2
cm PC1
c d
1.5 3 H
2
kg
Variances
1.0 1
ADL P
PC2
SN
0 GWZ M UBXI R F
CK QT O E
V J
1 Y
0.5 cm
2
3
0.0
PC1 PC2 3 2 1 0 1 2 3
PC1
Fig. 6. PCA plots using students height and weight data. (a) Scaled data; (b) data projected
into the new coordinate system of principal components; (c) screeplot of the two resulting
PCs; (d) biplot showing the loadings vectors of the original variables.
3. Biological
Examples
PCA was first applied to microarray data in 2000 (9, 10) and has
been reviewed in this context (11). We decided to choose other
types of data to illustrate PCA. First we use a dataset which deals
with the transformation of qualitative data into numerical data:
codon frequencies from various taxa will be used to evaluate their
main codon differences with a PCA. Next, we use data from a
recent study about principal changes in the Escherichia coli
(E. coli) metabolism after applying stress conditions (12). The
data analysis is extended with adding visualization that PCA enables
to better understand the E. coli primary stress metabolism.
3.1. Sequence Data Here we use PCA to demonstrate that the codon usage differs for
Analysis protein-coding genes of different organisms, a fact that is well-
known. Genome sequences for many taxa are freely available from
different online resources. We just take five genomes for this exam-
ple, although we could easily have taken 50 or 500: Arabidopsis
thaliana (a higher plant), Caenorhabditis elegans (a nematode),
Drosophila melanogaster (the fruit fly), Canis familiaris (the dog),
and Saccharomyces cerevisiae (yeast). For each of these genomes, we
only use protein-coding genes and end up with about 33,000
(plant), 24,000 (nematode), 15,000 (dog), 18,000 (fruit fly), and
6,000 (yeast) genes each.
For one species at a time, we then record for each of these gene
sequences which of the 64 possible codons is used how many times.
The data to be analyzed for interesting patterns is therefore a
5 64 matrix. It describes the codon usage combined for all
protein-coding genes for each of the five taxa. An abbreviated and
transposed version of this matrix is shown below; note that absolute
frequencies were recorded.
22 Principal Components Analysis 539
> codonsread.table(http://bitbucket.org/mittel-
mark/r-code/downloads/codonsTaxa.txt, header
TRUE, row.names1)
> head(codons[,c(1:3,62:64)])
AAA AAC AAG TTC TTG TTT
ATHA 419472 275641 432728 269345 285985 299032
CELE 394707 192068 272375 249129 210524 240164
CFAM 243989 187183 314228 205704 134896 186681
DMEL 185136 280003 417111 227142 173632 141944
SSCE 124665 72230 88924 52059 77227 76198
a 40 b
Variances
30
20 TGG
5 CCT GGG
10 CFAM
0 ATHA
PC1 PC3 PC5
PC2
ACT CAG
0 CAC
CELE
CAT GCC
SSCE
c 1.0
TGT
TGG
CCT GGG
AGGCTC DMEL
CTT
GCT
TCT
AGA
GCA CGG
GTC 5
TTT GGA GAG
GTG TCGACG
PC2
TCA
ACA ATG TTC GAC
GTT
AAA
GAA CCC
CTG
CAG
0.0 ACTCCA AGC
TGC
CAC
GGC
AGT
CAT GGT
TTA GCC
ATC
AAG
TAT
GAT
ATT
GTA TCC
ACC
TTGATA CGA CGC
CAACTATGA
AAT CGT
TAA GCG
CCG
AACTAC
TAGTCG
ACG
1.0
PC1 PC1
Fig. 7. Codon data. (a) Screeplot of PCs variances; (b) biplot for first two PCs and most important loadings; (c) correlation
plot for all variables.
540 D. Groth et al.
3.2. Metabolite Data In this example we employ PCA to analyze the system level stress
Analysis adjustments following the response of E. coli to five different per-
turbations. We make use of time-resolved metabolite measure-
ments to get a detailed understanding of the successive events
following heat- and cold-shock, oxidative stress, lactose diauxie,
and stationary phase. A previous analysis of the metabolite data
together with transcript data measured under the exact same per-
turbations and time-points was able to show, that E. colis response
on the metabolic level shows a higher degree of specificity as
22 Principal Components Analysis 541
Glc-6-P
6-P-gluconolactone
Fru-6-P
pentose phosphate
6-P-gluconic a.
pathway
Ribose-5-P Xylulose-5-P G3P
glycolysis
1,3 DPGA
2PGA
Pyruvic a.
Acetyl-CoA
OAA Citric a.
TCA cycle
Malic a. Isocitric a.
Fumaric a. 2-ketogluratic a.
Succinic a. Succinyl-CoA
Fig. 10. Metabolite concentrations of conditions and different time-points. Within each time-series, each metabolite
concentration is normalized to preperturbation levels.
4. PCA
Improvements
and Alternatives
PCA is an excellent method for finding orthogonal directions that
correspond to maximum variance. Datasets can, of course, contain
other types of structures that PCA is not designed to detect. For
example, the largest variations might be not of the greatest
biological importance. This is a problem which cannot easily be
solved as it requires the knowledge of the biology behind the data.
In this case it may be important to remove the outliers to minimize
the effect of single values on the overall outcome. Approaches to
provide outlier-insensitive PCA algorithms like robust (14) or
weighted PCA (15) and an R package, rrcov (16), which can be
used to apply some of the advanced PCA methods to the data set
are available. The R package provides the function PcaCov which
calls robust estimators of covariance.
In datasets with many variables it is sometimes difficult to
obtain a general description of a certain component. For this
purpose, e.g., in microarray analysis, often the enrichment for
certain ontology terms for the variables contributing at most to a
component is used to get an impression what the component is
actually representing (17).
Sometimes a problem with PCA is that the components,
although uncorrelated, are dependent and orthogonal to each
other. Independent components analysis (ICA) does not have this
shortcoming. Some authors have found that ICA outperforms PCA
(18), other authors have found the opposite (19, 20). Which
method is in practice best, depends on the actual data structure,
and ICA is in some cases a possible alternative to PCA. The
fastICA algorithm can be used for this purpose (21, 22). Because
ICA does not reduce the number of variables as PCA does, ICA can be
used in conjunction with PCA to get a decreased number of variables
to consider. For instance, it has been shown that ICA, when per-
formed on the first few principal components, i.e., on the results of
a preceding PCA, can improve the sample differentiation (23).
Higher-order dependencies, for instance data are scattered in a
ringlike manner around a certain point, are sometimes difficult to
resolve with standard PCA, and a nonlinear approach may be required
to transform the data firstly with a new coordinate system. This
parametric approach is sometimes called kernelPCA (24, 25). To
obtain deeper insights into the relevant variables required to differen-
tiate between the samples, factor analysis might be a better choice.
546 D. Groth et al.
5. Availability
of R-Code
The example data and the R-code required to create the graphics of
this article is available at the webpage: http://bitbucket.org/
mittelmark/r-code/wiki/Home.
The script file ma.pca.r contains some functions which can be
used to simplify data analysis using R. The data and functions of the
ma.pca object can be investigated by typing the ls(ma.pca)
command. Some of the most important functions and objects are:
l ma.pca$new(data)performs a new PCA analysis on data,
needs to be called first
l ma.pca$summary()returns a summary, with the variances for
the most important components
l ma.pca$scoresthe positions of the new data points in the new
coordinate system
l ma.pca$loadingsnumerical values to describe the amount
each variable contributes to a certain component
l ma.pca$plot()a pairs plot for the most important compo-
nents, % of variance in the diagonal
l ma.pca$biplot()produces a biplot for the samples and for the
most important variables
l ma.pca$corplot()produces a correlation plot for all variables
on selected components
l ma.pca$screeplot()produces an improved screeplot for the
PCA
These functions have different parameters, for example not
to plot the first two but other components can be chosen with
the pcs-argument. For instance: ma.pca$corplot(pcsc
(PC2,PC3),cex1.2) would rather plot the second versus
the third component and slightly enlarge the text labels. To get
comfortable with the functions users should study the material on
the project website and the R-source code.
22 Principal Components Analysis 547
Acknowledgments
References
1. Hotelling H (1933) Analysis of complex statis- 14. Hubert M, Engelen S (2004) Robust PCA and
tical variables into principal components. J classification in biosciences. Bioinformatics
Educ Psychol 24:417441, and 498520 20:17281736
2. Quackenbush J (2002) Microarray data nor- 15. Kriegel HP, Kroger P, Schubert E, Zimek A
malization and transformation. Nat Genet 32 (2008) A general framework for increasing
(Suppl):496501 the robustness of PCA-based correlation clus-
3. Steinfath M, Groth D, Lisec J, Selbig J (2008) tering algorithms. In: Ludascher B, Mamoulis
Metabolite profile analysis: from raw data to N (eds) Scientific and statistical database man-
regression and classification. Physiol Plant agement. Springer, Berlin
132:150161 16. Todorov V, Filzmoser P (2009) An object-
4. Cover TM, Hart PE (1967) Nearest neighbor oriented framework for robust multivariate
pattern classification. IEEE Trans Inf Theory analysis. J Stat Softw 32:147
13:2127 17. Ma S, Kosorok MR (2009) Identification of
5. Bo TM, Dysvik B, Jonassen I (2004) LSim- differential gene pathways with principal com-
pute: accurate estimation of missing values in ponent analysis. Bioinformatics 25:882889
microarray data with least squares methods. 18. Draper BA, Baek K, Bartlett MS, Beveridge JR
Nucleic Acids Res 32:e34 (2003) Recognizing faces with PCA and ICA.
6. Stacklies W, Redestig H, Scholz M et al (2007) Comput Vis Image Understand 91:115137
pcaMethodsa bioconductor package 19. Virtanen J, Noponen T, Merilainen P (2009)
providing PCA methods for incomplete data. Comparison of principal and independent
Bioinformatics 23:11641167 component analysis in removing extracerebral
7. Troyanskaya O, Cantor M, Sherlock G et al interference from near-infrared spectroscopy
(2001) Missing value estimation methods for signals. J Biomed Opt 14:054032
DNA microarrays. Bioinformatics 17:520525 20. Baek K, Draper BA, Beveridge JR, She K (2002)
8. Celton M, Malpertuy A, Lelandais G, de Bre- PCA vs. ICA: a comparison on the feret data set.
vern AG (2010) Comparative analysis of miss- In Proc of the 4th Intern Conf on Computer
ing value imputation methods to improve Vision, ICCV 20190, pp 824827
clustering and interpretation of microarray 21. Hyvarinen A (1999) Fast and robust fixed-
experiments. BMC Genomics 11:15 point algorithms for independent component
9. Alter O, Brown PO, Botstein D (2000) Singu- analysis. IEEE Trans Neural Netw 10:626634
lar value decomposition for genome-wide 22. Marchini JL, Heaton C, Ripley BD (2009)
expression data processing and modeling. fastICA: FastICA algorithms to perform ica
Proc Natl Acad Sci USA 97:1010110106 and projection pursuit. http://cran.r-project.
10. Alter O, Brown PO, Botstein D (2003) org/web/packages/fastICA
Generalized singular value decomposition for 23. Scholz M, Selbig J (2007) Visualization and
comparative analysis of genome-scale expres- analysis of molecular data. Methods Mol Biol
sion data sets of two different organisms. Proc 358:87104
Natl Acad Sci USA 100:33513356 24. Scholz M, Kaplan F, Guy CL et al (2005) Non-
11. Quackenbush J (2001) Computational analysis linear PCA: a missing data approach. Bioinfor-
of microarray data. Nat Rev Genet 2:418427 matics 21:38873895
12. Jozefczuk S, Klie S, Catchpole G et al (2010) 25. Scholkopf B, Smola A, M uller KR (1998) Non-
Metabolomic and transcriptomic stress linear component analysis as a kernel eigen-
response of Escherichia coli. Mol Syst Biol value problem. Neural Comput 10:12991319
6:364 26. Hotelling H (1936) Relations between two
13. Gasch AP, Spellman PT, Kao CM et al (2000) sets of variates. Biometrika 28:321377
Genomic expression programs in the response 27. de Leeuw J, Mair P (2009) Simple and canoni-
of yeast cells to environmental changes. Mol cal correspondence analysis using the R pack-
Biol Cell 11:42414257 age anacor. J Stat Softw 31:118
Chapter 23
Abstract
Partial least square (PLS) methods (also sometimes called projection to latent structures) relate the information
present in two data tables that collect measurements on the same set of observations. PLS methods proceed by
deriving latent variables which are (optimal) linear combinations of the variables of a data table. When the goal
is to find the shared information between two tables, the approach is equivalent to a correlation problem and
the technique is then called partial least square correlation (PLSC) (also sometimes called PLS-SVD). In this
case there are two sets of latent variables (one set per table), and these latent variables are required to have
maximal covariance. When the goal is to predict one data table the other one, the technique is then called
partial least square regression. In this case there is one set of latent variables (derived from the predictor table)
and these latent variables are required to give the best possible prediction. In this paper we present and
illustrate PLSC and PLSR and show how these descriptive multivariate analysis techniques can be extended to
deal with inferential questions by using cross-validation techniques such as the bootstrap and permutation
tests.
Key words: Partial least square, Projection to latent structure, PLS correlation, PLS-SVD,
PLS-regression, Latent variable, Singular value decomposition, NIPALS method, Tucker inter-battery
analysis
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_23, # Springer Science+Business Media, LLC 2013
549
550 H. Abdi and L.J. Williams
2. Notations
Data are stored in matrices which are denoted by upper case bold
letters (e.g., X). The identity matrix is denoted I. Column vectors
23 Partial Least Squares Methods: Partial Least Squares. . . 551
are denoted by lower case bold letters (e.g., x). Matrix or vector
transposition is denoted by an uppercase superscript T (e.g., XT).
Two bold letters placed next to each other imply matrix or vector
multiplication unless otherwise mentioned. The number of rows,
columns, or sub-matricesis denoted by an uppercase italic letter
(e.g., I) and a given row, column, or sub-matrixis denoted by a
lowercase italic letter (e.g., i).
PLS methods analyze the information common to two matrices.
The first matrix is an I by J matrix denoted X whose generic element
is xi,j and where the rows are observations and the columns are
variables. For PLSR the X matrix contains the predictor variables
(i.e., independent variables). The second matrix is an I by K matrix,
denoted Y, whose generic element is yi,k. For PLSR, the Y matrix
contains the variables to be predicted (i.e., dependent variables). In
general, matrices X and Y are statistically preprocessed in order to
make the variables comparable. Most of the time, the columns of X
and Y will be rescaled such that the mean of each column is zero and
its norm (i.e., the square root of the sum of its squared elements) is
one. When we need to mark the difference between the original data
and the preprocessed data, the original data matrices will be denoted
X and Y and the rescaled data matrices will be denoted ZX and ZY.
4. Partial Least
Squares Correlation
PLSC generalizes the idea of correlation between two variables to
two tables. It was originally developed by Tucker (51), and refined
by Bookstein (14, 15, 46). This technique is particularly popular in
brain imaging because it can handle the very large data sets gener-
ated by these techniques and can easily be adapted to handle
sophisticated experimental designs (31, 3841). For PLSC, both
tables play a similar role (i.e., both are dependent variables) and the
goal is to analyze the information common to these two tables. This
is obtained by deriving two new sets of variables (one for each table)
called latent variables that are obtained as linear combinations of
the original variables. These latent variables, which describe the
observations, are required to explain the largest portion of the
covariance between the two tables. The original variables are
described by their saliences.
For each latent variable, the X or Y variable saliences have a
large magnitude, and have large weights for the computation of the
latent variable. Therefore, they have contributed a large amount to
creating the latent variable and should be used to interpret that
latent variable (i.e., the latent variable is mostly made from these
high contributing variables). By analogy with principal component
analysis (see, e.g., (13)), the latent variables are akin to factor scores
and the saliences are akin to loadings.
4.1. Correlation Between Formally, the pattern of relationships between the columns of X
the Two Tables and Y is stored in a K J cross-product matrix, denoted R (that is
usually a correlation matrix in that we compute it with ZX and ZY
instead of X and Y). R is computed as:
R ZY T ZX : (2)
The SVD (see Eq. 1) of R decomposes it into three matrices:
R UDVT : (3)
In the PLSC vocabulary, the singular vectors are called saliences:
so U is the matrix of Y-saliences and V is the matrix of X-saliences.
Because they are singular vectors, the norm of the saliences for a
given dimension is equal to one. Some authors (e.g., (31)) prefer
to normalize the salience to their singular values (i.e., the delta-
normed Y saliences will be equal to U D instead of U) because the
plots of the salience will be interpretable in the same way as factor
scores plots for PCA. We will follow this approach here because it
makes the interpretation of the saliences easier.
23 Partial Least Squares Methods: Partial Least Squares. . . 553
4.1.1. Common Inertia The quantity of common information between the two tables can
be directly quantified as the inertia common to the two tables. This
quantity, denoted Total, is defined as
X
L
Total d ; (4)
4.2. Latent Variables The latent variables are obtained by projecting the original matrices
onto their respective saliences. So, a latent variable is a linear
combination of the original variables and the weights of this linear
combination are the saliences. Specifically, we obtain the latent
variables for X as:
LX ZX V; (5)
and for Y as:
LY ZY U: (6)
(NB: some authors compute the latent variables with Y and X
rather than ZY and ZX; this difference is only a matter of normali-
zation, but using ZY and ZX has the advantage of directly relating
the latent variables to the maximization criterion used). The latent
variables combine the measurements from one table in order to find
the common information between the two tables.
4.3. What Does PLSC The goal of PLSC is to find pairs of latent vectors lX, and lY, with
Optimize? maximal covariance and with the additional constraints that (1) the
pairs of latent vectors made from two different indices are uncorre-
lated and (2) the coefficients used to compute the latent variables
are normalized (see (48, 51), for proofs).
Formally, we want to find
lX; ZXv and lY; ZY u
such that
cov lX; ; lY; / lTX; lY; max (7)
[where cov lX; ; lY; denotes the covariance between lX, and lY, ]
under the constraints that
lT
X; lY;0 0 when 6
0
(8)
(note that lTX; lX;0 and lTY; lY;0 are not required to be null) and
uT T
u v v 1: (9)
It follows from the properties of the SVD (see, e.g., (13, 21, 30, 47))
that u and v are singular vectors of R. In addition, from Eqs. 3, 5,
554 H. Abdi and L.J. Williams
lT
X; lY; d : (10)
So, when 1, we have the largest possible covariance between
the pair of latent variables. When 2 we have the largest possible
covariance for the latent variables under the constraints that the
latent variables are uncorrelated with the first pair of latent variables
(as stated in Eq. 8, e.g., lX,1 and lY,2 are uncorrelated), and so on for
larger values of .
So in brief, for each dimension, PLSC provides two sets of
saliences (one for X one for Y) and two sets of latent variables.
The saliences are the weights of the linear combination used to
compute the latent variables which are ordered by the amount of
covariance they explain. By analogy with principal component anal-
ysis, saliences are akin to loadings and latent variables are akin to
factor scores (see, e.g., (13)).
4.4.1. Permutation Test The permutation testoriginally developed by Student and Fisher
for Omnibus Tests and (37)provides a nonparametric estimation of the sampling distri-
Dimensions bution of the indices computed and allows for null hypothesis
testing. For a permutation test, the rows of X and Y are randomly
permuted (in practice only one of the matrices need to be per-
muted) so that any relationship between the two matrices is now
replaced by a random configuration. The matrix Rperm is computed
from the permuted matrices (this matrix reflects only random asso-
ciations of the original data because of the permutations) and the
analysis of Rperm is performed: The singular value decomposition of
Rperm is computed. This gives a set of singular values, from which
the overall index of effect Total (i.e., the common inertia) is com-
puted. The process is repeated a large number of times (e.g.,
10,000 times). Then, the distribution of the overall index and the
distribution of the singular values are used to estimate the proba-
bility distribution of Total and of the singular values, respectively.
If the common inertia computed for the sample is rare enough
(e.g., less than 5%) then this index is considered statistically
23 Partial Least Squares Methods: Partial Least Squares. . . 555
4.4.2. What are the The Bootstrap (23, 24) can be used to derive confidence intervals
Important Variables for and bootstrap ratios (5, 6, 9 , 40) which are also sometimes test-
a Dimension values (32). Confidence intervals give lower and higher values,
which together comprise a given proportion (e.g., often 95%) of
the values of the saliences. If the zero value is not in the confidence
interval of the saliences of a variable, this variable is considered
relevant (i.e., significant). Bootstrap ratios are computed by
dividing the mean of the bootstrapped distribution of a variable
by its standard deviation. The bootstrap ratio is akin to a Student
t criterion and so if a ratio is large enough (say 2.00 because it
roughly corresponds to an a .05 critical value for a t-test) then
the variable is considered important for the dimension. The boot-
strap estimates a sampling distribution of a statistic by computing
multiple instances of this statistic from bootstrapped samples
obtained by sampling with replacement from the original sample.
For example, in order to evaluate the saliences of Y, the first step is
to select with replacement a sample of the rows. This sample is then
used to create Yboot and Xboot that are transformed into ZYboot and
ZXboot, which are in turn used to compute Rboot as:
R boot ZY Tboot ZX boot : (11)
The Bootstrap values for Y, denoted Uboot, are then computed as
Uboot R boot VD1 : (12)
The values of a large set (e.g., 10,000) are then used to compute
confidence intervals and bootstrap ratios.
4.5. PLSC: Example We will illustrate PLSC with an example in which I 36 wines are
described by a matrix X which contains J 5 objective measure-
ments (price, total acidity, alcohol, sugar, and tannin) and by a
matrix Y which contains K 9 sensory measurements (fruity, floral,
vegetal, spicy, woody, sweet, astringent, acidic, hedonic) provided
(on a 9 point rating scale) by a panel of trained wine assessors (the
ratings given were the median rating for the group of assessors).
Table 1 gives the raw data (note that columns two to four, which
Table 1
Physical and chemical descriptions (matrix X) and assessor sensory evaluations (matrix Y) of 36 wines
Total
Wine Varietal Origin Color Price acidity Alcohol Sugar Tannin Fruity Floral Vegetal Spicy Woody Sweet Astringent Acidic Hedonic
1 Merlot Chile Red 13 5. 33 13. 8 2. 75 559 6 2 1 4 5 3 5 4 2
2 Cabernet Chile Red 9 5. 14 13. 9 2. 41 672 5 3 2 3 4 2 6 3 2
3 Shiraz Chile Red 11 5. 16 14. 3 2. 20 455 7 1 2 6 5 3 4 2 2
4 Pinot Chile Red 17 4. 37 13. 5 3. 00 348 5 3 2 2 4 1 3 4 4
5 Chardonnay Chile White 15 4. 34 13. 3 2. 61 46 5 4 1 3 4 2 1 4 6
6 Sauvignon Chile White 11 6. 60 13. 3 3. 17 54 7 5 6 1 1 4 1 5 8
7 Riesling Chile White 12 7. 70 12. 3 2. 15 42 6 7 2 2 2 3 1 6 9
8 Gewurztraminer Chile White 13 6. 70 12. 5 2. 51 51 5 8 2 1 1 4 1 4 9
9 Malbec Chile Rose 9 6. 50 13. 0 7. 24 84 8 4 3 2 2 6 2 3 8
10 Cabernet Chile Rose 8 4. 39 12. 0 4. 50 90 6 3 2 1 1 5 2 3 8
11 Pinot Chile Rose 10 4. 89 12. 0 6. 37 76 7 2 1 1 1 4 1 4 9
12 Syrah Chile Rose 9 5. 90 13. 5 4. 20 80 8 4 1 3 2 5 2 3 7
13 Merlot Canada Red 20 7. 42 14. 9 2. 10 483 5 3 2 3 4 3 4 4 3
14 Cabernet Canada Red 16 7. 35 14. 5 1. 90 698 6 3 2 2 5 2 5 4 2
15 Shiraz Canada Red 20 7. 50 14. 5 1. 50 413 6 2 3 4 3 3 5 1 2
16 Pinot Canada Red 23 5. 70 13. 3 1. 70 320 4 2 3 1 3 2 4 4 4
17 Chardonnay Canada White 20 6. 00 13. 5 3. 00 35 4 3 2 1 3 2 2 3 5
18 Sauvignon Canada White 16 7. 50 12. 0 3. 50 40 8 4 3 2 1 3 1 4 8
19 Riesling Canada White 16 7. 00 11. 9 3. 40 48 7 5 1 1 3 3 1 7 8
20 Gewurztraminer Canada White 18 6. 30 13. 9 2. 80 39 6 5 2 2 2 3 2 5 6
21 Malbec Canada Rose 11 5. 90 12. 0 5. 50 90 6 3 3 3 2 4 2 4 8
22 Cabernet Canada Rose 10 5. 60 1. 25 4. 00 85 5 4 1 3 2 4 2 4 7
23 Pinot Canada Rose 12 6. 20 13. 0 6. 00 75 5 3 2 1 2 3 2 3 7
24 Syrah Canada Rose 12 5. 80 13. 0 3. 50 83 7 3 2 3 3 4 1 4 7
25 Merlot USA Red 23 6. 00 13. 6 3. 50 578 7 2 2 5 6 3 4 3 2
26 Cabernet USA Red 16 6. 50 14. 6 3. 50 710 8 3 1 4 5 3 5 3 2
27 Shiraz USA Red 23 5. 30 13. 9 1. 99 610 8 2 3 7 6 4 5 3 1
28 Pinot USA Red 25 6. 10 14. 0 0.00 340 6 3 2 2 5 2 4 4 2
29 Chardonnay USA White 16 7. 20 13. 3 1. 10 41 6 4 2 3 6 3 2 4 5
30 Sauvignon USA White 11 7. 20 13. 5 1. 00 50 6 5 5 1 2 4 2 4 7
31 Riesling USA White 13 8. 60 12. 0 1. 65 47 5 5 3 2 2 4 2 5 8
32 Gewurztraminer USA White 20 9. 60 12. 0 0.00 45 6 6 3 2 2 4 2 3 8
33 Malbec USA Rose 8 6. 20 12. 5 4. 00 84 8 2 1 4 3 5 2 4 7
34 Cabernet USA Rose 9 5. 71 12. 5 4. 30 93 8 3 3 3 2 6 2 3 8
35 Pinot USA Rose 11 5. 40 13. 0 3. 10 79 6 1 1 2 3 4 1 3 6
36 Syrah USA Rose 10 6. 50 13. 5 3. 00 89 9 3 2 5 4 3 2 3 5
558 H. Abdi and L.J. Williams
describe the varietal, origin, and color of the wine, are not used in
the analysis but can help interpret the results).
4.5.1. Centering Because X and Y measure variables with very different scales, each
and Normalization column of these matrices is centered (i.e., its mean is zero) and
rescaled so that its norm (i.e., square root of the sum of squares) is
equal to one. This gives two new matrices called ZX and ZY which
are given in Table 2.
The K 5 by J 9 matrix of correlations R is then computed
from ZX and ZY as
R ZY T ZX
2 3
0:278 0:083 0:068 0:115 0:481 0:560 0:407 0:020 0:540
6 7
6 0:029 0:531 0:3480:168 0:162 0:084 0:098 0:202 0:202 7
6 7
6
6 0:044 0:387 0:016 0:431 0:661 0:445 0:730 0:399 0:850 77
6 7
4 0:305 0:187 0:198 0:118 0:400 0:469 0:326 0:054 0:418 5
0:008 0:479 0:132 0:525 0:713 0:408 0:936 0:336 0:884
(13)
The R matrix contains the correlation between each of variable in
X with each of variable in Y.
4.5.3. From Salience The saliences can be plotted as a PCA-like map (one per table), but
to Factor Score here we preferred to plot the delta-normed saliences FX and FY,
which are also called factor scores. These graphs give the same
information as the salience plots, but their normalization makes
Table 2
The matrices ZX and ZY (corresponding to X and Y)
Wine Name Varietal Origin Color Price Total acidity Alcohol Sugar Tannin Fruity Floral Vegetal Spicy Woody Sweet Astringent Acidic Hedonic
1 Merlot Chile Red 0.046 0.137 0.120 0.030 0.252 0.041 0.162 0.185 0.154 0.211 0.062 0.272 0.044 0.235
2 Cabernet Chile Red 0.185 0.165 0.140 0.066 0.335 0.175 0.052 0.030 0.041 0.101 0.212 0.385 0.115 0.235
3 Shiraz Chile Red 0.116 0.162 0.219 0.088 0.176 0.093 0.271 0.030 0.380 0.211 0.062 0.160 0.275 0.235
4 Pinot Chile Red 0.093 0.278 0.061 0.003 0.098 0.175 0.052 0.030 0.072 0.101 0.361 0.047 0.044 0.105
5 Chardonnay Chile White 0.023 0.283 0.022 0.045 0.124 0.175 0.058 0.185 0.041 0.101 0.212 0.178 0.044 0.025
6 Sauvignon Chile White 0.116 0.049 0.022 0.015 0.118 0.093 0.168 0.590 0.185 0.229 0.087 0.178 0.204 0.155
7 Riesling Chile White 0.081 0.210 0.175 0.093 0.127 0.041 0.387 0.030 0.072 0.119 0.062 0.178 0.364 0.220
8 Gewurztraminer Chile White 0.046 0.064 0.136 0.055 0.120 0.175 0.497 0.030 0.185 0.229 0.087 0.178 0.044 0.220
9 Malbec Chile Rose 0.185 0.034 0.037 0.444 0.096 0.227 0.058 0.125 0.072 0.119 0.386 0.066 0.115 0.155
10 Cabernet Chile Rose 0.220 0.275 0.234 0.155 0.091 0.041 0.052 0.030 0.185 0.229 0.237 0.066 0.115 0.155
11 Pinot Chile Rose 0.150 0.202 0.234 0.352 0.102 0.093 0.162 0.185 0.185 0.229 0.087 0.178 0.044 0.220
12 Syrah Chile Rose 0.185 0.054 0.061 0.123 0.099 0.227 0.058 0.185 0.041 0.119 0.237 0.066 0.115 0.090
13 Merlot Canada Red 0.197 0.169 0.337 0.098 0.197 0.175 0.052 0.030 0.041 0.101 0.062 0.160 0.044 0.170
14 Cabernet Canada Red 0.058 0.159 0.258 0.119 0.354 0.041 0.052 0.030 0.072 0.211 0.212 0.272 0.044 0.235
15 Shiraz Canada Red 0.197 0.181 0.258 0.162 0.145 0.041 0.162 0.125 0.154 0.009 0.062 0.272 0.435 0.235
16 Pinot Canada Red 0.301 0.083 0.022 0.141 0.077 0.309 0.162 0.125 0.185 0.009 0.212 0.160 0.044 0.105
17 Chardonnay Canada White 0.197 0.039 0.061 0.003 0.132 0.309 0.052 0.030 0.185 0.009 0.212 0.066 0.115 0.040
18 Sauvignon Canada White 0.058 0.181 0.234 0.049 0.128 0.227 0.058 0.125 0.072 0.229 0.062 0.178 0.044 0.155
19 Riesling Canada White 0.058 0.108 0.254 0.039 0.122 0.093 0.168 0.185 0.185 0.009 0.062 0.178 0.523 0.155
20 Gewurztraminer Canada White 0.127 0.005 0.140 0.024 0.129 0.041 0.168 0.030 0.072 0.119 0.062 0.066 0.204 0.025
(continued)
Table 2
(continued)
ZX: Centered and normalized version of X: Physical/Chemical
Wine descriptors description ZY: Centered and normalized version of Y: Assessors evaluation
Wine Name Varietal Origin Color Price Total acidity Alcohol Sugar Tannin Fruity Floral Vegetal Spicy Woody Sweet Astringent Acidic Hedonic
21 Malbec Canada Rose 0.116 0.054 0.234 0.261 0.091 0.041 0.052 0.125 0.041 0.119 0.087 0.066 0.044 0.155
22 Cabernet Canada Rose 0.150 0.098 0.136 0.102 0.095 0.175 0.058 0.185 0.041 0.119 0.087 0.066 0.044 0.090
23 Pinot Canada Rose 0.081 0.010 0.037 0.313 0.102 0.175 0.052 0.030 0.185 0.119 0.062 0.066 0.115 0.090
24 Syrah Canada Rose 0.081 0.068 0.037 0.049 0.097 0.093 0.052 0.030 0.041 0.009 0.087 0.178 0.044 0.090
25 Merlot USA Red 0.301 0.039 0.081 0.049 0.266 0.093 0.162 0.030 0.267 0.321 0.062 0.160 0.115 0.235
26 Cabernet USA Red 0.058 0.034 0.278 0.049 0.363 0.227 0.052 0.185 0.154 0.211 0.062 0.272 0.115 0.235
27 Shiraz USA Red 0.301 0.142 0.140 0.110 0.290 0.227 0.162 0.125 0.493 0.321 0.087 0.272 0.115 0.300
28 Pinot USA Red 0.370 0.024 0.160 0.320 0.092 0.041 0.052 0.030 0.072 0.211 0.212 0.160 0.044 0.235
29 Chardonnay USA White 0.058 0.137 0.022 0.204 0.127 0.041 0.058 0.030 0.041 0.321 0.062 0.066 0.044 0.040
30 Sauvignon USA White 0.116 0.137 0.061 0.214 0.121 0.041 0.168 0.435 0.185 0.119 0.087 0.066 0.044 0.090
31 Riesling USA White 0.046 0.342 0.234 0.146 0.123 0.175 0.168 0.125 0.072 0.119 0.087 0.066 0.204 0.155
32 Gewurztraminer USA White 0.197 0.489 0.234 0.320 0.124 0.041 0.278 0.125 0.072 0.119 0.087 0.066 0.115 0.155
33 Malbec USA Rose 0.220 0.010 0.136 0.102 0.096 0.227 0.162 0.185 0.154 0.009 0.237 0.066 0.044 0.090
34 Cabernet USA Rose 0.185 0.082 0.136 0.134 0.089 0.227 0.052 0.125 0.041 0.119 0.386 0.066 0.115 0.155
35 Pinot USA Rose 0.116 0.127 0.037 0.007 0.100 0.041 0.271 0.185 0.072 0.009 0.087 0.178 0.115 0.025
36 Syrah USA Rose 0.150 0.034 0.061 0.003 0.092 0.361 0.052 0.030 0.267 0.101 0.062 0.066 0.115 0.040
Fig. 2. The Saliences (normalized to their eigenvalues) for the physical attributes of the
wines.
FY VD
2 3
0:210 0:297 0:198 0:006 0:037
6 0:611 0:552 0:156 0:001 0:023 7
6 7
6 0:079 0:389 0:145 0:056 0:013 7
6 7
6 0:696 0:151 0:080 0:013 0:056 7
6 7
66 1:161 0:117 0:022 0:001 0:007 7
7
6 0:871 0:342 0:169 0:012 0:021 7
6 7
6 1:287 0:009 0:169 0:072 0:015 7
6 7
4 0:480 0:271 0:052 0:100 0:011 5
1:417 0:067 0:017 0:034 0:007
(16)
Figures 2 and 3 show the X and Y plot of the saliences for
Dimensions 1 and 2.
4.5.4. Latent Variables The latent variables for X and Y are computed according to Eqs. 5
and 6. These latent variables are shown in Tables 3 and 4. The
corresponding plots for Dimensions 1 and 2 are given in Figures 4
562 H. Abdi and L.J. Williams
Fig. 3. The Saliences (normalized to their eigenvalues) for the sensory evaluation of the
attributes of the wines.
Table 3
PLSC. The X latent variables. LX = ZXV
Table 3
(continued)
Table 4
PLSC. The Y-latent variables. LY = ZXU
Table 4
(continued)
and 5. These plots show clearly that wine color is a major determi-
nant of the wines both for the physical and the sensory points of
view.
23 Partial Least Squares Methods: Partial Least Squares. . . 565
2
11 9
10 2
23
21 34+ 12
4 3
+ + 35
22 1 + 26
33 5
24 36
+ +25
6 27+
17
14
1
19 8 20 13
18 16
7 30
+
15
+
+ 31 29 + 28
Chile red
Canada rose
+ 32 + USA white
Fig. 4. Plot of the wines: The X-latent variables for Dimensions 1 and 2.
33 2
+
34 35 +
11 + 3
12
36 27
+ 26 +
9 10 +
24 15 + 25
1
21 22
18 23 13
5 29 2 1
32 + +
+ 28 14
19 17
20 16 4
+ 31
8 + Chile red
30
6 7 Canada rose
+ USA white
Fig. 5. Plot of the wines: The Y-latent variables for Dimensions 1 and 2.
4.5.5. Permutation Test In order to evaluate if the overall analysis extracts relevant informa-
tion, we computed the total inertia extracted by the PLSC. Using
Eq. 4, we found that the inertia common to the two tables was
equal to Total 7. 8626. To evaluate its significance, we generated
10,000 R matrices by permuting the rows of X. The distribution of
the values of the inertia is given in Fig. 6, which shows that the
566 H. Abdi and L.J. Williams
500
450
350
300
250
200
150
0
0 1 2 3 4 5 6 7 8
Inertia of the Permuted Sample
Fig. 6. Permutation test for the inertia explained by the PLSC of the wine. The observed value was never obtained in the
10,000 permutation. Therefore we conclude that PLSC extracted a significant amount of common variance between these
two tables P < 0.0001).
4.5.6. Bootstrap Bootstrap ratios and 95% confidence intervals for X and Y are given
for Dimensions 1 and 2 in Table 5. As it is often the case, bootstrap
ratios and confidence intervals concur in indicating the relevant
variables for a dimension. For example, for Dimension 1, the
important variables (i.e., variables with a Bootstrap ratio > 2 or
whose confidence interval excludes zero) for X are Tannin, Alcohol,
Price, and Sugar; whereas for Y they are Hedonic, Astringent,
Woody, Sweet, Floral, Spicy, and Acidic.
23 Partial Least Squares Methods: Partial Least Squares. . . 567
Table 5
PLSC. Bootstrap Ratios and Confidence Intervals for X and Y.
Dimension 1 Dimension 2
5. Partial Least
Square Regression
Partial least square Regression (PLSR) is used when the goal of the
analysis is to predict a set of variables (denoted Y) from a set of
predictors (called X). As a regression technique, PLSR is used to
predict a whole table of data (by contrast with standard regression
which predicts one variable only), and it can also handle the case of
multicolinear predictors (i.e., when the predictors are not linearly
independent). These features make PLSR a very versatile tool
because it can be used with very large data sets for which standard
regression methods fail.
In order to predict a table of variables, PLSR finds latent
variables, denoted T (in matrix notation), that model X and simul-
taneously predict Y. Formally this is expressed as a double decom-
position of X and the predicted Y:b
5.1. Iterative In PLSR, the latent variables are computed by iterative applications
Computation of the of the SVD. Each run of the SVD produces orthogonal latent variables
Latent Variables for X and Y and corresponding regression weights (see, e.g., (4) for
in PLSR more details and alternative algorithms).
5.1.1. Step One To simplify the notation we will assume that X and Y are mean-
centered and normalized such that the mean of each column is zero
and its sum of squares is one. At step one, X and Y are stored
(respectively) in matrices X0 and Y0. The matrix of correlations
(or covariance) between X0 and Y0 is computed as
R 1 XT0 Y 0 : (20)
R 1 W1 D1 CT1 : (21)
The first pair of singular vectors (i.e., the first columns of W1 and
C1) are denoted w1 and c1 and the first singular value (i.e., the first
diagonal entry of D1) is denoted d1. The singular value represents
the maximum covariance between the singular vectors. The first
latent variable of X is given by (compare with Eq. 5 defining LX):
t1 X0 w1 (22)
where t1 is normalized such that tT1 t1 . The loadings of X0 on t1 (i.e.,
the projection of X0 on the space of t1) are given by
p1 XT
0 t1 : (23)
23 Partial Least Squares Methods: Partial Least Squares. . . 569
The least square estimate of X from the first latent variable is given
by
b 1 tT p :
X (24)
1 1
5.1.2. Last Step The iterative process continues until X is completely decomposed
into L components (where L is the rank of X). When this is done,
the weights (i.e., all the ws) for X are stored in the J by L matrix W
(whose th column is w). The latent variables of X are stored in the
I by L matrix T. The weights for Y are stored in the K by L matrix
C. The pseudo latent variables of Y are stored in the I by L matrix
U. The loadings for X are stored in the J by L matrix P. The
regression weights are stored in a diagonal matrix B. These regres-
sion weights are used to predict Y from X ; therefore, there is one b
for every pair of t and u, and so B is an L L diagonal matrix.
The predicted Y scores are now given by
b TBCT XBPLS ;
Y (30)
where, BPLS P BC , (where P is the Moore-Penrose pseu-
T+ T T+
5.2. What Does PLSR PLSR finds a series of L latent variables t such that the covariance
Optimize? between t1 and Y is maximal and such that t1 is uncorrelated with t2
which has maximal covariance with Y and so on for all L latent
variables (see, e.g., (4, 17, 19, 26, 48, 49), for proofs and develop-
ments). Formally, we seek a set of L linear transformations of X that
satisfies (compare with Eq. 7):
t Xw such that covt ; Y max (31)
570 H. Abdi and L.J. Williams
tT t 1: (33)
5.3. How Good is the A common measure of the quality of prediction of observations
Prediction? within the sample is the Residual Estimated Sum of Squares (RESS),
which is given by (4)
5.3.1. Fixed Effect Model
RESS b k2 ;
k Y Y (34)
where k k2 is the square of the norm of a matrix (i.e., the sum of
squares of all the elements of this matrix). The smaller the value of
RESS, the better the quality of prediction (4, 13).
5.3.2. Random Effect Model The quality of prediction generalized to observations outside of the
sample is measured in a way similar to RESS and is called the Pre-
dicted Residual Estimated Sum of Squares (PRESS). Formally PRESS is
obtained as (4):
PRESS e k2 :
k Y Y (35)
The smaller PRESS is, the better the prediction.
5.3.3. How Many Latent By contrast with the fixed effect model, the quality of prediction for
Variables? a random model does not always increase with the number of latent
variables used in the model. Typically, the quality first increases and
then decreases. If the quality of the prediction decreases when the
number of latent variables increases this indicates that the model is
overfitting the data (i.e., the information useful to fit the observa-
tions from the learning set is not useful to fit new observations).
Therefore, for a random model, it is critical to determine the
optimal number of latent variables to keep for building the
model. A straightforward approach is to stop adding latent variables
as soon as the PRESS decreases. A more elaborated approach (see, e.
g., (48)) starts by computing the ratio Q2 for the th latent
variable, which is defined as
PRESS
Q 2 1 (36)
RESS 1;
with PRESS (resp. RESS1) being the value of PRESS (resp. RESS) for
the th (resp. 1) latent variable [where RESS0 K I 1].
A latent variable is kept if its value of Q2 is larger than some
arbitrary value generally set equal to 1 :952 :0975 (an
23 Partial Least Squares Methods: Partial Least Squares. . . 571
alternative set of values sets the threshold to .05 when I 100 and
to 0 when I > 100, see, e.g., (48, 58), for more details). Obviously,
the choice of the threshold is important from a theoretical point of
view, but, from a practical point of view, the values indicated above
seem satisfactory.
5.3.4. Bootstrap When the number of latent variables of the model has been
Confidence Intervals for decided, confidence intervals for the predicted values can be
the Dependent Variables derived using the Bootstrap. Here, each bootstrapped sample pro-
vides a value of BPLS which is used to estimate the values of the
observations in the testing set. The distribution of the values of
these observations is then used to estimate the sampling distribu-
tion and to derive Bootstrap ratios and confidence intervals.
5.4. PLSR: Example We will use the same example as for PLSC (see data in Tables 1
and 2). Here we used the physical measurements stored in matrix X
to predict the sensory evaluation data stored in matrix Y. In order
to facilitate the comparison between PLSC and PLSR, we have
decided to keep two latent variables for the analysis. However if
we had used the Q2 criterion of Eq. 36, with values of 1. 3027 for
Dimension 1 and 0.2870 for Dimension 2, we should have kept
only one latent variable for further analysis.
Table 6 gives the values of the latent variables (T), the recon-
b and the predicted values of Y (Y).
stituted values of X (X) b The value
of BPLS computed with two latent variables is equal to
BPLS
2 3
0:0981 0:0558 0:0859 0:0533 0:1785 0:1951 0:1692 0:0025 0:2000
6 7
6 0:0877 0:3127 0:1713 0:1615 0:1204 0:0114 0:1813 0:1770 0:1766 7
6 7
6 0:0276
6 0:2337 0:0655 0:2135 0:3160 0:20977 0:3633 0:1650 :
0:3936 7
7
6 7
4 0:1253 0:1728 0:1463 0:0127 0:1199 0:1863 0:0877 0:0707 0:1182 5
0:0009 0:3373 0:1219 0:2675 0:3573 0:2072 0:4247 0:2239 0:4536
(37)
The values of W which play the role of loadings for X are equal to
2 3
0:3660 0:4267
6 0:1801 0:5896 7
6 7
W6 6 0:5844 0:0771 7:
7 (38)
4 0:2715 0:6256 5
0:6468 0:2703
A plot of the first two dimensions of W given in Fig. 7 shows that
X is structured around two main dimensions. The first dimension
opposes the wines rich in alcohol and tannin (which are the red
wines) are opposed to wines that are sweet or acidic. The second
dimension opposes sweet wines to acidic wines (which are also
more expensive) (Figs. 8 and 9).
Table 6
PLSR: Prediction of the sensory data (matrix Y) from the physical measurements (matrix X). Matrices T, U, X,
b Yb
T U b
X b
Y
Total
Wine Dim 1 Dim 2 Dim 1 Dim 2 Price acidity Alcohol Sugar Tannin Fruity Floral Vegetal Spicy Woody Sweet Astringent Acidic Hedonic
1 0.16837 0.16041 2.6776 0.97544 15.113 5.239 14.048 3.3113 471.17 6.3784 2.0725 1.7955 3.6348 4.3573 2.9321 4.0971 3.1042 2.8373
2 0.18798 0.22655 2.8907 0.089524 14.509 4.8526 14.178 3.6701 517.12 6.4612 1.6826 1.6481 3.8384 4.5273 2.9283 4.337 2.9505 2.4263
3 0.17043 0.15673 3.1102 2.1179 15.205 5.2581 14.055 3.2759 472.39 6.3705 2.0824 1.8026 3.6365 4.3703 2.9199 4.108 3.1063 2.8139
4 0.12413 0.14737 1.4404 1.0106 14.482 5.3454 13.841 3.438 413.67 6.4048 2.3011 1.8384 3.4268 4.0358 3.0917 3.7384 3.2164 3.5122
5 0.0028577 0.07931 0.13304 0.5399 13.226 5.8188 13.252 3.5632 245.11 6.4267 3.0822 2.0248 2.7972 3.14 3.4934 2.7119 3.5798 5.422
6 0.080038 0.015175 2.6712 2.671 13.069 6.4119 12.82 3.319 113.62 6.3665 3.8455 2.2542 2.2783 2.5057 3.7148 1.9458 3.9108 6.8164
7 0.15284 0.18654 2.4224 2.2504 14.224 7.4296 12.383 2.4971 31.847 6.1754 4.9385 2.6436 1.6593 1.9082 3.8112 1.1543 4.3538 8.205
8 0.11498 0.09827 2.9223 2.4331 13.636 6.9051 12.61 2.9187 43.514 6.2735 4.3742 2.4429 1.9796 2.2185 3.7601 1.5647 4.1249 7.4844
9 0.18784 0.21492 1.952 0.98895 7.6991 5.1995 12.482 5.4279 62.365 6.8409 3.1503 1.8016 2.2553 1.8423 4.3943 1.4234 3.7333 7.975
10 0.18149 0.21809 1.8177 0.95013 7.7708 5.1769 12.513 5.4187 71.068 6.8391 3.1112 1.7927 2.2876 1.8891 4.3728 1.4767 3.7149 7.8756
11 0.21392 0.25088 2.1158 1.4184 6.6886 5.017 12.388 5.8026 43.283 6.9247 3.0763 1.7341 2.2134 1.6728 4.5368 1.2706 3.7243 8.2918
12 0.080776 0.11954 1.2197 1.3413 11.084 5.6554 12.902 4.2487 158.44 6.5782 3.2041 1.9678 2.524 2.5621 3.8671 2.121 3.6803 6.5773
13 0.26477 0.085879 1.5647 0.88629 20.508 6.5509 14.323 1.1469 503.24 5.8908 2.8881 2.2864 3.5804 4.9319 2.2795 4.5097 3.3327 1.8765
14 0.27335 0.012467 2.4386 0.84706 19.593 6.1319 14.409 1.6096 538.44 5.9966 2.5048 2.1273 3.7516 5.0267 2.3272 4.6744 3.1889 1.6141
15 0.22148 0.14773 2.5658 0.88267 20.609 6.931 14.089 0.93334 430.34 5.8398 3.3465 2.4328 3.2863 4.595 2.3812 4.0928 3.5271 2.6278
16 0.15251 0.089213 1.1964 1.2017 18.471 6.6538 13.817 1.6729 367.45 6.0044 3.3258 2.332 3.1078 4.1299 2.7176 3.6395 3.5663 3.5337
17 0.020577 0.072286 0.3852 0.65881 15.773 6.6575 13.235 2.4344 214.94 6.1706 3.7406 2.3412 2.5909 3.197 3.2555 2.6449 3.805 5.4426
18 0.16503 0.15453 1.8587 0.17362 13.53 7.2588 12.349 2.7767 35.61 6.2384 4.8313 2.5797 1.6678 1.8359 3.8946 1.1032 4.3234 8.3249
19 0.15938 0.12373 2.0114 1.163 13.184 7.0815 12.394 2.9608 18.379 6.2806 4.6627 2.5123 1.7481 1.8903 3.9066 1.1882 4.2588 8.1846
20 0.034285 0.071934 0.99958 1.5624 16.023 6.6453 13.297 2.3698 231.5 6.1567 3.6874 2.3357 2.6485 3.2949 3.202 2.7511 3.7765 5.2403
21 0.20205 0.12592 1.0834 0.53399 8.7377 5.7103 12.362 4.8856 15.13 6.7166 3.6292 1.9958 2.0319 1.7003 4.3514 1.1944 3.9154 8.3491
22 0.13903 0.095646 0.90872 0.44113 10.351 5.8333 12.626 4.3693 80.458 6.6025 3.5372 2.0386 2.2379 2.1358 4.0699 1.6397 3.8397 7.4783
23 0.13566 0.14176 0.67329 0.26414 9.7392 5.5716 12.67 4.6698 100.15 6.6711 3.3041 1.9394 2.3371 2.1809 4.1077 1.7276 3.7534 7.3432
24 0.077587 0.048002 0.95125 0.55919 12.19 6.055 12.871 3.7413 137.99 6.4628 3.5342 2.1189 2.4052 2.5521 3.7752 2.0495 3.797 6.6631
25 0.21821 0.043304 2.897 1.0065 17.752 5.8598 14.197 2.2626 491.22 6.1423 2.4453 2.0275 3.6255 4.659 2.6061 4.3241 3.2047 2.3216
26 0.26916 0.13515 2.5723 0.85536 17.355 5.3054 14.484 2.6448 583.5 6.2322 1.8146 1.8147 4.0069 5.0644 2.5074 4.8404 2.9431 1.4018
27 0.29345 0.034272 3.4006 1.2348 19.282 5.8542 14.529 1.8326 578.41 6.0485 2.2058 2.021 3.9215 5.1914 2.3 4.8922 3.0675 1.2318
28 0.25617 0.20133 2.1121 0.9005 22.038 7.2062 14.211 0.39528 453.76 5.7192 3.4724 2.5349 3.3314 4.8178 2.1853 4.2883 3.549 2.2172
29 0.011979 0.21759 0.85732 0.18988 17.295 7.4986 12.996 1.5947 126.59 5.9776 4.5577 2.6614 2.1872 2.8984 3.2224 2.1988 4.1213 6.1909
30 0.034508 0.16317 1.5868 2.1363 16.08 7.2096 12.93 2.079 118.03 6.0867 4.3821 2.5534 2.1941 2.7626 3.3714 2.0981 4.0733 6.4213
31 0.17235 0.29489 1.6713 1.187 15.448 8.0531 12.226 1.8476 92 6.0264 5.5299 2.8808 1.3781 1.7195 3.7677 0.85837 4.58 8.6928
32 0.098879 0.52412 1.5407 1.0685 20.167 9.2864 12.41 0.087465 81.643 5.5898 6.35 3.3433 1.26 2.1385 3.2243 1.1171 4.8257 8.0376
33 0.1672 0.072228 0.62606 2.4774 10.171 5.986 12.484 4.3461 38.717 6.5956 3.7551 2.0981 2.0776 1.9242 4.1547 1.391 3.9372 7.9361
34 0.15281 0.11474 1.6241 1.4483 9.816 5.7363 12.576 4.5679 70.404 6.6469 3.4977 2.0027 2.2159 2.0463 4.1453 1.5591 3.8348 7.6456
35 0.072566 0.066931 0.35548 1.7924 12.006 5.9449 12.906 3.8469 150.44 6.4872 3.4248 2.0769 2.461 2.5965 3.7765 2.1136 3.7542 6.5542
36 0.056807 0.0071035 0.76977 1.6816 13.174 6.2693 12.938 3.3586 149.05 6.3768 3.6517 2.1988 2.416 2.6815 3.6481 2.1548 3.8253 6.4334
574 H. Abdi and L.J. Williams
2
sugar
tannin
alcohol
price
total acidity
spicy fruity
sweet
astringent
woody
acidic
hedonic
1
vegetal
Fig. 8. The circle of correlation between the Y variables and the latent variables for
Dimensions 1 and 2.
23 Partial Least Squares Methods: Partial Least Squares. . . 575
2
11
2 10
9
1
23
+ 3 4 12 21
26 + 34
5 35 22
25 + +
27 + 33
+ 36 24
+
6
14
20 1
16
13 17 8
19
15 + 30 18
+ 28 7
+ 29
+ 31
Chile red
Canada rose + 32
+ USA yellow
Fig. 9. PLSR. Plot of the latent variables (wines) for Dimensions 1 and 2.
6. Software
7. Related Methods
8. Conclusion
References
1. Abdi H (2001) Linear algebra for neural net- 6. Abdi H, Edelman B, Valentin D, Dowling WJ
works. In: Smelser N, Baltes P (eds) Interna- (2009b) Experimental design and analysis for
tional encyclopedia of the social and behavioral psychology. Oxford University Press, Oxford
sciences. Elsevier, Oxford UK 7. Abdi H, Valentin D (2007a) Multiple factor
2. Abdi H (2007a) Eigen-decomposition: eigen- analysis (MFA). In: Salkind N (ed) Encyclope-
values and eigenvectors. In: Salkind N (ed) dia of measurement and statistics. Sage, Thou-
Encyclopedia of measurement and statistics. sand Oaks, CA
Sage, Thousand Oaks, CA 8. Abdi H, Valentin D (2007b) STATIS. In: Sal-
3. Abdi H (2007) Singular value decomposition kind N (ed) Encyclopedia of measurement and
(SVD) and generalized singular value decom- statistics. Sage, Thousand Oaks, CA
position (GSVD). In: Salkind N (ed) Encyclo- 9. Abdi H, Valentin D, OToole AJ, Edelman B
pedia of measurement and statistics. Sage, (2005) DISTATIS: the analysis of multiple
Thousand Oaks, CA distance matrices. In: Proceedings of the
4. Abdi H (2010) Partial least square regression, IEEE computer society: international confer-
projection on latent structure regression, PLS- ence on computer vision and pattern recogni-
regression. Wiley Interdiscipl Rev Comput Stat tion pp 4247
2:97106 10. Abdi H, Williams LJ (2010a) Barycentric dis-
5. Abdi H, Dunlop JP, Williams LJ (2009) How to criminant analysis. In: Salkind N (ed) Encyclo-
compute reliability estimates and display confi- pedia of research design. Sage, Thousand Oaks,
dence and tolerance intervals for pattern classi- CA
fiers using the Bootstrap and 3-way 11. Abdi H, Williams LJ (2010b) The jackknife. In:
multidimensional scaling (DISTATIS). Neuro- Salkind N (ed) Encyclopedia of research
Image 45:8995 design. Sage, Thousand Oaks, CA
578 H. Abdi and L.J. Williams
12. Abdi H, Williams LJ (2010c) Matrix algebra. 28. Gittins R (1985) Canonical analysis. Springer,
In: Salkind N (ed) Encyclopedia of research New York
design. Sage, Thousand Oaks, CA 29. Good P (2005) Permutation, parametric and
13. Abdi H, Williams LJ (2010d) Principal compo- bootstrap tests of hypotheses. Springer, New
nents analysis. Wiley Interdiscipl Rev Comput York
Stat 2:433459 30. Greenacre M (1984) Theory and applications
14. Bookstein F (1982) The geometric meaning of of correspondence analysis. Academic, London
soft modeling with some generalizations. In: 31. Krishnan A, Williams LJ, McIntosh AR, Abdi
Joreskog K, Wold H (eds) System under indi- H (2011) Partial least squares (PLS) methods
rect observation, vol 2. North-Holland, for neuroimaging: a tutorial and review. Neu-
Amsterdam. roImage 56:455475
15. Bookstein FL (1994) Partial least squares: a 32. Lebart L, Piron M, Morineau A (2007) Statis-
dose-response model for measurement in the tiques exploratoires multidimensionelle.
behavioral and brain sciences. Psycoloquy 5 Dunod, Paris
16. Boulesteix AL, Strimmer K (2006) Partial least 33. Mardia KV, Kent JT, Bibby JM (1979) Multi-
squares: a versatile tool for the analysis of high- variate analysis. Academic, London
dimensional genomic data. Briefing in Bioin- 34. Martens H, Martens M (2001) Multivariate
formatics 8:3244 analysis of quality: an introduction. Wiley, Lon-
17. Burnham A, Viveros R, MacGregor J (1996) don
Frameworks for latent variable multivariate 35. Martens H, Naes T (1989) Multivariate cali-
regression. J Chemometr 10:3145 bration. Wiley, London
18. Chessel D, Hanafi M (1996) Analyse de la co- 36. Mazerolles G, Hanafi M, Dufour E, Bertrand
inertie de k nuages de points. Revue de Statis- D, Qannari ME (2006) Common components
tique Appliquee 44:3560 and specific weights analysis: a chemometric
19. de Jong S (1993) SIMPLS: an alternative method for dealing with complexity of food
approach to partial least squares regression. products. Chemometr Intell Lab Syst
Chemometr Intell Lab Syst 18:251263 81:4149
20. de Jong S, Phatak A (1997) Partial least squares 37. McCloskey DN, Ziliak J (2008) The cult of
regression. In: Proceedings of the second inter- statistical significance: how the standard error
national workshop on recent advances in total costs us jobs, justice, and lives. University of
least squares techniques and error-in-variables Michigan Press, Michigan
modeling. Society for Industrial and Applied 38. McIntosh AR, Gonzalez-Lima F (1991) Struc-
Mathematics tural modeling of functional neural pathways
21. de Leeuw J (2007) Derivatives of generalized mapped with 2-deoxyglucose: effects of acous-
eigen-systems with applications. Department tic startle habituation on the auditory system.
of Statistics Papers, 128 Brain Res 547:295302
22. Dray S, Chessel D, Thioulouse J (2003) 39. McIntosh AR, Lobaugh NJ (2004) Partial least
Co-inertia analysis and the linking of ecological squares analysis of neuroimaging data: applica-
data tables. Ecology 84:30783089 tions and advances. NeuroImage 23:
23. Efron B, Tibshirani RJ (1986) Bootstrap meth- S250S263
ods for standard errors, confidence intervals, 40. McIntosh AR, Chau W, Protzner A (2004)
and other measures of statistical accuracy. Stat Spatiotemporal analysis of event-related fMRI
Sci 1:5477 data using partial least squares. NeuroImage
24. Efron B, Tibshirani RJ (1993) An introduction 23:764775
to the bootstrap. Chapman & Hall, New York 41. McIntosh AR, Bookstein F, Haxby J, Grady C
25. Escofier B, Page`s J (1990) Multiple factor anal- (1996) Spatial pattern analysis of functional
ysis. Comput Stat Data Anal 18:120140 brain images using partial least squares. Neuro-
26. Esposito-Vinzi V, Chin WW, Henseler J, Wang Image 3:143157
H (eds) (2010) Handbook of partial least 42. McIntosh AR, Nyberg L, Bookstein FL, Tul-
squares: concepts, methods and applications. ving E (1997) Differential functional connec-
Springer, New York. tivity of prefrontal and medial temporal
27. Gidskehaug L, Stdkilde-Jrgensen H, cortices during episodic memory retrieval.
Martens M, Martens H (2004) Bridge-PLS Hum Brain Mapp 5:323327
regression: two-block bilinear regression with- 43. Mevik B-H, Wehrens R (2007) The PLS pack-
out deflation. J Chemometr 18:208215 age: principal component and partial least
23 Partial Least Squares Methods: Partial Least Squares. . . 579
squares regression in R. J Stat Software 53. van den Wollenberg A (1977) Redundancy
18:124 analysis: an alternative to canonical correlation.
44. Rao C (1964) The use and interpretation of Psychometrika 42:207219
principal component analysis in applied 54. Williams LJ, Abdi H, French R, Orange JB
research. Sankhya 26:329359 (2010) A tutorial on Multi-Block Discriminant
45. Stone M, Brooks RJ (1990) Continuum Correspondence Analysis (MUDICA): a new
regression: cross-validated sequentially con- method for analyzing discourse data from clin-
structed prediction embracing ordinary least ical populations. J Speech Lang Hear Res
squares, partial least squares and principal com- 53:13721393
ponents regression. J Roy Stat Soc B 55. Wold H (1966) Estimation of principal com-
52:237269 ponent and related methods by iterative least
46. Streissguth A, Bookstein F, Sampson P, Barr H squares. In: Krishnaiah PR (ed) Multivariate
(1993) Methods of latent variable modeling by analysis. Academic Press, New York
partial least squares. In: The enduring effects of 56. Wold H (1973) Nonlinear Iterative Partial
prenatal alcohol exposure on child develop- Least Squares (NIPALS) modeling: some cur-
ment. University of Michigan Press rent developments. In: Krishnaiah PR (ed)
47. Takane Y (2002) Relationships among various Multivariate analysis. Academic Press, New
kinds of eigenvalue and singular value decom- York
positions. In: Yanai H, Okada A, Shigemasu K, 57. Wold H (1982) Soft modelling, the basic
Kano Y, Meulman J (eds) New developments design and some extensions. In: Wold H, Jor-
in psychometrics. Springer, Tokyo eskog K-G (eds) Systems under indirect obser-
48. Tenenhaus M (1998) La regression PLS. Tech- vation: causality-structure-prediction, Part II.
nip, Paris North-Holland, Amsterdam
49. Tenenhaus M, Tenenhaus A (in press) Regular- 58. Wold S (1995) PLS for multivariate linear
ized generalized canonical correlation analysis. modelling. In: van de Waterbeenl H (ed)
Psychometrika QSAR: chemometric methods in molecular
50. Thioulouse J, Simier M, Chessel D (2003) design, methods and principles in medicinal
Simultaneous analysis of a sequence of paired chemistry, vol 2. Verla Chemie, Weinheim
ecological tables. Ecology 20:21972208 Germany
51. Tucker L (1958) An inter-battery method of 59. Wold S, Sjostrom M, Eriksson L (2001)
PLS-regression: a basic tool of chemometrics.
factor analysis. Psychometrika 23:111136
Chemometr Intell Lab Syst 58:109130
52. Tyler DE (1982) On the optimality of the
simultaneous redundancy transformations.
Psychometrika 47:7786
Chapter 24
Maximum Likelihood
Shuying Yang and Daniela De Angelis
Abstract
The maximum likelihood method is a popular statistical inferential procedure widely used in many areas to
obtain the estimates of the unknown parameters of a population of interest. This chapter gives a brief
description of the important concepts underlying the maximum likelihood method, the definition of the
key components, the basic theory of the method, and the properties of the resulting estimates. Confidence
interval and likelihood ratio test are also introduced. Finally, a few examples of applications are given to
illustrate how to derive maximum likelihood estimates in practice. A list of references to relevant papers and
software for a further understanding of the method and its implementation is provided.
Key words: Likelihood, Maximum likelihood estimation, Censored data, Confidence interval,
Likelihood ratio test, Logistic regression, Linear regression, Dose response
1. Introduction
The maximum likelihood method is, like the least squares method,
a statistical inferential technique to obtain estimates of the
unknown parameters of a population using the information from
an observed sample. It was primarily introduced by RA Fisher
between 1912 and 1920, though the idea has been traced back to
the late nineteenth century (1, 2).
The principle of the maximum likelihood method is to find the
value of the population parameter, the maximum likelihood esti-
mate (MLE), that maximize the probability of observing the given
data. The maximum likelihood method, by motivation, is different
from the least squares method, but the MLEs coincide with the
least squares estimates (LSEs) under certain assumptions, e.g., that
residual errors follow a normal distribution.
While the maximum likelihood theory has its basis the point
estimate of unknown parameters in a population described by a
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_24, # Springer Science+Business Media, LLC 2013
581
582 S. Yang and D. De Angelis
2. Important
Concepts
2.1. Likelihood Suppose we have a sample y (y1,. . .yn) where each yi is indepen-
and Log-Likelihood dently drawn from a population characterized by a distribution
Function f(y; y). Here f(y; y) denotes the probability density function
(PDF) (for continuous y) or the probability distribution function
(for discrete y) of the population, and y are unknown parameters.
Depending on the distributions, y can be a single scalar parameter
or a parameter vector.
2.1.1. Likelihood Function If y were specified, the probability of observing yi, given the popu-
lation parameter y, can be written as f(yi; y), which is the probability
density function or the probability function evaluated
Q at yi. Then
the joint probability of observing (y1,. . .,yn) is f y i ; y. This is the
i
likelihood function. Throughout this chapter, we use interchange-
ably the notation L(y) and L(y; y) to describe the likelihood func-
tion, where y (y1,. . .,yn). In practice, y is unknown and it is our
24 Maximum Likelihood 583
Fig. 1. The likelihood (left) and log-likelihood (right) functions based on n 1,000 samples randomly selected from a
standard normal distribution [mu denotes the mean, sigma2 indicates s2 ].
pk 1 pnk
and the log-likelihood is:
LL y; y klnp n kln1 p. In Fig. 2, the top panel
shows the likelihood and log-likelihood function of this example
for n 10 and k 2.
2.1.3. Likelihood Function There are cases where a subset yk+1,. . .,yn of data y1,. . .,yn may not be
of Censored Data precisely observed, but the values are known to be either below or
above a certain threshold. For example, many laboratory based
measurements are censored due to the assay accuracy limit, usually
referred to as the lower limit of quantification (LLQ). This happens
when the bioanalysis system cannot accurately distinguish the level
of component of interest from the system noise. For such cases,
the exact value for yi, i k + 1,. . .n, is not available. However, it is
known that the value is equal to or below the LLQ. Such data are
referred to as left censored data.
In other cases, the data are ascertained to be above a certain
threshold, with no specific value assigned. For example, in animal
experiments, the animals are examined every day to monitor the
appearance of particular features, e.g., skin lesions. The time to the
appearance of lesions is then recorded and analyzed. For animals
with no skin lesions by the end of the study (2 weeks for example),
24 Maximum Likelihood 585
-5
0.006
-15
Log-Likelihood
0.004
Likelihood
-25
0.002
-35
0.000
0.2 0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Probability p Probability p
-5
0.006
-15
Log-Likelihood
0.004
Likelihood
-25
0.002
-35
0.000
-1.386 -1.386
-4 -2 0 2 4 -4 -2 0 2 4
logit p logit p
Fig. 2. Likelihood and log-likelihood functions with respect to p and a. Note: the red solid square points mark the maximum
of the L(p) and LL(p), the texts above the x-axis indicate the MLE of p (0.2) or a (logit of p) (-1.386).
the time to lesions will be recorded as > 2 weeks. These data are
referred to as right censored data. As it is only known that an animal
has no lesion at 2 weeks after the treatment, whether the animal will
have and when it will have skin lesions is not known.
However, suppose examinations were not carried out between
day 7 and 11 and at day 11 an animal was found to have lesions.
Then the time to lesions will be between 7 and 11 days, although
the exact time of lesions appearance is not known. In this case, the
time to lesion for this animal will be interval censored, i.e., it is
longer than 7 days, but shorter than 11 days.
When such cases arise in practice, ignoring the characteristics of
the data in the analysis may cause biases (see ref. 11 and the
references cites therein), so appropriate adjustments must be
applied.
Let y1,. . .,yk represent the observed data, and yk+1,. . .,yn those
not precisely observed but known to be left censored (assumed to
lie within interval [1,LLQ] or [0,LLQ] for laboratory measure-
ments that must be greater than or equal to 0). Assume that
586 S. Yang and D. De Angelis
2.3. Properties of MLE The MLE has several important properties making the maximum
likelihood approach an attractive method for statistical inference.
The MLEs are, for example, asymptotically normal, consistent,
efficient, and parameterization invariant (13). What follows is
focused on the properties more used in practice.
2.3.1. Asymptotically MLE When n (the number of observations) tends to infinity, the MLE of
Follows Normal or y^yMLE follows a normal distribution with mean the true parame-
Multinormal Distribution ter y, and variance (or variance-covariance matrix if y is a vector) the
inverse of the expectation of the second derivative of LL(y) with
588 S. Yang and D. De Angelis
2.3.2. Parameterization This property states that if y^MLE is the MLE for parameter y, then
Invariance the MLE of any function of y, say, g(y), is defined as g^yMLE , that is
the value of the function g evaluated at y ^yMLE ,
This parameterization invariance property of MLEs allows
flexibility in choosing model parameterizations, which is important
in cases where parameter transformation can make the maximiza-
tion step simpler and easier. Example 3.2 illustrates how parameters
can be transformed in practice in order to obtain appropriate
estimate of the unknown parameters of interest.
2.5. Likelihood Ratio Often several different models can be found to describe a popula-
Test tion from which the data are selected. The problem is then to
identify which model is more appropriate to explain the data and
represent the population. It is possible to use likelihood theory to
test the appropriateness of a given model in comparison to a model
from the same family but with a different number of parameters
(nested models). For example, assume LA is the maximum of the
24 Maximum Likelihood 589
2.6. Further Reading Readers who are interested in the general theory and application of
and Statistical the maximum likelihood method are referred to refs. 1318.
Software A number of statistical packages are available for obtaining
maximum likelihood estimates of model parameters in both the
linear or nonlinear modelling fields. Packages commonly used in
academia and pharmaceutical industry include, SAS (19), Stata
(20), and R (21). In addition, a few other packages are dedicated
to the analysis of non linear mixed effects models, like NONMEM
(10) and Monolix (22). R and Monolix are freely downloadable
online.
3. Examples
3.1. Maximum Suppose random variable y represent the time to an event, e.g., a
Likelihood Estimation certain toxicity from an experiment, and it is assumed that y follows y
of Exponential an exponential distribution with density function: f y; y 1y e y ,
Distribution where y > 0.
Let y1,. . .,yn be a random sample from this exponential distribu-
tion, then the likelihood and log-likelihood functions are given by:
P yi
y1
y2
yn
y
L y 1y e y 1y e y . . . 1y e y y1n e i , and LL y
P
n lny 1y yi.
i P
d LL y y
The solution of the equation y y2i i 0 is:
n
^y 1 P dy
and ^yMLE n1 y i y :
i
590 S. Yang and D. De Angelis
d LL y
Note: dy represents the derivative of LL(y) with respective
d 2 LL y
to y, and d2 y
is the derivative of d LL
dy
y
with respective of
parameter y, also the second order derivative of LL(y) with respec-
tive to y.
An alternative way of proving that LL(y) has maximum at y is
through the analysis of the behavior of d LL dy
y
:
rewrite d LL
dy
y
as: d LL y
dy n
y ny
y 2 ny
y2
n
y
y2
nyyny
2
ny y
y2
So if y<y , then d LLdy
y
>0; indicating LL(y) is increasing, and if
y>y , then d LLdy
y
<0, indicating LL(y) is decreasing and therefore
LL(y) has the maximum at y .
3.2. Probability In an animal toxicity study, ten cynomolgus monkeys were admi-
of Toxicity nistered a 100 mg of compound X intravenously every week for
8 weeks. During the study, two out of the ten monkeys developed
skin lesions. What is the probability of having the skin lesions in the
entire population?
Solution:
Let y 1 indicate the event of having skin lesions, and y 0
indicate no skin lesions, where y following a Bernoulli distribution
with the probability of having skin lesions p.
The data obtained from the monkey study were: (y1,. . .,
y10) (0,0,0,0,1,0,0,0,0,1).
Then the likelihood and log-likelihood of these data can be
written as:
dp2
pk2 1p
nk ^
2 , substituting p 0:2 gives dp2
<0:
Therefore, p^ is the maximum likelihood estimate of the population
parameter p.
As illustrated in the above Subheading 2.4, the confidence 2
LL^p
interval of p around ^ p, can be calculated using: I ^p d dp 2
^
3
knk 62:5, the standard error of p is 0.126, therefore, the 90%
n
a
Using this transformation, assume a logit(p), then p 1ee
a .
1e
0.4776).
592 S. Yang and D. De Angelis
3.3. Linear Regression Assume that yi b0 + bxi + ei, where i 1,2,. . .,n, the yis are
independently drawn from a population, the xis are independent
variables, and that ei ~ N(0, s2) is residual error. Q 1
The likelihood of y (y1,. . .,yn) is: Ly p
2ps2
2 i
e 2s2 y i b 0 bx i , where y b 0 ; b; s2 ,
1
n 1 X
LLy ln2ps2 2 y i b 0 bx i 2
2 2s i
or
1 X
2LLy nln 2p lns2 2 y i b 0 bx i 2
s i
@2LLy 1 X
2 2 x i y i b 0 bx i 0 (2)
@b s i
@2LLy n 1 X
2 4 y i b 0 bx i 2 0 (3)
@s2 s s i
where @2LLy
@b 0 , @2LLy
@b and @2LLy
@s2
represent the first order
partial derivation of 2LL(y) with respect to b0, b and s2, respec-
tively.
Solving the
P above P three P equations, we have:
^ xy P
b^0MLE n1
x y
b MLE P 2 P 2i i ,
i i i i i
y i bx i and
x x
i i
P i i i
y b 0 bx i 2
^2 MLE i i n
s
It is noted that the MLEs of b, b0 are equivalent to their
corresponding least squares estimates.
3.4. Dose Response In the early drug development, compound X was tested in monkeys
Model to assess its toxicological effects. Three doses (10, 100, 300 mg) of
compound X and placebo were given to 40 monkeys, 10 monkeys
in each dose group, every week for 8 weeks. During the 8 weeks of
study, the appearance of skin lesion was observed and recorded.
The question is whether the probability of skin lesion is associated
with the dosage given. The number of skin lesion in each dose
group was: 0, 1, 5, 9 for placebo, 10 mg, 100 mg and 300 mg
group, respectively.
Solutions:
Let y 1 indicate the presence of skin lesion, and y 0 indicate no
skin lesion. The question can then be rephrased by asking whether
24 Maximum Likelihood 593
> Summary(res.logit)
Call:
glm(formula = y ~ log(dose + 1), family = "binomial", data = d)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.95101 -0.37867 -0.07993 0.56824 2.31126
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.7448 2.0339 -2.825 0.00473 **
log(dose + 1) 1.3118 0.4324 3.034 0.00242 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Observed
Model Prediction
95% Confidence Interval
0.0
Fig. 5. Observed and model predicted probability of skin lesion (black dots are the
observed proportion of monkeys having skin lesion for the corresponding dose group;
solid line is the model predicted probability of skin lesion; dashed lines are 95 %
confidence interval of probability (p); Dose 1 represents placebo).
24 Maximum Likelihood 595
expab lnD1
p 1expab lnD1 . For example, when no drug is given, i.e.,
e a^
D 0, the probability of having skin lesions is 1e a^ 0:003 and
References
1. Hald A (1999) On the history of maximum models with below-quantification-limit data.
likelihood in relation to inverse probability Pharm Stat 9(4):313330
and least squares. Statist Sci 14(2):214222 12. Fletcher R (1987) Practical methods of
2. Aldrich J (1997) R.A. Fisher and the making of optimization, 2nd edn. Wiley, New York
maximum likelihood 19121922. Statist Sci 12 13. Young GA, Smith RL (2005) Essentials of
(3):162176 statistical inference, chapter 8. Cambridge
3. Akaike H (1973) Information theory and an University Press, Cambridge, UK
extension of the maximum likelihood principle. 14. Bickel PJ, Doksum KA (1977) Mathematical
In: Petrox BN, Caski F (eds) Second interna- statistics. Holden-day, Inc., Oakland, CA
tional symposium on information theory. 15. Casella G, Berger RL (2002) Statistical infer-
Akademiai Kiado, Budapest, pp 267281 ence, 2nd edn. Pacific Grove, Duxberry, CA
4. Schwarz G (1978) Estimating the dimension of 16. DeGroot MH, Schervish MJ (2002) Probabil-
a model. Ann Statist 6:461464 ity and statistics, 3rd edn. Addison-Wesley,
5. McCullagh P, Nelder JA (1989) Generalized Boston, MA
linear models, 2nd edn. Chapman and Hall, 17. Spanos A (1999) Probability theory and statis-
New York tical inference. Cambridge University Press,
6. Cox DR (1970) The analysis of binary data. Cambridge, UK
Chapman and Hall, London 18. Pawitan Y (2001) In all likelihood: statistical
7. Cox DR (1972) Regression models and life modelling and inference using likelihood.
tables. J Roy Statist Soc 34:187220 Cambridge University Press, Cambridge, UK
8. Lindsey JK (2001) Nonlinear models in medi- 19. SAS Institute Inc. (2009) SAS manuals. http://
cal statistics. Oxford University Press, Oxford, support.sas.com/documentation/index.html
UK 20. STATA Data analysis and statistical software.
9. Wu L (2010) Mixed effects models for complex http://www.stata.com/
data. Chapman and Hall, London 21. The R project for statistical computing.
10. Beal SL, Sheiner LB, Boeckmann AJ (eds) http://www.r-project.org/
(19892009) NONMEM users guides. Icon 22. The Monolix software. http://www.monolix.
development solutions. Ellicott City org/
11. Yang S, Roger J (2010) Evaluations of Bayesian
and maximum likelihood methods in PK
Chapter 25
Bayesian Inference
Frederic Y. Bois
Abstract
This chapter provides an overview of the Bayesian approach to data analysis, modeling, and statistical
decision making. The topics covered go from basic concepts and definitions (random variables, Bayes rule,
prior distributions) to various models of general use in biology (hierarchical models, in particular) and ways
to calibrate and use them (MCMC methods, model checking, inference, and decision). The second half of
this Bayesian primer develops an example of model setup, calibration, and inference for a physiologically
based analysis of 1,3-butadiene toxicokinetics in humans.
Key words: Bayes rule, Bayesian statistics, Posterior distribution, Prior distribution, Markov chain
Monte Carlo simulations
1. Introduction
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5_25, # Springer Science+Business Media, LLC 2013
597
598 F.Y. Bois
2. Important
Concepts
2.1. Random Variables A fundamental tenet of the Bayesian approach is that probabilities
Everywhere are formal representations, i.e., tools to quantify, degrees of belief.
First, the model posited to generate the data is probabilistic and
therefore gives the probability of occurrence of any particular data
value. For example, if we assume that the data values are normally
distributed with mean m and SD s, we would bet that data values
exceeding m + 3s will be quite rare. Actually, any model output can
be a considered random variable. Second, the model parameters, if
their values are not precisely known to the analyst, are also
described by probability distributions and are treated as random
variables. This can also apply to the models boundary conditions,
e.g., the initial concentrations in a reactor. Following our simple
25 Bayesian Inference 599
Fig. 1. A simple graphical model. The data Y are supposed to be distributed normally
around mean m with SD s. The conditional dependence of Y on m and s is indicated by
arrows.
2.2. Graphical Models A statistical model can be usefully represented by a graph and more
precisely by a directed acyclic graph (DAG) (8, 9). In such graphs
(see Fig. 1) the vertices (nodes) represent model-related random
variables (inputs or design parameters, observable quantities, etc.)
and directed edges (arrows) connect one vertex to another to
indicate a stochastic dependence between them. Figure 1 represents
graphically the above Gaussian example: the data values Y are
supposed to be distributed around mean m and SD s: The depen-
dence of Y on m and s is indicated by arrows. To be coherent these
graphs need to be acyclic, which simply means that there is no way
to start a vertex and follow a sequence of edges that eventually
loops back to the start. Besides the clarity of a graphical view, they
also offer simple algorithmic ways to automatically decompose a
complex model into smaller, more tractable, subunits. Such simpli-
fications can be achieved through rules to determine conditional
independence between nodes, as explained below.
2.3. Multilevel A very useful and common extension of the simple model of Fig. 1
(Hierarchical) Models is to build a hierarchy of dependencies (10). Take the example of
inter- and intra-variability. Interindividual variability means that
individuals are different in their characteristics or more precisely
in the measure of their characteristics. For example, body mass
differs between subjects. This difference can be modeled determin-
istically (mass increases nonlinearly with age, etc.), but such a
modeling usually leaves out some unexplained variability. People
of the same age, sex, ethnicity, birth cohort, etc. still have different
body masses. In front of such residual uncertainty (or very early
on if we have no mechanistic clue) we can resort to probabilistic
600 F.Y. Bois
2.4. Mixture Models Mixture models can be seen as a special class of hierarchical model,
in which the intermediate nodes are discrete indicator variables
(sort of labels). For example, assume that we are given measure-
ments of the body mass of a group of male and female rats. We just
know that animals of the two sexes were used, but we do not know
602 F.Y. Bois
the sex of any particular animal. We just have one body mass for it.
We also know that males and females have different average body
masses, and we are in fact also interested in recovering (guessing)
the sex of these animals. This is in fact a particular case of a very
general classification problem, and we could easily imagine more
than two classes. The corresponding model can be viewed hier-
archically with a set of indicator variables, d, above the individual
data and conditioning the values of the individual parameters. The
variables d are themselves given a probabilistic specification, usually
in the guise of a binomial model if there are two classes or multino-
mial beyond that. Further details on this class of models are given
for example in (5).
2.5. Nonparametric The flexibility of hierarchical and mixture modeling is often suffi-
Models cient to yield useful predictive models and data analyses. However,
flexibility can be further increased by calling upon infinite-
dimensional models (paradoxically called nonparametric models)
of functions. These nonparametric models can be used to replace
fixed functional forms like the linear model or even fixed distribu-
tion like the lognormal. Relaxing the assumption of a fixed func-
tional link leads nonparametric regression models, for which
Gaussian processes, spline functions or mixtures, or Dirichlet pro-
cesses have been well investigated (13, 14). For example, the
assumption of a simple unimodal distribution of body masses
within a population is probably incorrect if the population is not
homogeneous. This may not prevent posterior individual esti-
mates to form clumps which can be identified, but estimation may
be unstable and convergence may be difficult to obtain, and in any
case, population variance will mean very little. A mixture of Dirich-
let processes can be used to put a flexible model on the distribution
of the individual parameters in the population. Similarly, with
enough daily data, we may want to model flexibly the uneven
evolution of body mass during pregnancy rather than resorting to
the simplistic variability model of Fig. 2 (15).
2.6. Conditioning on Bayes approach also considers that the data, once they have been
the Data to Update observed (i.e., have become actual, by opposition to imag-
Prior Knowledge: ined), cease to be random and have to be treated as fixed values.
Bayes Rule This is similar to the quantum physics view of our world, in which
particles are described as density functions (waves) before they are
observed, and collapse into actual particles with precise character-
istics and position only after observation (16). Thomas Bayes idea
was then to simply apply the definition of conditional probabilities
to reverse the propagation of uncertainty.
By definition the conditional probability of an event A, given
event B, is as follows:
PA; B
PA jB ; (1)
PB
25 Bayesian Inference 603
Since the data are considered fixed numerical values, [y] can be
considered as a normalization constant. It can either be calculated
precisely or numerically or even remain unspecified, as when using
MCMC sampling, which can sample values of y from [u|y] regard-
less of the value of [y].
The posterior parameters distribution summarizes what is
known about u after collecting the data y and the remaining uncer-
tainty about it. It is obtained by updating the prior [u] using the
data likelihood (Eq. 2), and this updating is a simple multiplication.
In the usual case where several parameters have to be estimated, the
posterior distribution is a joint (multivariate) distribution, which
can be quite complex (see Subheading 2.8).
Posterior predictive probability. The probability distribution for new
(future) data when we have updated the parameters distribution
should also reflect that updating. Hopefully, having analyzed data
1
Note that if the data were exactly what we expected a priori, there would be not much need to improve the
model.
604 F.Y. Bois
2.7. Quantifying Prior Informative priors. The existence of prior information about a
Knowledge models parameter values is quite common in biology. For example,
we can define reasonable bounds for almost any parameter having
a natural meaning. That is even more the case when the
models are detailed and mechanistically based. PBPK models
(see the corresponding chapter in this volume) are good examples
of those. The scientific literature, nowadays often abstracted in
databases, gives us values or ranges for organ volumes, blood
flows, etc. Such data can be used directly to inform prior parameter
distributions, without analyzing through the model. Such priors
may also come from Bayesian analyses using simpler models. In that
case, we often proceed by analogy, assuming that the prior infor-
mation comes from individuals or species similar to those for which
we have the data to analyze. The fact that interindividual variability
is often present may cast doubts on the validity of such analogies,
but hierarchical models (see above) can be used to protect against
unwarranted similarity assumptions. Another way to obtain infor-
mative priors is to elicit them from expert judgment (2, 4, 17).
Various techniques can be used for such an elicitation. They usually
boil down to having field experts summarize their knowledge about
a parameter in the form of percentiles and then fitting a distribution
function to such data. The distribution functions used can be
of minimal entropy, reference, conjugate, forms, etc.
(see below). In any case, we should in general strive for using
carefully chosen informative priors, as that is efficient use of the
knowledge already painstakingly acquired.
Caveat. The prior should not be constructed using the data entering
the likelihood (i.e., the data to be analyzed). This would be a double
use of the same information and clearly violates Bayes rule.
Vague (noninformative) priors. In some cases (ad hoc parameters,
symmetry of the problems, very poor information a priori, over-
whelming data which will surely dominate the inference, etc.) a
vague prior can be preferred. In that case it is first assumed that
parameters are a priori independent (i.e., what we know about one
does not have a bearing on what we know of a other). Second, all
values of the parameter (or of its logarithm if it is a variance
parameter) are considered equiprobable. An example of symmetry
reasoning is to use a priori the same probability of occurrence for
each of the six sides of an ordinary dice. A certain number of
mathematical criteria have been proposed to derive vague priors,
such as in Jeffreys, maximum entropy, hierarchical, or reference
priors (2, 4, 6, 18).
606 F.Y. Bois
2.8. Getting at the Analytical solutions. Simple problems, or problems for which con-
Posterior jugate or flat priors can be used, may admit closed-form solutions.
That is usually the case with distributions from the exponential
family (35). For example, for binomially distributed data, and
hence a binomial model, the parameter to estimate is usually the
sampling probability p. For example, imagine a device drawing
random series of 0 or 1. You assume that each time it draws either
1 with fixed probability p, or 0 with probability 1 p. The proba-
bility of drawing x 1 in a series of n draws is then:
x j p; n / px 1 pnx : (7)
The conjugate prior for parameter p can be shown to be the
beta distribution with hyper-parameters a and b (46):
ri 0 with i 0: (11)
f ui J u ui
2. If r 1, accept u0 ; otherwise, accept it only with probability r
(to that effect it is enough to sample a uniform random number
u between zero and one and accept u0 if u r).
608 F.Y. Bois
2.9. Checking The above models and analyses can be quite complex. Common
the Results sense dictates that we should check them thoroughly before jump-
ing to conclusions. Several aspects of the model can be checked:
Posterior distributions consistency. The joint posterior distribution of
the model unknowns summarizes all our knowledge about them
from priors and data. Hopefully, the data have been sufficient to
modify substantially the prior, so that prior and posterior differ, the
later being still reasonable. For example, the posterior density
should not be concentrated on a boundary, if a bounded prior has
been used. That would be a sure sign that data and prior conflict
about the value of some parameters. Such conflicts can also lead to
multi-peak posterior distributions, hard to estimate and sample
from. In that event, the data, the prior, and the model itself should
be carefully questioned. If the data are not informative, we fall back
on the prior, which is not a problem as long as informative prior
distributions have been used. Actually, there are ways to estimate
retrospectively or prospectively the information gain brought by
specific experiments (32, 33) but it goes beyond the scope of this
chapter. If neither the data nor the prior (or even part of the prior)
is informative, the posterior is left vague and numerical algorithms
usually have problems converging in such cases. So, beware of
noninformative priors on any parameter if you are not sure that
the data will be informative about it. The problem may arise in
higher dimension than the usual univariate marginal, and it may
happen that only combinations of parameters are identifiable given
610 F.Y. Bois
the data. For example, if two parameters a and b are only found as
the product ab in the model formulation, chances are that the data
likelihood only constrain that product and that any combination of
a and b will be acceptable if vague (even if proper) priors are used.
This can be diagnosed by examining the correlations between pairs
of parameters, and such a check should be routinely done. Higher
dimensions correlations can indeed happen, but they are harder to
check, and they tend to translate into 2D correlations anyway.
Data fit. Let us imagine that we have a well formed estimate of the
posterior distribution or a large sample of parameter vectors drawn
from it. The next, obvious, step is to check whether the data
analyzed are well modeled. That is relatively straightforward to do
if the graphs to construct (e.g., observed vs. predicted data values)
are easy to obtain. If it is analytical, the posterior predictive distri-
bution of the data should be used to assess systematic deviations in
scale or location between the actual data and their estimated distri-
bution. If a posterior parameter sample was obtained by numerical
methods, the model can usually be run for each parameter vector
sampled to simply simulate data values. For each data point, an
histogram of predicted values can be constructed and the probabil-
ity of the data under the model can be checked (confidence bands
can for example be formed for the data).
Cross validation. Whenever possible, it is worth keeping part of the
data unused for model calibration and reserve it for the predictive
data check described above (rather than using the data used for
calibration). If the cross-validation data are reasonably well mod-
eled they can always be reintroduced in the calibration data set for
an increased accuracy of parameter estimates.
2.10. Inference Summarizing the results. The results of your analysis could be given
and Decision (published) in the form of a large sample from the posterior distri-
bution, but it is usual and probably useful to summarize that
information in the form of point estimates, credibility intervals,
etc. In the Bayesian framework all these can be cast easily as a
decision problem, involving possible valuation of the consequences
of errors (an overestimation error may not have the same impor-
tance or cost as an underestimation). It can be shown, for example,
that with a loss proportional to the square of errors, the optimal
marginal estimate for a parameter is its posterior mean. Similarly,
under absolute error loss, the optimal Bayes estimate is the poste-
rior median, and under zeroone loss the Bayes estimate is the
posterior mode. However, most people to which we communicate
results are unable to justify a preference for a particular loss func-
tion, and offering them several estimates just confuses them. In that
case, I tend to prefer reporting the mode, particularly with many-
parameter models, because it is the best compromise between prior
and data. While point estimates are useful summaries, many people
25 Bayesian Inference 611
3. An Example
of Application:
Butadiene
Population PBPK The following example is taken from a series of clinical studies on
Modeling 1,3-butadiene metabolism in humans. We were primarily interested
in identifying the determinants of human variability in butadiene
metabolism (3642). I will present here an application of Bayesian
population modeling to that question, along the lines of Gelman
et al. (43). The deterministic link between the parameters of inter-
est and the data will be a so-called physiologically based pharmaco-
kinetic (PBPK) model (4448).
3.1. The Data The data we will analyze in this example were obtained from 11
healthy adults (aged between 19 and 65 years) who gave informed
consent to the study. They were exposed on two occasions for
20 min to 2 ppm butadiene in the air. The second test took place
48 weeks after the first. Timed measurements of butadiene were
made in their exhaled breath during exposure and for 40 min after it
ended. For each subject, on each occasion, pulmonary flow rate,
Fpul, was monitored (with a coefficient of variation, CV, of 10%) and
a blood sample was taken to determine butadiene blood over air
partition coefficient, Pa (estimated with a CV of 17%). In addition,
the sex, age, body mass, and height of the subjects were recorded
upon enrollment. For details see ref. 41.
3.2. Statistical Models Visual inspection of the data (Fig. 3) strongly suggests that intrain-
dividual variability is present, in addition to interindividual varia-
bility. To test the statistical significance of this observation, we will
614 F.Y. Bois
5.00 a b c
l ll
ll l l
lll l ll ll l l l
0.50
l l
l
ll l l
l l l l
l
l
0.05 l
l
l l l
l l l
l
l l l l
l
Concentration in exhaled air (ppm)
5.00 d e f
ll
ll l l l ll l l l
l ll l l l
ll
0.50 l
l l
l l
l l ll l
ll
l l
l l l l
0.05 l l
l
l l l l l
l l
l
5.00 g h i
l ll
ll l l
lll l l
ll
l
ll
l l l l
lll
0.50
l
l l l
l
l l ll
l
l
l
0.05 l
l l
l
l
l
l l l l
l
l l
l
l l
0 10 30 50 0 10 30 50 0 10 30 50
Time (min)
Fig. 3. Nine (randomly taken out of 11) subjects data on 1,3-butadiene concentration in exhaled air after a 20 min
inhalation exposure to 2 ppm 1,3-butadiene in the air. Both inter- and intraindividual variability are apparent.
P P
i
c i i
f Yq1i Yq2i
Yi
P 1 2 3
Fig. 4. A hierarchical interindividual variability model for 1,3-butadiene data analysis. For
subject i the measured exhaled air concentration Yi are supposed to be distributed log-
normally around the predictions of a PBPK model, f, with geometric SD s1. Pulmonary flow
rate measurements, Yy1 , and blood over air partition coefficient measurements, Yy2 , are
assumed to be distributed around the corresponding parameters (Fpul and Pa, members of the
subject-specific parameter set ui) geometric SDs s2 and s3, respectively. Parameters yi,
together with inhaled air concentration C, measurement times t, and covariates j i, condition
the predictions of f. At the population level, yi are supposed to be log-normally distributed
around a geometric mean m with geometric SD S. Prior distributions P are placed on s1, m,
and S. Known quantities are placed in square nodes, estimands in circular nodes.
P P P
i i
C i
j
t ij
f Yq1ij Yq2ij
Yij
P 1 2 3
Fig. 5. Model B, describing inter- and intraindividual variability, for 1,3-butadiene data
analysis. The symbols are identical to those of Fig. 4, with an extension to describe a pair
of occasions j at which each subject i is observed. The data Y are now doubly subscripted
and parameters uij describe the state of subject i on occasion j. They are assumed to be
log-normally distributed around the subject-specific parameters yi with geometric SD D.
covariates j i in input. In our case t and C were the same for all
subjects and are not subscripted. In model A, the data from the two
occasions j for each subject are pooled together and considered as
repeated measurements made at the same time. The direct measure-
ments, Yy1 and Yy2 , of pulmonary flow rate, Fpul, and butadiene
blood over air partition coefficient, Pa, are not considered as known
covariates, but modeled as data (which in effect they actually are),
log-normally distributed around their true values (y1 and y2) with
geometric SDs s2 and s3, respectively. The data likelihood is then:
logYi ; Yy1 i ; Yy2 i jf ui ; xi ; t; C; s1; s2; s3
N log Yi jlogf ui ; xi ; t; C; logs1
N log Yy1 i jlogui1 ; logs2
N log Yy2 i jlogui2 ; logs3 :
(15)
Model B (Fig. 5) differs only slightly from A: an intermediate
layer of occasion-specific parameters is added. We now differentiate
25 Bayesian Inference 617
LUNGS
FAT
POORLY
PERFUSED
WELL
PERFUSED
Fig. 6. Representation of the PBPK model used for 1,3-butadiene. This model corresponds
to the function f in Figs. 4 and 5. Its parameters are listed in Table 1 and the equations are
given in the text.
3.3. Embedded PBPK The same PBPK model f is embedded in models A and B. It is a
Model minimal description of butadiene distribution and metabolism in
the body after inhalation (Fig. 6). Three compartments lump
618 F.Y. Bois
together tissues with similar perfusion rate (blood flow per unit of
tissue mass): The well-perfused compartment regroups the liver,
brain, lungs, kidneys, and other viscera; The poorly perfused
compartment lumps muscles and skin; The third is fat tissues.
Butadiene is transported to each of these compartments via arterial
blood. At the organ exit, venous blood is assumed to be in equilib-
rium with the compartment tissues. Butadiene can also be metabo-
lized into an epoxide by the liver, kidneys, and lung, which are part
of the well-perfused compartment. The kinetics of butadiene in
each of the three compartments can therefore be described classi-
cally by the following set of differential equations:
dQpp Qpp
Fpp Cart ;
dt Ppp Vpp
dQfat Qfat
Ffat Cart ;
dt Pfat Vfat
dQwp Qwp
Fwp Cart kmet Qwp ; (18)
dt Pwp Vwp
where Qx is the quantity of butadiene in each compartment
(x pp for poorly perfused, fat, or wp for well-perfused).
Fx and Vx are the corresponding blood flow rate and volume,
respectively. Cart is butadiene arterial blood concentration. The
partition coefficients Px are equilibrium constants between butadi-
ene concentration in compartment x and its concentration in venous
blood. The first-order rate constant for metabolism is noted kmet.
The arterial blood concentration Cart is computed as follows,
assuming instantaneous equilibrium between blood and air in the
lung:
Fpul 1 rds Cinh Ftotal Cven
Cart ; (19)
Fpul 1 rds =Pa Ftotal
where Ftotal is the blood flow to the lung, Fpul the pulmonary
ventilation rate, rds the fraction of dead space (volume unavailable
for bloodair exchange) in the lung, and Pa the blood over air
partition coefficient. In our experiments, dead space is artificially
increased by the use of a face mask. Cven is the concentration of
butadiene in venous blood and is simply obtained as the sum of
butadiene concentrations in venous blood at the organ exits
weighted by corresponding blood flows:
P
x2fpp;fat;wpg Fx Qx =Px Vx
Cven ; (20)
Ftotal
with
Ftotal Fpp Ffat Fwp : (21)
Finally, butadiene concentration in exhaled air, Cexh, can be
obtained as:
25 Bayesian Inference 619
Cart
Cexh 1 rds rds Cinh : (22)
Pa
Remember that Cexh, Fpul, and Pa have been measured on the
exposed subjects. The model values for those form, with the data
values, the basis of the computation of the data likelihood (Eqs. 15
and 17). In Eqs. 15 and 17, Fpul was noted y1 and Pa noted y2 for
convenience.
Model parameter scaling. You may have noticed that the inter- and
intraindividual variances S and D are just vectors, rather than being
full variancecovariances matrices, as is customary in population
models (49, 50). We have not modeled covariances between param-
eter values for a given individual. The reason is that, by model
construction, those covariances are modeled deterministically by
scaling functions, which render actual parameters (scaling coeffi-
cients) independent from each other. That approach is heavily used
in purely predictive PBPK models (51). It is well known, for exam-
ple, that total blood flow (cardiac output) is correlated with alveolar
ventilation rate, which depends in turn on pulmonary ventilation
rate and the fraction of dead space, defined above (52). We model
this dependency as:
Fpul 1 rds
Ftotal : (23)
1:14
The coefficient 1.14 corresponds to the value so-called ventila-
tion over perfusion ratio, at rest (like while seating, as were our
subjects during the controlled exposure experiments).
In turn, blood flow rates to the various tissues and organs
depend on cardiac output; at least, their sum must equal cardiac
output. Those relationships were modeled by the following alge-
braic equations:
Ffat f Ffat :Ftotal ; (24)
Fpp f Fpp :Ftotal ; (25)
Fwp Ftotal 1 f Ffat f Fpp (26)
The choice to condition Fwp upon the others is quite arbitrary
and is dictated by the pragmatic consideration that the fractional
flows fFfat and fFpp are rather smaller and even if sampled indepen-
dently will not add up to more than one. For a more balanced
alternative see ref. 43.
Tissue volumes scaling is a bit more sophisticated and uses the
subjects age (A in years), sex (S, coded as 1 for males and for
2 females), height (Bh, in m), and mass (Bm, in kg) (53):
Bm
Vfat Bm 0:012 0:1082 S 0:0023A 5:4: (27)
Bh2
620 F.Y. Bois
Table 1
List of PBPK model parameters used for 1,3-butadiene
(see Fig. 6)
3.4. Choosing As can be seen in Table 1, nine parameters will be sampled during
the Priors model calibration. The others are either measured with sufficient
precision (Bm, Bh) or determined without ambiguity (S, A). The
case of Pfat is special: Knowledge of the model behavior (see Eq. 18)
indicates that it determines the rate of butadiene exit from the fat,
which in turn is rate limiting for the terminal half-life of butadiene
in blood. However, with a follow-up of only 60 min (see Fig. 3),
too short to observe that terminal elimination phase, we have no
hope to get information about that parameter from the data. We
therefore set it to a published value (54).
The prior knowledge we have about the value of the other
parameters is rather general and concerns in fact population
averages. For most of the population mean parameters, m, we use
lognormal distributions which constrain them to be positive, with a
geometric SD of 1.2 (corresponding approximately to a CV of
20%), further truncated to stay within physiological bounds (38).
Those parameters are rather well known and those priors are very
informative. We make an exception for the major focus of our
study: the metabolic rate constant kmet. We have a general idea of
its value (0.24 min1 20%) (36), but we chose for its population
mean a uniform prior ranging a factor of 60 in order to let the data
speak about it.
For the population standard deviations, Sk, which measure
interindividual variability, we know from the previous analyses
(36, 38) that they correspond to CVs of the order of 10100%.
Remember (Eq. 14) that we defined the population distribution to
be lognormal (i.e., normal after logarithmic transformation). We
will use MCMC sampling to sample from the posterior distribution
and we do not need to use a conjugate prior distribution (which in
this case would be inverse-gamma). For flexibility and simplicity ,
we use a half-normal prior (with a SD of 0.3) for the logarithm of all
the population variance Sk2 (55). This is quite informative: its mean
is 0.24, so, as expected, our prior population distributions will have
a CV around 0.5 (square root of 0.24), with values concentrated
between 0 and 1.
The above population priors will be used for both model A and
model B. The latter requires in addition the specification of the
intraindividual SDs Dk. We do not have much knowledge about
them, but we do not expect them to be much larger than interindi-
vidual variabilities, given the small time span separating the two
observations. So, again, we will use the same half-normal prior
(with SD 0.3) for the logarithm of each intraindividual variance
Dk2.
We are left with defining priors for the geometric SDs of the
measurement errors (Eqs. 15 and 17) s1, s2, and s3. We expect s1
to be well identified because we have 110 data points altogether,
and therefore as many differences between model and data, to
estimate the analytical error on butadiene concentrations in exhaled
622 F.Y. Bois
3.5. Computing Using GNU MCSim it is enough to specify the prior distributions
the Posterior and dependencies for the various parameters and the data likeli-
hoods. The hierarchical links are automatically made and the pos-
teriors sampled numerically by a standard MetropolisHastings
algorithm (31) (see also the users manual online at http://www.
gnu.org/software/mcsim). It is best, in our experience to run at
least three Markov chains, starting from different pseudo-random
number seeds. They should all converge, in distribution, to the
posterior, but there is no general rule about the rate of convergence
for complex problems like ours. A good diagnostic for convergence
is provided in (23) and that is the one we will use. A first step is to
run a single chain for about 1050 iterations to evaluate the time
needed for an iteration. That will give you an idea of the number of
chains and iterations you can run given your time constraints and
available hardware. The more iterations, the better. It is usually
recommended to run chains as long as possible, to keep the second
half of the samples generated and to check convergence on that set.
You can in fact try to keep the last 3/4 of iterations if they appear to
have converged by the time the first quarter is done. There is no real
rule about that. A general strategy, if computations seem to require
more than an hour, is to run batches of 310 chains for a day or so.
Then check visually that they progress toward convergence (check-
ing at least the trajectory of the population parameters, the data
likelihood, and the posterior density); run the convergence diag-
nostic tool and focus on monitoring problem parameters, slow to
converge. GNU MCSim allows you to restart a chain at the point
you stopped it (that is probably a must for any useful MCMC
software). You can run batches until convergence is achieved. Stor-
age space can be saved by only recording the samples generated
during one iteration in every five, or ten, etc. and forget about the
others. You still probably need a few thousand samples to form
good approximations of the posterior distribution confidence
regions etc. For nice problems, even as complex as our current
example, convergence is usually achieved between a few thousand
and a hundred thousand iterations. It depends on the weight of the
data compared to the prior, the distance between prior and
likelihood, the quality of the model calibrated, and the identifia-
bility of its parameters:
25 Bayesian Inference 623
Few data will not move the joint posterior very far from the
priors (unless vague prior are used); The posterior will be rather
flat and easy to sample from. You may learn little from them, but
you will get the result quickly. Note that a MCMC sampling
strategy has been proposed to take advantage of this feature and
of the consistent updating properties of Bayesian approach: In
essence, data are gradually introduced in the problem to
smoothly reach the posterior distribution (56).
If the data weigh a lot and conflict with the priors, the posterior
might be multimodal and difficult to sample from. The sampler
can get stuck in local modes and convergence maybe never
reached. Also, a lot of data will usually tell a rich story and will
require a detailed model to explain them; Good detailed models
are harder to come by.
There seem to be many incompatible ways to fit a bad model to
data, and that translate into a multimodal posterior distribution,
with none of the modes giving a satisfying fit. The problem may
lie in the deterministic portion of the model or in its probabilis-
tic part (for example, your data comes from a mixture of dis-
tributions, while you assume unimodality). In my experience
bad models are very hard to work with and hardly converge.
Your model, even if good, may be over-parameterized. For
example, your model may include a sum, a product, or a ratio
of parameters, while the data only constrain the result of that
operation. Think about fitting data nicely aligned along a
straight line with a model like y abx + c. Chances are that an
infinity of couples (a, b) will fit as well and have equal posterior
density. This translates in very high correlations between para-
meters and a very hard posterior to sample from. In the simple
case evoked, that problem could be diagnosed in advance and
corrected from the start, but that is much harder to do with a
complex nonlinear model. In theory, if you have placed infor-
mative priors on the parameters (here, on a and b) you should be
safe. However, it is difficult to say how informative they have to
be compared to the data. You can try to understand the cause of
the problem by examining the correlations between parameters
and the autocorrelation within chains (high autocorrelations are
bad news). The solution is usually to simplify the model or to
reparameterize it.
For our butadiene example, which runs quickly, 20,000 itera-
tions per chain are enough to reach convergence for model A and
30,000 for model B. We run five MCMC chains of 50,000 itera-
tions for A and 60,000 iterations for B, keeping 1 in 10 of the last
30,000 iterations. That leaves us with a total of 15,000 samples
(vectors) from the joint posterior distribution for each model, with
the data log-likelihood and posterior log-density for each of them.
624 F.Y. Bois
Table 2
Summary (mode, mean 6 SD [2.5th percentile, 97.5th percentile]) of the posterior
distributions of the population geometric means, m, and population
(or interindividual) geometric SDs, , for model A parameters
Parameter m S
fVwp 0.24, 0.22 0.032 [0.16, 0.29] 1.02, 1.5 0.31 [1.02, 2.2]
Fpul 7.7, 7.4 0.77 [5.9, 9.1] 1.26, 1.4 0.26 [1.14, 2.1]
rds 0.39, 0.38 0.039 [0.30, 0.45] 1.17, 1.5 0.32 [1.08, 2.3]
fFpp 0.17, 0.17 0.020 [0.13, 0.21] 1.26, 1.5 0.24 [1.15, 2.1]
fFfat 0.049, 0.053 0.0086 [0.038, 0.072] 1.14, 1.6 0.32 [1.08, 2.3]
Pa 1.5, 1.5 0.15 [1.2, 1.8] 1.25, 1.4 0.26 [1.10, 2.1]
Pwp 0.78, 0.74 0.11 [0.54, 0.99] 1.20, 1.5 0.32 [1.08, 2.2]
Ppp 0.58, 0.63 0.097 [0.47, 0.86] 1.75, 1.5 0.32 [1.05, 2.2]
kmet 0.19, 0.20 0.044 [0.14, 0.31] 1.30, 1.6 0.26 [1.15, 2.2]
Table 3
Summary (mode, mean 6 SD [2.5th percentile, 97.5th percentile]) of the posterior
distributions of the population geometric means, m, and population
(or interindividual) geometric SDs, , and intraindividual SDs, , for model B
parameters
u m S D
fVwp 0.21, 0.23 0.032 [0.17, 0.30] 1.2, 1.5 0.30 [1.1, 2.2] 1.2, 1.5 0.30 [1.1, 2.2]
Fpul 7.6, 7.4 0.77 [6.0, 9.2] 1.3, 1.4 0.25 [1.1, 2.0] 1.002, 1.1 0.06 [1.01, 1.2]
rds 0.36, 0.38 0.04 [0.29, 0.44] 1.3, 1.5 0.34 [1.02, 2.3] 1.1, 1.5 0.33 [1.05, 2.2]
fFpp 0.15, 0.19 0.022 [0.13, 0.22] 1.3, 1.4 0.27 [1.04, 2.1] 1.4, 1.6 0.22 [1.3, 2.2]
fFfat 0.048, 0.054 0.008 [0.04, 0.071] 1.3, 1.5 0.33 [1.06, 2.2] 1.4, 1.6 0.31 [1.09, 2.2]
Pa 1.3, 1.4 0.15 [1.1, 1.8] 1.2, 1.3 0.28 [1.03, 2.1] 1.1, 1.2 0.16 [1.03, 1.6]
Pwp 0.80, 0.75 0.11 [0.55, 0.99] 1.9, 1.5 0.32 [1.07, 2.2] 1.3, 1.6 0.31 [1.13, 2.3]
Ppp 0.61, 0.68 0.11 [0.49, 0.93] 1.2, 1.5 0.32 [1.1, 2.3] 1.9, 1.7 0.31 [1.2, 2.3]
kmet 0.15, 0.21 0.062 [0.14, 0.39] 1.2, 1.6 0.29 [1.14, 2.2] 1.6, 1.7 0.27 [1.2, 2.2]
0.6
0.5
Population average of kmet (1/min)
0.4
0.3
0.2
0.1
Fig. 7. Trajectory of the MCMC sampler for the population mean of kmet (1,3-butadiene metabolic rate constant) in model A,
with interindividual variability only. The five chains converged quickly to the posterior distribution. The last 30,000
iterations were kept to form the posterior sample used for further analyses. On that basis, a smoothed (Gaussian kernel)
estimate of the posterior density for kmet is shown on the right.
1.2
1.0
Logarithm of the population variance of kmet
0.8
0.6
0.4
0.2
0.0
Iteration
Fig. 8. Trajectory of the MCMC sampler for the logarithm of kmet population variance in model A, with interindividual
variability only. On the right, a smoothed (Gaussian kernel) estimate of the last 30,000 samples, in the dashed rectangle,
gives an estimate of the posterior density. It is truncated to zero because variances cannot be negative.
The subject average parameters are much less correlated, and that is
true also of the population parameters (not shown). You can also
see (series of diagonals beside the main, trivial, one) that the occa-
sion level parameters influence the subjects averages and that the
occasions influence each other.
3.6. Checking We have in fact already started checking the models when assessing
the Models the fits to the data (Figs. 9 and 10). The fit of model B is quite
better and its residual variance, s1, much lower. We have also seen
that the posterior parameters estimates are reasonable. In the case
of a parametric population model, it is also useful to check the
distributional assumptions of the hierarchy. We have assumed log-
normal distributions throughout. Was that reasonable? Figure 11
shows that, at least for kmet, individual values seem reasonably
spread around their mean. Figure 13 may be clearer. It shows a
simple Gaussian kernel density estimate of the posterior
628 F.Y. Bois
a b
0.10
0.01
Fig. 9. Observed versus predicted concentrations of 1,3-butadiene in exhaled air, all data together. For a perfect fit, all
points would fall on the diagonal. The fit of model B (with inter- and intraindividual variability) is markedly better than that
of model A (interindividual variability only).
5.00
a b c
0.50
0.05
Concentration in exhaled air (ppm)
5.00 d e f
0.50
0.05
5.00
g h i
0.50
0.05
0 10 30 50 0 10 30 50 0 10 30 50
Time (min)
Fig. 10. Model predictions (lines) and observed concentrations, on two occasions (open and closed circles), of 1,3-
butadiene in exhaled air, for the same nine volunteers as in Fig. 3. Model A predictions are indicated by dashed lines;
Model B predictions for the two occasions are indicated by solid lines.
25 Bayesian Inference 629
0.6
0.5
0.4
kmet (1 / min)
0.3
0.2
0.1
0.0
A B C D E F G H I J K
Fig. 11. Box-plot of the posterior samples of kmet for the population (noted m) and for each
subject, estimated using model B, with inter- and intraindividual variability. For each
subject the first box corresponds to the individual average, yi, and the other two to the
occasion-specific values yi1 and yi2..
3.7. Inference on Model As we have seen above, model B, with both inter- and intraindividual
Structure variability really seems to be a better model than model A. In our case,
intraindividual variability could be as high as interindividual variability.
630 F.Y. Bois
1.0
kmet
Pwp
Ppp
Pa
12
Fpul
fFpp
fFfat
0.5
rds
fVwp
kmet
Pwp
Ppp
Pa
11 Fpul 0.0
fFpp
fFfat
rds
fVwp
kmet
Pwp
Ppp 0.5
Pa
1 Fpul
fFpp
fFfat
rds
fVwp
1.0
a
a
p
ds
pp
pp
ds
pp
pp
ds
pp
pp
p
l
l
t
et
et
et
P
P
pu
pu
pu
fa
fa
fa
w
w
r
r
fF
fF
fF
P
fF
fF
fF
fV
fV
fV
P
m
m
F
F
k
1 11 12
Fig. 12. Graphical representation of the correlation matrix between posterior parameter values for subject A, using model
B. The first group of parameters are subject-level averages, y1, the second and third groups are the occasion-specific
parameters y11 and y12. Correlations are stronger at the occasion level.
Density
4
3.8. Making Predictions We are interested in assessing the capacity of different subjects in
metabolizing 1,3-butadiene. Figure 11 summarizes the posterior
distribution of a key parameter involved, but that may not be the
whole story. The health risks due to butadiene exposure may rather
be linked to the quantity of butadiene metabolized, Qmet. That
quantity is certainly a function of kmet, but it depends also on the
quantity of butadiene inhaled, which in turn depends on Fpul, etc. It
is easy, using the PBPK model and the posterior parameters samples
we have, to compute Qmet for each subject within 60 min at each
occasion after their 20 min exposure to 2 ppm butadiene in air
(Fig. 14). We can see that the subjects ranking using Qmet is
different from the one using kmet. When computing those predic-
tions, it is important to use one by one the entire parameter vectors
sampled, even if that is for a subset of the samples. That is because
the parameter values sampled are correlated (see Fig. 12) and must
stay so. Usually, MCMC samplers output one parameter vector at a
time or per line of output file, and you just need to do one
computation per line, using all the parameter values of that line.
Purely predictive simulation for new random subjects
requires a bit more care. In the case of model B, we have a posterior
sample of 15,000 9 (mk, Sk) pairs, because we updated the distri-
bution of k 9 model parameters. We can use each of the nine
pairs in a given sample to sample nine random parameter values.
632 F.Y. Bois
0.30
0.25
0.15
0.10
0.05
0.00
X A B C D E F G H I J K
3.9. Conclusion The above example does not exhaust the topic of Bayesian inference
of the Example and is not even complete by itself. The distributional assumptions
made for the variances S and D, in particular, should be checked, as
well as the sensitivity of the results to the bounds imposed on the
prior distributions. The conclusion that intraindividual variability is
important was probably obvious from the start (Fig. 3). This may
be quite general, but the data are seldom here to check it, and most
population pharmacokinetic analyses do not investigate intraindi-
vidual variability. Note, however, that when it is omitted, interindi-
vidual variability estimates are quite reasonable (Tables 2 and 3), at
least in this case. There are plenty of aspects of Bayes inference that
we have just mentioned in passing (nonparametrics, for example) or
not at all (e.g., optimal design, see refs. 36, 57, and the vast area of
clinical trial design and analysis). But their principles remain the
same. The interested reader can refer to the literature indicated at
the end of the introduction of this chapter to go deeper and beyond
what we have surveyed here.
References
1. Albert J (2007) Bayesian computation with R. with aggregated data. J Agr Biol Environ Stat
Springer, New York 12:346363
2. Berger JO (1985) Statistical decision theory 12. Pillai G, Mentre F, Steimer JL (2005) Non-linear
and Bayesian analysis, 2nd edn. Springer, mixed effects modelingfrom methodology
New York and software development to driving implemen-
3. Box GEP, Tiao GC (1973) Bayesian inference tation in drug development science. J Pharmaco-
in statistical analysis. Wiley, New York kinet Pharmacodyn 32:161183
4. OHagan A (1994) Kendalls advanced theory 13. Dunson DB (2009) Bayesian nonparametric
of statisticsvolume 2BBayesian inference. hierarchical modeling. Biom J 51:273284
Edward Arnold, London 14. Gosh JK, Ramamoorthi RV (2003) Bayesian
5. Gelman A, Carlin JB, Stern HS, Rubin DB non-parametrics. Springer, New York
(2004) Bayesian data analysis, 2nd edn. 15. Bigelow JL, Dunson DB (2007) Bayesian
Chapman & Hall, London adaptive regression splines for hierarchical
6. Bernardo JM, Smith AFM (1994) Bayesian data. Biometrics 63:724732
theory. Wiley, New York 16. Oppenheim J, Wehner S (2010) The uncer-
7. Press SJ (1989) Bayesian statistics: principles, tainty principle determines the nonlocality of
models, and applications. Wiley, New York quantum mechanics. Science 330:10721074
8. Whittaker J (1990) Graphical models in applied 17. Garthwaite PH, Kadane JB, OHagan A (2005)
multivariate statistics. Wiley, Chichester Statistical methods for eliciting probability dis-
9. Shafer G, Pearl J (1990) Readings in uncertain tributions. J Am Stat Assoc 100:680700
reasoning. Morgan Kaufmann, San Mateo, CA 18. Jaynes ET (2003) Probability theory: the logic
10. Gelman A (2006) Multilevel (hierarchical) of science. Cambridge University Press, Cam-
modeling: what it can and cannot do. Techno- bridge
metrics 48:432435 19. Gilks WR, Richardson S, Spiegelhalter DJ
11. Chiu WA, Bois F (2007) An approximate (1996) Markov Chain Monte Carlo in practice.
method for population toxicokinetic analysis Chapman & Hall, London
25 Bayesian Inference 635
20. Liu JS (2001) Monte Carlo strategies in scien- 37. Brochot C, Smith TJ, Bois FY (2007) Devel-
tific computing. Springer, New York opment of a physiologically based toxicokinetic
21. Metropolis N, Rosenbluth AW, Rosenbluth model for butadiene and four major metabo-
MN, Teller AH, Teller E (1953) Equation of lites in humans: global sensitivity analysis for
state calculation by fast computing machines. J experimental design issues. Chem Biol Interact
Chem Phys 21:10871092 167:168183
22. Hastings WK (1970) Monte Carlo sampling 38. Mezzetti M, Ibrahim JG, Bois FY, Ryan LM,
methods using Markov chains and their appli- Ngo L, Smith TJ (2003) A Bayesian compart-
cations. Biometrika 57:97109 mental model for the evaluation of 1,3-
23. Gelman A, Rubin DB (1992) Inference from butadiene metabolism. J R Stat Soc C 52:
iterative simulation using multiple sequences 291305
(with discussion). Stat Sci 7:457511 39. Micallef S, Smith TJ, Bois FY (2002) Model-
24. Geman S, Geman D (1984) Stochastic relaxa- ling of intra-individual and inter-individual
tion, Gibbs distributions, and the Bayesian res- variability in 1,3-butadiene metabolism. In:
toration of images. IEEE Trans Pattern Anal PAGE 11annual meeting of the population
Mach Intell 6:721741 approach group in Europe, Population
Approach Group in Europe, Paris, ISSN
25. Doucet A, de Freitas N, Gordon N (2001) 18716032
Sequential Monte Carlo methods in practice.
Springer, New York 40. Ngo L, Ryan LM, Mezzetti M, Bois FY, Smith
TJ (2011) Estimating metabolic rate for
26. Andrieu C, Doucet A, Holenstein R (2010) butadiene at steady state using a Bayesian
Particle Markov chain Monte Carlo methods. physiologically-based pharmacokinetic model.
J R Stat Soc B 72:269342 J Environ Ecol Stat 18:131146
27. Smith A, Gelfand A (1992) Bayesian statistics 41. Smith T, Bois FY, Lin Y-S, Brochot C,
without tears: a samplingresampling perspec- Micallef S, Kim D, Kelsey KT (2008) Quanti-
tive. Am Stat 46:8488 fying heterogeneity in exposure-risk relation-
28. Rubin DB (1988) Using the SIR algorithm to ships using exhaled breath biomarkers for
simulate posterior distributions. In: Bernardo 1,3-butadiene exposures. J Breath Res 2:
JM, De Groot MH, Lindley DV, Smith AFM 037018 (037010 p.)
(eds) Bayesian Statistics 3. Oxford University 42. Smith T, Lin Y-S, Mezzetti L, Bois FY, Kelsey
Press, Oxford, pp 395402 K, Ibrahim J (2001) Genetic and dietary factors
29. R Development Core Team (2010) R: a lan- affecting human metabolism of 1,3-butadiene.
guage and environment for statistical comput- Chem Biol Interact 135136:407428
ing. R Foundation for Statistical Computing, 43. Gelman A, Bois FY, Jiang J (1996) Physiologi-
Vienna, Austria cal pharmacokinetic analysis using population
30. Bois FY, Maszle D (1997) MCSim: a simula- modeling and informative prior distributions.
tion program. J Stat Software 2(9). http:// J Am Stat Assoc 91:14001412
www.jstatsoft.org/v02/i09 44. Bischoff KB, Dedrick RL, Zaharko DS, Long-
31. Bois FY (2009) GNU MCSim: Bayesian statis- streth JA (1971) Methotrexate pharmacokinet-
tical inference for SBML-coded systems biol- ics. J Pharm Sci 60:11281133
ogy models. Bioinformatics 25:14531454 45. Bois FY, Zeise L, Tozer TN (1990) Precision
32. Hammitt JK, Shlyakhter AI (1999) The and sensitivity analysis of pharmacokinetic
expected value of information and the proba- models for cancer risk assessment: tetrachlor-
bility of surprise. Risk Anal 19:135152 oethylene in mice, rats and humans. Toxicol
33. Yokota F, Gray G, Hammitt JK, Thompson KM Appl Pharmacol 102:300315
(2004) Tiered chemical testing: a value of infor- 46. Droz PO, Guillemin MP (1983) Human sty-
mation approach. Risk Anal 24:16251639 rene exposureV. Development of a model for
34. Kass RE, Raftery AE (1995) Bayes factors. biological monitoring. Int Arch Occup Envi-
J Am Stat Assoc 90:773795 ron Health 53:1936
35. Green PJ (1995) Reversible jump Markov 47. Gerlowski LE, Jain RK (1983) Physiologically
chain Monte Carlo computation and Bayesian based pharmacokinetic modeling: principles
model determination. Biometrika 82:711732 and applications. J Pharm Sci 72:11031127
36. Bois FY, Smith T, Gelman A, Chang H-Y, 48. Reddy M, Yang RS, Andersen ME, Clewell HJ
Smith A (1999) Optimal design for a study of III (2005) Physiologically based pharmacoki-
butadiene toxicokinetics in humans. Toxicol netic modeling: science and applications. Wiley,
Sci 49:213224 Hoboken, New Jersey
636 F.Y. Bois
49. Racine-Poon A, Wakefield J (1998) Statistical 54. Filser JG, Johanson G, Kessler W, Kreuzer
methods for population pharmacokinetic mod- PE, Stei P, Baur C, Csanady GA (1993) A
elling. Stat Methods Med Res 7:6384 pharmacokinetic model to describe toxicoki-
50. Lunn DJ, Best N, Thomas A, Wakefield J, netic interactions between 1,3-butadiene
Spiegelhalter D (2002) Bayesian analysis of and styrene in rats: predictions for human
population PK/PD models: general concepts exposure, IARC Scientific Publication No.
and software. J Pharmacokinet Biopharm 29: 127. In: Sorsa M, Pletonen K, Vainio H,
271307 Hemminki K (eds) Butadiene and styrene:
51. Bois F, Jamei M, Clewell HJ (2010) PBPK assessment of health hazards, International
modelling of inter-individual variability in the Agency for Research on Cancer, Lyon,
pharmacokinetics of environmental chemicals. France
Toxicology 278:256267 55. Gelman A (2006) Prior distributions for
52. Fiserova-Bergerova V (1983) Physiological variance parameters in hierarchical models
models for pulmonary administration and (comment on article by Browne and Draper).
elimination of inert vapors and gases. In: Bayesian Anal 1:515534
Fiserova-Bergerova F (ed) Modeling of inhala- 56. Tanner MA, Wong WH (1987) The calculation
tion exposure to vapors: uptake, distribution, of posterior distributions by data augmentation
and elimination. CRC Press, Boca Raton, FL, (with discussion). J Am Stat Assoc 82:528550
pp 73100 57. Amzal B, Bois FY, Parent E, Robert CP (2006)
53. Deurenberg P, Weststrate JA, Seidell JC (1991) Bayesian optimal design via interacting
Body mass index as a measure of body fatness: MCMC. J Am Stat Assoc 101:773785
age- and sex-specific prediction formulas. Br J
Nutr 65:105141
INDEX
Brad Reisfeld and Arthur N. Mayeno (eds.), Computational Toxicology: Volume II, Methods in Molecular Biology, vol. 930,
DOI 10.1007/978-1-62703-059-5, # Springer Science+Business Media, LLC 2013
637
638 C OMPUTATIONAL TOXICOLOGY
Index
Binding membrane ................................................................ 500
affinity ............................................................... 40, 130 receptor.................................................. 167, 174, 175
domain, signaling................................................................... 384
energy, CellML.................................................................. 386, 405
ligand ....................................................................... 179 CellNetOptimizer ................................................ 179211
model, Cellular
plasma, automata ........................................383, 402, 403, 407
site ..................................................................... 24, 272 compartments........................................ 403, 405, 420
tissue ........................................................................ 275 systems ............................................................ 399423
Bioaccumulation .................................277, 288, 310, 314 Cephalosporins..................................................... 119, 124
Bioavailability ............................................................7, 300 CHARMM,
BioCarta....................................................... 362, 368, 370 ChEBI
Bioconductor.............................. 168, 169, 236, 360361 identifier..................................................................... 47
BioEpisteme ......................................................... 316317 ontology .................................................................... 49
Biomarker ................................... 235, 251290, 357, 359 web............................................................................. 52
BioSolveIT, Chemaxon .................................................................47, 50
Bond Chembl ............................................................................ 56
acceptor........................................................................ 9 Cheminformatics.......................................................50, 58
breaking, ChemOffice,
contribution ................................................................ 9 Chemometrics ...................................................... 500, 549
contribution method, ChemProp ...........................................100, 306, 319, 502
donor ........................................................................... 9 ChemSpider..................................................................... 50
Bone.....................................................254, 278, 614, 620 Classification
Boolean ................................................... 57, 11, 33, 181, models..................................... 99119, 501, 519, 521
182184, 192197, 200, 202204, 208, 209, molecular ................................................................. 119
211, 385, 402 qsar.................................................................... 61, 519
Bossa tree ......................................................... 103, 347, 516
predictivity ................................................................... 7 Clearance
rulebase ........................................................... 130, 138 drug ......................................................................... 396
structure....................................................................... 7 metabolite,
Brain barrier model,
penetration, process,
permeability ............................................................. 110 rate,
ClogP,
C Clonal growth,
Caco Cluster
cells........................................................................... 105 analysis .................................. 103, 107109, 504, 516
pbs,
permeability ............................................................. 105
Cadmium .............................................................. 277, 278 Cmax,
Caffeine, CODESSA ..................................319, 322, 325, 326, 502
Combinatorial chemistry ....................................... 47, 343
Calcium.......................................................................... 326
Cancer Comparative Molecular Field Analysis
risk................................................................... 270, 349 (CoMFA) ........................8, 9, 306, 321, 322, 502
risk assessment................................................ 233, 635 Comparative molecular similarity indices
analysis (CoMSIA) ........................................9, 322
Capillary,
Carbamazepine, Compartmental
Carcinogenic absorption................................................................ 382
analysis ..................................................................... 382
activity.........................................................69, 72, 106
effects .............................................................. 283, 299 model ....................................................................... 382
potency ...............................59, 74, 88, 112, 145, 281 systems ..................................................................... 382
COmplex PAthway SImulator
potential................................ 7, 7274, 131, 143, 342
Cardiovascular ............................................. 254, 256, 341 (COPASI) ................................................. 386, 407
Cell Comprehensive R Archive Network
cycle ......................................................................... 407 (CRAN) ............................................................ 360
growth ............................................................ 387, 541 CoMSIA. See Comparative molecular similarity
indices analysis (CoMSIA)
COMPUTATIONAL TOXICOLOGY
Index 639
Conformational physicochemical....................................................... 314
dynamics .................................................................. 403 prediction..................................................................... 7
energetic ...................................................................... 9 properties................................................................. 314
search ........................................................................... 9 QSAR........................................................................... 8
space.............................................................. 37, 38, 40 QSPR ........................................................................... 8
Connectivity Developmental
index ............................................................... 114, 115 effects ..................................................... 303, 315, 323
topochemical ........................................................... 114 toxicants.........................................308, 310, 312, 315
Consensus toxicity ....................... 55, 87, 90, 127, 129, 305337
predictions ............................................................... 313 Diazepam,
score, Dibenzofurans ............................................................... 306
COPASI. See COmplex PAthway Dietary
SImulator (COPASI) exposure assessment................................................ 128
COSMOS ...................................................................... 325 intake ....................................................................... 280
Covariance ................................. 512, 531533, 545, 550, Differential equations ................................ 180, 381, 383,
552554, 561, 568570, 576, 604, 633 388, 403, 405407, 475497, 618
matrix ............................................110, 458, 528, 530, Dioxin ................................ 217, 268, 281, 283287, 387
532534, 587, 633 Discriminant analysis................................ 14, 17, 19, 103,
CRAN. See Comprehensive R Archive 105, 110, 129, 314, 315, 516
Network (CRAN) DNA
Crystal structure............................................................ 317 adducts..................................................................... 134
Cytochrome binding.....................82, 83, 132, 138, 140, 142, 152
metabolism ........................................ 74, 77, 130, 141 Docking
substrate, methods .......................................................... 258, 259
Cytokine ..............................................167, 206, 209, 385 molecular,
Cytoscape.................................................... 168, 172, 186, scoring,
198, 211, 236, 240, 245, 246 simulations,
Cytotoxicity ............................................. 56, 61, 181, 310 tools,
Dose
D administered ................................................... 300, 593
Database extrapolation............................................................ 299
metric ....................................................................... 285
chemical ............................................................... 2951
KEGG ............................................171, 174, 175, 236 pharmacokinetic ...................................................... 257
network...................................................................... 92 reference ......................................................... 280, 298
response ..........................................55, 126, 288, 290,
search ..................................................... 29, 33, 34, 47
software................................................................36, 45 297301, 382, 385, 388, 592595
DDT (dichlorodiphenyltrichloroethane)..................... 287 Dosimetry ...................................174, 300, 381, 389, 390
3D QSAR ................................................ 8, 306, 322, 502
Decision
forest ........................................................................ 103 Dragon.................................................322, 324326, 502
processes, Dragon descriptors........................................................ 326
Drug
support............................................................ 342, 351
tree ................................................................. 6, 7479, binding..................................................................... 258
103, 104, 107, 108, 130, 318, 320, databases .................................................................. 100
343, 344, 348, 518 development ............................................99, 119, 253,
254, 260, 341, 342, 349, 350, 357,
Derek ....................................................68, 127, 138, 307,
309, 311, 345, 346, 348, 524 370, 371, 592
Dermal .................................................105, 275, 285, 297 distribution .............................................................. 593
drug interactions ...........................257, 268, 269, 317
Descriptors
chemical .......................................37, 5861, 324, 391 impurities ............................................... 342, 343, 346
models............................................................. 511, 522 induced toxicity ......................................357, 368372
metabolism ..........................................................74, 77
molecular ..................................................... 3, 4, 612,
14, 16, 1820, 38, 100, 103, 110, 308, 315317, plasma,
319323, 326, 327, 344, 502506, 511, 515, receptor........................................................... 104, 105
517, 522 resistance,
640 C OMPUTATIONAL TOXICOLOGY
Index
Drug (cont.) dose .......................................................................... 126
safety .............................................. 101, 256, 341351 hazard ...................................................................... 126
solubility, level ....................................... 231, 270, 276, 277, 289
targets ............................................................. 235, 245 model ....................................................................... 173
DSSTox.......................................... 55, 88, 89, 92, 93, 95, population ............................................................... 276
145, 327, 328, 345 response ................................................................... 126
Dynamic systems ........................................................... 389 route......................................................................... 287
scenario .................................................................... 300
E
F
Ecotoxicity................................41, 83, 90, 110, 310, 500
Ecotoxicology.................................................................... 4 Factor analysis ........................................ 8, 105, 109110,
Effectors...............................................254, 269, 276, 298 308, 314, 320, 545, 576
Electrotopological ......................................................... 321 FastICA.......................................................................... 545
Elimination Fat
chemical ..................................................................... 16 compartment .................................382, 617, 618, 620
drug, tissue ..................................... 300, 381, 382, 618, 620
model ......................................................................... 16 Fate and transport,
process ....................................................................... 16 Fingerprint............................................... 35, 57, 318, 344
rate ........................................................................... 281 Food
Emax model, additives ................................................................... 285
Endocrine consumption data........................................... 276, 286
disruptors............................................... 270, 282, 287 intake ..................................................... 281, 285, 286
system ........................................................................ 19 safety ...............................................70, 126, 135, 307,
Ensemble methods.......................................103, 112113 311, 349, 350
Enterocytes, Force field,
Environmental Forest ...................................................103, 306, 309, 313
agents ........................................................58, 345, 350 decision ........................................................... 103, 313
contaminants .................................................. 276, 305 decision tree............................................................. 103
fate .................................................8, 32, 81, 125, 133 method ........................................................... 130, 313
indicators ........................................................ 275290 Formaldehyde,
pollutants ........................................................ 100, 275 Fortran ........................................................................... 315
protection .......................................72, 100, 279, 313, Fractal ................................................................... 386, 387
391, 394 Fragment based ............................................................... 38
toxicity ....................................................................... 32 Functional
Environmental public health analysis ............................................... 72, 74, 264, 318
indicators................................................... 275290 groups ................................................ 6, 7, 8284, 153
Enzyme theory.............................................................. 153, 245
complex, units ....................................................... 379, 390, 391
cytochrome .............................................................. 285 Fuzzy
metabolites .............................................................. 422 adaptive .................................................. 306, 309, 316
networks ........................................ 237241, 245247 logic ........................................................181184, 205
receptors .................................................................. 269
substrates ................................................................. 257 G
transporters.............................................................. 261
Gastrointestinal ........................................... 254, 268, 386
EPISuite................................................................ 322, 524 GastroPlus,
Epoxide...........................................................70, 270, 618 Genechip............................................ 359, 362, 363, 365,
Estrogenic...................................................................... 288
368, 370
Ethanol .......................................................................... 155 Gene, genetic
Excel.......................................................92, 168, 328, 412 algorithms..................................................12, 13, 103,
Expert systems.......................................... 69, 7084, 128,
306, 320, 328, 511
129, 136, 139, 143, 152, 306308, 311, 337, expression networks ....................................... 165177
343, 350 function ................................................................... 318
Exposure networks ................................................ 168, 177, 383
assessment.............................................. 126, 270, 275
COMPUTATIONAL TOXICOLOGY
Index 641
neural networks .............................................. 103, 318 Immunotoxicity..............................................55, 128, 313
omnibus ................................................................... 601 InChI .................................................................. 38, 78, 85
ontology ........................................170, 171, 262, 362 Information
profiling .........................................362, 377, 385, 389 systems ........................................................32, 37, 277
regulatory networks ...................................... 217, 221, theory....................................................................... 321
224, 377, 384, 402 Ingestion..............................................135, 275, 285, 286
regulatory systems................................................... 406 Inhalation .......................................... 135, 275, 285, 287,
Genotoxicity .................................. 72, 82, 106, 125159, 297299, 614, 617
270, 342, 343, 345 Interaction
Glomerular ...................................................................... 56 energy ..................................................................9, 404
Glucuronidation ............................................................ 143 fields ............................................................................. 9
Glutathione ................................................. 217, 368, 370 model ......................................................................... 68
Graph network........................ 168, 172, 217, 377, 385, 389
model .............................................322, 413, 599, 600 rules................................................................. 401404
theory....................................................................... 502 Interindividual variability........................... 276, 599, 601,
GraphViz ....................................198, 210, 361, 367, 368 605, 613616, 621, 624627, 629, 633, 634
Interspecies
H differences................................................................ 298
Hazard extrapolation............................................................ 298
assessment..................................71, 81, 126, 130131 Intestinal
characterization ........................................75, 126, 277 absorption,
permeability,
identification...............................................72, 95, 126
HazardExpert .................................... 128, 133, 146, 147, tract .......................................................................... 386
155, 156, 158, 307, 309, 312, 313 IRIS (Integrated Risk Information System)............89, 92
Hepatic
J
lobule ..................................................... 379, 389, 391
metabolism ..................................................... 302, 391 Java............................................171, 236, 309, 313, 361,
Hepatitis ............................................................... 260, 279 366, 410413
Hepatotoxic ................................175, 260, 262, 268, 370 JSim,
HERG
blockers........................................................... 104, 105 K
channel, KEGG
Hierarchical ligand .............................................237, 239, 247, 421
clustering ......................................108, 314, 322, 358,
pathway .........................................171, 174, 175, 237,
368, 370 238, 241, 244, 247, 263
models............................................ 599601, 605, 606 Ketoconazole,
HIV .............................................105, 115, 254, 258, 277
Kidney
Homeostasis ........................................370, 376, 382, 393 cortex ....................................................................... 268
Homology ..................................................................... 262 injury...................................................... 259, 260, 267
models,
K-nearest neighbor .................................... 103, 108, 128,
Hormone 316, 321, 505, 516, 517, 529
binding..................................................................... 285 KNIME.........................................................360362, 364
receptor.................................................................... 285 Kow (octanol-water partition
HPXR
coefficient) ............................................7, 8, 75, 83
activation,
agonists, L
antagonists,
HQSAR ....................................................... 306, 321, 322 Langevin,
Leadscope ........................................................47, 50, 313,
I 344, 345, 347, 524
Least squares........................................................... 9, 172,
Immune 314, 316, 319, 322, 429, 439, 466468,
cells.................................................................. 173174 470, 505, 506, 512, 529, 549577,
response ......................................... 167, 168, 173174
581, 592
642 C OMPUTATIONAL TOXICOLOGY
Index
Ligand 446, 483, 502, 505, 512, 528, 533536, 582,
binding................................................... 179, 217, 306 586, 587, 598, 605
complex........................................................... 217, 421 Matlab......................................................... 182, 187, 188,
interactions ........................................ 8, 205, 404, 421 196, 197, 210, 323, 407, 494497, 575, 576
library, Maximum likelihood estimation
receptor........................................................ 8, 41, 179, (MLE) ......................................550, 581, 585594
185, 217, 306 MCSim.................................................609, 622, 630, 633
screening .................................................................. 259 Mercury ................................................................ 278, 280
Likelihood Meta-analysis ........................................................ 262, 601
functions .......................................582585, 589, 603, MetabolExpert ............................................ 128, 312, 313
605, 606, 619 Metabolic Network Reconstruction ................... 244247
method ................................................. 113, 127, 458, Metabolism
581, 582, 589 (bio)activation .................................................. 72, 300
ratio ........................................................588589, 593, drug .....................................................................74, 77
607, 608, 619, 623, 630 liver ...............................................140, 259, 382, 386,
Linear algebra ....................................................... 429473 387, 390392, 618
Linear discriminant analysis .................................... 14, 17, prediction............................................ 7, 41, 126, 130,
19, 103, 105, 110 133134, 140143, 153157, 310, 316
Lipinski, rate .................................................170, 276, 391, 618
Lipophilic..................................................... 278, 306, 320 Metabolomics/metabonomics,
Liver Metacore............................................................... 264, 358
enzyme......................................................55, 391, 392 Metadrug ....................................................................... 269
injury..................................... 259, 260, 357, 386, 390 Metal ....................................................................... 72, 317
microsomes.................................................................. 4 Metapc ........................................................................... 310
regeneration.................................................... 379, 387 Metasite ........................................................................408,
tissue ..................................................... 381, 382, 386, Meteor ........................................................ 127, 128, 133,
390, 392, 409, 618 134, 142, 143, 154, 155
toxicity ..................................................................... 260 Methanol,
LOAEL (lowest observed adverse Methemoglobin ............................................................ 311
effect level)............................................61, 62 129, Methotrexate,
298, 299, 308, 312, 315, 327, 329, 332 Metyrapone,
Logistic MexAlert,
growth ..................................................................... 402 Michaelis-Menten equation....................... 300, 301, 384,
regression.............................. 315, 328, 329, 344, 593 494496
Logit .......................................................... 585, 590, 591, Microglobulinuria ......................................................... 259
593, 594 Micronuclei alerts.......................................................... 276
function ................................................. 590, 591, 593 Microsomes ....................................................................... 4
Lognormal ..........................................601, 602, 615617, Milk............................................. 143, 276279, 282289
620, 621, 627, 629, 631, 632, 633 Minitab ................................................................. 117, 323
LogP .................................................8, 1013, 18, 4749, Missing data................................................................... 129
58, 128, 143, 326, 500, 502 MLE. See Maximum likelihood
Lungs ............................................................................. 618 estimation (MLE)
Model
M checking.......................................... 92, 229233, 622,
627629
Madonna (Berkeley-Madonna),
Malarial ...........................................................10, 105, 261 development ..............................................57, 62, 110,
Mammillary ................................................................... 269 302, 317, 319, 324, 337, 501, 513516
error ...............................................117, 192, 196, 519
Markov chain Monte Carlo ................................. 597, 607
Markup language ...........................................36, 386, 405 evaluation.................................................4, 19, 40, 54,
Maternal....................................................... 279, 280, 285 59, 78, 79, 100, 132, 159, 210, 220, 307, 309,
318, 320, 330331, 342, 343, 350, 412, 499,
Mathematica ......................................................3, 6, 8, 30,
3234, 68, 70, 107, 112, 180, 184, 215217, 501, 576, 610
219221, 224, 300, 301, 306, 308, 309, fitting ..............................................17, 200, 206, 302,
314316, 323, 335, 377, 383, 386, 387, 399, 318, 506, 508, 512, 513, 522, 605, 623
COMPUTATIONAL TOXICOLOGY
Index 643
identification............................................76, 100, 113, property ................................................ 4, 7, 8, 20, 75,
132, 172, 173, 191, 216, 225, 229, 76, 316, 320322, 500, 502503, 522
259, 343, 418 shape ..................................................... 326, 327, 404,
prediction....................................................... 100, 349, 405, 420, 421, 500, 502503
628, 633 similarity ............................................... 9, 38, 83, 180,
refinement......................................190, 194, 318, 415 259, 269, 308, 309, 318, 319, 344, 515
selection .....................................................4, 9, 12, 17, targets ....................19, 253, 254, 259, 268, 269, 362
57, 81, 85, 103, 132, 193, 194, 255, 256, 315, Molfile................................ 131, 132, 309, 311, 314, 315
319321, 329, 331, 504, 511, 512, 514516, Molpro........................................................................... 323
582, 608, 612 Monte Carlo simulation ............................................... 597
structure..................................................612, 629631 Mopac ..................................................316, 325, 326, 502
uncertainty.................................................19, 71, 112, Morphogenesis ...........................313, 381, 383, 384, 407
128, 299, 302, 336, 337, 588, 598, 599, 601, Multi-Agent Systems ..........................215, 400401, 415
602, 603, 611, 632, 633 Multidimensional drug discovery,
validation ........................................... 57, 86, 418, 499 Multidrug resistance,
Modeling Multiscale.............................................386389, 407, 409
homology................................................................. 262 Multivariate
molecular ............................................... 318, 413, 502 analysis ..................................................................... 358
in vitro .................................................. 54, 57, 58, 61, regression................................................................. 507
68, 88, 93, 95, 100, 136, 138, 261, 337, Mutagenicity
377379, 381, 390393, 406 alerts............................................................................. 7
Models ames test,
animal..............................................32, 54, 58, 62, 68, prediction,
88, 99, 104, 256, 259, 271, 280, 288, 299, 300, Myelosuppression,
301, 378, 409, 600, 601 MySQL ................................................................. 169, 236
antitubercular .......................................................... 104
biological activity......................................... 6, 8, 9, 11, N
19, 41, 57, 68, 88, 100, 101, 103, 104,
NAMD,
114, 118, 119, 306, 308, 312, 314, Nanoparticles,
318, 321, 322, 500 Nasal/pulmonary,
bone ................................................................ 254, 255 Nearest neighbor.......................................... 59, 103, 108,
carcinogenicity...........................................5, 7, 55, 59,
128, 313, 316, 321, 404, 505, 516,
60, 68, 70, 71, 76, 78, 79, 86, 88, 517, 518, 529
100, 104, 127129, 259, 299, 309, Neoplastic ...................................................................... 106
312, 313, 328, 348, 349
Nephrotoxicity .............................................................. 266
developmental .........................................8, 21, 41, 50, Nervous system,
55, 57, 62, 68, 71, 95, 100, 112 Network
intestinal .................................................................. 379
gene....................................... 175176, 217, 268, 383
myths, KEGG ...................................................................... 237
predict binding ........................................................ 403 metabolic ....................................... 217, 235248, 402
reproductive..............................................55, 306, 328
neural .................................14, 16, 17, 103, 113, 308,
MoKa, 314316, 318, 320, 402, 515, 516, 518, 521
Molecular Neurotoxicity ........................................... 55, 87, 90, 128,
descriptor ........................................ 3, 4, 612, 14, 16, 258, 288, 311, 313
1820, 38, 100, 103, 110, 308, 315317,
Newborn........................................................................ 328
319323, 326, 327, 344, 502506, 511, 515, Newton method ................................................... 475, 587
517, 522 NHANES ............................................................. 279, 289
docking .................................................................... 259
Nicotine,
dynamics ............................... 358, 367, 371, 403, 408 Nitrenium ion............................................. 134, 138, 140,
fragments ................................ 38, 308, 310, 314, 321 144, 152, 154, 155
geometry..................................................38, 215, 319,
NOAEL .......................................... 61, 62, 298, 299, 327
326, 405, 502503 Non
mechanics ....................................................... 321, 326 bonded interactions,
networks ........................ 76, 133, 134, 166, 378, 384 congeneric ..................... 17, 68, 69, 71, 76, 309, 351
644 C OMPUTATIONAL TOXICOLOGY
Index
Non (cont.) Partition coefficient.................................... 7, 10, 75, 306,
genotoxic ......................................70, 76, 77, 78, 131, 312, 500, 613, 615, 616, 618, 620, 622
138, 143, 157, 263, 285, 347 Passi toolkit .......................................................... 416418
mutagenic ........................................84, 101, 146, 343 Pathway
Noncancer risk assessment........................ 72, 74, 89, 299 analysis .................................. 259, 263, 264, 269, 358
Non-compartmental analysis ........................................ 300 maps ......................................................................... 236
NONMEM.................................................................... 589 Pattern recognition .................... 103107, 315, 516, 521
Nonspecific binding .................................... 105, 290, 328 Pediatric ......................................................................... 260
Nuclear receptor............................................................ 172 Perchlorate,
Nucleic acids.................................................................... 50 Perfusion............................................................... 618, 619
Nucleophiles ........................................139, 144, 152, 154 Permeability
Numerical brain barrier ............................................................. 110
integration ...................................................... 479480 drug ....................................................... 105, 110, 113
methods ......................................... 478, 490497, 610 intestinal,
in vitro,
O Persistent organic pollutants (POPs) ..........................275,
Oasis database ....................................... 3740, 45, 46, 50 277, 279, 280, 282289
Pesticide .................................. 4, 5, 7, 40, 50, 55, 68, 69,
Objective function .......................................172, 192194
Occams razor, 90, 94, 139, 157, 279, 284, 285, 328
Occupational safety, Pharmacogenomics ..................................... 211, 253, 260
Ocular ............................................................................ 312 Pharmacophore ...........................................................9, 67
Physiome
OECD
guidelines.......................................................... 84, 307 jsim models,
qsar toolbox........................................ 71, 8284, 129, project,
Phytochemical ............................................. 259, 268, 269
132, 134137, 140, 157, 524
Omics.................................................................... 264, 528 Pitfalls ................................................................. 4, 10, 408
OpenMolGRID............................................................. 319 PKa................................................................................. 128
Plasma
Open MPI,
OpenTox Framework................................................8486 concentration .......................................................... 280
Optimization protein binding,
dosage ...................................................................... 592 Pollution ........................................................................ 284
Polybrominated diphenyl ethers
methods .......................................................... 195, 511
pre clinical, (PBDEs)...................................275, 279, 287288
Oral Polychlorinated biphenyls
(PCBs)......................................268, 279287, 312
absorption................................................................ 314
dose .......................................................................... 129 Polycyclic aromatic hydrocarbons
Organochlorine ........................................... 279, 284, 285 (PAHs) ......................................68, 138, 140, 217,
270, 276, 278
Orthologs ............................................................. 266268
Outlier ........................................... 15, 16, 110, 324, 330, Polymorphism ...................................................... 256258
503, 522, 523, 529531, 545, 600 Pooled data.................................................. 286, 289, 616
Poorly perfused tissues ................................................. 620
Overfitting ................................................... 337, 508, 570
Oxidative stress........................................... 175177, 218, Population based model ...................................... 279, 280
219, 221223, 227, 228, 231, 368, 370, 371, Portal vein ............................................................ 379, 388
540, 543, 544 Posterior distribution.........................597, 603, 606611,
613, 621626, 631, 633
P Predict
absorption................................... 7, 41, 259, 316, 382
Paracetamol, ADME parameters .................................................. 316
Paralogs.......................................................................... 265 aqueous solubility ..................................................... 75
Parameter binding............................................18, 40, 56, 75, 77,
estimation ................................................................ 300 82, 83, 104, 106, 130, 132, 138, 140, 142, 144,
scaling ...................................................................... 619 152, 154, 179, 258, 301, 306, 403
Paraoxon, biological activity.................................. 6, 8, 9, 11, 16,
Partial least squares (PLS) .................................. 9, 14, 17, 19, 41, 57, 67, 68, 88, 90, 99, 100, 101, 103,
57, 314, 316, 318, 319, 320, 322, 324, 504, 505, 104107, 110, 114, 118, 119, 306, 308, 310,
512, 549577 312, 314, 318, 321, 322, 500, 506, 516
COMPUTATIONAL TOXICOLOGY
Index 645
boiling point, interaction......................................172, 262, 264, 266
carcinogenicity...........................................5, 7, 55, 59, ligand,
60, 6795, 100, 104, 127130, 133, 135, 136, structure................................................................... 258
138140, 143, 151, 152, 259, 270, 309312, targets .................................................... 253, 258, 259
342, 348, 349, 499 Proteomics............................................................ 254, 265
clearance .........................................11, 14, 16, 58, 60, Prothrombin.................................................................. 105
68, 74, 92, 132, 146, 268, 288, 409, 601, 627 Pulmonary .................................................. 379, 613, 615,
CNS permeability, 616, 618620, 622
cytochrome P450 ......................................74, 77, 130, Pyrene ..................................................215, 217219, 222
134, 139, 141, 144
developmental toxicity ................................ 55, 87, 90, Q
127, 129, 305337
QikProp,
fate ................................................8, 32, 81, 125, 127, QSARPro ....................................................................... 321
128, 131, 133, 142 QSIIR ........................................................................5362
genotoxicity ...............................................72, 82, 106,
Quantum chemical descriptors.............................. 37, 324
125133, 136159, 270, 342, 343, 345 Quinone......................................................................... 105
Henry constant,
melting point................................................ 10, 58, 75 R
metabolism ....................................... 7, 41, 72, 74, 77,
125133, 136159, 235248, 259, 269, 301, R (statistical software)................................ 328, 358, 360,
302, 305, 310, 316, 320, 351, 382, 391, 392, 613 364, 589
mutagenicity ...............................7, 6774, 77, 7988, Random
91, 94, 95, 100, 125, 127130, 132, 133, effects ....................................................................... 570
136140, 145147, 152, 156, 158, 272, 309, forest ..............................................103, 306, 309, 313
311313, 344347, 499 Ranking.................................................73, 141, 170, 171,
pharmacokinetic parameters ............................ 99, 119 193, 265, 631
physicochemical properties......................... 3, 7, 8, 20, Reabsorption,
41, 58, 72, 75, 131, 305, 308, 311, 314, 315, Reactive intermediates ......................................... 139, 153
499, 502 Receptor
safety ........................................ 29, 53, 54, 70, 86, 89, agonists ........................................................... 278, 284
99, 101, 113, 114, 115, 118, 126, 135, 216, AhR ...................................... 166, 217219, 222, 223,
258, 259, 271, 301, 302, 305337, 341351 268, 278, 279, 284, 368
toxicity ..................... 85, 86, 100, 136, 155, 305337 binding affinity .......................................................... 40
PredictPlus, mediated toxicity..................................................... 502
Pregnancy ...................................279, 327, 328, 602, 617 Recirculation,
Pregnane Xenobiotic receptors .................................... 269 Reconstructed enzyme network................. 240, 245, 246
Prenatal ................................................................... 55, 328 Reference concentration (RfC) ........................... 298, 299
Prior distribution ....................................... 597, 599, 600, Reference dose (RfD) ................................. 280, 298, 299
601, 603, 604, 606, 607, 609, 613, 615, 620, Relational databases .....................................30, 4749, 69
621, 622, 634 Renal clearance,
Prioritization ..................................................68, 320, 337 Reproductive toxicity ............................................. 55, 328
toxicity testing, Reprotox ............................................................... 327, 328
Procheck, Rescaling........................................................................ 199
Pro Chemist .................................................................. 320 Residual errors...................................................... 592, 625
Progesterone, Respiratory
Project Caesar............................................. 100, 117, 127, system ...................................................................... 298
138, 147, 309 tract,
Propranolol, RetroMex,
ProSA, Reverse engineering .....................................172173, 180
Protein Richly perfused tissues,
binding................................................... 106, 300, 359 Risk
databank (PDB) ..................................................46, 50 analysis .................................................................54, 86
docking ........................................................... 258, 259 characterisation.......................................................... 75
folding estimation ................................................................ 349
646 C OMPUTATIONAL TOXICOLOGY
Index
Risk (cont.) Singular value decomposition ................... 429, 460462,
Integrated risk information system 464, 465468, 472, 473, 549, 551, 554
(IRIS)...................................................89, 92, 318 Sinusoids ...................................................... 379, 387, 391
management ............................................................ 142 Size penalty................................. 195, 196, 200202, 204
Risk/safety assessment Skin
chemical ................................................................... 126 lesion ....................................583, 584, 585, 590, 592,
pharmaceutical......................................................... 300 593595
screening .................................................................. 301 sensitization ...................................... 74, 77, 100, 104,
testing ...................................................................... 299 130, 309, 311, 312
Robustness............................................19, 107, 110, 348, SMARTCyp ...........................................77, 130, 141, 142
512514, 606, 613 SMBioNet............................................................. 229233
SMILES ................................................31, 38, 55, 78, 82,
S 85, 86, 131, 132, 135, 144, 148151, 309,
Saccharomyces cerevisiae<! 538 312314, 327, 502, 503, 506
Smoking............................. 126, 134, 135, 139, 174, 217
Salicylic acid
Sample size .................................................................... 597 Sodium.................................................................. 326, 359
SBML.................................................................... 386, 405 Solubility prediction........................................................ 58
Source-to-effect continuum ........................................... 58
Scalability ....................................................................... 412
Scaling...................................................20, 193, 280, 299, SPARC,
365, 386, 420, 530, 531, 534, 619, 620 SPARQL,
factor ........................................................................ 365 Sparse data,
Species
procedure................................................................. 530
SCOP, differences....................................................... 268, 302
Screening extrapolation............................................................ 299
scaling ............................................................. 280, 420
drug,
drug discovery ...............................100, 101, 256, 343 SPSS ...................................................................... 117, 576
environmental chemicals.................................. 94, 259 Statistica ..........................................................20, 117, 323
Stereoisomers ................................................................ 326
hts............................................................................... 54
methods, Steroid................................................................... 105, 285
protocols .................................................................. 101 Stomach,
Screeplot .....................................535, 537, 539, 540, 546 Storage compartment,
Stress response............................................................... 175
Searchable toxicity database .................................. 88, 145
Secondary structure prediction, Structural
Selectivity index..........................101, 115, 116, 118, 119 alert ........................................ 7, 67, 7679, 125, 136,
138, 152, 270, 308, 311, 342344, 348
Self-organizing maps............................................ 316, 515
Sensitivity similarity ..........................................87, 153, 270, 301
analysis, Structure-activity relationship (SAR)
analysis ..............................72, 74, 129, 136, 342, 343
coefficient,
Sensitization ...........................................74, 77, 100, 104, model ................................6, 100, 323, 330, 331, 337
127, 130, 309, 311313 Sub
cellular...................................................................... 407
Sequence
alignment, compartments................................................. 387, 392
homology................................................................. 262 Substituted benzenes .................................................... 105
Serum albumin, Substrate
active site ................................................................. 422
Sevoflurane .................................................................... 266
Shellfish, binding............................................................ 104, 106
Signal transduction ....................170, 179, 185, 263, 402 inducers.................................................................... 121
inhibitor ................................................................... 106
SIMCA.......................................... 20, 103, 322, 516, 576
Simcyp, Substructure
Similarity searching ..............................................................33, 34
similarity ..............................................................92, 93
analysis ........................................................9, 318, 515
indices .......................................................................... 9 Sugar ........................................................... 555, 556, 559,
search ...................................... 34, 35, 38, 92, 93, 319 560, 566, 567, 572, 574
Simulated annealing ........................................................ 57 Sulfamoyl adenosines .................................................... 104
COMPUTATIONAL TOXICOLOGY
Index 647
Sulfur dioxide, database .................................................5456, 86, 87,
Supersaturation, 327, 328, 348, 349
Support vector machine drug .....................101, 341, 342, 357, 358, 368372
(SVM) .............................. 36, 103, 105, 110112 DSSTox.................................................................... 328
Surrogate endpoint .............................................. 254, 257 endocrine disruption........................................ 90, 104
Surveillance programs .........................260, 276, 279, 289 endpoint ........................................... 7, 56, 62, 6971,
Switch ......................................................... 383, 385, 437, 92, 100, 101, 127, 128, 159, 301, 311, 312, 313,
438, 446449, 538 342, 348, 349, 351, 358, 362, 371
SYBYL.......................................................... 315, 321, 322 environmental........................................ 32, 83, 88, 93
SYSTAT ....................................................... 117, 323, 328 estimates .................................................................. 307
Systems mechanism .......................................54, 167, 175, 301
biology ................................................. 215, 216, 253, organic,
261, 375, 377, 378, 383, 386, 399, pathways ............................................... 359, 362, 363,
402, 404, 405 366, 368, 377, 379
toxicology ....................................................... 375393 potential..........................90, 257, 306308, 312, 342
prediction.......................... 6971, 100, 155, 305337
T rodent carcinogenicity ..................................... 59, 312
screening .................................................................. 372
Tanimoto coefficient ..................................................... 269
TEFs. See Toxic equivalency factors (TEFs) testing .......................................................... 53, 54, 61,
Teratogenicity............................................ 128, 309311, 305, 351, 376, 377, 393
313, 342, 348 Toxicogenomics .................................................... 94, 259,
261266, 268, 357373
Tetrachlorodibenzo dioxin (TCDD) .................. 281, 285
Tetrahymena pyriformis .................................18, 104, 314 Toxicoinformatics.......................................................... 373
Theophylline, Toxicologically equivalent,
Toxicophore ........................ 67, 127, 130, 137, 311313
Therapeutic
doses, ToxMic............................................ 74, 76, 130, 133, 140
index ........................................................................ 115 TOXNET.................................................... 56, 88, 89, 90,
92, 93, 135, 137, 145, 324, 327
Thermodynamic properties,
Thiazoles............................................................... 106, 139 ToxPredict .................................................................85, 86
Threshold value .................................................... 517, 598 ToxRefDB..........................................................55, 90, 94,
Thyroid ........................................................ 280, 285, 379 145, 327, 328
Toxtree...................................................... 6, 7, 68, 70, 71,
Tissue
dosimetry ........................................................ 381, 390 7480, 82, 86, 117, 130, 133, 136,
grouping, 138, 139, 147, 155, 156, 158, 270,
344, 345, 348, 524
partition coefficient ........................................ 618, 620
volumes........................................................... 382, 619 TPSA,
Tmax, Tracers,
Training sets ...................................... 107, 147, 323, 346,
TNF stimulation............................................................ 208
Tolerable 350, 351, 501, 513
daily intake...................................................... 280, 286 Transcription factor.................................... 172, 179, 217,
362, 370
weekly intake,
Tolerance .............................................................. 329, 587 Transcriptome ............................................................... 360
TopKat .........................................69, 129, 138, 140, 147, Transduction ..............................170, 179, 185, 263, 402
151, 155, 156, 158, 307, 309, 312 Transit compartment,
Transport
Topliss tree,
Topological mechanisms ............................................................. 302
index ........................................................................ 114 models...................................................................... 381
proteins (transporters) .................................. 260, 261,
Total clearance,
ToxCast program ............................................................ 95 269, 369, 387
ToxCreate ..................................................................85, 86 Tree ......................................................... 6, 39, 7479, 82,
103, 104, 107, 108, 130, 136, 230, 318, 320,
Toxic equivalency factors (TEFs) ................................. 285
Toxicity/toxicological 344, 347, 386, 516518
chemical ................................... 53, 54, 58, 71, 8695, self organizing ......................................................... 320
301, 349, 376, 389, 392, 520 Trichloroacetic acid (TCA).................................. 544, 545
648 C OMPUTATIONAL TOXICOLOGY
Index
TTC decision tree ......................................................... 130 Variability .........................................................70, 71, 262,
Tumor ....................................................... 55, 72, 88, 104, 276, 280, 289, 385, 404, 501, 528, 599602,
254, 269, 359, 387, 390, 391 604, 613616, 621, 624630, 632634
Turnover, Variable selection......................................... 12, 321, 504,
Tyrosinase inhibitors............................................ 104, 105 511512, 516
Tyrosine ......................................................................... 254 Vascular endothelial ............................................... 56, 105
VCCLAB,
U Venlafaxine,
Vinyl chloride,
UML ...........................................406, 414, 416, 418, 419
Uracils ................................................................... 114116 Virtual
Urinary cadmium concentration .................................. 278 high throughput screening (vHTS),
libraries .................................................................... 101
Urine............................................................ 276, 278, 279
screening ................................. 30, 101, 343, 512, 514
V tissue ..............................................375, 383, 384, 409
VolSurf,
Valacyclovir, Volume of distribution,
Validation
external ................................................. 6, 21, 59, 319, W
325, 337, 345, 349, 510, 514516, 524
internal.............................................17, 512, 514, 515 Warfarin ................................................................ 257, 261
loo .............................................................17, 330, 513 WinBUGS,
WinNonLin,
methods .........................................129, 330, 331, 554
qsar.......................................... 19, 325, 345, 499524 WSKOWWIN,
techniques.............................................. 319, 349, 512
X
Van der Waals .................................................................... 9
Vapor pressure ................................................................. 75 XPPAUT,