Академический Документы
Профессиональный Документы
Культура Документы
GSPLab
Data Mining
Edward R. Dougherty
Department of Electrical and Computer Engineering
Center for Bioinformatics and Genomic Systems Engineering
Texas A&M University
gsp.tamu.edu
Texas A&M
GSPLab
Reading
Book: Chapter 8
Papers: Paper: Dougherty, E. R., Prudence, Risk, and
Reproducibility in Biomarker Discovery,
BioEssays, Vol. 34, No. 4, 277-279, 2012.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Knowledge Discovery
Knowing the constitution of scientific knowledge
and how to validate it leaves open the question of
how to discover knowledge.
Obviously, we need to observe Nature, but in what
manner.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
An Experiment is a Question
Hans Reichenbach (Rise of Scientific
Philosophy): An experiment is a question
addressed to Nature.As long as we
depend on the observation of occurrences
not involving our assistance, the
observable happenings are usually the
product of so many factors that we cannot
determine the contribution of each
individual factor to the total result.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Reasoning to Science
Hans Reichenbach: By means of the
artificial occurrences of planned
experiments, the complex occurrence of
Nature is thus analyzed into its
components. That Greek science did not
use experiments in any significant way
proves how difficult it was to turn from
reasoning to empirical science.
Science is not constituted by reasoning about
data; it is constituted by pragmatic, predictive
models.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
Mere Observation
Hannah Arendt: [Natural science]
seemed to be liberated by the discovery that
our senses by themselves do not tell the
truth. Henceforth, sure of the unreliability
of sensation and the resulting insufficiency
of mere observation, the natural sciences
turned toward the experiment, which, by
directly interfering with nature, assured the
development whose progress has ever since
appeared to be limitless.
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
Efficient Experimentation
Douglas Montgomery: If an experiment is
to be performed most efficiently, then a
scientific approach to planning the
experiment must be considered. By the
statistical design of experiments we refer to
the process of planning the experiment so
that appropriate data will be collected, which
may be analyzed by statistical methods
resulting in valid and objective conclusions.
The statistical approach to experimental
design is necessary if we wish to draw
meaningful conclusions from the data.
gsp.tamu.edu
Texas A&M
GSPLab
Everyday Classification
Some algorithm is proposed.
The algorithm separates some data set.
We are not told the distribution from which the data come.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
Data Mining
Data mining is a return to pre-Baconian groping, albeit, at
a much faster groping rate than was then possible.
It suffers from three debilitating properties:
It does not ask precise questions.
There is no statistical characterization of the procedure.
As opposed to pattern recognition, it lacks a characterization of
prediction in the context of a distribution.
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Asymptopia
Edward Leamer: Two of the latest products-toend-all-suffering are nonparametric estimation
and consistent standard errors, which promise
results without assumptions, as if we were
already in Asymptopia where data are so plentiful
that no assumptions are needed By
disguising the assumptions on which nonparametric
methods and consistent standard errors rely, the purveyors
of these methods have made it impossible to have an
intelligible conversation about the circumstances in which
their gimmicks do not work well and ought not to be used.
As for me, I prefer to carry parameters on my journey so I
know where I am and where I am going, not travel stoned
on the latest euphoria drug.
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
A Huge Challenge
Janet Woodcock (Director, Center for Drug
Evaluation and Research, FDA): [As much as
75 percent of published biomarker associations
are not replicable] This poses a huge
challenge for industry in biomarker
identification and diagnostics development.
Dougherty, E. R., Prudence, Risk, and Reproducibility in
Biomarker Discovery, BioEssays, 34(4), 277-279, 2012.
Yousefi, M., and E. R. Dougherty, Performance Reproducibility
Index for Classification, Bioinformatics, 28(21), 2824-2833,
2012.
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
Yousefi, M. R., Hua, J., Sima, C., and E. R. Dougherty, Reporting Bias When Using Real
Data Sets to Analyze Classification Performance, Bioinformatics, 26 (1),
( 68-76, 2010.
03/28/15
Texas A&M
GSPLab
Multiple-Rule Bias
Use r classification rules and s error
estimation rules. Select the pair with
the minimum estimated error, min,est...
Bias(m) = E[min,est true(imin)], over
sampling distribution, m = rs, n = 60.
Yousefi, M. R., Hua, J., and E. R. Dougherty, MultipleRule Bias in the Comparison of Classification Rules,
Bioinformatics, 27(12), 1675-1683, 2011.
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
03/28/15
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
Texas A&M
GSPLab
time course or
experiments
patterns
genes
Texas A&M
GSPLab
Texas A&M
GSPLab
Clustering Algorithm
An algorithm that partitions a set of points into several
groups, based on a measure of similarity (or
dissimilarity) between the points.
Example:
x
3
Group 1
Group 2
Group 3
x2
x1
Texas A&M
GSPLab
Fuzzy c-means
K-means
S.O.M.
Hierarchical clustering (Euclidean distance)
Hierarchical clustering (correlation)
Texas A&M
GSPLab
K-means Clustering
Goal: Partition points into tight clusters.
Algorithm:
Randomly initialize with k means m1,, mk
Place x into Ci if ||x mi|| ||x mj|| for j = 1,, k
Update m1,, mk as the means of C1,, Ck
Repeat until means do not change
Clusters determined by Voronoi diagram of m1,, mk
Texas A&M
GSPLab
Hierarchical Clustering
Iteratively join clusters based on similarity measure
(agglomerative clustering).
Farthest neighbor similarity measure:
d(Ci, Cj) = max {||x y|| : x Ci, y Cj}
Texas A&M
GSPLab
A. cholesterol biosynthesis
B. cell cycle
C. immediate-early response
D. signaling and angiogenesis
E. wound healing and tissue remodeling
Source: Michael B. Eisen, et
al., PNAS 1998, Vol.95
Texas A&M
GSPLab
Solution
Mathematical theory
Pattern recognition theory and random set theory
Texas A&M
GSPLab
Texas A&M
GSPLab
Dougherty, E. R. , Barrera, J., Brun, M., Kim, S., Cesar, R. M., Chen, Y.,
Texas A&M
GSPLab
Synthetic Example
5 synthetic
templates
Simulated data
from the templates
different variances
5 different
clustering methods
Texas A&M
GSPLab
Texas A&M
GSPLab
Experiment ( 2 = 3.0)
many
misclassifications
22 misclassifications
(8.8%)
Texas A&M
GSPLab
Before clustering
24.5% Error!!
Texas A&M
GSPLab
Clustering Error
Points are a realization S of a labeled random point
process.
Clustering algorithm assigns to S a label function S.
The error of is the expected difference between its
labels and the labels generated by the point process.
Error must take into account that we do not care about
the ordering, only the partitions generated.
Expectation taken with respect to the distribution of the
point process.
Texas A&M
GSPLab
Texas A&M
GSPLab
Clustering Validity
Clustering validity is analogous to classification
validity.
Replace classifier with cluster operator and
classification error with clustering error.
Texas A&M
GSPLab
Validation Indices
Validation indices are meant to judge the validity of a
clustering output.
They can be based on a number of heuristic
considerations and methodologies.
Do they correspond to scientific validity?
Does a validation index correlate to clustering error?
Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., and E. R.
Dougherty, Model-Based Evaluation of Clustering Validation Measures,
Pattern Recognition, 40 (3), 807-824, 2007.
Texas A&M
GSPLab
Texas A&M
GSPLab
Texas A&M
GSPLab
Scientific Knowledge
Requires a mathematical model.
In classification, the model is learned from training data.
Texas A&M
GSPLab
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab
gsp.tamu.edu
Texas A&M
GSPLab