Академический Документы
Профессиональный Документы
Культура Документы
Bayes theorem
A priori / A posteriori prob.
Loss function
Bayes decision rule
Min. error rate classification
Discriminant functions
Error bounds and prob.
Computational Intelligence & Pattern Recognition © Robi Polikar, 2013 Rowan University, Glassboro, NJ
PR Today in PR
To 10 algorithms in machine learning
Bayes theorem
Bayes Decision Theory
Bayes rule
Loss function & expected loss
Minimum error rate classification
Classification using discriminant functions
Error bounds & probabilities
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR
X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, et al., “Top 10 Algorithms in Data Mining,” Knowledge Information Systems, vol. 14, pp. 1-37, 2008.
* C4.5 is listed as one of the top 10 in Wu et al. paper. Dr. Polikar disagrees with this, as C4.5 is a variant of CART. The MLP is a far more
deserving classifier to be in the top 10. Also, note that J. Quinlan, the creator of C4.5, is one of the authors of this paper.
PR K-Nearest Neighbor
Given a set of labeled training points, a test instance should be
given the label that appears most abundantly in its surrounding.
k=11
Sensor 2 Measurements
(feature 2)
Sensor 1 Measurements
(feature 1)
Measurements from class 1 class2 class 3 class 4
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes
Given an observation 𝒙, the correct class 𝜔 is the one that
maximizes the posterior probability 𝑃 𝜔 𝒙 .
The posterior is computed using the Bayes
rule from
• the prior probability 𝑃 𝜔 , the
probability of class 𝜔 occurring in general, and
• the likelihood, 𝑝 𝒙 𝜔 , the probability the observed
specific value of 𝒙 occurring in class 𝜔
𝑃 𝜔𝑗 𝒙 ∝ 𝑝 𝒙 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
𝑝 𝐱 𝜔𝑗 = 𝑝 𝑥𝑖 𝜔𝑗 = 𝑝 𝑥1 𝜔𝑗 ⋅ 𝑝 𝑥2 𝜔𝑗 ⋅ ⋯ ⋅ 𝑝 𝑥𝑑 𝜔𝑗
𝑖=1
then 𝑝 𝒙 𝜔𝑗 is just a product of one-dimensional
individual likelihoods, much easier to computer.
This is the “naïve” assumption made by NB
If, the distribution form is also assumed (usually
Gaussian), then NB is very easy to implement.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR K-Means
: Instances that belong to the same class /cluster should “look-
alike”, i.e., be located within close proximity of each other.
K-means iteratively partitions the data into 𝑘 clusters,
each centered around its cluster center, in such a way
that that the within-cluster distances (sum of
distances of all instances to their cluster center) – is
minimized, when summed over all clusters.
D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR C4.5 / C5.0/See5
C4.5 is a successor of ID3, and is preceded by its successor C5.0. It is essentially
CART with different splitting criteria.
Differences between the two, which in my view are minor:
• C4.5 uses information theory based splitting criteria
instead of CART’s Gini Index
• CART creates binary trees (binary splits), whereas
C4.5 can handle multiple outcomes
• C4.5 has a more efficient pruning mechanism
• CART can handle unequal misclassification costs.
Is one of the authors of the Wu et al. paper
being R. Quinlan (the creator of C4.5) be the
reason why C4.5 was included in their list?
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Multilayer
Perceptron (MLP)
: Mimics the structure of a
physiological neural network, a massively
interconnected web of neurons, where each neuron
performs a relatively simple function
Each neuron (node) computes a weighted sum of its inputs,
and then passes that sum through a nonlinear thresholding
function. The neuron “fires” (or not)
based on the output of the thresholding function.
The optimal weights are determined using a
gradient descent optimization.
d input J(w)
nodes H hidden
layer
x1 nodes
c output J(w1)
nodes
- J(w1)
J(w2)
x2 z1 - J(w2)
…
Wjk
..
J(w3)
....
Wij zk
yj
..
zc
…
x(d-1) w1 w2 w3 a
i=1,2,…d
j=1,2,…,H 1 2
xd k=1,2,…c
𝑑 𝐻
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Support Vector
PR
Machines
: In a two-class linear classification
problem the best decision boundary is the one that
maximizes the margin between the class boundaries.
SVM uses quadratic
programming to find
this optimal boundary.
This may not see too
terribly useful, since
most problems are
not linear.
Theodoridis & Koutroumbas,
TK
Pattern Recognition, 4/e, Academic Press.
G
R. Gutieerez-Osuna,
Lecture Notes
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Expectation – Maximization
Gaussian Mixture Models
: An extremely versatile optimization algorithm,
EM is an iterative approach that cycles the expectation
(E) and maximization (M) steps, to find the estimates
of a statistical model.
Designed for parameter estimation (determining the
values of unknown parameters 𝛉 of a model), and commonly
used in conjunction with other algorithms, such as k-means,
Gaussian Mixture Models (GMMs), hierarchical mixture of
experts, or in missing data analysis.
In E-step, the expected value of a likelihood function – the
figure of merit in determining the true value of the unknown
parameter) is computed, under the current estimate 𝛉 of the
unknown parameters 𝛉 (that are to be estimated).
In M step, the new estimate of 𝛉 is computed such that this new
estimate maximizes the current likelihood. Then, E & M steps
are iteratively continued until convergence.
In GMMs, data are modeled using a weighted combination of
Gaussians, and EM is used to determine the Gaussian parameters,
as well as the mixing coefficients (the mixing eights)
This will become more clear when we discuss density
estimation problem.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR A Priori
: Evaluate a dataset of lists to
determine which items appear together, i.e., learns
the associations among the items in the lists.
A priori is a breath-first search using hash-steps to
quickly search large datasets.
It is an iterative search: start with 1 item, whose
frequency of occurrence exceeds a threshold, called
the minimum support. Then find all pairs of items
that include the single items (called the candidate
lists), and scan the dataset to determine those pairs
whose frequency of occurrence exceeds the
threshold. Continue with triplets, quadruplets, etc.
The fundamental premise: Any item or list of items
whose frequency of occurrence fall below the
threshold, cannot be part of a superset that includes
these items. This is how A priori limits the search
space.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR PageRankTM
: The importance of a webpage is proportional to the links that point to it by other
web pages, as well as the importance of those web pages. A page 𝑃 that receives links from many
webpages gets a higher PageRank. If those links are coming
from pages with high PageRank themselves, then 𝑃
receives even a higher PageRank.
This is the original algorithm used by Google
for ranking its search results. Currently, it is
only part of the (undisclosed) algorithm used
by Google.
PageRank is named after its inventor Larry
Page. The fact that it is a “page ranking”
algorithm is a convenient coincidence.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR AdaBoost
: Combine the decisions of an
ensemble of classifiers to reduce the likelihood of having
chosen a poorly trained classifier.
Conceptually similar to seeking several opinions before
making an important decision.
Based on the premise that there is increased confidence
that a decision agreed upon by many (experts, reviewers,
doctors, “classifiers”) is usually correct.
AdaBoost generates an ensemble of classifiers using a
given “base model,” which can be any supervised
classifier. The accuracy of the ensemble, based on
weighted majority voting of its member classifiers, is
usually higher than that of a single classifier of that type.
The weaker the base classifier (the poorer its
performance), the greater the impact of AdaBoost.
AdaBoost trains the ensemble members on different
subsets of the training data. Each additional classifier is
trained with data that is biased towards those instances
that were misclassified by the previous classifier focus
on increasingly difficult to learn samples.
AdaBoost turns a dumb classifier into a smart one!
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR A comparison
…but don’t take my word for it:
Elements of Statistical Learning, Hastie,
Tibshirani and Firedman, Springer, 2009.
http://www-stat.stanford.edu/~tibs/ElemStatLearn/
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR
PR Probability Theory
in one slide
Here are the most important things to know in probability 𝑃 𝑥 ≥ 0, 𝑃 𝑥 = 1 (disc.)
𝑥
Probabilities are nonnegative and normalize to 1
𝑝 𝑥 𝑑𝑥 = 1 (cont.)
If you have two r.v. 𝑋 and 𝑌, they have a joint distribution 𝑃(𝑋, 𝑌)
𝑃 𝑋, 𝑌 = 𝑃 𝑌, 𝑋 , 0 ≤ 𝑃 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ≤ 1, 𝑃 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = 1, or 𝑝 𝑥, 𝑦 𝑑𝑦𝑑𝑥 = 1
𝑖 𝑗
𝑝 𝑋 = 𝑝 𝑥, 𝑦, 𝑧 𝑑𝑦𝑑𝑧 e.g., 𝑃 𝐴, 𝐶 = 𝑃 𝐴, 𝐵, 𝐶, 𝐷, 𝐸
𝐵 𝐷 𝐸
• The product rule: The joint probability can always be obtained by multiplying the conditional
probability (conditioned on one of the variables) with the marginal probability of the conditioned
variable: 𝑃 𝑋, 𝑌 = 𝑃 𝑌 𝑋 𝑃 𝑋 = 𝑃 𝑋 𝑌 𝑃 𝑌
𝑃 𝑋𝑌 𝑃 𝑌 𝑃 𝑋𝑌 𝑃 𝑌
• which gives rise to Bayes rule: 𝑃 𝑌 𝑋 = = ∝𝑃 𝑋𝑌 𝑃 𝑌
𝑃 𝑋 𝑌 𝑃 𝑋, 𝑌
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayes Rule
We pose the following question: Given that the event A (e.g., observation of some data) has
occurred, what is the probability that any single one of the event B’s occur (the correct class
is one of the category choices)?
𝑃 𝐴 ∩ 𝐵𝑗 𝑃(𝐴 𝐵𝑗 ) ⋅ 𝑃(𝐵𝑗
𝑃(𝐵𝑗 𝐴) = = 𝑁
𝑃(𝐴) 𝑃(𝐴 𝐵𝑘 ) ⋅ 𝑃(𝐵𝑘 )
𝑘=1
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayesian Way
of Thinking
In Bayesian statistics, we compute the probability based on three pieces of information:
Prior: Our (subjective?) degree of belief that the event is plausible in the first place.
Likelihood: The probability of making an observation, under the condition that the
event has occurred: how likely is it to observe what I just observed, if event A did in
fact happen (or, how likely is it to observe this outcome, if A [class 𝜔𝐴 ] were true). Likelihood
describes what kind of data we expect to see in each class.
Evidence: The probability of making such an observation.
It is the combination of these three that gives the probability of an event, given that an
observation (however incomplete information it may provide) has been made. The
probability computed based on such an observation is then called the posterior probability.
Given the observation, the Bayesian thinking updates the original belief (the prior) based
on the likelihood and evidence.
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 × 𝑝𝑟𝑖𝑜𝑟
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒
Sometimes, the combination of evidence and likelihood are so compelling that it can
overwrite our original belief.
• Recall the Rowan medical school example: As of 2010, there was only some chatter about the
possibility of such a school prior: very low. In 2011 we see building start construction:
Likelihood: 𝑃 𝑏𝑢𝑖𝑙𝑑𝑖𝑛𝑔 𝑚𝑒𝑑𝑖𝑐𝑎𝑙 𝑠𝑐ℎ𝑜𝑜𝑙): Very high. This high likelihood trumps our low
prior Posterior: 𝑃 𝑚𝑒𝑑𝑖𝑐𝑎𝑙 𝑠𝑐ℎ𝑜𝑜𝑙 𝑏𝑢𝑖𝑙𝑑𝑖𝑛𝑔 𝑖𝑛 𝑝𝑙𝑎𝑐𝑒 : very high!
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR On Frequentists & Bayesians
http://www.statisticalengineering.com/frequentists_and_bayesians.htm
http://www25.brinkster.com/ranmath/bayes02.htm
http://en.wikipedia.org/wiki/Bayesian_probability
http://en.wikipedia.org/wiki/Frequency_probability
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR What Can Bayes
do for You?
How about $9.5 billion?
A success story: Mike Lynch, the “Bayesian Millionaire”,
founded his company (Autonomy) in 1991. Developed
systems for
• matching fingerprints for the Essex
police force
• reading car number plates
WHY?
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayes Theorem for
Spam Filtering!
Thomas Bayes: An obscure 18th century clergyman and statistician who published 2 minor works
but who nowadays is “more important than Marx and Einstein put together”…
Telegraph Magazine, 3 February, 2001
[My favorite fellow of the Royal Society is the Reverend Thomas Bayes, an obscure 18th-century
Kent clergyman and a brilliant mathematician who] devised a complex equation known as the
Bayes theorem, which can be used to work out probability distributions. It had no practical
application in his lifetime, but today, thanks to computers, is routinely used in the modelling of
climate change, astrophysics and stock-market analysis.
Bill Bryson
Quoted in Max Davidson, 'Bill Bryson: Have faith, science can solve our problems', Daily Telegraph (26 Sep 2010)
While we in pattern recognition have known the virtues of Bayes theorem, it has
recently been made popular by its success in spam filters
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayes Classifier
Statistically, the best classifier you can build !!!
Based on quantifying the trade offs betweens various classification decisions using
a probabilistic approach
The theory assumes:
Decision problem can be posed in probabilistic terms
All relevant probability values are known or can be estimated (in practice this is
not true)
Back to our fish example:
Assume that we know the probabilities of observing sea bass and salmons, 𝑃(𝜔1)
and 𝑃(𝜔2), for a particular location of fishing and time of year
• Prior probability
Based on this information, how would you guess the type of the next fish to be
caught?
𝜔 = 𝜔1 𝑖𝑓 𝑃(𝜔1 ) > 𝑃(𝜔2 ) A reasonable
𝜔 = 𝜔2 𝑖𝑓 𝑃(𝜔2 ) > 𝑃(𝜔1 ) decision rule ?
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Setting this up
We now make an observations – say the length and width of the fish caught: 14 inches by 6
inches
Random variables, say 𝑋1 (𝑙𝑒𝑛𝑔𝑡ℎ): 𝑥1 = 14; and 𝑋2 (𝑤𝑖𝑑𝑡ℎ): 𝑥2 = 6;
How to use this information?
• There are two possibilities, the fish is seabass ω1 or salmon ω2
Probabilistically, then 𝐱 = 𝑥1 , 𝑥2 𝑇
𝜔1 𝑖𝑓 𝑃 𝜔 = 𝜔1 𝑥1 , 𝑥2 > 0.5 𝜔 𝑖𝑓 𝑃 𝜔 = 𝜔1 𝐱 > 𝑃 𝜔 = 𝜔2 𝐱
𝜔= or 𝜔 = 1
𝜔2 𝑖𝑓 𝑃 𝜔 = 𝜔2 𝑥1 , 𝑥2 > 0.5 𝜔2 , otherwise
𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
𝑃 𝜔𝑗 𝑥 = = 𝐶
𝑝 𝑥 𝑘=1 𝑝 𝑥 𝜔𝑘 ⋅ 𝑃 𝜔𝑘
Posterior Probability: The (conditional) probability of correct Evidence: The total probability of
class being ωj, given that feature value x has been observed. observing the feature value as x. Serves
Based on the measurement (observation), the probability of as a normalizing constant, ensuring that
correct class being ωj has shifted from 𝑃(𝜔𝑗) to 𝑃(𝜔𝑗 𝑥) posterior probabilities add up to 1
A Bayes classifier, decides on the class 𝜔𝑗 that has the largest posterior probability.
The Bayes classifier is statistically the best classifier one can possibly construct. Why?
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR How to we compute
Class Conditional Probabilities?
𝜔1 : Sea bass 𝑃(𝑥 𝜔1 ): Class conditional probability for seabass
𝜔2 : Salmon 𝑃(𝑥 𝜔2 ): Class conditional probability for salmon
Likelihood: For example, given that a salmon (ω2) is observed, what is the probability of this salmon’s
length is between 11 and 12 inches? Or simply, what is the probability that a salmon’s length is between
11 and 12 inches? Or, how likely is it that a Salmon is between 11 and 12 inches?
𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
𝑃 𝜔𝑗 𝑥 =
𝑝 𝑥
𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
= 𝐶
𝑘=1 𝑝 𝑥 𝜔𝑘 ⋅ 𝑃 𝜔𝑘
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Maximum A Posteriori Estimate:
PR
MAP!
Let’s formalize our understanding before we go any further.
We are given a training dataset, 𝒟, which contains 𝐶 classes. Hence, each instance belongs
to one of 𝜔1 , … , 𝜔𝐶 classes. We are then given an instance 𝐱, based on which we are asked
to predict its class label.
Of all classes, we pick our best guess as the one that has the maximum posterior probability,
i.e., that is the most likely (most consistent with the observed data – the likelihood) while
best conforming to our original gut feeling – the prior probability.
If the label indicated by the likelihood is different than our gut feeling, we may choose a
label that is different than that of our original belief. This would happen, for example, if we
see a lot of data (evidence) that contradict our prior belief, where the likelihood overwrite
(overwhelm) the prior.
Choosing the label that has the highest posterior probability is known as the maximum a
posteriori (MAP) decision, and is formally given as
𝐶
𝜔 = arg max𝑐=1 𝑝 𝜔 = 𝜔𝑐 𝐱, 𝒟
where 𝜔 is our best estimate of the true class (so called the MAP estimate), and where the
conditioning on the dataset 𝒟 is made explicit
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Maximum Likelihood Estimate
PR
MLE
Now, recall that posterior depends on three pieces of information: the prior, the likelihood
and the evidence: 𝑃 𝜔𝑗 𝑥 = 𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝑝 𝑥 . Since we are choosing the class with the
largest 𝑃 𝜔𝑗 𝑥 , and the 𝑝(𝑥) does nor depend on class 𝑗, the denominator is just a
normalization constant that does not affect the classification.
Hence, we really need two pieces of information, likelihood and the prior:
𝑃 𝜔𝑗 𝑥 ∝ 𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
Of these two, the prior is independent of the data – after all, it is based on our prior
subjective belief. As we receive more and more data, the decision becomes more and more
dependent on the likelihood. If we make the decision purely on the likelihood (choose the
𝐶
class that maximizes the likelihood), i.e., 𝜔 = arg max𝑐=1 𝑝 𝐱 𝜔 = 𝜔𝑐 , 𝒟 , we obtain the
maximum likelihood estimate (MLE) of the true class label.
• We will see more on MLE later.
In general, as we see more and more data, the MAP estimate usually converges towards the
MLE.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR The Real World:
The World of Many Dimensions
In most practical applications, we have more then one feature, and therefore the
random variable x must be replaced with a random vector x. p(x) p(x)
The joint probability distribution p(x) still satisfies the axioms of probability
The Bayes rule is then
𝑝 𝐱 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝑝 𝐱 = 𝑝 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑑
𝑃 𝜔𝑗 𝐱 =
𝑝 𝐱 𝑝 𝐱 𝜔𝑗 = 𝑝 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑑 𝜔𝑗
𝑝 𝐱 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
= 𝐶
𝑘=1 𝑝 𝐱 𝜔𝑘 ⋅ 𝑃 𝜔𝑘
𝑝 𝐱 = 𝑝 𝑥1 ⋅ 𝑝 𝑥2 . . . 𝑝 𝑥𝑑
If – and only if – the random variables 𝑑
in a vector are statistically independent = 𝑝 𝑥𝑖
𝑖=1
While the notation changes only slightly, the implications are quite substantial:
The curse of dimensionality
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR The Curse of
Dimensionality
Remember: In order to approximate the distribution, we need to create a histogram
1-D
On average, let’s say we need 30
instances for each of the 20 bins to adequately
populate the histogram 20*30=600 fishes
3-D
2-D
20*20*30=12,000! 20*20*20*30=240,000!
RP
fishes fishes!
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR The Loss Function
Mathematical description of how costly each action (making a class decision) is.
Are certain mistakes costlier than others?
𝑅 𝛼𝑖 𝐱 = 𝜆 𝛼𝑖 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝐱
𝑗=1
Bayes decision takes the action that minimizes this conditional risk !
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayes Decision Rule
Using Conditional Risk
Loss incurred by taking action 𝛼𝑖
Sum over all classes when the true class is class 𝜔𝑗 Probability that the
true class is class 𝜔𝑗
𝑐
1. Compute conditional risk 𝑅 𝛼𝑖 𝐱 = 𝜆 𝛼𝑖 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝐱 for each action taken
𝑗=1
2. Select the action that has the minimum conditional risk. Let this be action 𝑘
3. The overall risk is then
𝑅= 𝑅 𝛼𝑘 𝐱 ⋅ 𝑝 𝐱 𝑑𝐱
Integrated over all
𝐱∈𝑋
possible values of x
Probability that x
Conditional risk associated with taking action will be observed
𝛼𝑘(𝐱) based on the observation x.
4. This is the Bayes Risk, the minimum possible risk that can be taken by any classifier !
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Two-Class
Special Case
Definitions: Cancer Healthy
Diagnosis Diagnosis
True
𝛼1: Decide on 𝜔1, Sample loss 0.5 1000
Cancer
𝛼2: Decide on 𝜔2, function 10 0 Healty
True
𝜆𝑖𝑗: 𝜆(𝛼𝑖 𝜔𝑗) Loss for deciding on 𝜔𝑖 when the true class is 𝜔𝑗
Conditional risk:
𝑹(𝜶𝟏 𝐱) = 𝜆11𝑃(𝜔1 𝐱) + 𝜆12𝑃(𝜔2 𝐱): risk associated with choosing class 1
𝑹(𝜶𝟐 𝐱) = 𝜆21𝑃(𝜔1 𝐱) + 𝜆22𝑃(𝜔2 𝐱): risk associated with choosing class 2
Note that 𝜆11 and 𝜆22 need not be zero, though we expect 𝜆11 < 𝜆12, 𝜆22 < 𝜆21
Clearly, we decide on 𝜔1 if 𝑅(𝛼1 𝐱) < 𝑅(𝛼2 𝐱), decide on 𝜔2, otherwise
𝜆21 − 𝜆11 𝑃 𝜔1 𝐱 > 𝜆12 − 𝜆22 𝑃 𝜔2 𝐱
𝑐ℎ𝑜𝑜𝑠𝑒 𝜔1
⇔
𝜆21 − 𝜆11 𝑝 𝐱 𝜔1 𝑃 𝜔1 > 𝜆12 − 𝜆22 𝑝 𝐱 𝜔2 𝑃 𝜔2
𝑐ℎ𝑜𝑜𝑠𝑒 𝜔1
G
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Minimum Error-Rate Classification:
Multiclass Case
If we associate taking action 𝑖 as selecting class 𝜔𝑖 , and if all errors are equally likely, we
obtain the zero-one loss (symmetrical cost function)
0, if 𝑖 = 𝑗
𝜆 𝛼𝑖 𝜔𝑗 =
1, if 𝑖 ≠ 𝑗
This loss function assigns no loss to correct classification, and assigns 1 to misclassification.
The risk corresponding to this loss function is then
𝑐
𝑅 𝛼𝑖 𝐱 = 𝜆 𝛼𝑖 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝐱 = 𝑃 𝜔𝑗 𝐱 = 1 − 𝑃 𝜔𝑖 𝐱
𝑗=1 𝑗≠𝑖
𝑗=1,...,𝑐
What does this tell us…?
To minimize this risk (average probability of error), we need to choose the class that
maximizes the posterior probability 𝑃(𝜔𝑖 𝐱). Only this selection will minimize the risk in
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Error Probabilities
(Bayes Rule Rules!)
In a two class case, there are two sources of error: P(error ) p error | x p x dx
x is in R1, but true class is ω2,
P x|2 P 2
x is in R2, but true class is ω1,
P 2 | x p x dx
R1
P 1 | x p x dx
R2
P x|1 P 1
P(error) = +
D
xB: Optimal Bayes solution
x*: Non-optimal solution
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR
Another example
P(error ) p error | x p x dx
Optimal
P 2 | x p x dx
boundary R1
P 1 | x p x dx
R2
B
Includes the pink shaded region,
if decision boundary is not at x0 R1
P 2 | x p x dx P | x p x dx
R2 1
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Probability of Error
In multi-class case, there are more ways to be wrong than to be right, so we exploit the fact
that 𝑃(𝑒𝑟𝑟𝑜𝑟) = 1 − 𝑃(𝑐𝑜𝑟𝑟𝑒𝑐𝑡), where 𝑝 𝐱, 𝜔𝑖 = 𝑃 𝜔𝑖 𝐱 𝑝 𝐱
𝐶 𝐶 = 𝑝 𝐱 𝜔𝑖 𝑃 𝜔𝑖
Discrete 𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 𝑃 𝐱 ∈ 𝑅𝑖 , 𝜔𝑖 = 𝑃 𝐱 ∈ 𝑅𝑖 𝜔𝑖 𝑃 𝜔𝑖
𝑖=1 𝑖=1
𝐶 𝐶
Continuous 𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 𝑝 𝐱 𝜔𝑖 𝑃 𝜔𝑖 𝑑𝐱 = 𝑃 𝜔𝑖 𝐱 𝑝 𝐱 𝑑𝐱
𝑖=1 𝐱∈𝑅𝑖 𝑖=1 𝐱∈𝑅𝑖
Of course, in order to minimize the P(error), we need to maximize P(correct) for which we
need to maximize each and every one of the integrals. Note that 𝑝(𝐱) is common to all
integrals, therefore the expression will be maximized by choosing the decision regions
𝑅𝑖 where the posterior probabilities 𝑃(𝜔𝑖 𝐱) are maximum:
G I II III IV V x
From R. Gutierrez @ TAMU
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Reject Option
Sometimes, when the posterior probabilities are not high enough, we may decide
that refusing to make a decision may reduce the overall risk.
Rejecting to make a decision practically means that the available information is not
adequate to make a confident enough decision. This forces the user to obtain
additional information.
The reject option can be controlled P 1 x P 2 x
simply by using a posterior
probability threshold θ, below which
we refuse to make a decision
If θ=1 all samples are rejected
If θ<1/C, no samples are rejected.
B
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Inference & Decision
Typically, a Bayes classifier requires two steps:
• Inference: Use training data to determine the prior / likelihood probabilities
• Decision: Given new data, choose the class for which the posterior is maximum
There are other general approaches:
Generative models: As in Bayes classifier, determine the joint probabilities explicitly,
followed by Bayes rule to obtain posteriors, then determine class membership for each input
• Having now access to joint distribution, this allows us to “compute” additional
data without observing it, since the data are assumed to come from that
distribution.
• Immensely useful, but often computationally difficult (sometimes impossible!)
Probabilistic Discriminative models: Determine the posterior probability – or a function
of it – directly (without computing joint probabilities). The function then discriminates
among classes.
Discriminative models: Find a function 𝑓(𝐱), that approximate the unknown mapping
function from the data x to their correct classes 𝜔𝑗. Probability does not necessarily play a
role, and posteriors are not computed. The outputs of some models (e.g. some neural
networks), can, however be interpreted as posterior probabilities under certain conditions.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Discriminant Based
Classification
A discriminant is a function 𝑔(𝐱), that discriminates between classes. This function
assigns the input vector to a class according to its definition: Choose class 𝑖 if
𝑔𝑖 (𝐱) > 𝑔𝑗 (𝐱) ∀𝑖 ≠ 𝑗, 𝑖, 𝑗 = 1,2, . . . , 𝑐
Bayes rule can be implemented in terms of discriminant functions, simply by choosing the
posterior as the discriminant 𝑔𝑖 (𝐱) = 𝑃(𝜔𝑖 𝐱)
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Normal Densities
If likelihood probabilities are normally distributed, then a number of simplifications can be made.
In particular, the discriminant function can be written as in this greatly simplified form (!) by using
the log transformation:
1 1
−2 𝐱−𝛍𝑖 𝑇 𝚺𝑖−1 𝐱−𝛍𝑖
𝑝 𝐱 𝜔𝑖 = 𝑑 2 1 2
𝑒
2𝜋 𝚺𝑖
𝑝 𝐱 𝜔𝑖 ~𝑁 𝛍𝑖 , 𝚺𝑖
1 𝑑 1
𝑔𝑖 (𝐱) = − 𝐱 − 𝛍𝑖 𝑇 ⋅ 𝚺𝑖−1 ⋅ 𝐱 − 𝛍𝑖 − ln2𝜋 − ln 𝚺𝑖 + ln𝑃 𝜔𝑖
2 2 2
Features are statistically independent, and all features have the same variance: Distributions
are spherical in d dimensions, the boundary is a generalized hyperplane (linear discriminant) of
d-1 dimensions, and features create equal sized hyperspherical clusters.
gi x x μi i1 x μi ln 2 ln i ln P i
1 T d 1
2 2 2
Σi independent of i
2d
1
x μ T 1
x μ d ln 2 1 ln ln P
Σ 1 1 2 I ditto 2
i
i i
2 2
i i
1
x μi x μi ln P i
T
2 2
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Euclidean Distance
Recall the definition of distance between two (vectors) points u and v.
This can be computed regardless of the dimensionality (of course, so long as u and
v are of the same dimension)
y
u
d (u, v ) u v
|uy-vy|
u1 v1 2 un vn
2
v
x
|ux-vx|
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR i I
Case 1: _________
2
This case results in linear discriminants that can be written in the form of a linear line
1
‖𝐱 − 𝛍𝑖 ‖ 2 𝐰𝑖 = 𝛍,
𝑔𝑖 (𝐱) = 𝐰𝑖𝑇 𝐱 + 𝑤𝑖0 𝜎2 𝑖
𝑔𝑖 𝐱 = − + ln𝑃 𝜔𝑖 −1
2𝜎 2 𝑤𝑖0 = 2 𝛍𝑇𝑖 ⋅ 𝛍𝑖 + ln𝑃 𝜔𝑖
2𝜎
Threshold (Bias)
of the ith category
1-D case
2-D case
3-D case
(non-equal
priors)
D
Note how priors shift the discriminant function away from the more likely mean !!!
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
function discriminant1
PR
% This function demonstrates the Bayesian classifier for data drawn from a Gaussian
distribution, for the type 1 case, where the covariance matrix is diagonal and constant.
This is equivalent to all features being independent and classes having equal variances.
To see different effect, change the parameters mu1, mu2, mu3, and sigma.
Example
% Robi Polikar, September 2005 - modified September 2007, September 2009.
for i=1:length(x)
for j=1:length(y)
g1(i,j)=(1/sigma^2)*mu1'*[x(i); y(j)]-1/(2*sigma^2)*(mu1'*mu1)+log(prior1);
end
end %Compute the discriminants for equal sigma case (implements
on slide 51). Note that in this case, the x'x term is removed, since it is
for i=1:length(x) independent of the class information.
for j=1:length(y)
g2(i,j)=(1/sigma^2)*mu2'*[x(i); y(j)]-1/(2*sigma^2)*(mu2'*mu2)+log(prior2);
end
end
for i=1:length(x)
for j=1:length(y)
g3(i,j)=(1/sigma^2)*mu3'*[x(i); y(j)]-1/(2*sigma^2)*(mu3'*mu3)+log(prior3);
end
%Need to determine for each point, whether g1, g2 or g3 is maximum. To
end
do this effectively, put all 3 matrices on a 3-dimensional matrix, and find
the max index along the third dimension.
g(:,:,1)=g1; g(:,:,2)=g2; g(:,:,3)=g3;
[a b]=max(g, [], 3); %b: indicates 1, 2, 3 which of the g functions is maximum for each point.
figure ; pcolor(x,y,b); xlabel('Feature2'); ylabel('Feature1'); title('Pseudocolor plot of the decision boundaries'); shading interp;
%Create colormap for the colors Red, Green and Blue.
RGB=zeros(64,3); RGB(1:21,:)=repmat([1 0 0],21,1);
RGB(22:43,:)=repmat([0 1 0],22,1); RGB(44:64,:)=repmat([0 0 1],21,1);
colormap(RGB); colorbar;
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Example
function p=gauss2d(X,Y, MU, SIGMA, C)
% This function creates a 2D Gaussian probability distribution function
%
% MU: Mean vector, its length must be equal to 2 (must be column vector)
% SIGMA: Covariance vector, must be semi-positive definite matrix of 2x2
% X and Y: Cartesian coordinates of the points at which the Gaussian will be computed
% C: color matrix in [R G B] form indicating the edge colors
I=length(X); J=length(Y);
%mu=[-1 1];
%Sigma = [.9 .4; .4 .3];
for i=1:I
for j=1:J
p(i,j) = mvnpdf([X(i) Y(j)],MU,SIGMA);
end
end
h=surf(X,Y,p');
if nargin==5
alpha(0.75);
set(h, 'facecolor', C, 'edgecolor', C, 'facealpha', 0.9); %Color the faces of the mesh plot according to C
title('The theoretical distribution of 2-D Gaussian - {\itp }({\bf x }| \omega_j )')
xlabel('Feature1'); ylabel('Feature2');
grid on
end
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Example
The theoretical distribution of 2-D Gaussian - p ( x | )
j
0.08
0.06
0 8
10
10 2.5
5
5 6
0 0
Feature1
Feature2 -5 -5
Feature1
4 2
2
1.5
0
-2 1
-2 0 2 4 6 8 10
Feature2
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
12 122 12d
2 Demo: discriminant2.m
PR
i 21
2
22 22d
Case (Section i
2:_______
d 1 d 2 d2
2
4.2.2 in Murphy)
Covariance matrices are arbitrary, but equal to each other for all classes. Features then form hyper-
ellipsoidal clusters of equal size and shape. This also results in linear discriminant functions
whose decision boundaries are again hyperplanes: 1
𝑔𝑖 𝐱 = −𝑇 −1
2
𝐱 − 𝛍𝑖 𝚺 𝐱 − 𝛍𝑖 + ln𝑃 𝜔𝑖
1
𝑔𝑖 (𝐱) = 𝐰𝒊𝑇 𝐱 + 𝑤𝑖0 𝐰𝑖 = 𝚺 −1 𝛍𝑖 𝑤𝑖0 = − 2 𝛍𝑇𝒊 𝚺 −1 𝛍𝑖 + ln𝑃 𝜔𝑖
D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Mahalanobis Distance
In this case, instances are classified not based on minimum Euclidean distance, but the
minimum Mahalanobis distance.
Samples drawn from a 2D Gaussian lie in a cloud centered around the mean μ. The quantity
𝑟 = 𝐱 − 𝛍 𝑇 𝚺 −1 𝐱 − 𝛍 is known as the Mahalanobis distance of x to the mean of group of
points normally distributed by N(μ, σ2).
The contours of constant density are (hyper)ellipsoids of constant Mahalanobis distance
from the mean.
RP 𝒓
𝒓 𝒓
D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR i
Case 2:_______
Now note that clusters are not spherical but ellipsoidal, due to covariances not being diagonal.
Also, note that unequal priors shift the discriminant function away from the more likely mean.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Case 2
function discriminant2
PR
% This function demonstrates the Bayesian classifier for data drawn from a
Gaussian distribution, for the type 2 case, where the covariance matrix
% is arbitrary, but constant for each class. To see different effect, change the
parameters mu1, mu2, mu3, and sigma.
% Robi Polikar, September 2005 - modified September 2007. Example
mu1=[3 2]'; mu2=[7 4]'; mu3=[2 5]'; prior1=1/3; prior2=1/3; prior3=1/3; Sigma=[0.9 0.3 ; 0.3 0.4;];
x=[-2:0.1:10]; y=[-2:0.1:10]; X=[x; y];
p1=gauss2d(x,y,mu1', Sigma, [0 0 0]); hold; p2=gauss2d(x,y,mu2', Sigma, [1 0.3 0]); p3=gauss2d(x,y,mu3', Sigma, [1 1 0.9]);
for i=1:length(x)
for j=1:length(y)
g1(i,j)=(inv(Sigma)*mu1)'*[x(i); y(j)]-0.5*mu1'*inv(Sigma)*mu1+log(prior1);
end
end %Compute the discriminants for equal sigma case (implements
on slide 58). Note that in this case, the x'x term is removed, since it is
for i=1:length(x)
independent of the class information.
for j=1:length(y)
g2(i,j)=(inv(Sigma)*mu2)'*[x(i); y(j)]-0.5*mu2'*inv(Sigma)*mu2+log(prior1);
end
end
for i=1:length(x)
for j=1:length(y)
g3(i,j)=(inv(Sigma)*mu3)'*[x(i); y(j)]-0.5*mu3'*inv(Sigma)*mu3+log(prior1);
end
end
0.4
0.3
0.2
0.1
0
10
10
5
5 Pseudocolor plot of the decision boundaries
0 10 3
0
Feature2 -5 -5
Feature1 2.8
8
2.6
2.4
6
2.2
Feature1
4 2
1.8
2
1.6
1.4
0
1.2
-2 1
-2 0 2 4 6 8 10
Feature2
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR i Arbitrary
Case 3:____________
Demo: discriminant3.m (Section 4.2.1 in Murphy)
All bets are off ! No simplifications are possible to the general 𝑔(𝐱). In two class case, the
decision boundaries form hyperquadratics. The discriminant functions are now, in general,
quadratic (nor linear) and non-contiguous. This is then a quadratic classifier.
1 −1 −1
𝐖 𝑖 = − 𝚺 𝑖 , 𝐰 𝑖 = 𝚺 𝑖 𝛍𝑖
𝑇 𝑇
𝑔𝑖 (𝐱) = 𝐱 𝐖𝑖 𝐱 + 𝐰𝑖 𝐱 + 𝑤𝑖0 2
1 𝑇 −1 1
𝑤𝑖0 = − 𝛍𝒊 𝚺𝑖 𝛍𝑖 − ln 𝚺𝑖 + ln𝑃(𝜔𝑖
2 2
Hyperbolic Circular
Parabolic Linear Ellipsoidal
D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Demo: discriminant3.m
i Arbitrary
Case 3:____________
For the multi class case, the boundaries will look even more complicated. As an example
Decision
Boundaries
D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR i Arbitrary
Case 3:____________
In 3-D
D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Example
0.4
0.3
0.1
8
0 2.5
10 6
10
Feature1
5
5
0 4 2
0
Feature2 -5 -5 Feature1 2
1.5
0
-2 1
-2 0 2 4 6 8 10
Feature2
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Are we done?
Since we now know the best classifier that can be built, are we done with PR? Can
we go home…?
Not quite: Bayes classifier cannot be used if we don’t know the prob. distributions
This is typically the rule, not the exception
In most applications of practical interest, we do not know the underlying
distributions
The distributions can be estimated, if there is sufficient data
Sufficient ??? Make that “a ton of” , or better yet… “lots x 10(tons of data)”
Estimating the prior distribution is relatively easy; however, estimating the class-
conditional distribution is difficult, and it gets giga-difficult as dimensionality
increases…. curse of dimensionality
If we know the form of the distribution, say normal (but of course, what else), but
not its parameters, say mean and variance, the problem reduces to that of
parameter estimation from distribution estimation.
If not, there are nonparametric density estimation techniques.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes Classifier
There is one quick-and-dirty fix to the curse of dimensionality.
A highly practical solution to this problem is to assume class-conditional independence of
the features which yields the so-called Naïve Bayes classifier.
𝑑
𝑝 𝐱 𝜔𝑗 = 𝑝 𝑥𝑖 𝜔𝑗 = 𝑝 𝑥1 𝜔𝑗 ⋅ 𝑝 𝑥2 𝜔𝑗 ⋅ ⋯ ⋅ 𝑝 𝑥𝑑 𝜔𝑗 , 𝐱 = 𝑥1 , ⋯ , 𝑥𝑑
𝑖=1
Note that this is not as restrictive of a requirement as full feature independence
𝑑
𝑝 𝐱 = 𝑝 𝑥𝑖 = 𝑝 𝑥1 𝑝 𝑥2 ⋯ 𝑝 𝑥𝑑
𝑖=1
The discriminant function corresponding to the Naïve Bayes classifier is then
𝑑
𝑔𝑗𝑁𝐵 𝐱 = 𝑝 𝜔𝑗 𝑝 𝑥𝑖 𝜔𝑗
𝑖=1
The main advantage of this is that we only need to compute univariate densities 𝑝(𝑥𝑖 𝜔𝑗),
which are much easier to compute (next week!) then the multivariate densities 𝑝(𝒙 𝜔𝑗).
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes Classifier
(NBC)
So, the NBC assumes that features are class-conditionally independent. Is this a reasonable
assumption?
For most applications, probably not. However, in practice, Naïve Bayes has shown to have
respectable performance, comparable to that of neural networks, even when this assumption
is clearly violated.
“How come…?” you ask…
One reason is that the model is quite simple!
• For 𝑪 class and 𝒅 features, the computational complexity of the NBC is 𝓞(𝑪𝒅)
• This makes NBC very resilient to over-fitting, the Achilles’ heel of more sophisticated classifiers.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes Classifier
(Known Distribution type)
So given some training data how do we implement the Naïve Bayes classifier?
If you can assume that the class conditional features follow a specific distribution,
e.g., the Gaussian distribution, then implement the following pseudocode.
Let 𝐱 = {𝑥1 𝑥2 … 𝑥𝑑 } be the training data
for j=1,…,c
for i=1,…,d
1 1 2
Compute the mean of the xi for all instances from class ωj : 𝜇𝑖𝑗 = 𝑥∈𝜔𝑗 𝑥𝑖 , 𝜎𝑖𝑗 = 𝑥∈𝜔𝑗 𝑥𝑖 − 𝜇𝑖𝑗 (1)
𝑁𝑖𝑗 𝑁𝑖𝑗
If the likelihoods follow a different, non-Gaussian (but of known type, say chi-square) distribution,
compute the parameters of that distribution for (1) and use the definition of that distribution for (2).
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes Classifier
(Unknown Distribution type)
If the form of the distribution is not known, then it must be estimated. Use either one of the
approaches discussed in the density estimation lecture (such as Parzen windows, k-nearest
neighbor, etc.), or simply follow the histogram approach showed earlier (slide 29):
For each class, look at the minimum and maximum values of each feature
Divide that range into a meaningful bins, based on the
number of training data
• Optimize the number of bins vs. number of instances
in each bin
• The larger the number of bins, the smoother the estimate
• The larger the number of instances in each bin,
the more accurate the estimate will be
• For a given dataset size, you can only maximize one or the other.
Create the histogram by counting the number of instances that fall into each bin
The distribution is the estimate of the class conditional likelihood of that feature.
Follow steps (3) and (4) of the pseudo code in the previous slide.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes in
Matlab
Matlab’s Statistics Toolbox has a built-in naïve Bayes classifier.
This is not a standard matlab function, however, but rather an object. You need to
use it as such. See examples in Matlab documentation by typing
>>help naivebayes
NaiveBayes class Naive Bayes classifier
A NaiveBayes object defines a Naive Bayes classifier. A Naive Bayes classifier assigns a new observation to the most probable class, assuming the features
are conditionally independent given the class value.
Construction Properties
NaiveBayesCreate NaiveBayes object (the information / fields saved by each example of
this object)
Methods CIsNonEmpty Flag for non-empty classes
(what you can do with objects of this class) ClassLevels Class levels
Disp Display NaiveBayes classifier object Cnames Class names
display Display NaiveBayes classifier object Prior Class priors
fit Create Naive Bayes classifier object by fitting training data Dist Distribution names
posterior Compute posterior probability of each class for test data NClasses Number of classes
predict Predict class label for test data NDims Number of dimensions
Params Parameter estimates
For example, nb = NaiveBayes.fit (data, labels) fits a naïve Bayes classifier object to the training data, and saves the results in the object nb. The object nb
has several fields for carrying different types of information. For example, nb.NClasses gives you the number of classes that the nb object recognizes
(trained on). To use the NaiveBayes object to classify test data, test_labels = predict(nb, test_data) or alternatively test_labels = nb.predict(test_data) .
Similarly, [post,cpre] = posterior(nb,test) returns post, the posterior probabilities of each of the instances (rows) of the test dataset test along with the class
cpre each is assigned.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Using Matlab’s
PR naivebayes_ocr_demo.m
NaiveBayes
%This function demonstrates the functionality of the Naive Bayes classifier on the 10-class, 62 feature OCR database. Uses Matlab 2012a
%Robi Polikar, September 2012
Confusion Matrix
Output Class
opt_class = vec2ind(opt_class);
opttest_class = vec2ind(opttest_class); 13 21 17 21 1 178 7 11 32 7 57.8%
6
0.7% 1.2% 0.9% 1.2% 0.1% 9.9% 0.4% 0.6% 1.8% 0.4% 42.2%
1 2 3 4 5 6 7 8 9 10
Target Class
%Create, compute and plot confusion matrix
[C CM ind performance]=confusion(true_labels, predicted_labels);
plotconfusion(true_labels, predicted_labels)
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes
naivebayes.m
Example (1)
Randomly generated Gaussian data Decision boundaries generated by naive Bayes implemented from scratch
8 8 3
6 2.
6
2.
4 4
2.
2
2
2.
0
0 2
-2
1.
-2
-4 1.
-4
-6 1.
-6
Decision boundaries generated by Matlabs naive Bayes 1.
-8
-4 -3 -2 -1 0 1 2 3 48 5 6 3
-8 1
-5 2.8 0 5
6
Mu1 =[-1 , 1];
2.6
Mu2 =[ 2 , 3]; 4
Mu3 =[ 3 , 0]; 2.4
2
2.2
Sigma1 = [.9 .1; .1 .6];
Sigma2 = [1 -0.5; -0.5 1]; 0 2
Sigma3 = [.3 0.8; 0.8 5];
1.8
-2
tr_COUNT=[500 500 500]; 1.6
-4
1.4
-6
1.2
6 2.8
6
4 2.6
4
2
2.4
0 2
2.2
-2
0 2
-4
1.8
-2
-6
1.6
-8 -4
1.4
-10
-4 -3 -2 -1 0 1 2 3 4 5 6 -6
Decision boundaries generated by Matlabs naive Bayes 1.2
8 3
-8 1
-5 2.8 0 5
6
Mu1 =[-1 , 1];
2.6
Mu2 =[ 2 , 3]; 4
Mu3 =[ 3 , 0]; 2.4
2
2.2
Sigma1 = [.9 .1; .1 .6]; Observe the effect of the change
Sigma2 = [1 -0.5; -0.5 1]; 0 2
Sigma3 = [.3 0.8; 0.8 5]; in the prior probabilities !
1.8
-2
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Error Bounds
The Chernoff bound and its approximation Bhattacharya bound are two such bounds
that are often used. If the distributions are Gaussian, these expressions are relatively
easier to compute Often times even non-Gaussian cases are considered as Gaussian.
Red Stars on titles indicate graduate level material only – Grad students: please read the
error bounds in chapter 2 of DHS.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR
Exercise
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Midterm Project – I
Due: Sept 27
Choose 10 datasets from UCI database, including Wine, Iris, Vehicle, Optical
Recognition of Handdigits and Gas Sensor Array
Using built-in Matlab functions or by coding on your own, implement the kNN,
naïve Bayes, SVM, MLP, CART and AdaBoost (with CART) algorithms.
• Read about their implementations
• What parameters do they have? What did you use / choose, why?
• Perform appropriate cross-validation and statistical analysis. Which algorithm work better?
BONUS
Use your favorite algorithm on any of the publicly available face recognition
datasets and see if you can get good results.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ