Вы находитесь на странице: 1из 79

PR

Dr. Robi Polikar


Lecture 2
Top 10 Algorithms in Machine Learning

Bayes Decision Theory

Bayes theorem
A priori / A posteriori prob.
Loss function
Bayes decision rule
Min. error rate classification
Discriminant functions
Error bounds and prob.
Computational Intelligence & Pattern Recognition © Robi Polikar, 2013 Rowan University, Glassboro, NJ
PR Today in PR
 To 10 algorithms in machine learning
 Bayes theorem
 Bayes Decision Theory
 Bayes rule
 Loss function & expected loss
 Minimum error rate classification
 Classification using discriminant functions
 Error bounds & probabilities

Image / photo credits:


A E. Alpaydin, Introduction to Machine Learning, MIT Press, 2010
B C. Bishop, Machine Learning & Pattern Recognition, Springer, 2006
D Duda, Hart & Stork, Pattern Classification, 2/e Wiley, 2000
G R. Gutieerez-Osuna, Lecture Notes, Texas A&M - http://research.cs.tamu.edu/prism/rgo.htm
RP Original graphic created / generated by Robi Polikar – All Rights Reserved © 2001 – 2013. May be used with permission and citation.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR

X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, et al., “Top 10 Algorithms in Data Mining,” Knowledge Information Systems, vol. 14, pp. 1-37, 2008.
* C4.5 is listed as one of the top 10 in Wu et al. paper. Dr. Polikar disagrees with this, as C4.5 is a variant of CART. The MLP is a far more
deserving classifier to be in the top 10. Also, note that J. Quinlan, the creator of C4.5, is one of the authors of this paper.
PR K-Nearest Neighbor
Given a set of labeled training points, a test instance should be
given the label that appears most abundantly in its surrounding.

k=11
Sensor 2 Measurements
(feature 2)

Sensor 1 Measurements
(feature 1)
Measurements from class 1 class2 class 3 class 4

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes
Given an observation 𝒙, the correct class 𝜔 is the one that
maximizes the posterior probability 𝑃 𝜔 𝒙 .
The posterior is computed using the Bayes
rule from
• the prior probability 𝑃 𝜔 , the
probability of class 𝜔 occurring in general, and
• the likelihood, 𝑝 𝒙 𝜔 , the probability the observed
specific value of 𝒙 occurring in class 𝜔

𝑃 𝜔𝑗 𝒙 ∝ 𝑝 𝒙 𝜔𝑗 ⋅ 𝑃 𝜔𝑗

For 𝒙 ∈ ℝ𝑑 , 𝑝 𝒙 𝜔𝑗 is a 𝑑-dimensional joint


probability that is difficult to compute for large 𝑑
However, if features are conditionally independent:
𝑑

𝑝 𝐱 𝜔𝑗 = 𝑝 𝑥𝑖 𝜔𝑗 = 𝑝 𝑥1 𝜔𝑗 ⋅ 𝑝 𝑥2 𝜔𝑗 ⋅ ⋯ ⋅ 𝑝 𝑥𝑑 𝜔𝑗
𝑖=1
then 𝑝 𝒙 𝜔𝑗 is just a product of one-dimensional
individual likelihoods, much easier to computer.
This is the “naïve” assumption made by NB
If, the distribution form is also assumed (usually
Gaussian), then NB is very easy to implement.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR K-Means
: Instances that belong to the same class /cluster should “look-
alike”, i.e., be located within close proximity of each other.
 K-means iteratively partitions the data into 𝑘 clusters,
each centered around its cluster center, in such a way
that that the within-cluster distances (sum of
distances of all instances to their cluster center) – is
minimized, when summed over all clusters.

M K. Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012


Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Classification &
Regression Trees
: CART creates a “decision tree,” basically a hierarchical organization
of IF-THEN rules, based on values of each feature.
 The tree starts at Root node, that represents the first question
(rule) to be answered by the tree. Usually, the root is associated
with the most important attribute (feature).
 The root is connected to other internal nodes with directional
links, called branches, that represent the values of the attributes
for the node. Each decision made at a node splits the data through
the branches. A leaf (or terminal ) node is where no further
split occurs, and each is associated with a category labels, i.e., class.
 CART progressively evaluates all features, determine the most
informative one, and the critical value to split, that provides best
classification. Information theory / entropy based criteria are used
for this purpose.

D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR C4.5 / C5.0/See5
 C4.5 is a successor of ID3, and is preceded by its successor C5.0. It is essentially
CART with different splitting criteria.
 Differences between the two, which in my view are minor:
• C4.5 uses information theory based splitting criteria
instead of CART’s Gini Index
• CART creates binary trees (binary splits), whereas
C4.5 can handle multiple outcomes
• C4.5 has a more efficient pruning mechanism
• CART can handle unequal misclassification costs.
 Is one of the authors of the Wu et al. paper
being R. Quinlan (the creator of C4.5) be the
reason why C4.5 was included in their list?

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Multilayer
Perceptron (MLP)
: Mimics the structure of a
physiological neural network, a massively
interconnected web of neurons, where each neuron
performs a relatively simple function
 Each neuron (node) computes a weighted sum of its inputs,
and then passes that sum through a nonlinear thresholding
function. The neuron “fires” (or not)
based on the output of the thresholding function.
 The optimal weights are determined using a
gradient descent optimization.
d input J(w)
nodes H hidden
layer
x1 nodes
c output J(w1)
nodes
- J(w1)

J(w2)
x2 z1 - J(w2)

Wjk
..

J(w3)
....

Wij zk
yj
..

zc

x(d-1) w1 w2 w3 a
i=1,2,…d
j=1,2,…,H 1 2
xd k=1,2,…c

𝑑 𝐻

𝑦𝐽 = 𝑓 𝑛𝑒𝑡𝐽 = 𝑓 𝑤𝐽𝑖 𝑥𝑖 𝑧𝐾 = 𝑓 𝑛𝑒𝑡𝐾 = 𝑓 𝑤𝐾𝑗 𝑦𝑗


𝑖=1 𝑗=1

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Support Vector
PR
Machines
: In a two-class linear classification
problem the best decision boundary is the one that
maximizes the margin between the class boundaries.
SVM uses quadratic
programming to find
this optimal boundary.
 This may not see too
terribly useful, since
most problems are
not linear.
Theodoridis & Koutroumbas,
TK
Pattern Recognition, 4/e, Academic Press.

 But, the SVM’s biggest feat, is its transformation, the


so-called “kernel-trick,” that allows a non-linear problem
solved in the high-dimensional linear space,
without doing any high-dimensional calculations!!!

G
R. Gutieerez-Osuna,
Lecture Notes
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Expectation – Maximization
Gaussian Mixture Models
: An extremely versatile optimization algorithm,
EM is an iterative approach that cycles the expectation
(E) and maximization (M) steps, to find the estimates
of a statistical model.
 Designed for parameter estimation (determining the
values of unknown parameters 𝛉 of a model), and commonly
used in conjunction with other algorithms, such as k-means,
Gaussian Mixture Models (GMMs), hierarchical mixture of
experts, or in missing data analysis.
 In E-step, the expected value of a likelihood function – the
figure of merit in determining the true value of the unknown
parameter) is computed, under the current estimate 𝛉 of the
unknown parameters 𝛉 (that are to be estimated).
 In M step, the new estimate of 𝛉 is computed such that this new
estimate maximizes the current likelihood. Then, E & M steps
are iteratively continued until convergence.
 In GMMs, data are modeled using a weighted combination of
Gaussians, and EM is used to determine the Gaussian parameters,
as well as the mixing coefficients (the mixing eights)
 This will become more clear when we discuss density
estimation problem.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR A Priori
: Evaluate a dataset of lists to
determine which items appear together, i.e., learns
the associations among the items in the lists.
 A priori is a breath-first search using hash-steps to
quickly search large datasets.
 It is an iterative search: start with 1 item, whose
frequency of occurrence exceeds a threshold, called
the minimum support. Then find all pairs of items
that include the single items (called the candidate
lists), and scan the dataset to determine those pairs
whose frequency of occurrence exceeds the
threshold. Continue with triplets, quadruplets, etc.
 The fundamental premise: Any item or list of items
whose frequency of occurrence fall below the
threshold, cannot be part of a superset that includes
these items. This is how A priori limits the search
space.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR PageRankTM
: The importance of a webpage is proportional to the links that point to it by other
web pages, as well as the importance of those web pages. A page 𝑃 that receives links from many
webpages gets a higher PageRank. If those links are coming
from pages with high PageRank themselves, then 𝑃
receives even a higher PageRank.
 This is the original algorithm used by Google
for ranking its search results. Currently, it is
only part of the (undisclosed) algorithm used
by Google.
 PageRank is named after its inventor Larry
Page. The fact that it is a “page ranking”
algorithm is a convenient coincidence.

See US Patent #6,285,999


http://en.wikipedia.org/wiki/Pagerank http://www.google.com/patents?vid=6285999

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR AdaBoost
: Combine the decisions of an
ensemble of classifiers to reduce the likelihood of having
chosen a poorly trained classifier.
 Conceptually similar to seeking several opinions before
making an important decision.
 Based on the premise that there is increased confidence
that a decision agreed upon by many (experts, reviewers,
doctors, “classifiers”) is usually correct.
 AdaBoost generates an ensemble of classifiers using a
given “base model,” which can be any supervised
classifier. The accuracy of the ensemble, based on
weighted majority voting of its member classifiers, is
usually higher than that of a single classifier of that type.
 The weaker the base classifier (the poorer its
performance), the greater the impact of AdaBoost.
 AdaBoost trains the ensemble members on different
subsets of the training data. Each additional classifier is
trained with data that is biased towards those instances
that were misclassified by the previous classifier  focus
on increasingly difficult to learn samples.
 AdaBoost turns a dumb classifier into a smart one!
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR A comparison
 …but don’t take my word for it:
 Elements of Statistical Learning, Hastie,
Tibshirani and Firedman, Springer, 2009.
http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR
PR Probability Theory
 

in one slide
 Here are the most important things to know in probability 𝑃 𝑥 ≥ 0, 𝑃 𝑥 = 1   (disc.)
𝑥
 Probabilities are nonnegative and normalize to 1
𝑝 𝑥 𝑑𝑥 = 1   (cont.)

 If you have two r.v. 𝑋 and 𝑌, they have a joint distribution 𝑃(𝑋, 𝑌)
𝑃 𝑋, 𝑌 = 𝑃 𝑌, 𝑋 ,     0 ≤ 𝑃 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 ≤ 1,      𝑃 𝑋 = 𝑥𝑖 , 𝑌 = 𝑦𝑗 = 1,   or    𝑝 𝑥, 𝑦 𝑑𝑦𝑑𝑥 = 1
𝑖 𝑗

• If X and Y are independent (and only then)  𝑃 𝑋, 𝑌 = 𝑃 𝑋 𝑃 𝑌


• The sum rule: The marginal probability of a single r.v. can always be obtained by summing
(integrating) the pdf over all values of all other variables, for example
𝑝 𝑥 = 𝑝 𝑥, 𝑦 𝑑𝑦     𝑃 𝑦 = 𝑃 𝑋 = 𝑥𝑖 , 𝑌     
𝑖

𝑝 𝑋 = 𝑝 𝑥, 𝑦, 𝑧 𝑑𝑦𝑑𝑧     e.g.,  𝑃 𝐴, 𝐶 = 𝑃 𝐴, 𝐵, 𝐶, 𝐷, 𝐸
𝐵 𝐷 𝐸
• The product rule: The joint probability can always be obtained by multiplying the conditional
probability (conditioned on one of the variables) with the marginal probability of the conditioned
variable: 𝑃 𝑋, 𝑌 = 𝑃 𝑌 𝑋 𝑃 𝑋 = 𝑃 𝑋 𝑌 𝑃 𝑌

𝑃 𝑋𝑌 𝑃 𝑌 𝑃 𝑋𝑌 𝑃 𝑌
• which gives rise to Bayes rule: 𝑃 𝑌 𝑋 = = ∝𝑃 𝑋𝑌 𝑃 𝑌
𝑃 𝑋 𝑌 𝑃 𝑋, 𝑌

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayes Rule
 We pose the following question: Given that the event A (e.g., observation of some data) has
occurred, what is the probability that any single one of the event B’s occur (the correct class
is one of the category choices)?

𝑃 𝐴 ∩ 𝐵𝑗 𝑃(𝐴 𝐵𝑗 ) ⋅ 𝑃(𝐵𝑗
𝑃(𝐵𝑗 𝐴) = = 𝑁
𝑃(𝐴) 𝑃(𝐴 𝐵𝑘 ) ⋅ 𝑃(𝐵𝑘 )
𝑘=1

This is known as the Bayes rule, and is one of the most


important corner stones of machine learning.
Rev. Thomas Bayes,
(1702-1761)
The denominator, summation over all values of B, is just a normalization constant, ensuring
that all probabilities add up to 1.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayesian Way
of Thinking
 The classic war in statistics: Frequentists vs. Bayesians
 They cannot even agree on the meaning of probability
• Frequentist: the expected likelihood of an event, over a long run: 𝑃(𝐴) = 𝑛/𝑁.
• Bayesian: measure of plausibility of an event happening, given an observation
providing incomplete data, and previous (sometimes / possibly subjective) degree of
belief (known as the prior, or a priori probability)
 Many phenomena of random nature can be explained by the frequentist definition
of probability:
• The probability of hitting the jackpot in NJ State Lottery;
• The probability that there will be at least 20 non-rainy days in Glassboro in September;
• The probability that at least one student will fail this class;
• The probability that the sum of two random cards will be 21;
 …but some cannot!
• The probability that a meteor will end life on Earth in the next 100 years ;
• The probability that the war in Afghanistan will end by 2014;
• The probability that there will be another major recession in the next 10 years;
• The probability that Rowan will have two medical schools by 2013. Imagine this
question being asked in 2000, and 2009)
 Yet, you can make approximate estimations of such probabilities
 You are following a Bayesian way thinking to do so

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayesian Way
of Thinking
 In Bayesian statistics, we compute the probability based on three pieces of information:
 Prior: Our (subjective?) degree of belief that the event is plausible in the first place.
 Likelihood: The probability of making an observation, under the condition that the
event has occurred: how likely is it to observe what I just observed, if event A did in
fact happen (or, how likely is it to observe this outcome, if A [class 𝜔𝐴 ] were true). Likelihood
describes what kind of data we expect to see in each class.
 Evidence: The probability of making such an observation.
 It is the combination of these three that gives the probability of an event, given that an
observation (however incomplete information it may provide) has been made. The
probability computed based on such an observation is then called the posterior probability.
 Given the observation, the Bayesian thinking updates the original belief (the prior) based
on the likelihood and evidence.
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 × 𝑝𝑟𝑖𝑜𝑟
𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 =
𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒

 Sometimes, the combination of evidence and likelihood are so compelling that it can
overwrite our original belief.
• Recall the Rowan medical school example: As of 2010, there was only some chatter about the
possibility of such a school  prior: very low. In 2011 we see building start construction:
Likelihood: 𝑃 𝑏𝑢𝑖𝑙𝑑𝑖𝑛𝑔 𝑚𝑒𝑑𝑖𝑐𝑎𝑙 𝑠𝑐ℎ𝑜𝑜𝑙): Very high. This high likelihood trumps our low
prior  Posterior: 𝑃 𝑚𝑒𝑑𝑖𝑐𝑎𝑙 𝑠𝑐ℎ𝑜𝑜𝑙 𝑏𝑢𝑖𝑙𝑑𝑖𝑛𝑔 𝑖𝑛 𝑝𝑙𝑎𝑐𝑒 : very high!
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR On Frequentists & Bayesians

 http://www.statisticalengineering.com/frequentists_and_bayesians.htm
 http://www25.brinkster.com/ranmath/bayes02.htm
 http://en.wikipedia.org/wiki/Bayesian_probability
 http://en.wikipedia.org/wiki/Frequency_probability

 

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR What Can Bayes

do for You?
How about $9.5 billion?
A success story: Mike Lynch, the “Bayesian Millionaire”,
founded his company (Autonomy) in 1991. Developed
systems for
• matching fingerprints for the Essex
police force
• reading car number plates

Autonomy has been estimated at £ 4.7 billion (!!!)


Slide inspired by and/or courtesy of Dr. L. I. Kuncheva

Thomas Bayes: An obscure 18th century clergyman


and statistician who published 2 minor works but who
nowadays is

“more important than Marx and Einstein put


together”…
Telegraph Magazine, 3 February, 2001

WHY?
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayes Theorem for
 
Spam Filtering!
Thomas Bayes: An obscure 18th century clergyman and statistician who published 2 minor works
but who nowadays is “more important than Marx and Einstein put together”…
Telegraph Magazine, 3 February, 2001

[My favorite fellow of the Royal Society is the Reverend Thomas Bayes, an obscure 18th-century
Kent clergyman and a brilliant mathematician who] devised a complex equation known as the
Bayes theorem, which can be used to work out probability distributions. It had no practical
application in his lifetime, but today, thanks to computers, is routinely used in the modelling of
climate change, astrophysics and stock-market analysis.
Bill Bryson
Quoted in Max Davidson, 'Bill Bryson: Have faith, science can solve our problems', Daily Telegraph (26 Sep 2010)

 While we in pattern recognition have known the virtues of Bayes theorem, it has
recently been made popular by its success in spam filters

 …and you can do it too. Step-by-step instructions available in Wikipedia:


 http://en.wikipedia.org/wiki/Bayesian_spam_filtering

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayes Classifier
 Statistically, the best classifier you can build !!!
 Based on quantifying the trade offs betweens various classification decisions using
a probabilistic approach
 The theory assumes:
 Decision problem can be posed in probabilistic terms
 All relevant probability values are known or can be estimated (in practice this is
not true)
 Back to our fish example:
 Assume that we know the probabilities of observing sea bass and salmons, 𝑃(𝜔1)
and 𝑃(𝜔2), for a particular location of fishing and time of year
• Prior probability
 Based on this information, how would you guess the type of the next fish to be
caught?
𝜔 = 𝜔1    𝑖𝑓   𝑃(𝜔1 ) > 𝑃(𝜔2 ) A reasonable
𝜔 = 𝜔2    𝑖𝑓   𝑃(𝜔2 ) > 𝑃(𝜔1 ) decision rule ?
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Setting this up
 We now make an observations – say the length and width of the fish caught: 14 inches by 6
inches
 Random variables, say 𝑋1 (𝑙𝑒𝑛𝑔𝑡ℎ): 𝑥1 = 14; and 𝑋2 (𝑤𝑖𝑑𝑡ℎ): 𝑥2 = 6;
 How to use this information?
• There are two possibilities, the fish is seabass ω1 or salmon  ω2
 Probabilistically, then 𝐱 = 𝑥1 , 𝑥2 𝑇
𝜔1    𝑖𝑓   𝑃 𝜔 = 𝜔1 𝑥1 , 𝑥2 > 0.5 𝜔    𝑖𝑓   𝑃 𝜔 = 𝜔1 𝐱 > 𝑃 𝜔 = 𝜔2 𝐱
𝜔=       or     𝜔 = 1
𝜔2   𝑖𝑓   𝑃 𝜔 = 𝜔2 𝑥1 , 𝑥2 > 0.5 𝜔2 , otherwise

 So how do we compute 𝑃(𝜔1 𝐱), 𝑃(𝜔2 𝐱)?


• Posterior Probability
 This can be set in the Bayesian framework, which computes the probability conditioned on
one variable from the probability conditioned on the other variable:
𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
𝑃 𝜔𝑗 𝑥 =
 But then, what is 𝑃(𝐱 𝜔1)? 𝑝 𝑥
• Likelihood 𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
= 𝐶
𝑘=1 𝑝 𝑥 𝜔𝑘 ⋅ 𝑃 𝜔𝑘
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Setting This Up
 Notice first that the following hold:
There are only two possibilities in our case; the fish on the
𝑃 𝜔 = 𝜔1 + 𝑃 𝜔 = 𝜔2 = 1 conveyor belt is either a salmon or seabass  the sum of their
𝑃 𝜔 = 𝜔1 𝐱 + 𝑃 𝜔 = 𝜔2 𝐱 = 1 probabilities must add up to 1.
This fact does not change, even after we make an observation
(say the length, or length, width, color, etc.)
Evidence
Think in reverse, in terms of the measurement, say length=14.
𝑝 𝐱 = 𝑝 𝐱 𝜔 = 𝜔1 𝑃 𝜔 = 𝜔1 + The probability that the length of any given fish being 14 can
𝑝 𝐱 𝜔 = 𝜔2 𝑃 𝜔 = 𝜔2 come from two sources: it is either a seabass (and the length is
14), or it is a salmon (and its length is 14).
 These apply even if we have multiple classes and many features:
𝐶 𝐶 Sum of prior probabilities of
𝑃 𝜔 = 𝜔𝑗 = 𝑃 𝜔𝑗 = 1 all classes must add up to 1
𝑗=1 𝑗=1
𝐶 Sum of posterior probabilities of
all classes must add up to 1
𝑃 𝜔𝑗 𝐱 = 𝑃 𝜔1 𝐱 + ⋯ + 𝑃 𝜔𝐶 𝐱 = 1
𝑗=1 𝐶

𝑝 𝐱 = 𝑝 𝐱 𝜔𝑗 𝑃 𝜔𝑗 = 𝑝 𝐱 𝜔1 𝑃 𝜔1 + ⋯ + 𝑝 𝐱 𝜔𝐶 𝑃 𝜔𝐶 Sum of all evidence must add up to 1


𝑗=1
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayes Rule
In Pattern Recognition
 Suppose, we know 𝑃(𝜔1), 𝑃(𝜔2), 𝑃(𝑥 𝜔1) and 𝑃(𝑥 𝜔2), and that we have observed the
value of the feature (a random variable) 𝑥, say, the length, and it is 14 inches…
 How would you decide on the “state of nature” – type of fish, based on this info?
 Bayes theory allows us to compute the posterior probabilities from prior and class-
conditional probabilities
Likelihood: The (class-conditional) probability of observing a feature value Prior Probability: The total probability of
of x, given that the correct class is ωj, or what kind of data do we expect to correct class being class ωj determined
see in class 𝜔𝑗 . All things being equal, the category with higher class based on prior experience (before an
conditional probability is more “likely” to be the correct class. observation is made)

𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
𝑃 𝜔𝑗 𝑥 = = 𝐶
𝑝 𝑥 𝑘=1 𝑝 𝑥 𝜔𝑘 ⋅ 𝑃 𝜔𝑘

Posterior Probability: The (conditional) probability of correct Evidence: The total probability of
class being ωj, given that feature value x has been observed. observing the feature value as x. Serves
Based on the measurement (observation), the probability of as a normalizing constant, ensuring that
correct class being ωj has shifted from 𝑃(𝜔𝑗) to 𝑃(𝜔𝑗 𝑥) posterior probabilities add up to 1
A Bayes classifier, decides on the class 𝜔𝑗 that has the largest posterior probability.
The Bayes classifier is statistically the best classifier one can possibly construct. Why?
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR How to we compute
Class Conditional Probabilities?
𝜔1 : Sea bass  𝑃(𝑥 𝜔1 ): Class conditional probability for seabass
𝜔2 : Salmon  𝑃(𝑥 𝜔2 ): Class conditional probability for salmon
Likelihood: For example, given that a salmon (ω2) is observed, what is the probability of this salmon’s
length is between 11 and 12 inches? Or simply, what is the probability that a salmon’s length is between
11 and 12 inches? Or, how likely is it that a Salmon is between 11 and 12 inches?

𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
𝑃 𝜔𝑗 𝑥 =
𝑝 𝑥
𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
= 𝐶
𝑘=1 𝑝 𝑥 𝜔𝑘 ⋅ 𝑃 𝜔𝑘

To find the likelihood, let’s approximate


the continuous valued distribution with
a histogram.

This is the kind of data


we expect to see in
class 𝜔1 and class 𝜔2
D
RP
length
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Posterior Probabilities
 Bayes rule allows us to convert the likelihood to posterior probability (difficult to determine)
with the help of prior probabilities and the evidence (easier to determine).
 If in fact I observed a fish that is 14 inches long, I can now ask: given that I observed a 14’’
long fish, what is the probability that it is a seabass? That is a salmon? The answer is the
posterior probability of these classes. Of course, we choose the class with the larger
posterior probability.

Posterior probabilities for priors P (ω1) = 2/3 and P(ω2)


= 1/3. For example, given that a pattern is measured to
have feature value x =14, the probability it is in
category ω2 is roughly 0.08, and that it is in ω1 is 0.92.
At every x, the posteriors sum to 1.0.
Which class would you choose now?

How good is your decision? What is your probability of


making an error with this decision?

 P(1 | x) if we decide on class 2


P(error | x)  
D  P(2 | x) if we decide on class 1
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayes Decision Rule
Choose 𝜔𝑖 if 𝑃(𝜔𝑖 𝑥) > 𝑃(𝜔𝑗 𝑥) for all 𝑖 = 1,2, … , 𝑐

If there are multiple features, 𝐱 = {𝑥1, 𝑥2, … , 𝑥𝑑 } 


Choose 𝜔𝑖 if 𝑃(𝜔𝑖 𝐱) > 𝑃(𝜔𝑗 𝐱) for all 𝑖 = 1,2, … , 𝑐

Choose the class that has


the largest posterior
probability !!!
See page 65-72 in Murphy for an alternate explanation

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Maximum A Posteriori Estimate:
PR
MAP!
 Let’s formalize our understanding before we go any further.
 We are given a training dataset, 𝒟, which contains 𝐶 classes. Hence, each instance belongs
to one of 𝜔1 , … , 𝜔𝐶 classes. We are then given an instance 𝐱, based on which we are asked
to predict its class label.
 Of all classes, we pick our best guess as the one that has the maximum posterior probability,
i.e., that is the most likely (most consistent with the observed data – the likelihood) while
best conforming to our original gut feeling – the prior probability.
 If the label indicated by the likelihood is different than our gut feeling, we may choose a
label that is different than that of our original belief. This would happen, for example, if we
see a lot of data (evidence) that contradict our prior belief, where the likelihood overwrite
(overwhelm) the prior.
 Choosing the label that has the highest posterior probability is known as the maximum a
posteriori (MAP) decision, and is formally given as

𝐶
𝜔 = arg max𝑐=1 𝑝 𝜔 = 𝜔𝑐 𝐱, 𝒟

where 𝜔 is our best estimate of the true class (so called the MAP estimate), and where the
conditioning on the dataset 𝒟 is made explicit

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Maximum Likelihood Estimate
PR
MLE
 Now, recall that posterior depends on three pieces of information: the prior, the likelihood
and the evidence: 𝑃 𝜔𝑗 𝑥 = 𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝑝 𝑥 . Since we are choosing the class with the
largest 𝑃 𝜔𝑗 𝑥 , and the 𝑝(𝑥) does nor depend on class 𝑗, the denominator is just a
normalization constant that does not affect the classification.
 Hence, we really need two pieces of information, likelihood and the prior:
𝑃 𝜔𝑗 𝑥 ∝ 𝑝 𝑥 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
 Of these two, the prior is independent of the data – after all, it is based on our prior
subjective belief. As we receive more and more data, the decision becomes more and more
dependent on the likelihood. If we make the decision purely on the likelihood (choose the
𝐶
class that maximizes the likelihood), i.e., 𝜔 = arg max𝑐=1 𝑝 𝐱 𝜔 = 𝜔𝑐 , 𝒟 , we obtain the
maximum likelihood estimate (MLE) of the true class label.
• We will see more on MLE later.
 In general, as we see more and more data, the MAP estimate usually converges towards the
MLE.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR The Real World:
The World of Many Dimensions

 In most practical applications, we have more then one feature, and therefore the
random variable x must be replaced with a random vector x. p(x)  p(x)
 The joint probability distribution p(x) still satisfies the axioms of probability
 The Bayes rule is then
𝑝 𝐱 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝑝 𝐱 = 𝑝 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑑                   
𝑃 𝜔𝑗 𝐱 =
𝑝 𝐱 𝑝 𝐱 𝜔𝑗 = 𝑝 𝑥1 , 𝑥2 , ⋯ , 𝑥𝑑 𝜔𝑗
𝑝 𝐱 𝜔𝑗 ⋅ 𝑃 𝜔𝑗
= 𝐶
𝑘=1 𝑝 𝐱 𝜔𝑘 ⋅ 𝑃 𝜔𝑘

𝑝 𝐱 = 𝑝 𝑥1 ⋅ 𝑝 𝑥2 . . . 𝑝 𝑥𝑑
 If – and only if – the random variables 𝑑
in a vector are statistically independent = 𝑝 𝑥𝑖
𝑖=1
 While the notation changes only slightly, the implications are quite substantial:
 The curse of dimensionality
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR The Curse of
Dimensionality
Remember: In order to approximate the distribution, we need to create a histogram

1-D
On average, let’s say we need 30
instances for each of the 20 bins to adequately
populate the histogram 20*30=600 fishes

3-D
2-D
20*20*30=12,000! 20*20*20*30=240,000!
RP
fishes fishes!
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR The Loss Function
 Mathematical description of how costly each action (making a class decision) is.
Are certain mistakes costlier than others?

{𝜔1, 𝜔2, … , 𝜔𝑐}: Set of states of nature (classes)


{𝛼1, 𝛼2, … 𝛼𝑎}: Set of possible actions. Note that a need not be same as c. Because we
may have more (or fewer) number of actions than the number of classes.
For example, not making a decision (rejection) is also an action.
{𝜆1, 𝜆2, … 𝜆𝑎}: Losses associated with each action
𝜆(𝛼𝑖 𝜔𝑗): The loss function: Loss incurred by taking action i when the true state of nature
is in fact j.
𝑅(𝛼𝑖 𝐱): Conditional risk - Expected loss for taking action i
𝑐

𝑅 𝛼𝑖 𝐱 = 𝜆 𝛼𝑖 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝐱
𝑗=1

Bayes decision takes the action that minimizes this conditional risk !
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Bayes Decision Rule
Using Conditional Risk
Loss incurred by taking action 𝛼𝑖
Sum over all classes when the true class is class 𝜔𝑗 Probability that the
true class is class 𝜔𝑗
𝑐
1. Compute conditional risk 𝑅 𝛼𝑖 𝐱 = 𝜆 𝛼𝑖 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝐱 for each action taken
𝑗=1
2. Select the action that has the minimum conditional risk. Let this be action 𝑘
3. The overall risk is then
𝑅= 𝑅 𝛼𝑘 𝐱 ⋅ 𝑝 𝐱 𝑑𝐱
Integrated over all
𝐱∈𝑋
possible values of x

Probability that x
Conditional risk associated with taking action will be observed
𝛼𝑘(𝐱) based on the observation x.

4. This is the Bayes Risk, the minimum possible risk that can be taken by any classifier !

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Two-Class
Special Case
 Definitions: Cancer Healthy
Diagnosis Diagnosis
True
 𝛼1: Decide on 𝜔1, Sample loss 0.5 1000 
Cancer

 
 𝛼2: Decide on 𝜔2, function  10 0  Healty
True

 𝜆𝑖𝑗: 𝜆(𝛼𝑖 𝜔𝑗)  Loss for deciding on 𝜔𝑖 when the true class is 𝜔𝑗
 Conditional risk:
 𝑹(𝜶𝟏 𝐱) = 𝜆11𝑃(𝜔1 𝐱) + 𝜆12𝑃(𝜔2 𝐱): risk associated with choosing class 1
 𝑹(𝜶𝟐 𝐱) = 𝜆21𝑃(𝜔1 𝐱) + 𝜆22𝑃(𝜔2 𝐱): risk associated with choosing class 2
 Note that 𝜆11 and 𝜆22 need not be zero, though we expect 𝜆11 < 𝜆12, 𝜆22 < 𝜆21
Clearly, we decide on 𝜔1 if 𝑅(𝛼1 𝐱) < 𝑅(𝛼2 𝐱), decide on 𝜔2, otherwise
𝜆21 − 𝜆11 𝑃 𝜔1 𝐱 > 𝜆12 − 𝜆22 𝑃 𝜔2 𝐱
𝑐ℎ𝑜𝑜𝑠𝑒  𝜔1

𝜆21 − 𝜆11 𝑝 𝐱 𝜔1 𝑃 𝜔1 > 𝜆12 − 𝜆22 𝑝 𝐱 𝜔2 𝑃 𝜔2
𝑐ℎ𝑜𝑜𝑠𝑒  𝜔1

The Likelihood Ratio Test (LRT): Pick 𝜔1 if the


𝑝 𝐱 𝜔1 ω1 𝜆12 − 𝜆22 𝑃 𝜔2 LRT is greater then a threshold that is independent of x.
Λ 𝐱 = < 𝜆21 − 𝜆11 ⋅ 𝑃 𝜔1
>
    
𝑝 𝐱 𝜔2 ω 2
This rule, which minimizes the Bayes risk, is also called
the Bayes Criterion.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR  
Example

(and zero loss for


correct decision
λ11= λ22=0; λ12= λ21=1)

G From R. Gutierrez @ TAMU


Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Note in both of these examples that you need to
PR know the closed form solution of 𝑃 𝑥 𝜔 to
solve for 𝜆. What do you do if otherwise?
Example
(Try this at home!)

λ11 λ22 λ12 λ21

Modified from R. Gutierrez @ TAMU


 

G
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Minimum Error-Rate Classification:
Multiclass Case

 If we associate taking action 𝑖 as selecting class 𝜔𝑖 , and if all errors are equally likely, we
obtain the zero-one loss (symmetrical cost function)
0, if 𝑖 = 𝑗
𝜆 𝛼𝑖 𝜔𝑗 =
1, if 𝑖 ≠ 𝑗
This loss function assigns no loss to correct classification, and assigns 1 to misclassification.
The risk corresponding to this loss function is then
𝑐

𝑅 𝛼𝑖 𝐱 = 𝜆 𝛼𝑖 𝜔𝑗 ⋅ 𝑃 𝜔𝑗 𝐱 = 𝑃 𝜔𝑗 𝐱 = 1 − 𝑃 𝜔𝑖 𝐱 
𝑗=1 𝑗≠𝑖
𝑗=1,...,𝑐
What does this tell us…?
 To minimize this risk (average probability of error), we need to choose the class that
maximizes the posterior probability 𝑃(𝜔𝑖 𝐱). Only this selection will minimize the risk in 

𝑝 𝐱 𝜔1 ω1 𝑃 𝜔2 𝑃 𝜔𝑖 𝐱 ω1 Maximum a posteriori (MAP) criterion


Λ 𝐱 = >
     ⋅
< 𝑃 𝜔1 ⟺ > 1
𝑝 𝐱 𝜔2 ω
2
<
𝑃 𝜔𝑗 𝐱 ω Maximum likelihood criterion for equal priors
2

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Error Probabilities
(Bayes Rule Rules!)

In a two class case, there are two sources of error: P(error )   p  error | x   p  x   dx
x is in R1, but true class is ω2, 
P  x|2  P 2 
x is in R2, but true class is ω1,
  P   2 | x   p  x   dx
R1

  P 1 | x   p  x   dx
R2
P  x|1  P 1 

xB: Optimal Bayes solution


x*: Non-optimal solution
D

P(error) = +

p(x  R1 , 2 )  p(x  R1 | 2 )  P(2 ) p(x  R2 , 1 )  p(x  R2 | 1 )  P(1 )


Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR In case of
non-optimal Solution
Note that the Bayes error, achieved by the Bayes boundary xB, is the
smallest error that can be achieved under the given distribution. Additional
error is introduced in case of any other (non-optimal) solution

D
xB: Optimal Bayes solution
x*: Non-optimal solution

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR  
Another example

P(error )   p  error | x   p  x   dx

Optimal
  P 2 | x   p  x   dx
boundary R1

  P 1 | x   p  x   dx
R2

B
Includes the pink shaded region,
if decision boundary is not at x0 R1
P 2 | x   p  x   dx  P  | x   p  x   dx
R2 1

P  x|2  P 2  P  x|1  P 1 

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Probability of Error
 In multi-class case, there are more ways to be wrong than to be right, so we exploit the fact
that 𝑃(𝑒𝑟𝑟𝑜𝑟) = 1 − 𝑃(𝑐𝑜𝑟𝑟𝑒𝑐𝑡), where 𝑝 𝐱, 𝜔𝑖 = 𝑃 𝜔𝑖 𝐱 𝑝 𝐱
𝐶 𝐶 = 𝑝 𝐱 𝜔𝑖 𝑃 𝜔𝑖
Discrete  𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 𝑃 𝐱 ∈ 𝑅𝑖 , 𝜔𝑖 = 𝑃 𝐱 ∈ 𝑅𝑖 𝜔𝑖 𝑃 𝜔𝑖
𝑖=1 𝑖=1
𝐶 𝐶

Continuous  𝑃 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 = 𝑝 𝐱 𝜔𝑖 𝑃 𝜔𝑖 𝑑𝐱 = 𝑃 𝜔𝑖 𝐱 𝑝 𝐱 𝑑𝐱
𝑖=1 𝐱∈𝑅𝑖 𝑖=1 𝐱∈𝑅𝑖

 Of course, in order to minimize the P(error), we need to maximize P(correct) for which we
need to maximize each and every one of the integrals. Note that 𝑝(𝐱) is common to all
integrals, therefore the expression will be maximized by choosing the decision regions
𝑅𝑖 where the posterior probabilities 𝑃(𝜔𝑖 𝐱) are maximum:

For example: In areas I and IV, 𝑃(𝜔2 𝑥)


is greater than either of 𝑃(𝜔1 𝑥) or
𝑃(𝜔3 𝑥). Therefore, these areas are
assigned to class 𝜔2 as Region 2.

G I II III IV V x
From R. Gutierrez @ TAMU
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Reject Option
 Sometimes, when the posterior probabilities are not high enough, we may decide
that refusing to make a decision may reduce the overall risk.
 Rejecting to make a decision practically means that the available information is not
adequate to make a confident enough decision. This forces the user to obtain
additional information.
 The reject option can be controlled P 1 x  P 2 x 
simply by using a posterior
probability threshold θ, below which
we refuse to make a decision
 If θ=1 all samples are rejected
 If θ<1/C, no samples are rejected.

B
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Inference & Decision
 Typically, a Bayes classifier requires two steps:
• Inference: Use training data to determine the prior / likelihood probabilities
• Decision: Given new data, choose the class for which the posterior is maximum
 There are other general approaches:
 Generative models: As in Bayes classifier, determine the joint probabilities explicitly,
followed by Bayes rule to obtain posteriors, then determine class membership for each input
• Having now access to joint distribution, this allows us to “compute” additional
data without observing it, since the data are assumed to come from that
distribution.
• Immensely useful, but often computationally difficult (sometimes impossible!)
 Probabilistic Discriminative models: Determine the posterior probability – or a function
of it – directly (without computing joint probabilities). The function then discriminates
among classes.
 Discriminative models: Find a function 𝑓(𝐱), that approximate the unknown mapping
function from the data x to their correct classes 𝜔𝑗. Probability does not necessarily play a
role, and posteriors are not computed. The outputs of some models (e.g. some neural
networks), can, however be interpreted as posterior probabilities under certain conditions.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Discriminant Based
Classification
 A discriminant is a function 𝑔(𝐱), that discriminates between classes. This function
assigns the input vector to a class according to its definition: Choose class 𝑖 if
𝑔𝑖 (𝐱) > 𝑔𝑗 (𝐱) ∀𝑖 ≠ 𝑗, 𝑖, 𝑗 = 1,2, . . . , 𝑐
 Bayes rule can be implemented in terms of discriminant functions, simply by choosing the
posterior as the discriminant 𝑔𝑖 (𝐱) = 𝑃(𝜔𝑖 𝐱)

Each discriminant function generates


c decision regions, 𝑅1, … , 𝑅𝑐, which are
separated by decision boundaries. Decision
regions need NOT be contiguous.

The decision boundary satisfies 𝑔𝑖 (𝐱) = 𝑔𝑗 (𝐱

𝐱 ∈ 𝑅𝑖    ⇔   𝑔𝑖 (𝐱) > 𝑔𝑗 (𝐱)


∀𝑖 ≠ 𝑗, 𝑖, 𝑗 = 1,2, . . . , 𝑐
D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Discriminant Functions
 We may view the classifier as an automated machine that computes 𝑐
discriminants, and selects the category corresponding to the largest discriminant
 A neural network is one such classifier

 for Bayes classifier with non-uniform risks, 𝑅(𝛼𝑖 𝐱): 𝑔𝑖 𝐱 = −𝑅 𝛼𝑖 𝐱


 for MAP classifier (of uniform risks): 𝑔𝑖 𝐱 = 𝑃 𝜔𝑖 𝐱
 for maximum likelihood classifier (of equal priors): 𝑔𝑖 𝐱 = 𝑝 𝐱 𝜔𝑖
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Discriminant Functions
 In fact, multiplying every discriminant with the same constant, or adding or
subtracting a constant to all discriminants does not change the decision boundary
 In general every 𝑔𝑖 (𝐱) can be replaced by 𝑓 𝑔𝑖 (𝐱) , where 𝑓 ∙ is any
monotonically increasing function without affecting the actual decision boundary
 Some linear or non-linear transformations of the previously stated discriminants
may greatly simplify the design of the classifier

 What examples can you think of…?

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Normal Densities
 If likelihood probabilities are normally distributed, then a number of simplifications can be made.
In particular, the discriminant function can be written as in this greatly simplified form (!) by using
the log transformation:
1 1
−2 𝐱−𝛍𝑖 𝑇 𝚺𝑖−1 𝐱−𝛍𝑖
𝑝 𝐱 𝜔𝑖 = 𝑑 2 1 2
𝑒
2𝜋 𝚺𝑖
                   𝑝 𝐱 𝜔𝑖 ~𝑁 𝛍𝑖 , 𝚺𝑖

1 𝑑 1
 𝑔𝑖 (𝐱) = − 𝐱 − 𝛍𝑖 𝑇 ⋅ 𝚺𝑖−1 ⋅ 𝐱 − 𝛍𝑖 − ln2𝜋 − ln 𝚺𝑖 + ln𝑃 𝜔𝑖
2 2 2

(doesn’t this make everything so crystal clear…?)

There are three distinct cases that can occur:


Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
1 0 0
 0 
PR   i   2 0


1


𝚺 𝑖 = 𝜎
Case 1: _________
2
𝑰
0 0 1

Features are statistically independent, and all features have the same variance: Distributions
are spherical in d dimensions, the boundary is a generalized hyperplane (linear discriminant) of
d-1 dimensions, and features create equal sized hyperspherical clusters.

gi  x     x  μi  i1   x  μi    ln 2  ln i  ln P i 
1 T d 1
2   2 2
Σi    independent of i
2d
1
   x  μ T 1
   x  μ    d ln 2  1 ln   ln P  
Σ 1  1  2  I  ditto 2 
i 
i i
 2 2
i i

1 
 x  μi    x  μi    ln P i 
T

2 2  

The general form of the …and if we have unit variance, 2 =1,


…and if priors are the same: we have the nearest
discriminant is then Euclidean distance classifier
𝐱 − 𝛍𝑖 2 𝐱 − 𝛍𝑖 𝑇 𝐱 − 𝛍 𝑖 Recall Euclidean norm:
𝑔𝑖 𝐱 = − + ln𝑃 𝜔𝑖 𝑔𝑖 𝐱 = − ‖𝐱 − 𝜇𝑖 ‖2 = 𝐱 − 𝛍𝑖 𝑇 𝐱 − 𝛍𝑖
2𝜎 2 2𝜎 2
𝐱 − 𝛍𝑖 𝑇 𝐱 − 𝛍𝑖
 =− 2𝜎 2
+ ln𝑃 𝜔𝑖 Minimum Distance Classifier

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR   Euclidean Distance
 Recall the definition of distance between two (vectors) points u and v.
 This can be computed regardless of the dimensionality (of course, so long as u and
v are of the same dimension)

y
u
d (u, v )  u  v

|uy-vy|
 u1  v1 2    un  vn 
2

v

x
|ux-vx|

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR i   I
Case 1: _________
2
 

Examples of such hyperspherical clusters are:

1-D 2-D 3-D


Note that when prior probabilities are identical for all classes,
the class distributions are equidistance from the decision boundary.
D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Demo: discriminant1.m i   I
Case 1:________
2

 This case results in linear discriminants that can be written in the form of a linear line
1
‖𝐱 − 𝛍𝑖 ‖ 2 𝐰𝑖 = 𝛍,
𝑔𝑖 (𝐱) = 𝐰𝑖𝑇 𝐱 + 𝑤𝑖0 𝜎2 𝑖
𝑔𝑖 𝐱 = − + ln𝑃 𝜔𝑖 −1
2𝜎 2 𝑤𝑖0 = 2 𝛍𝑇𝑖 ⋅ 𝛍𝑖 + ln𝑃 𝜔𝑖
2𝜎
Threshold (Bias)
of the ith category

1-D case

2-D case
3-D case
(non-equal
priors)

D
Note how priors shift the discriminant function away from the more likely mean !!!
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
function discriminant1

PR
% This function demonstrates the Bayesian classifier for data drawn from a Gaussian
distribution, for the type 1 case, where the covariance matrix is diagonal and constant.
This is equivalent to all features being independent and classes having equal variances.
To see different effect, change the parameters mu1, mu2, mu3, and sigma.
Example
% Robi Polikar, September 2005 - modified September 2007, September 2009.

mu1=[3 2]'; mu2=[7 4]'; mu3=[2 5]'; prior1=1/3; prior2=1/3; prior3=1/3;


sigma=2; Sigma=[sigma 0 ; 0 sigma]; x=[-2:0.1:10]; y=[-2:0.1:10]; X=[x; y];
p1=gauss2d(x,y,mu1', Sigma, [1 0 0]); hold on; p2=gauss2d(x,y,mu2', Sigma, [0 1 0]); p3=gauss2d(x,y,mu3', Sigma, [0 0 1]);

for i=1:length(x)
for j=1:length(y)
g1(i,j)=(1/sigma^2)*mu1'*[x(i); y(j)]-1/(2*sigma^2)*(mu1'*mu1)+log(prior1);
end
end %Compute the discriminants for equal sigma case (implements 
on slide 51). Note that in this case, the x'x term is removed, since it is
for i=1:length(x) independent of the class information.
for j=1:length(y)
g2(i,j)=(1/sigma^2)*mu2'*[x(i); y(j)]-1/(2*sigma^2)*(mu2'*mu2)+log(prior2);
end
end

for i=1:length(x)
for j=1:length(y)
g3(i,j)=(1/sigma^2)*mu3'*[x(i); y(j)]-1/(2*sigma^2)*(mu3'*mu3)+log(prior3);
end
%Need to determine for each point, whether g1, g2 or g3 is maximum. To
end
do this effectively, put all 3 matrices on a 3-dimensional matrix, and find
the max index along the third dimension.
g(:,:,1)=g1; g(:,:,2)=g2; g(:,:,3)=g3;
[a b]=max(g, [], 3); %b: indicates 1, 2, 3 which of the g functions is maximum for each point.
figure ; pcolor(x,y,b); xlabel('Feature2'); ylabel('Feature1'); title('Pseudocolor plot of the decision boundaries'); shading interp;
%Create colormap for the colors Red, Green and Blue.
RGB=zeros(64,3); RGB(1:21,:)=repmat([1 0 0],21,1);
RGB(22:43,:)=repmat([0 1 0],22,1); RGB(44:64,:)=repmat([0 0 1],21,1);
colormap(RGB); colorbar;
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR   Example
function p=gauss2d(X,Y, MU, SIGMA, C)
% This function creates a 2D Gaussian probability distribution function
%
% MU: Mean vector, its length must be equal to 2 (must be column vector)
% SIGMA: Covariance vector, must be semi-positive definite matrix of 2x2
% X and Y: Cartesian coordinates of the points at which the Gaussian will be computed
% C: color matrix in [R G B] form indicating the edge colors

I=length(X); J=length(Y);

%mu=[-1 1];
%Sigma = [.9 .4; .4 .3];

for i=1:I
for j=1:J
p(i,j) = mvnpdf([X(i) Y(j)],MU,SIGMA);
end
end

h=surf(X,Y,p');

if nargin==5
alpha(0.75);
set(h, 'facecolor', C, 'edgecolor', C, 'facealpha', 0.9); %Color the faces of the mesh plot according to C
title('The theoretical distribution of 2-D Gaussian - {\itp }({\bf x }| \omega_j )')
xlabel('Feature1'); ylabel('Feature2');
grid on
end

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR   Example
The theoretical distribution of 2-D Gaussian - p ( x |  )
j

0.08

0.06

0.04 Pseudocolor plot of the decision boundaries


10 3
0.02

0 8
10
10 2.5
5
5 6
0 0

Feature1
Feature2 -5 -5
Feature1
4 2

2
1.5
0

-2 1
-2 0 2 4 6 8 10
Feature2
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
  12  122  12d 
 2  Demo: discriminant2.m
PR 
i   21

 2
 22  22d 

 Case (Section i  
2:_______
 d 1  d 2  d2 
2
4.2.2 in Murphy)

Covariance matrices are arbitrary, but equal to each other for all classes. Features then form hyper-
ellipsoidal clusters of equal size and shape. This also results in linear discriminant functions
whose decision boundaries are again hyperplanes: 1
𝑔𝑖 𝐱 = −𝑇 −1
2
𝐱 − 𝛍𝑖 𝚺 𝐱 − 𝛍𝑖 + ln𝑃 𝜔𝑖 
1
𝑔𝑖 (𝐱) = 𝐰𝒊𝑇 𝐱 + 𝑤𝑖0 𝐰𝑖 = 𝚺 −1 𝛍𝑖 𝑤𝑖0 = − 2 𝛍𝑇𝒊 𝚺 −1 𝛍𝑖 + ln𝑃 𝜔𝑖

D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Mahalanobis Distance
 In this case, instances are classified not based on minimum Euclidean distance, but the
minimum Mahalanobis distance.
 Samples drawn from a 2D Gaussian lie in a cloud centered around the mean μ. The quantity
𝑟 = 𝐱 − 𝛍 𝑇 𝚺 −1 𝐱 − 𝛍 is known as the Mahalanobis distance of x to the mean of group of
points normally distributed by N(μ, σ2).
 The contours of constant density are (hyper)ellipsoids of constant Mahalanobis distance
from the mean.

RP 𝒓
𝒓 𝒓
D

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR   i  
Case 2:_______

Now note that clusters are not spherical but ellipsoidal, due to covariances not being diagonal.
Also, note that unequal priors shift the discriminant function away from the more likely mean.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Case 2
function discriminant2

PR
% This function demonstrates the Bayesian classifier for data drawn from a
Gaussian distribution, for the type 2 case, where the covariance matrix
% is arbitrary, but constant for each class. To see different effect, change the
parameters mu1, mu2, mu3, and sigma.
% Robi Polikar, September 2005 - modified September 2007. Example
mu1=[3 2]'; mu2=[7 4]'; mu3=[2 5]'; prior1=1/3; prior2=1/3; prior3=1/3; Sigma=[0.9 0.3 ; 0.3 0.4;];
x=[-2:0.1:10]; y=[-2:0.1:10]; X=[x; y];

p1=gauss2d(x,y,mu1', Sigma, [0 0 0]); hold; p2=gauss2d(x,y,mu2', Sigma, [1 0.3 0]); p3=gauss2d(x,y,mu3', Sigma, [1 1 0.9]);

contour(x,y,p1, 'linewidth', 3); contour(x,y,p2, 'linewidth', 3); contour(x,y,p3, 'linewidth', 3);

for i=1:length(x)
for j=1:length(y)
g1(i,j)=(inv(Sigma)*mu1)'*[x(i); y(j)]-0.5*mu1'*inv(Sigma)*mu1+log(prior1);
end
end %Compute the discriminants for equal sigma case (implements 
on slide 58). Note that in this case, the x'x term is removed, since it is
for i=1:length(x)
independent of the class information.
for j=1:length(y)
g2(i,j)=(inv(Sigma)*mu2)'*[x(i); y(j)]-0.5*mu2'*inv(Sigma)*mu2+log(prior1);
end
end

for i=1:length(x)
for j=1:length(y)
g3(i,j)=(inv(Sigma)*mu3)'*[x(i); y(j)]-0.5*mu3'*inv(Sigma)*mu3+log(prior1);
end
end

g(:,:,1)=g1; g(:,:,2)=g2; g(:,:,3)=g3;


[a b]=max(g, [], 3); %b: indicates 1, 2, 3 which of the g functions is maximum for each point.
figure
pcolor(x,y,b); xlabel('Feature2'); ylabel('Feature1'); %Need to determine for each point, whether g1, g2 or g3 is maximum. To
title('Pseudocolor plot of the decision boundaries') do this effectively, put all 3 matrices on a 3-dimensional matrix, and find
shading interp; the max index along the third dimension.
colormap(hot);
colorbar;
%imagesc(b); %plotting "b" using imagesc clearly shows the decision boundaries
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR   Case 2 Example
The theoretical distribution of 2-D Gaussian - p ( x | j )

0.4

0.3

0.2

0.1

0
10
10
5
5 Pseudocolor plot of the decision boundaries
0 10 3
0

Feature2 -5 -5
Feature1 2.8
8
2.6

2.4
6
2.2

Feature1
4 2

1.8
2
1.6

1.4
0
1.2

-2 1
-2 0 2 4 6 8 10
Feature2

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR i  Arbitrary
Case 3:____________
Demo: discriminant3.m (Section 4.2.1 in Murphy)

All bets are off ! No simplifications are possible to the general 𝑔(𝐱). In two class case, the
decision boundaries form hyperquadratics. The discriminant functions are now, in general,
quadratic (nor linear) and non-contiguous. This is then a quadratic classifier.
1 −1 −1
𝐖 𝑖 = − 𝚺 𝑖 , 𝐰 𝑖 = 𝚺 𝑖 𝛍𝑖
𝑇 𝑇
𝑔𝑖 (𝐱) = 𝐱 𝐖𝑖 𝐱 + 𝐰𝑖 𝐱 + 𝑤𝑖0 2
1 𝑇 −1 1
𝑤𝑖0 = − 𝛍𝒊 𝚺𝑖 𝛍𝑖 − ln 𝚺𝑖 + ln𝑃(𝜔𝑖
2 2

Hyperbolic Circular
Parabolic Linear Ellipsoidal
D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Demo: discriminant3.m
i  Arbitrary
Case 3:____________
For the multi class case, the boundaries will look even more complicated. As an example

Decision
Boundaries
D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR   i  Arbitrary
Case 3:____________
In 3-D

D
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR   Example

The theoretical distribution of 2-D Gaussian - p ( x |  )


j

0.4

0.3

0.2 Pseudocolor plot of the decision boundaries


10 3

0.1
8
0 2.5
10 6
10

Feature1
5
5
0 4 2
0
Feature2 -5 -5 Feature1 2
1.5
0

-2 1
-2 0 2 4 6 8 10
Feature2

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Are we done?
 Since we now know the best classifier that can be built, are we done with PR? Can
we go home…?
 Not quite: Bayes classifier cannot be used if we don’t know the prob. distributions
 This is typically the rule, not the exception
 In most applications of practical interest, we do not know the underlying
distributions
 The distributions can be estimated, if there is sufficient data
 Sufficient ??? Make that “a ton of” , or better yet… “lots x 10(tons of data)”
 Estimating the prior distribution is relatively easy; however, estimating the class-
conditional distribution is difficult, and it gets giga-difficult as dimensionality
increases…. curse of dimensionality
 If we know the form of the distribution, say normal (but of course, what else), but
not its parameters, say mean and variance, the problem reduces to that of
parameter estimation from distribution estimation.
 If not, there are nonparametric density estimation techniques.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes Classifier
 There is one quick-and-dirty fix to the curse of dimensionality.
 A highly practical solution to this problem is to assume class-conditional independence of
the features which yields the so-called Naïve Bayes classifier.
𝑑

𝑝 𝐱 𝜔𝑗 = 𝑝 𝑥𝑖 𝜔𝑗 = 𝑝 𝑥1 𝜔𝑗 ⋅ 𝑝 𝑥2 𝜔𝑗 ⋅ ⋯ ⋅ 𝑝 𝑥𝑑 𝜔𝑗 ,          𝐱 = 𝑥1 , ⋯ , 𝑥𝑑
𝑖=1
 Note that this is not as restrictive of a requirement as full feature independence
𝑑

𝑝 𝐱 = 𝑝 𝑥𝑖 = 𝑝 𝑥1 𝑝 𝑥2 ⋯ 𝑝 𝑥𝑑
𝑖=1
 The discriminant function corresponding to the Naïve Bayes classifier is then
𝑑

𝑔𝑗𝑁𝐵 𝐱 = 𝑝 𝜔𝑗 𝑝 𝑥𝑖 𝜔𝑗
𝑖=1
 The main advantage of this is that we only need to compute univariate densities 𝑝(𝑥𝑖 𝜔𝑗),
which are much easier to compute (next week!) then the multivariate densities 𝑝(𝒙 𝜔𝑗).

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes Classifier
(NBC)
 So, the NBC assumes that features are class-conditionally independent. Is this a reasonable
assumption?
 For most applications, probably not. However, in practice, Naïve Bayes has shown to have
respectable performance, comparable to that of neural networks, even when this assumption
is clearly violated.
 “How come…?” you ask…
 One reason is that the model is quite simple!
• For 𝑪 class and 𝒅 features, the computational complexity of the NBC is 𝓞(𝑪𝒅)
• This makes NBC very resilient to over-fitting, the Achilles’ heel of more sophisticated classifiers.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes Classifier
(Known Distribution type)
 So given some training data how do we implement the Naïve Bayes classifier?
 If you can assume that the class conditional features follow a specific distribution,
e.g., the Gaussian distribution, then implement the following pseudocode.
Let 𝐱 = {𝑥1 𝑥2 … 𝑥𝑑 } be the training data
for j=1,…,c
for i=1,…,d
1 1 2
Compute the mean of the xi for all instances from class ωj : 𝜇𝑖𝑗 = 𝑥∈𝜔𝑗 𝑥𝑖 , 𝜎𝑖𝑗 = 𝑥∈𝜔𝑗 𝑥𝑖 − 𝜇𝑖𝑗 (1)
𝑁𝑖𝑗 𝑁𝑖𝑗

Let 𝐳 = {𝑧1 𝑧2 … 𝑧𝑑} be the test data to be classified. For each z


for j=1,…,c
for i=1,…,d
Compute class conditional likelihood distributions of each feature for each class:
2
𝑝 𝑥𝑖 𝜔𝑗 = 1 2𝜋𝜎𝑖𝑗 exp − 𝑥𝑖 − 𝜇𝑖𝑗 2 𝜎𝑖𝑗 (2)
Compute the posterior distribution, assuming the class-conditional independence of the features
𝑔𝑗𝑁𝐵 𝐱 = 𝑝 𝜔𝑗 𝑑𝑖=1 𝑝 𝑥𝑖 𝜔𝑗 ∝ 𝑝 𝜔𝑗 𝐱 (3)
Choose the class for each the posterior distribution / discriminant is the largest
𝜔∗ = argmax𝑔𝑗𝑁𝐵 𝐱 = argmax𝑝 𝜔𝑗 𝐱 (4)
𝑗 𝑗

 If the likelihoods follow a different, non-Gaussian (but of known type, say chi-square) distribution,
compute the parameters of that distribution for (1) and use the definition of that distribution for (2).
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes Classifier
(Unknown Distribution type)
 If the form of the distribution is not known, then it must be estimated. Use either one of the
approaches discussed in the density estimation lecture (such as Parzen windows, k-nearest
neighbor, etc.), or simply follow the histogram approach showed earlier (slide 29):
 For each class, look at the minimum and maximum values of each feature
 Divide that range into a meaningful bins, based on the
number of training data
• Optimize the number of bins vs. number of instances
in each bin
• The larger the number of bins, the smoother the estimate
• The larger the number of instances in each bin,
the more accurate the estimate will be
• For a given dataset size, you can only maximize one or the other.
 Create the histogram by counting the number of instances that fall into each bin
 The distribution is the estimate of the class conditional likelihood of that feature.
 Follow steps (3) and (4) of the pseudo code in the previous slide.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes in
Matlab
 Matlab’s Statistics Toolbox has a built-in naïve Bayes classifier.
 This is not a standard matlab function, however, but rather an object. You need to
use it as such. See examples in Matlab documentation by typing
>>help naivebayes
NaiveBayes class Naive Bayes classifier

A NaiveBayes object defines a Naive Bayes classifier. A Naive Bayes classifier assigns a new observation to the most probable class, assuming the features
are conditionally independent given the class value.
Construction Properties
NaiveBayesCreate NaiveBayes object (the information / fields saved by each example of
this object)
Methods CIsNonEmpty Flag for non-empty classes
(what you can do with objects of this class) ClassLevels Class levels
Disp Display NaiveBayes classifier object Cnames Class names
display Display NaiveBayes classifier object Prior Class priors
fit Create Naive Bayes classifier object by fitting training data Dist Distribution names
posterior Compute posterior probability of each class for test data NClasses Number of classes
predict Predict class label for test data NDims Number of dimensions
Params Parameter estimates

For example, nb = NaiveBayes.fit (data, labels) fits a naïve Bayes classifier object to the training data, and saves the results in the object nb. The object nb
has several fields for carrying different types of information. For example, nb.NClasses gives you the number of classes that the nb object recognizes
(trained on). To use the NaiveBayes object to classify test data, test_labels = predict(nb, test_data) or alternatively test_labels = nb.predict(test_data) .
Similarly, [post,cpre] = posterior(nb,test) returns post, the posterior probabilities of each of the instances (rows) of the test dataset test along with the class
cpre each is assigned.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
Using Matlab’s
PR naivebayes_ocr_demo.m
NaiveBayes
%This function demonstrates the functionality of the Naive Bayes classifier on the 10-class, 62 feature OCR database. Uses Matlab 2012a
%Robi Polikar, September 2012
Confusion Matrix

clear; close all; clc 1


141 0 0 0 0 0 0 0 0 0 100%
7.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%

load opt_train.mat; 0 18 0 0 0 0 0 0 1 0 94.7%


2
load opt_class.mat; 0.0% 1.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 5.3%
load opt_test.mat;
0 13 127 0 0 0 0 0 0 0 90.7%
load opttest_class.mat; 3
0.0% 0.7% 7.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 9.3%

opt_train=opt_train'; %The data must be in row format. 0 0 5 124 0 0 0 0 2 0 94.7%


4
opt_test=opt_test'; 0.0% 0.0% 0.3% 6.9% 0.0% 0.0% 0.0% 0.0% 0.1% 0.0% 5.3%

17 100 9 0 178 2 30 48 32 12 41.6%


%convert vectorized labels to category labels 5
0.9% 5.6% 0.5% 0.0% 9.9% 0.1% 1.7% 2.7% 1.8% 0.7% 58.4%

Output Class
opt_class = vec2ind(opt_class);
opttest_class = vec2ind(opttest_class); 13 21 17 21 1 178 7 11 32 7 57.8%
6
0.7% 1.2% 0.9% 1.2% 0.1% 9.9% 0.4% 0.6% 1.8% 0.4% 42.2%

%Fit the NB classifier to the data 0 4 0 0 0 0 144 0 0 0 97.3%


7
nb=NaiveBayes.fit(opt_train, opt_class, 'distribution', 'kernel' ); 0.0% 0.2% 0.0% 0.0% 0.0% 0.0% 8.0% 0.0% 0.0% 0.0% 2.7%
nb_out=predict(nb, opt_test);
0 0 1 1 0 0 0 99 0 0 98.0%
[post nb_out2]=posterior(nb, opt_test); 8
0.0% 0.0% 0.1% 0.1% 0.0% 0.0% 0.0% 5.5% 0.0% 0.0% 2.0%
% nb_out2=nb.predict(opt_test); %alternative usage
0 10 6 1 2 0 0 1 83 1 79.8%
9
0.0% 0.6% 0.3% 0.1% 0.1% 0.0% 0.0% 0.1% 4.6% 0.1% 20.2%
%convert labels to one-hot model to
%be used with confusion() function 7 16 12 36 0 2 0 20 24 160 57.8%
10
0.4% 0.9% 0.7% 2.0% 0.0% 0.1% 0.0% 1.1% 1.3% 8.9% 42.2%
true_labels = ind2vec(opttest_class);
79.2% 9.9% 71.8% 67.8% 98.3% 97.8% 79.6% 55.3% 47.7% 88.9% 69.7%
predicted_labels=ind2vec(nb_out); 20.8% 90.1% 28.2% 32.2% 1.7% 2.2% 20.4% 44.7% 52.3% 11.1% 30.3%

1 2 3 4 5 6 7 8 9 10
Target Class
%Create, compute and plot confusion matrix
[C CM ind performance]=confusion(true_labels, predicted_labels);
plotconfusion(true_labels, predicted_labels)
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes
naivebayes.m
Example (1)
Randomly generated Gaussian data Decision boundaries generated by naive Bayes implemented from scratch
8 8 3

6 2.
6

2.
4 4
2.
2
2
2.
0
0 2

-2
1.
-2
-4 1.
-4
-6 1.
-6
Decision boundaries generated by Matlabs naive Bayes 1.
-8
-4 -3 -2 -1 0 1 2 3 48 5 6 3
-8 1
-5 2.8 0 5
6
Mu1 =[-1 , 1];
2.6
Mu2 =[ 2 , 3]; 4
Mu3 =[ 3 , 0]; 2.4
2
2.2
Sigma1 = [.9 .1; .1 .6];
Sigma2 = [1 -0.5; -0.5 1]; 0 2
Sigma3 = [.3 0.8; 0.8 5];
1.8
-2
tr_COUNT=[500 500 500]; 1.6
-4
1.4
-6
1.2

Computational Intelligence & Pattern Recognition


-8
-5
– Fall 2013 0 5
©1 Robi Polikar, Rowan University, Glassboro, NJ
PR Naïve Bayes Example (2)
Randomly generated Gaussian data Decision boundaries generated by naive Bayes implemented from scratch
8 8 3

6 2.8
6
4 2.6
4
2
2.4

0 2
2.2
-2
0 2
-4
1.8
-2
-6
1.6
-8 -4
1.4
-10
-4 -3 -2 -1 0 1 2 3 4 5 6 -6
Decision boundaries generated by Matlabs naive Bayes 1.2
8 3
-8 1
-5 2.8 0 5
6
Mu1 =[-1 , 1];
2.6
Mu2 =[ 2 , 3]; 4
Mu3 =[ 3 , 0]; 2.4
2
2.2
Sigma1 = [.9 .1; .1 .6]; Observe the effect of the change
Sigma2 = [1 -0.5; -0.5 1]; 0 2
Sigma3 = [.3 0.8; 0.8 5]; in the prior probabilities !
1.8
-2

tr_COUNT=[500 500 4500]; 1.6


-4
1.4
-6
1.2

Computational Intelligence & Pattern Recognition


-8
-5
– Fall 2013 0 5
1© Robi Polikar, Rowan University, Glassboro, NJ
PR Conclusions
 The Bayes classifier for normally distributed classes is in general a quadratic classifier and
can be computed analytically.
 The Bayes classifier for normally distributed classes with equal covariance matrices is a
linear classifier
 For normally distributed classes with equal covariance matrices and equal priors is a
minimum – Mahalanobis distance classifier
 For normally distributed classes with equal covariance matrices proportional to the identity
matrix and with equal priors is a minimum Euclidean distance classifier
 Note that using a minimum Euclidean or Mahalanobis distance classifier implicitly makes
certain assumptions regarding statistical properties of the data, which may or may not – and
in general are not – true.
 However, in many cases, certain simplifications and approximations can be made that
warrant making such assumptions even if they are not true. The bottom line in practice in
deciding whether the assumptions are warranted is does the damn thing solve my classification
problem…?
 For multivariate data, use the naïve Bayes classifier, if the class-conditional independence
assumption (at least approximately) holds.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Error Bounds

It is difficult at best, if possible, to analytically compute the error probabilities,


Particularly when the decision regions are not contiguous. However, upper bounds for
this error can be obtained:

The Chernoff bound and its approximation Bhattacharya bound are two such bounds
that are often used. If the distributions are Gaussian, these expressions are relatively
easier to compute Often times even non-Gaussian cases are considered as Gaussian.

Red Stars on titles indicate graduate level material only – Grad students: please read the
error bounds in chapter 2 of DHS.
Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR
Exercise

Generate a multi-class (three or four class) artificial dataset with various


Gaussian distributions (of each of the three types) and demonstrate the
Bayesian classifier (Graduate students: calculate its theoretical and empirical
error). Choose examples that demonstrate your understanding. Show decision
boundaries by plotting the classifier decisions on a grid that encompasses the
entire feature space.

Recommended Completion time: September 20, 2013.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ
PR Midterm Project – I
Due: Sept 27
 Choose 10 datasets from UCI database, including Wine, Iris, Vehicle, Optical
Recognition of Handdigits and Gas Sensor Array
 Using built-in Matlab functions or by coding on your own, implement the kNN,
naïve Bayes, SVM, MLP, CART and AdaBoost (with CART) algorithms.
• Read about their implementations
• What parameters do they have? What did you use / choose, why?
• Perform appropriate cross-validation and statistical analysis. Which algorithm work better?
 BONUS
 Use your favorite algorithm on any of the publicly available face recognition
datasets and see if you can get good results.

Turn in Matlab code / plots for all


implementation assignments.

Computational Intelligence & Pattern Recognition – Fall 2013 © Robi Polikar, Rowan University, Glassboro, NJ

Вам также может понравиться