AI Lec 04+05 - Naive Bayes

SE : Artificial Intelligence
Lecture 3,4, 5
Nave Bayes Classifier
By
Dr Musi ali
Mustansar.ali@uettaxila.edu
.pk
Twitter: @musiali007
04/30/17 1
Outline
Classification
Text Categorisation
Probability
Bayesian Classifier
Nave Bayes Classifier
Application of Nave Bayes
Conclusion
04/30/17 2
Categorization/Classification
Given:
A description of an instance, xX, where X is the instance
language or instance space.

Issue: how to represent text documents.
A fixed set of categories:
C = {c1, c2,, cn}

Determine:
The category of x: c(x)C, where c(x) is a categorization function
whose domain is X and whose range is C.

We want to know how to build categorization functions
(classifiers).
04/30/17 3
Learning for Categorization
A training example is an instance xX, paired with its

correct category c(x): <x, c(x)> for an unknown
categorization function, c.
Given a set of training examples, D.
Find a hypothesized categorization function, h(x), such
that:
x, c ( x ) D : h ( x ) c ( x )
04/30/17 4
4
Sample Category Learning Problem
Instance language: <size, color, shape>
size {small, medium, large}
color {red, blue, green}
shape {square, circle, triangle}
C = {positive, negative}
D:
Example Size Color Shape Category
1 small red circle positive
2 large red circle positive
3 small red triangle negative
4
04/30/17
large blue circle negative 5
5
Text Categorization
Assigning documents to a fixed set of categories.
Applications:
Web pages
Recommending
Yahoo-like classification
Newsgroup Messages
Recommending
spam filtering
News articles
Personalized newspaper
Email messages

Routing
Prioritizing

Folderizing
spam filtering
04/30/17 6
6
Is this spam?
From: "" <takworlld@hotmail.com>

Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for similar courses
I am 22 years old and I have already purchased 6 properties using the

methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================
04/30/17 7
Document Classification
planning
Testing language
Data: proof
intelligence
(AI) (Programming) (HCI)

Classes:
ML Planning Semantics Garb.Coll. Multimedia GUI
Training learning planning programming garbage ... ...

Data: intelligence temporal semantics collection
algorithm reasoning language memory
reinforcement plan proof... optimization
network... language... region...
04/30/17 8
Text Categorization Examples
Assign labels to each document or web-page:

Labels are most often topics such as Yahoo-categories
e.g., "finance," "sports," "news>world>asia>business"

Labels may be genres
e.g., "editorials" "movie-reviews" "news

Labels may be opinion
e.g., like, hate, neutral

Labels may be domain-specific binary
e.g., "interesting-to-me" : "not-interesting-to-me

e.g., spam : not-spam
e.g., is a toner cartridge ad :isnt
04/30/17 9
Methods (1)
Manual classification
Used by Yahoo!, Looksmart, about.com, ODP, Medline
very accurate when job is done by experts
consistent when the problem size and team is small
difficult and expensive to scale
Automatic document classification
Hand-coded rule-based systems

Used by CS depts spam filter, Reuters, CIA, Verity,

E.g., assign category if document contains a given boolean combination of
words

Commercial systems have complex query languages
04/30/17 10
Methods (2)

Accuracy is often very high if a query has been carefully refined over time by a
subject expert

Building and maintaining these queries is expensive
Supervised learning of document-label assignment

function
Many new systems rely on machine learning
(Autonomy, Kana, MSN, Verity, )

k-Nearest Neighbors (simple, powerful)

Naive Bayes (simple, common method)

Support-vector machines (new, more powerful)

plus many other methods

No free lunch: requires hand-classified training data
04/30/17 11
Text Categorization: attributes
Representations of text are very high dimensional (one

feature for each word).
Algorithms that prevent overfitting in high-dimensional
space are best.
For most text categorization tasks, there are many
irrelevant and many relevant features.
04/30/17 12
Bayesian Methods
Learning and classification methods based on

probability theory.
Bayes theorem plays a critical role in probabilistic
learning and classification.
Build a generative model that approximates how data
is produced
Uses prior probability of each category given no
information about an item.
Categorization produces a posterior probability
distribution over the possible categories given a
description of an item.
04/30/17 13
Axioms of Probability Theory
All probabilities between 0 and 1

0 P ( A) 1
True proposition has probability 1, false has
probability 0.
P(true) = 1 P(false) = 0.
The probability of disjunction is:
P ( A B ) P ( A) P ( B ) P ( A B )
A A B B
04/30/17 14
14
Conditional Probability
P(A | B) is the probability of A given B

Assumes that B is all and only information
known.
Defined by:
P( A B)
P( A | B)
P( B)
A A B B
04/30/17 15
Independence
A and B are independent if:

P( A | B ) P( A)
These two constraints are logically equivalent
P( B | A) P( B)
Therefore, if A and B are independent:

P( A B)
P( A | B) P ( A)
P( B)
P ( A B ) P ( A) P ( B )
04/30/17 16
16
Joint Distribution
The joint probability distribution for a set of random variables, X1,
,Xn gives the probability of every combination of values (an n-
dimensional array with vn values if all variables are discrete with v
values, all vn values must sum to 1): P(X1,,Xn)
positive negative
circle square circle square
red 0.20 0.02 red 0.05 0.30
blue 0.02 0.01 blue 0.20 0.20
The probability of all possible conjunctions (assignments of values to
some subset of variables) can be calculated by summing the
appropriate subset of values from the joint distribution.
P(red circle ) 0.20 0.05 0.25
P(red ) 0.20 0.02 0.05 0.3 0.57
Therefore, all conditional probabilities can also be calculated.
04/30/17 17
17
Joint Distribution, Example
P( positive red circle ) 0.20

P ( positive | red circle ) 0.80
P (red circle ) 0.25
04/30/17 18
Probabilistic Classification
Let Y be the random variable for the class which takes

values {y1,y2,ym}.
Let X be the random variable describing an instance
consisting of a vector of values for n features <X1,X2
Xn>, let xk be a possible value for X and xij a possible
value for Xi.
For classification, we need to compute P(Y=yi | X=xk)

for i=1m
04/30/17 19
19
Motivational stuff
Life's battles don't always go

to the stronger or faster man.
But sooner or later, the man who wins
is the man who thinks they CAN!
Note: Not part of the course

Bayes Theorem
P( E | H ) P( H )
P( H | E )
P( E )
Simple proof from definition of conditional probability:

P( H E )
P( H | E ) (Def. cond. prob.)
P( E )
P( H E )
P( E | H ) (Def. cond. prob.)
P( H )
P( H E ) P( E | H ) P( H )
P( E | H ) P( H )
P( H | E )
P( E )
04/30/17 21
21
Bayesian Categorization
Determine category of xk by determining for each yi
P (Y yi ) P( X xk | Y yi )
P (Y yi | X xk )
P ( X xk )
P(X=xk) can be determined since categories are

complete and disjoint.
m m
P(Y yi ) P ( X xk | Y yi )
P(Y yi | X xk )
i 1 i 1 P ( X xk )
1
m
P ( X xk ) P(Y yi ) P ( X xk | Y yi )
i 1
22
Bayesian Categorization (cont.)
Need to know:
Priors: P(Y=yi)
Conditionals: P(X=xk | Y=yi)
Still need to make some sort of independence

assumptions about the features to make learning
tractable.
23
Nave Bayesian Categorization
If we assume features of an instance are

independent given the category
(conditionally independent).
n
P( X | Y ) P( X 1 , X 2 , X n | Y ) P( X i | Y )
i 1
Therefore, we then only need to know P(Xi | Y) for

each possible pair of a feature-value and a
category.
04/30/17 24
Smoothing
To account for estimation from small

samples, probability estimates are adjusted or
smoothed.
nijk mp
P ( X i xij | Y yk )
nk m
04/30/17 25
25
Nave Bayes: Learning
From training corpus, extract Vocabulary
Calculate required P(cj) and P(xk | cj) terms
For each cj in C do
docs subset of documents for which the target class
j
is cj
| docs j |
P (c j )
| total # documents |
Textj single document containing all docsj

for each word xk in Vocabulary
nk number of occurrences of xk in Textj
n = the total number of word occurrences in Textj
nk
P ( xk | c j )
n | Vocabulary |
04/30/17 26
Nave Bayes: Classifying
positions all word positions in current document

which contain tokens found in Vocabulary
Return cNB, where
c NB argmax P (c j )
c jC
P( x | c )
i positions
i j
Simply compute the Posterior Probability of each class

and assign the output label to the class having
maximum posterior probability.
04/30/17 27
Naive Bayes: Time Complexity
Training Time: O(Ld )

where Ld is the average length of a document
in D. Why?
Test Time: O(L )
t
where Lt is the average length of a test
document.
Very efficient overall, linearly proportional
to the time needed to just read in all the data.
04/30/17 28
Nave Bayes Application: Digital
Recongnition System
Nave Bayes Application: Digital

Recongnition System
04/30/17 29
Things Wed Like to Do
Spam Classification
Given an email, predict whether it is spam or
not
Weather
Based on temperature, humidity, etc predict
if it will rain tomorrow
04/30/17 30
Bayesian Classification Formulation
Problem statement:
Given features X1,X2,,Xn
Predict a label Y
04/30/17 31
Another Application
Digit Recognition
Classifier 5
X1,,Xn {0,1} (Black vs. White pixels)

Y {5,6} (predict whether a digit is a 5 or a 6)
04/30/17 32
The Bayes Classifier
In class, we saw that a good strategy is to predict:
(for example: what is the probability that the image

represents a 5 given its pixels?)
So How do we compute that?
04/30/17 33
Use Bayes Rule!

Likelihood Prior
Posterior Probability Normalization Constant
04/30/17 34
Lets expand this for our digit recognition task:
To classify, well simply compute these two probabilities and predict

based on which one is greater
04/30/17 35
Model Parameters
For the Bayes classifier, we need to learn two functions,

the likelihood and the prior
04/30/17 37
Model Parameters
The problem with explicitly modeling P(X1,,Xn|Y) is

that there are usually way too many parameters:
Well run out of space
Well run out of time
And well need tons of training data (which is usually
not available)
04/30/17 38
The Nave Bayes Model
The Nave Bayes Assumption: Assume that

all features are independent given the class
label Y
Equationally speaking:
04/30/17 39
Nave Bayes Training
Now that weve decided to use a Nave Bayes classifier, we need to
train it with some data:
Training Data
04/30/17 40
Nave Bayes Training
Training in Nave Bayes is easy:
Estimate P(Y=v) as the fraction of records with
Y=v
Estimate P(Xi=u|Y=v) as the fraction of records

with Y=v for which Xi=u
04/30/17 41
Nave Bayes Training
For binary digits, training amounts to averaging all of the training
fives together and all of the training sixes together.
04/30/17 42
Nave Bayes Classification
04/30/17 43
Nave Bayes Assumption
Recall the Nave Bayes assumption:
that all features are independent given the class label Y
Does this hold in real world?
04/30/17 44
Exclusive-OR Example
For an example where conditional independence fails:

Y=XOR(X1,X2)
X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0
04/30/17 45
Actually, the Nave Bayes assumption is almost never true
Still Nave Bayes often performs surprisingly well even

when its assumptions do not hold
04/30/17 46
Underflow Prevention
Multiplying lots of probabilities, which are

between 0 and 1 by definition, can result in
floating-point underflow.
Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs
of probabilities rather than multiplying
probabilities.
04/30/17 47
47
Recap
We defined a Bayes classifier but saw that its intractable

to compute P(X1,,Xn|Y)
We then used the Nave Bayes assumption that
everything is independent given the class label Y
04/30/17 48
Conclusions
Nave Bayes is:

Really easy to implement and often works well
Often a good first thing to try
04/30/17 49
Questions?
04/30/17 50
References
Rada Mihalcea Information Retrieval

and Web Search (www.cse.unt.edu/~rada
/CSCE5200)
Introduction to Information Retrieval,
Christopher D. Manning, Prabhakar Raghavan &
Hinrich Schtze (nlp.stanford.edu/IR-
book/newslides.html)
04/30/17 51
Appendix: Mathematical Formulation
04/30/17 52
Appendix: Joint Distribution of Nave
Bayes (NB)
The numerator is equivalent to the joint probability model
04/30/17 53
Appendix: Conditional Independence of
NB
04/30/17 54
Appendix: NB Final Model
04/30/17 55

AI Lec 04+05 - Naive Bayes

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

AI Lec 04+05 - Naive Bayes

Загружено:

Авторское право:

Доступные форматы

SE : Artificial Intelligence

Nave Bayes Classifier

Application of Nave Bayes

language or instance space.

C = {c1, c2,, cn}

whose domain is X and whose range is C.

A training example is an instance xX, paired with its

From: "" <takworlld@hotmail.com>

Anyone can buy real estate with no money down

Stop paying rent TODAY !

There is no need to spend hundreds or even thousands for similar courses

I am 22 years old and I have already purchased 6 properties using the

Change your life NOW !

(AI) (Programming) (HCI)

Training learning planning programming garbage ... ...

Assign labels to each document or web-page:

e.g., "finance," "sports," "news>world>asia>business"

e.g., "editorials" "movie-reviews" "news

e.g., like, hate, neutral

e.g., "interesting-to-me" : "not-interesting-to-me

Supervised learning of document-label assignment

Representations of text are very high dimensional (one

Learning and classification methods based on

All probabilities between 0 and 1

P(A | B) is the probability of A given B

A and B are independent if:

Therefore, if A and B are independent:

P( positive red circle ) 0.20

Let Y be the random variable for the class which takes

For classification, we need to compute P(Y=yi | X=xk)

Life's battles don't always go

Note: Not part of the course

Simple proof from definition of conditional probability:

P(X=xk) can be determined since categories are

Still need to make some sort of independence

If we assume features of an instance are

Therefore, we then only need to know P(Xi | Y) for

To account for estimation from small

Textj single document containing all docsj

positions all word positions in current document

Simply compute the Posterior Probability of each class

Training Time: O(Ld )

Nave Bayes Application: Digital

if it will rain tomorrow

X1,,Xn {0,1} (Black vs. White pixels)

In class, we saw that a good strategy is to predict:

(for example: what is the probability that the image

So How do we compute that?

Use Bayes Rule!

Posterior Probability Normalization Constant

To classify, well simply compute these two probabilities and predict

For the Bayes classifier, we need to learn two functions,

The problem with explicitly modeling P(X1,,Xn|Y) is

Well run out of time

And well need tons of training data (which is usually

The Nave Bayes Assumption: Assume that

Estimate P(Xi=u|Y=v) as the fraction of records

Recall the Nave Bayes assumption:

that all features are independent given the class label Y

Does this hold in real world?

For an example where conditional independence fails:

Still Nave Bayes often performs surprisingly well even

Multiplying lots of probabilities, which are

We defined a Bayes classifier but saw that its intractable