Вы находитесь на странице: 1из 55

SE : Artificial Intelligence

Lecture 3,4, 5
Nave Bayes Classifier

By

Dr Musi ali
Mustansar.ali@uettaxila.edu
.pk
Twitter: @musiali007
04/30/17 1
Outline

Classification
Text Categorisation

Probability

Bayesian Classifier

Nave Bayes Classifier

Application of Nave Bayes

Conclusion

04/30/17 2
Categorization/Classification

Given:
A description of an instance, xX, where X is the instance

language or instance space.



Issue: how to represent text documents.
A fixed set of categories:

C = {c1, c2,, cn}


Determine:
The category of x: c(x)C, where c(x) is a categorization function

whose domain is X and whose range is C.



We want to know how to build categorization functions
(classifiers).

04/30/17 3
Learning for Categorization

A training example is an instance xX, paired with its


correct category c(x): <x, c(x)> for an unknown
categorization function, c.
Given a set of training examples, D.
Find a hypothesized categorization function, h(x), such
that:

x, c ( x ) D : h ( x ) c ( x )

04/30/17 4
4
Sample Category Learning Problem
Instance language: <size, color, shape>
size {small, medium, large}
color {red, blue, green}
shape {square, circle, triangle}
C = {positive, negative}
D:
Example Size Color Shape Category
1 small red circle positive
2 large red circle positive
3 small red triangle negative
4
04/30/17
large blue circle negative 5
5
Text Categorization
Assigning documents to a fixed set of categories.
Applications:
Web pages
Recommending
Yahoo-like classification
Newsgroup Messages
Recommending
spam filtering
News articles
Personalized newspaper
Email messages

Routing
Prioritizing

Folderizing
spam filtering

04/30/17 6
6
Is this spam?

From: "" <takworlld@hotmail.com>


Subject: real estate is the only way... gem oalvgkay

Anyone can buy real estate with no money down

Stop paying rent TODAY !

There is no need to spend hundreds or even thousands for similar courses

I am 22 years old and I have already purchased 6 properties using the


methods outlined in this truly INCREDIBLE ebook.

Change your life NOW !

=================================================
Click Below to order:
http://www.wholesaledaily.com/sales/nmd.htm
=================================================

04/30/17 7
Document Classification

planning
Testing language
Data: proof
intelligence

(AI) (Programming) (HCI)


Classes:
ML Planning Semantics Garb.Coll. Multimedia GUI

Training learning planning programming garbage ... ...


Data: intelligence temporal semantics collection
algorithm reasoning language memory
reinforcement plan proof... optimization
network... language... region...

04/30/17 8
Text Categorization Examples

Assign labels to each document or web-page:


Labels are most often topics such as Yahoo-categories

e.g., "finance," "sports," "news>world>asia>business"


Labels may be genres

e.g., "editorials" "movie-reviews" "news


Labels may be opinion

e.g., like, hate, neutral


Labels may be domain-specific binary

e.g., "interesting-to-me" : "not-interesting-to-me


e.g., spam : not-spam
e.g., is a toner cartridge ad :isnt

04/30/17 9
Methods (1)

Manual classification
Used by Yahoo!, Looksmart, about.com, ODP, Medline
very accurate when job is done by experts
consistent when the problem size and team is small
difficult and expensive to scale
Automatic document classification
Hand-coded rule-based systems

Used by CS depts spam filter, Reuters, CIA, Verity,

E.g., assign category if document contains a given boolean combination of
words

Commercial systems have complex query languages
04/30/17 10
Methods (2)


Accuracy is often very high if a query has been carefully refined over time by a
subject expert

Building and maintaining these queries is expensive

Supervised learning of document-label assignment


function
Many new systems rely on machine learning
(Autonomy, Kana, MSN, Verity, )

k-Nearest Neighbors (simple, powerful)

Naive Bayes (simple, common method)

Support-vector machines (new, more powerful)

plus many other methods

No free lunch: requires hand-classified training data

04/30/17 11
Text Categorization: attributes

Representations of text are very high dimensional (one


feature for each word).
Algorithms that prevent overfitting in high-dimensional
space are best.
For most text categorization tasks, there are many
irrelevant and many relevant features.

04/30/17 12
Bayesian Methods

Learning and classification methods based on


probability theory.
Bayes theorem plays a critical role in probabilistic
learning and classification.
Build a generative model that approximates how data
is produced
Uses prior probability of each category given no
information about an item.
Categorization produces a posterior probability
distribution over the possible categories given a
description of an item.

04/30/17 13
Axioms of Probability Theory

All probabilities between 0 and 1


0 P ( A) 1
True proposition has probability 1, false has
probability 0.
P(true) = 1 P(false) = 0.
The probability of disjunction is:
P ( A B ) P ( A) P ( B ) P ( A B )
A A B B

04/30/17 14
14
Conditional Probability

P(A | B) is the probability of A given B


Assumes that B is all and only information
known.
Defined by:
P( A B)
P( A | B)
P( B)

A A B B

04/30/17 15
Independence

A and B are independent if:


P( A | B ) P( A)
These two constraints are logically equivalent
P( B | A) P( B)

Therefore, if A and B are independent:


P( A B)
P( A | B) P ( A)
P( B)

P ( A B ) P ( A) P ( B )

04/30/17 16
16
Joint Distribution
The joint probability distribution for a set of random variables, X1,
,Xn gives the probability of every combination of values (an n-
dimensional array with vn values if all variables are discrete with v
values, all vn values must sum to 1): P(X1,,Xn)
positive negative
circle square circle square
red 0.20 0.02 red 0.05 0.30
blue 0.02 0.01 blue 0.20 0.20
The probability of all possible conjunctions (assignments of values to
some subset of variables) can be calculated by summing the
appropriate subset of values from the joint distribution.
P(red circle ) 0.20 0.05 0.25
P(red ) 0.20 0.02 0.05 0.3 0.57
Therefore, all conditional probabilities can also be calculated.
04/30/17 17
17
Joint Distribution, Example

P( positive red circle ) 0.20


P ( positive | red circle ) 0.80
P (red circle ) 0.25

04/30/17 18
Probabilistic Classification

Let Y be the random variable for the class which takes


values {y1,y2,ym}.
Let X be the random variable describing an instance
consisting of a vector of values for n features <X1,X2
Xn>, let xk be a possible value for X and xij a possible
value for Xi.

For classification, we need to compute P(Y=yi | X=xk)


for i=1m

04/30/17 19
19
Motivational stuff

Life's battles don't always go


to the stronger or faster man.
But sooner or later, the man who wins
is the man who thinks they CAN!

Note: Not part of the course


Bayes Theorem

P( E | H ) P( H )
P( H | E )
P( E )

Simple proof from definition of conditional probability:


P( H E )
P( H | E ) (Def. cond. prob.)
P( E )
P( H E )
P( E | H ) (Def. cond. prob.)
P( H )
P( H E ) P( E | H ) P( H )

P( E | H ) P( H )
P( H | E )
P( E )
04/30/17 21
21
Bayesian Categorization
Determine category of xk by determining for each yi
P (Y yi ) P( X xk | Y yi )
P (Y yi | X xk )
P ( X xk )

P(X=xk) can be determined since categories are


complete and disjoint.
m m
P(Y yi ) P ( X xk | Y yi )
P(Y yi | X xk )
i 1 i 1 P ( X xk )
1

m
P ( X xk ) P(Y yi ) P ( X xk | Y yi )
i 1

22
Bayesian Categorization (cont.)

Need to know:
Priors: P(Y=yi)
Conditionals: P(X=xk | Y=yi)

Still need to make some sort of independence


assumptions about the features to make learning
tractable.

23
Nave Bayesian Categorization

If we assume features of an instance are


independent given the category
(conditionally independent).
n
P( X | Y ) P( X 1 , X 2 , X n | Y ) P( X i | Y )
i 1

Therefore, we then only need to know P(Xi | Y) for


each possible pair of a feature-value and a
category.
04/30/17 24
Smoothing

To account for estimation from small


samples, probability estimates are adjusted or
smoothed.

nijk mp
P ( X i xij | Y yk )
nk m

04/30/17 25
25
Nave Bayes: Learning
From training corpus, extract Vocabulary
Calculate required P(cj) and P(xk | cj) terms
For each cj in C do
docs subset of documents for which the target class
j
is cj
| docs j |
P (c j )
| total # documents |

Textj single document containing all docsj


for each word xk in Vocabulary
nk number of occurrences of xk in Textj
n = the total number of word occurrences in Textj
nk
P ( xk | c j )
n | Vocabulary |

04/30/17 26
Nave Bayes: Classifying

positions all word positions in current document


which contain tokens found in Vocabulary
Return cNB, where

c NB argmax P (c j )
c jC
P( x | c )
i positions
i j

Simply compute the Posterior Probability of each class


and assign the output label to the class having
maximum posterior probability.
04/30/17 27
Naive Bayes: Time Complexity

Training Time: O(Ld )


where Ld is the average length of a document
in D. Why?
Test Time: O(L )
t
where Lt is the average length of a test
document.
Very efficient overall, linearly proportional
to the time needed to just read in all the data.
04/30/17 28
Nave Bayes Application: Digital
Recongnition System

Nave Bayes Application: Digital


Recongnition System

04/30/17 29
Things Wed Like to Do

Spam Classification
Given an email, predict whether it is spam or

not

Weather
Based on temperature, humidity, etc predict

if it will rain tomorrow

04/30/17 30
Bayesian Classification Formulation

Problem statement:
Given features X1,X2,,Xn
Predict a label Y

04/30/17 31
Another Application

Digit Recognition

Classifier 5

X1,,Xn {0,1} (Black vs. White pixels)


Y {5,6} (predict whether a digit is a 5 or a 6)

04/30/17 32
The Bayes Classifier

In class, we saw that a good strategy is to predict:

(for example: what is the probability that the image


represents a 5 given its pixels?)

So How do we compute that?

04/30/17 33
The Bayes Classifier

Use Bayes Rule!


Likelihood Prior

Posterior Probability Normalization Constant

04/30/17 34
The Bayes Classifier
Lets expand this for our digit recognition task:

To classify, well simply compute these two probabilities and predict


based on which one is greater

04/30/17 35
Model Parameters

For the Bayes classifier, we need to learn two functions,


the likelihood and the prior

04/30/17 37
Model Parameters

The problem with explicitly modeling P(X1,,Xn|Y) is


that there are usually way too many parameters:
Well run out of space

Well run out of time

And well need tons of training data (which is usually

not available)

04/30/17 38
The Nave Bayes Model

The Nave Bayes Assumption: Assume that


all features are independent given the class
label Y
Equationally speaking:

04/30/17 39
Nave Bayes Training
Now that weve decided to use a Nave Bayes classifier, we need to
train it with some data:

Training Data
04/30/17 40
Nave Bayes Training
Training in Nave Bayes is easy:
Estimate P(Y=v) as the fraction of records with

Y=v

Estimate P(Xi=u|Y=v) as the fraction of records


with Y=v for which Xi=u

04/30/17 41
Nave Bayes Training
For binary digits, training amounts to averaging all of the training
fives together and all of the training sixes together.

04/30/17 42
Nave Bayes Classification

04/30/17 43
Nave Bayes Assumption

Recall the Nave Bayes assumption:

that all features are independent given the class label Y

Does this hold in real world?

04/30/17 44
Exclusive-OR Example

For an example where conditional independence fails:


Y=XOR(X1,X2)

X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0

04/30/17 45
Actually, the Nave Bayes assumption is almost never true

Still Nave Bayes often performs surprisingly well even


when its assumptions do not hold

04/30/17 46
Underflow Prevention

Multiplying lots of probabilities, which are


between 0 and 1 by definition, can result in
floating-point underflow.
Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs
of probabilities rather than multiplying
probabilities.

04/30/17 47
47
Recap

We defined a Bayes classifier but saw that its intractable


to compute P(X1,,Xn|Y)
We then used the Nave Bayes assumption that
everything is independent given the class label Y

04/30/17 48
Conclusions

Nave Bayes is:


Really easy to implement and often works well

Often a good first thing to try

04/30/17 49
Questions?

04/30/17 50
References

Rada Mihalcea Information Retrieval


and Web Search (www.cse.unt.edu/~rada
/CSCE5200)
Introduction to Information Retrieval,
Christopher D. Manning, Prabhakar Raghavan &
Hinrich Schtze (nlp.stanford.edu/IR-
book/newslides.html)

04/30/17 51
Appendix: Mathematical Formulation

04/30/17 52
Appendix: Joint Distribution of Nave
Bayes (NB)
The numerator is equivalent to the joint probability model

04/30/17 53
Appendix: Conditional Independence of
NB

04/30/17 54
Appendix: NB Final Model

04/30/17 55

Вам также может понравиться