Вы находитесь на странице: 1из 9

CS5011: Introduction to Machine Learning

Fall 2014

Lecture 1: A - Introduction to Probability Theory


Lecturers: Nalini Deswal & Harini A

Scribes: Dibu John Philip

Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.
Probability theory is the branch of mathematics that deals with uncertainty and the analysis of random
phenomena. The central objects of probability theory are random variables, stochastic processes, and events.
In this class, the basic concepts of probability theory are discussed.

Basic Elements
Sample Space S
All the possible outcomes of an experiment. e.g. tossing a coin has a sample space S={H,T} Discrete
sample space;
selecting a number between 0 & 1 has the sample space S={[0,1]} Continuous sample space.
Event F
Subset of the Sample Space which is of interest to a person. Elements of A F will be subsets of S.
If A & B are two sets,
ABxA&xB
equality; A = B A B & B A

Operations on Sets
Union
A B: the set of all elements that are in A or in B or in both A and B.
Intersection
A B: the set of all elements that are in both A and B.
Compliment
Ac : the set of all elements that are not in A.

See Fig.1 for the Venn diagram representation of these operations.

Properties of Set Operations


Commutative: A B = A B; A B = A B
Associative: A (B C) = (A B) C
1

Lecture 1: A - Introduction to Probability Theory

Figure 1: Set operations

Figure 2: Disjoint sets

Distribution: A (B C) = (A B) (A C)
De Morgans Law1 (A B)c = Ac Bc

Disjoint Sets

Two sets A and B are said to be disjoint if there are no elements common to both A and B; i.e., A B =
. See Fig.2.
1 Augustus De Morgan was a British mathematician and logician. He formulated De Morgans laws and introduced the term
mathematical induction, making its idea rigorous.

Lecture 1: A - Introduction to Probability Theory

Figure 3: Partitioning of Sample Space

Partitioning

Dividing a sample space S into different sets such that there is nothing in common between the sets.
n
n
T
S
i.e.,
Ai = &
Ai = S. See Fig.3
i=1

i=1

Sigma Algebra

Sigma algebra B is the collection of all subsets of S which satisfies the following properties:
1. B
2. A B Ac B
3. A1 , A2 , . . . , An B

n
S

Ai B.

i=1

Example: S={1, 2} B = {{1}, {2}, {1, 2}, } (power set of S; PS )

Probability Function

P: F R which satisfies the following properties called the Axioms of Probability :


1. P(A) 0
2. P(S) = 1
3. For disjoint events A1 , A2 , . . . , Ai , P(

n
S
i=1

Ai ) =

n
P
i=1

P(Ai )

Example: Tossing of a coin; S={H, T }; B = {{H}, {T }, {H, T }, {}} The domain over which probability
is defined.

Lecture 1: A - Introduction to Probability Theory

P(H)= 12 ; P(T)= 12 ;
P(H or T)=1; P()=0.

Properties of Probability
1. P(A) 1
2. P(Ac ) = 1-P(A)
3. P(A B) = P(A) + P(B) - P(AB)

From (1) and (3), P(A B) 1;


P(A) + P(B) - P(AB) 1

P (A B) P (A) + P (B) 1
This is called the Bonferronis2 inequality which gives a lower bound on the intersection of two sets. However,
when P(A) and P(B) are too small, the RHS will be negative which makes it a trivial bound. In other words,
when you have two rare events, this inequality does not help you much. Also, this inequality can be used to
approximate P(A B) when it is difficult to calculate it.Generalizing this,

P(

n
T

i=1

n
P

P(Ai )-(n-1)

i=1

Conditional Probability

Conditional probability of any event A given B is


P(A|B)= P(AB)
P(B)
Fig.4 explains this with Venn diagrams.
Example: Roll of a single die. S = {1, 2 ,3, 4, 5 , 6}. Let A be the event of getting 1. Let B be the event
that the die shows up an odd number. P(A) = P(1) = 16 . Now, adding the knowledge that the die throws
out an odd number increases our confidence that 1 would show up;P(A|B) = P(1 | odd no.) = P(AB)
P(B) =
1/6
1/2

10

= 13 .

Independence

Occurrence of one event has no effect on the occurence of the other event.
2 Carlo Emilio Bonferronis first interests were in music and he studied conducting and the piano at the Music Conservatory
of Turin. However, his interests turned towards mathematics, encouraged by his father.

Lecture 1: A - Introduction to Probability Theory

Figure 4: Conditional Probability

P(A B) = P(A)P(B);
P(A | B) = P(AB)
P(B) =P(A).

11

Bayes Theorem

P(A | B) =

P(AB)
P(B)

P(A B) = P(A|B)P(B)

(1)

P(A B) = P(B|A)P(A)

(2)

Also,
From eqns. 1 and 2
P(A|B) =

P(B|A)P(A)
P(B)

where P(A|B) is called the posterior probability, P(B|A) the likelihood, P(A) the prior, and P(B) the
evidence.

12

Random Variables

A function that maps sample space S to R. See Fig.5.


Example: Tossing a coin thrice. S={HHH, HHT, HT T, T T T, T T H, T HH, T HT, HT H}. Let us be interested in the no. of possible Heads in the experiment. X=0, 1, 2, 3. P(X=xi )=P({sj S : X(sj ) = xi })

P(X=2) = 83 ; P(X=0)= 18 ;
P(X=1) = 83 ; P(X=3)= 18 .
3 Thomas Bayes never published what would eventually become his most famous accomplishment; his notes were edited and
published after his death by Richard Price.

Lecture 1: A - Introduction to Probability Theory

Figure 5: Random Variable

Another example; S:[-1, 1]; a taking a value between -1 and 1. Let X: a2 ;


X
S R([0, 1]); continuous random variable.

13

Probability Mass Function (pmf ): discrete random variable

Gives the probability of each numerical value that the random variable can take.

px (X):P(X=xi ) xi
With respect to the previous 3-coin tossing example,
(
px =

14

1
8,
3
8,

if x = 0 or 3
if x = 1 or 2

Probability Density Function (pdf ): continuous random variable

fx (X) is continuous and non-negative function.

P(X B) =

fx (X)dx

P(a X b) =

Rb

fx (X)dx

Point probability is zero in the continuous case. (check by putting a=b in the above equation)

Lecture 1: A - Introduction to Probability Theory

Figure 6: cdf of X

15

Cumulative Distribution Function


P

P(X x) =

P(X=xi ); discrete.

xi x

Fx (X x) =

Rx

fX (X);

Example: Again from the previous 3-coin tossing example;

0,

8,
FX (X) = 18 +

1 +

7 +
8

3
8
3
8
1
8

x<0
0 x < 1,
1
= 2,
1 x < 2,
+ 38 = 87 , 2 x < 3,
= 1,
x 3.

Fig.6 shows the cdf of X. The cdf has the following properties:
1. Every cumulative distribution function F is non-decreasing and right-continuous;
2.

lim FX (X)=0;

3. lim FX (X)=1.
x

16

Functions of Random Variables

Y = g(X) is a function of the random variable X; Y is also a random variable.


e.g. X: temp in C;
Y: 1.8X + 32.

16.1

Lecture 1: A - Introduction to Probability Theory

Expected Value / Mean

The mean of a discrete random variable X is a weighted average of the possible values that the random
variable can take.
X: set of random variables; let g(X) be a function on X. Then the expected value of g(X) is given as
=E[g(x)] =

g(X)P(X=x)

xX

For continuous random variables;


=E[g(x)] =

g(X)fX (x)

xX

Example: Toss a biased coin twice. Let the probability of it showing up heads by p = 34 .

=E[X]=0. 14 +1.2. 34 14 +2. 34 34 =

17

P(X=x)

1
4

2. 43 . 14

( 34 )2

3
2

Moment of a Random Variable

For an integer n, the nth moment


n : E[Xn ]
When n=1,
: E[X]
where is the mean or expected value of the random variable. That is, mean is the first moment.

18

Central Moment
n : E[(X-)n ]

When n = 2,
Variance = 2 = E[(X )2 ]
= E[X2 + 2 2X]
= E[X2 ] E[X]2

Lecture 1: A - Introduction to Probability Theory

Standard deviation =

The variance of a discrete random variable X measures the spread, or variability, of the distribution. Variance
is the second central moment.

19

Some Properties of Mean and Variance

1. E[a.g1 (X) + b.g2 (X)] = aE[g1 (X)] + bE[g2 (X)]; where a and b are scalars.
2. Var(a.g(X)) = a2 Var(g(X))
3. Var(X+Y) = Var(X) + Var(Y) + 2Cov(X, Y)

20

Next topic

The next section of this will lecture will introduce the different probability distributions.

References
[GR06]

George Casella and Roger Berger, Statistical Inference, Thomson Press (India) Ltd;
2 Edition edition, 2006.

Вам также может понравиться