Вы находитесь на странице: 1из 35

STUDYNAMA.

COM
India’s Mega Online Education Hub for Class 9-12 Students,
Engineers, Managers, Lawyers and Doctors.
Free Resources for Free Resources for Free Resources for Free Resources for Free Resources for
Class 9-12 Students Engineering Students MBA/BBA Students LLB/LLM Students MBBS/BDS Students

• Lecture Notes • Lecture Notes • Lecture Notes • Lecture Notes • Lecture Notes

• Project Reports • Project Reports • Project Reports • Project Reports • Project Reports

• Solved Papers • Solved Papers • Solved Papers • Solved Papers • Solved Papers

View More » View More » View More » View More » View More »

▼▼ Scroll Down to View your Downloaded File! ▼▼


Disclaimer

Please note none of the content or study material in this document or content in this file is prepared or
owned by Studynama.com. This content is shared by our student partners and we do not hold any
copyright on this content.

Please let us know if the content in this file infringes any of your copyright by writing to us at:
info@studynama.com and we will take appropriate action.
UNIT-
UNIT- I
Population:
Population In statistics, a population is a set of similar items or events which is of
interest for some question or experiment. A statistical population can be a group of
actually existing objects (e.g. the set of all stars within the Milky Way galaxy) or a
hypothetical and potentially infinite group of objects conceived as a generalization from
experience (e.g. the set of all possible hands in a game of poker). So, in short,
Population is the collection of all individuals or items under consideration in a statistical
study. A common aim of statistical analysis is to produce information about some
chosen population. In statistical inference, a subset of the population (a statistical
sample) is chosen to represent the population in a statistical analysis. If a sample is
chosen properly, characteristics of the entire population that the sample is drawn from
m
o
can be estimated from corresponding characteristics of the sample.
c
.
a
Sample: In statistics and quantitative research methodology, a data sample is a set of
mpopulation by a defined procedure. The
data collected and/or selected from a statistical
elements of a sample are known as a
n sample points, sampling units or observations.
ylarge; making a census or a complete enumeration of
Typically, the population is very
d is either impractical or impossible. Sample is that part of
tu
all the values in the population
the population from
S which information is collected. The sample usually represents a
subset of manageable size. Samples are collected and statistics are calculated from the
samples so that one can make inferences or extrapolations from the sample to the
population. The data sample may be drawn from a population without replacement, in
which case it is a subset of a population; or with replacement, in which case it is a
multisubset.

Parameters: The various constants such as; mean, variance, correlation coefficient etc.
of the population are known as parameters. The parameters are the functions of
Population observation. A parameter is usually unknown and is estimated from the

1
sample. So the inference about some specific unknown parameter is based on a
statistic.

Statistic:
tatistic: The estimator which is used to estimate the population is called a statistic. A
Statistic is also a constant such as; mean, median, variance, correlation coefficient etc.
computed from the sample observations. Hence, a statistic is a function of sample
observations and is used to make inference about parameters. The primary focus of
most research studies is the parameter of the population, not statistics calculated for the
particular sample selected. The sample and statistics describing it are important only in
so far as they provide information about the unknown parameters.

Sampling distribution of a statistic:

m
o
Suppose we have a population of size ‘N’ and we are interested to draw a sample of size ‘n’

c
from the population. In different time if we draw the sample of size n, we get different samples of
.
different observations i.e. we can get
a
c n possible samples. If we calculate some particular
N

statistic from each of the


m
c n samples, the distribution of sample statistic is called sampling
N

a
distribution of the statistic. For example if we consider the mean as the statistic, then the

yn
distribution of all possible means of the samples is a distribution of the sample mean and it is

d
called sampling distribution of the mean. The sampling distribution depends on the

t uof the population, the statistic being considered, the sampling


underlying distribution

S and the sample size used. There is often considerable interest in


procedure employed,
whether the sampling distribution can be approximated by an asymptotic distribution,
which corresponds to the limiting case either as the number of random samples of finite
size, taken from an infinite population and used to produce the distribution, tends to
infinity, or when just one equally-infinite-size "sample" is taken of that same population.
For example, consider a normal population with mean μ and variance σ². Assume we repeatedly
take samples of a given size from this population and calculate the arithmetic mean x for each
sample – this statistic is called the sample mean. Each sample has its own average value, and
the distribution of these averages is called the "sampling distribution of the sample mean". This
distribution is normal N μ, (n is the sample size) since the underlying population is normal,

2
although sampling distributions may also often be close to normal even when the population
distribution is not (see central limit theorem).
Standard error:

Standard deviation of the sampling distribution of the statistic t is called standard error of t.

i.e. S.E (t)= Var (t )

Utility of standard error:

1. It is a useful instrument in the testing of hypothesis. If we are testing a hypothesis at 5%


t − E (t )
l.o.s and if the test statistic i.e. Z = > 1.96 then the null hypothesis is rejected
S .E (t )
at 5% l.o.s otherwise it is accepted.
2. With the help of the S.E we can determine the limits with in which the parameter value
expected to lie. m
c o
3. S.E provides an idea about the precision of the sample. If S.E increases the precision
.
a
1
decreases and vice-versa. The reciprocal of the S.E i.e. is a measure of precision of
S.E
a sample. m
a
yn
4. It is used to determine the size of the sample.

Standard error of sample mean: d


tu
S
Theorem: Show that the standard error of sample mean ̅ of a random sample of size n
drawn at random from a population with mean and variance is i.e. . ̅
√ √
Proof: let , ,………., be a random sample of size n drawn at random from a population
with mean and variance , Therefore we have
∀ " 1,2, … … . 1
The sample mean is given b
1
̅ % % ⋯ … … . %
1
∴ ̅ ( % % ⋯…….% )
1
% * % ⋯ … … . %*
1
% % ⋯…….%

3
1

̅ ⟹ . ̅

Standard error of Population proportion:

Theorem: Show that the standard error of sample proportion , of an attribute in a random
/0
sample of size n drawn at random from a population with its population proportion - is .

where 1 1 2 -.
Proof let 3 be the number of the persons possessing the given attribute in a random
Proof:
sample of size n drawn at random from a population with its population proportion -, then
the distribution of 3 will be Binomial i.e. 3~5 , - . Let , be the proportion of persons
possessing the given attribute in a random sample then
6
,
7 6 /
, - 8∵ 3 -:
6 /0 /0
.
/0 m
Also , ⟹ . , where 1
c o1 2 -.

.
a
Statistical Hypotheses And its Types m
a
y n
Hypothesis: A statistical hypothesis is an assumption about a population parameter.
This assumption may ord may not be true. Hypothesis testing refers to the formal
u to accept or reject statistical hypotheses. The best way
procedures used bytstatisticians
to determine S
whether a statistical hypothesis is true would be to examine the entire
population. Since that is often impractical, researchers typically examine a random
sample from the population. If sample data are not consistent with the statistical
hypothesis, the hypothesis is rejected.

Example:
Example A coin may be tossed 200 times and we may get heads 80 times and tails
120 times, we may now be interested in testing the hypothesis that the coin is unbiased.
To take another example we may study the average weight of the 100 students of a
particular college and may get the result as 110lb. We may now be interested in testing

4
the hypothesis that the sample has been drawn from a population with average weight
115lb.

Hypotheses are of two types

1. Null Hypothesis
2. Alternative hypothesis
Null hypothesis:
hypothesis

The hypothesis under verification is known as null hypothesis and is denoted by H0 and is

always set up for possible rejection under the assumption that it is true.

For example, if we want to find out whether extra coaching has benefited the students or
not, we shall set up a null hypothesis that “extra coaching has not benefited the students”.

m
Similarly, if we want to find out whether a particular drug is effective in curing malaria we will

c o
take the null hypothesis that “the drug is not effective in curing malaria”.
.
Alternative hypothesis: a
m
a
The rival hypothesis or hypothesis which is likely to be accepted in the event of rejection of

yn
the null hypothesis H0 is called alternative hypothesis and is denoted by H1 or Ha.

d
tu
For example, if a psychologist who wishes to test whether or not a certain class of people

S
has a mean I.Q. 100, then the following null and alternative hypothesis can be established.

The null hypothesis would be

H 0 : µ = 100

Then the alternative hypothesis could be any one of the statements.

H 1 : µ ≠ 100
(or ) H 1 : µ > 100
(or ) H 1 : µ < 100

Errors in testing of hypothesis:

5
After applying a test, a decision is taken about the acceptance or rejection of null hypothesis

against an alternative hypothesis. The decisions may be four types.

1) The hypothesis is true but our test rejects it.(type-I error)


2) The hypothesis is false but our test accepts it. .(type-II error)
3) The hypothesis is true and our test accepts it.(correct)
4) The hypothesis is false and our test rejects it.(correct)

The first two decisions are called errors in testing of hypothesis.

i.e.1) Type-I error

2) Type-II error

m
Type-I error: The type-I error is said to be committed if the null hypothesis (H0) is true but our
1) Type-
test rejects it.
c o
.
2) Type- a
Type-II error: The type-II error is said to be committed if the null hypothesis (H0) is false but
m
our test accepts it.
a
yn
d
Level of significance:
tu
S
The maximum probability of committing type-I error is called level of significance and is
denoted by α .

α = P (Committing Type-I error)

= P (H0 is rejected when it is true)

This can be measured in terms of percentage i.e. 5%, 1%, 10% etc……..

Power of the test:


test:

The probability of rejecting a false hypothesis is called power of the test and is denoted by 1 − β .

6
Power of the test =P (H0 is rejected when it is false)

= 1- P (H0 is accepted when it is false)

= 1- P (Committing Type-II error)

= 1- β

• A test for which both α and β are small and kept at minimum level is
considered desirable.
• The only way to reduce both α and β simultaneously is by increasing sample
size.
• The type-II error is more dangerous than type-I error.
Critical region:

m
A statistic is used to test the hypothesis H0. The test statistic follows a known distribution. In a

c o
test, the area under the probability density curve is divided into two regions i.e. the region of
.
a
acceptance and the region of rejection. The region of rejection is the region in which H0 is

m
rejected. It indicates that if the value of test statistic lies in this region, H0 will be rejected.

a
This region is called critical region. The area of the critical region is equal to the level of

yn
significance α . The critical region is always on the tail of the distribution curve. It may be on

d
both sides or on one side depending upon the alternative hypothesis.

tu
S
One tailed and two tailed tests:

A test with the null hypothesis H 0 : θ = θ 0 against the alternative hypothesis H 1 : θ ≠ θ 0 , it is

called a two tailed test. In this case the critical region is located on both the tails of the
distribution.

A test with the null hypothesis H 0 : θ = θ 0 against the alternative hypothesis H 1 : θ > θ 0 (right

tailed alternative) or H 1 : θ < θ 0 (left tailed alternative) is called one tailed test. In this case

the critical region is located on one tail of the distribution.

H 0 : θ = θ 0 against H 1 : θ > θ 0 ------- right tailed test

7
H 0 : θ = θ 0 against H 1 : θ < θ 0 ------- left tailed test

Test statistic:

The test statistic is defined as the difference between the sample statistic value and the
hypothetical value, divided by the standard error of the statistic.

t − E (t )
i.e. test statistic Z =
S .E (t )

Procedure for testing of hypothesis:

1. Set up a null hypothesis i.e. H 0 : θ = θ 0 .


2. Set up a alternative hypothesis i.e. H 1 : θ ≠ θ 0 or H 1 : θ > θ 0 or H 1 : θ < θ 0
3. Choose the level of significance i.e. α .
4. Select appropriate test statistic Z.
m
5. Select a random sample and compute the test statistic.
c o
6. Calculate the tabulated value of Z at α % l.o.s i.e. Z α . .
a
7. Compare the test statistic value with the tabulated value at α % l.o.s. and make a
m
decision whether to accept or to reject the null hypothesis.
a
yn
d
tu
Large sample tests:

S
The sample size which is greater than or equal to 30 is called as large sample and the test
depending on large sample is called large sample test.

The assumption made while dealing with the problems relating to large samples are

Assumption-
Assumption-1: The random sampling distribution of the statistic is approximately normal.

Assumption-
Assumption-2: Values given by the sample are sufficiently closed to the population value and
can be used on its place for calculating the standard error of the statistic.

Large sample test for single mean (or) test for significance of single mean:

8
For this test

The null hypothesis is H 0 : µ = µ 0

against the two sided alternative H 1 : µ ≠ µ 0

where µ is population mean

µ0 is the value of µ

Let x1 , x 2 , x3 ,.................., x n be a random sample from a normal population with mean µ and

variance σ 2

( )
i.e. if X ~ N µ ,σ 2 then x ~ N µ , σ ( 2

n
), Where x be the sample mean
t − E (t )
Now the test statistic Z = ~ N (0,1)
m
o
S .E (t )

=
x − E(x)
~ N (0,1) .c
S .E ( x )
a
m
⇒Z =
x−µ
~ N (0,1) a
yn
σ
n

d
tu
Now calculate Z

S
Find out the tabulated value of Z at α % l.o.s i.e. Z α

If Z > Z α , reject the null hypothesis H0

If Z < Z α , accept the null hypothesis H0

Note: if the population standard deviation is unknown then we can use its estimate s, which will

1
be calculated from the sample. s =
n −1
∑ ( x − x )2 .

Large sample test for difference between two means:

9
If two random samples of size n1 and n2 are drawn from two normal populations with means µ1
and µ 2 , variances σ 1 and σ 2 respectively
2 2

Let x1 and x 2 be the sample means for the first and second populations respectively

 σ1 2   σ 2 
Then x1 ~ N  µ1 ,  and x 2 ~ N  µ 2 , 2
 n1   n 2 

 σ1 2 σ2 2 
Therefore x1 - x 2 ~ N  µ1 − µ 2 , + 
 n1 n2 

For this test

The null hypothesis is H 0 : µ1 = µ 2 ⇒ µ1 − µ 2 = 0

against the two sided alternative H 1 : µ1 ≠ µ 2

m
Now the test statistic Z =
t − E (t )
~ N (0,1)
c o
S .E (t )
.
( x1 − x 2 ) − E ( x1 − x 2 ) a
=
m
~ N (0,1)

a
S .E ( x1 − x 2 )

⇒Z= yn
( x1 − x 2 ) − ( µ1 − µ 2 )
~ N (0,1)
d
S .E ( x1 − x 2 )

t⇒uZ =
S
( x1 − x 2 )
~ N (0,1) [since µ1 − µ 2 =0 from H0]
σ1 2 σ2 2
+
n1 n2

Now calculate Z

Find out the tabulated value of Z at α % l.o.s i.e. Z α

If Z > Z α , reject the null hypothesis H0

If Z < Z α , accept the null hypothesis H0

10
Note: If σ 1 and σ 2 are unknown then we can consider S1 and S 2 as the estimate value of
2 2 2 2

σ 1 2 and σ 2 2 respectively..

Large sample test for single standard deviation (or) test for significance of standard deviation:

Let x1 , x 2 , x3 ,.................., x n be a random sample of size n drawn from a normal population with

mean µ and variance σ 2 ,

for large sample, sample standard deviation s follows a normal distribution with mean σ and

variance σ
2

2n
(
i.e. s ~ N σ , σ
2

2n
)
For this test

The null hypothesis is H 0 : σ = σ 0

against the two sided alternative H 1 : σ ≠ σ 0


m
c o
Now the test statistic Z =
t − E (t )
~ N (0,1) .
S .E (t )
a
m
a
s − E (s)
= ~ N (0,1)

y n S .E ( s)

⇒ Z = d N (0,1)
s −σ
~
u
t 2n
σ

Now calculate
S
Z

Find out the tabulated value of Z at α % l.o.s i.e. Z α

If Z > Z α , reject the null hypothesis H0

If Z < Z α , accept the null hypothesis H0

Large sample test for difference between two standard deviations:

11
If two random samples of size n1 and n2 are drawn from two normal populations with means µ1
and µ 2 , variances σ 1 and σ 2 respectively
2 2

Let s1 and s 2 be the sample standard deviations for the first and second populations
respectively

 σ1 2   σ 2 
Then s1 ~ N  σ 1 ,  and x 2 ~ N  σ 2 , 2 
 2n1   2 n 2

 σ1 2 σ 2 2 
Therefore s1 - s 2 ~ N  σ 1 − σ 2 , + 
 2n1 2n2 

For this test

The null hypothesis is H 0 : σ 1 = σ 2 ⇒ σ 1 − σ 2 = 0

against the two sided alternative H 1 : σ 1 ≠ σ 2 m


c o
t − E (t )
.
a
Now the test statistic Z = ~ N (0,1)
S .E (t )

m
=
( s1 − s 2 ) − E ( s1 − s 2 )
a
~ N (0,1)

y n S .E ( s1 − s 2 )

⇒Z= d
( s − s ) − (σ − σ )
u N (0,1) ~
1 2 1 2

t S .E ( s − s ) 1 2

S ⇒ Z = (s − s ) N (0,1) 1 2
~ [since σ 1 − σ 2 =0 from H0]
2
σ1 σ2 2
+
2n1 2n 2

Now calculate Z

Find out the tabulated value of Z at α % l.o.s i.e. Z α

If Z > Z α , reject the null hypothesis H0

If Z < Z α , accept the null hypothesis H0

12
Large sample test for single proportion (or) test for significance of proportion:

Let x is number of success in n independent trails with constant probability p, then x follows a

binomial distribution with mean np and variance npq.

In a sample of size n let x be the number of persons processing a given attribute then the
x
sample proportion is given by pˆ =
n

x 1 1
Then E ( pˆ ) = E   = E ( x) = np = p
n n n

 x 1 1 pq
And V ( pˆ ) = V   = 2
V ( x) = 2 npq =
n n n n

m
pq
S .E ( pˆ ) =

o
n

For this test . c


a
The null hypothesis is H 0 : p = p 0
m
a
n
against the two sided alternative H 1 : p ≠ p 0

Now the test statistic Z = d N (0,1)


t − E (t )y ~

t u S .E (t )

S =
pˆ − E ( pˆ )
S .E ( pˆ )
~ N (0,1)

pˆ − p
⇒Z = ~ N (0,1)
pq
n

Now calculate Z

Find out the tabulated value of Z at α % l.o.s i.e. Z α

If Z > Z α , reject the null hypothesis H0

13
If Z < Z α , accept the null hypothesis H0

Large sample test for single proportion (or) test for significance of proportion:

let x1 and x 2 be the number of persons processing a given attribute in a random sample of size
x1 x
n1 and n2 then the sample proportions are given by pˆ 1 = and pˆ 2 = 2
n1 n2

Then E ( pˆ 1 ) = p1 and E ( pˆ 2 ) = p 2 ⇒ E ( pˆ 1 − pˆ 2 ) = p1 - p2

p1 q1 p q pq p q
And V ( pˆ 1 ) = and V ( pˆ 2 ) = 2 2 ⇒ V ( pˆ 1 − pˆ 2 ) = 1 1 + 2 2
n1 n2 n1 n2

p1 q1 p2 q2 p1 q1 p 2 q 2
S .E ( pˆ 1 ) = and S .E ( pˆ 2 ) = ⇒ S .E ( pˆ 1 − pˆ 2 ) = +
n1 n2 n1 n2

m
For this test

c o
.
The null hypothesis is H 0 : p1 = p 2
a
against the two sided alternative H 1 : p1 ≠ p 2 m
a
yn
t − E (t )
Now the test statistic Z = ~ N (0,1)
S .E (t )
d
tupˆ − pˆ −ˆE( pˆˆ − pˆ )
S = ~ N (0,1)
1 2 1 2

S .E ( p1 − p 2 )

pˆ 1 − pˆ 2 − ( p1 − p 2 )
⇒Z= ~ N (0,1)
S .E ( pˆ 1 − pˆ 2 )

pˆ 1 − pˆ 2
⇒Z= ~ N (0,1)
p1 q1 p 2 q 2
+
n1 n2

pˆ 1 − pˆ 2
⇒Z= ~ N (0,1) Since p1 = p 2 from H0
pq pq
+
n1 n 2

14
pˆ 1 − pˆ 2
⇒Z= ~ N (0,1)
1 1 
pq + 
 n1 n2 

n1 pˆ 1 + n 2 pˆ 2
When p is not known p can be calculated by p = and q = 1 − p
n1 + n 2

Now calculate Z

Find out the tabulated value of Z at α % l.o.s i.e. Z α

If Z > Z α , reject the null hypothesis H0

If Z < Z α , accept the null hypothesis H0

m
c o
.
a
m
a
yn
d
tu
S

15
LECTURE NOTES ON CHI-SQUARE DISTRIBUTION

CHI-SQUARE DISTRIBUTION

The  2 distribution was first obtained by Helmert in 1875 and rediscovered by Karl
Pearson in 1900.

The square of a standard normal variate is known as chi-square variate with 1 degree of
freedom (d.f).

X   X 
2
Thus if X `~ N ( ,  ) , then Z  2
~ N (0,1) and Z  
2
 , is a chi-square
   
variate with 1 d.f abbreviated by the letter  2 of the Greek alphabet.

In general, if X1, X 2 ,..., X n are n independent normal variates with means 1, 2 ,..., n and
standard deviations 1,  2 ,...,  n respectively then the variate

2 2 2
 X     X  2   X  n 
   1 1    2
2
  ..................   n 
  1    2    n 
m
 
n
 X i  i



2

c o
 i  .
i 1
a
m
Which is the sum of squares of n independent standard normal variates, follows chi-square
distribution with n d.f.
a
APPLICATIONS OF CHI-SQUARE DISTRIBUTION yn
d
tu
 2 distribution has a large number of applications, some of which are listed below:

S
1. Chi-square test of goodness of fit.
2. Chi-square test for independence of attributes.
3. Chi-square test for the population variance.

1. CHI-SQUARE TEST OF GOODNESS OF FIT


A very powerful test to describe the magnitude of discrepancy between theory and
observation was given by Prof. Karl Pearson in 1900. It enables us to find if the deviations of
the observations from theory is just by chance or is it really due to the inadequacy of the
theory to fit the observed data. This test is known as  2 -test of goodness of fit.

If Oi i  1,2,..., n is a set of observed frequencies and Ei i  1,2,..., n is the corresponding


set of expected (theoretical) frequencies, then the Statistic  2 may be defined as
LECTURE NOTES ON CHI-SQUARE DISTRIBUTION

n
 Oi  Ei 2   n n 
 
2

i 1

 Ei
,




i 1
Qi  
i 1
Ei 

follows chi-square distribution with (n-1) d.f.

In order to determine whether the divergence is due to chance or otherwise. We have to


compare the computed value of  2 with the table values. Table values of  2 as given by
R.A. Fisher are available for various levels of confidence, ordinarily up to 30 degrees of
freedom. If the calculated value of  2 is less than the table value at the particular level of
confidence, the divergence is said to arise due to fluctuations of sampling. If the calculated
value of  2 exceeds the table value, the divergence is said to be significant.

Illustration: A die is thrown 132 times with the following results:

Number turned up: 1 2 3 4 5 6

Frequency: 16 20 25 14 29 28

m
o
Test the hypothesis that the die is unbiased.

Solution: Null Hypothesis: Set up the null hypothesis that the die is unbiased.
. c
a
On the basis of hypothesis that the die is unbiased, we expect each number to turn up,
m
a
132/6=22 times

yn
Apply  2 -test

d O  E 2 O  E 2
tu22
O E
E
16
20
S 22
36
4
1.64
0.18
25 22 9 0.41
14 22 64 2.91
29 22 49 2.23
28 22 36 1.64
O  E 2 =9.01
 E
No of degrees of freedom = n-1=6-1=5

For 5 degrees of freedom at 5% level of significance, the table value of  2 =11.07. The
calculated value of  2 is less than the table value and hence there is no evidence against
the hypothesis that die is unbiased.
LECTURE NOTES ON CHI-SQUARE DISTRIBUTION

Illustration: The theory predicts the proportion of beans in the four groups A, B, C and D
should be 9:3:3:1. In an experiment among 1600 beans, the numbers in the four groups
were 882, 313, 287 and 118. Does the experimental result support theory?

Solution: Null Hypothesis: We set up the null hypothesis that the experimental results
support the theory.

On the basis of hypothesis, the theoretical frequencies can be computed as follows:

Total no. of beans = 882+313+287+118=1600

These can be divided in the ratio 9:3:3:1

 E 882  1600  900 , E 313  1600  300


9 3
16 16

E 287   1600  300 , E 118  1600  100


3 1
16 16

Apply  2 -test
m
O  E 
co O  E 2
O E 2

.
a
E
882 900 324 0.3600
313 300
m 169 0.5633
287 300
a 169 0.5633

yn
118 100 324 3.2400
O  E 2
d 
tu
E =4.7266

S
No. of degrees of freedom=n-1=4-1=3

For 3 d.f. at 5% level of significance, the table value of  2 =7.815. The calculated value of
 2 is less than the table value. Hence the null hypothesis may be accepted at 5% level of
significance and conclude that there is good correspondence between theory and
experiment.

2. CHI-SQUARE TEST FOR INDEPENDENCE OF ATTRIBUTES

Under this test, we can find out whether two or more attributes are associated or not. Let
us consider two attributes A and B, A is di ided into r classes A1, A2 ,..., Ar and B is divided
into s classes B1 , B2 ,..., Bs . Such a classification in which attributes are divided into more
than two classes is known as manifold classification. The various cell frequencies can be
LECTURE NOTES ON CHI-SQUARE DISTRIBUTION

expressed in the following table known as r×s manifold contingency table where  Ai 
denotes the number of persons possessing the attribute Ai , i  1,2,..., r  , B j  denote the
number of persons possessing the attribute B j ,  j  1,2,..., s  and Ai B j  denote the
r s
number of persons possessing both the attributes  Ai  and B j . Also A  B
i 1
i
j 1
j N,

is the total frequency.

r×s contingency table

A A1 A2 .... Ai .... Ar Total


B
B1  A1B1   A2 B1  ....  Ai B1  ....  Ar B1  B1 
B2  A1 B2   A2 B2  ....  Ai B2  ....  Ar B2  B2 
⁞ ⁞ ⁞ ⁞ ⁞
A1B j  A2 B j  Ai B j  Ar B j  B j 
Bj .... ....
m
⁞ ⁞ ⁞
c

o ⁞
Bs 
Bs  A1 Bs   A2 Bs  ....  Ai Bs  ....
.
 Ar Bs 
Total  A1   A2  ....  Ai  a
....  Ar  N

m
a
yn
Under the null hypothesis that the two attributes A and B are independent, the expected
frequencies are calculated as follows
d
i
t u
P A  = probability that a person possessing the attribute A i

A  S
 ; i  1,2,..., r
i
N

B j 
 ; j  1,2,..., s
N

 
P Ai B j  P Ai P B j   (attributes Ai and Bj are independent under the null
hypothesis)

B 

P Ai B j    ANi  . Nj
If Ai B j o denote the expected frequency of Ai B j , then
LECTURE NOTES ON CHI-SQUARE DISTRIBUTION

Ai B j o  N.PAi B j   ANi B j  , (i  1,2,..., r ; j  1,2,..., s)

By using this formula expected frequencies for each of the cell frequencies
Ai B j , (i  1,2,..., r ; j  1,2,..., s) can be worked out.
The exact test for independence of attributes is very complicated but a fair degree of
approximation is given, for large samples by the  2 -test of goodness of fit i.e

r s
   o 2 
 Ai B j  Ai B j
 
2
i 1 j 1

  
Ai B j o 

follows  2 distribution with (r-1)(s-1) degrees of freedom.

Now comparing this calculated value with the tabulated value for (r-1)(s-1) d.f at certain
level of significance, we reject or retain the null hypothesis of independence of attributes at
that level of significance.

m
c o
.
Illustration: A certain drug was administered to 456 males out of total 720 in a certain
a
locality to test its efficacy against typhoid. The incidence of typhoid is shown below. Find out
the effectiveness of the drug against the disease
m
a
y nInfection No infection Total

Administering the drug:


d 144 312 456

tu
Without administering the drug: 192 72 264

Total: S 336 384 720

Solution: We set up the null hypothesis that the t o attri utes in iden e of typhoid and
the administration of the drug are independent.

Under the hypothesis of independence.

336  456
E 144   212.8
720

384  456
E 312   243.2
720

336  264
E 192   123.2
720
LECTURE NOTES ON CHI-SQUARE DISTRIBUTION

264  384
E 72   140.8
720

Apply  2 -test

O E O  E 2 O  E 2
E
144 212.8 4733.44 22.244
192 123.2 4733.44 38.420
312 243.2 4733.44 15.171
72 140.8 4733.44 65.742
O  E 2
 E =141.577

Degrees of freedom=(r-1)(s-1)=(2-1)(2-1)=1 d.f

For 1 d. f at 5% level of significance the table value of  2 =3.84. Since calculated value is
very much greater than the table value. It is highly significant. Hence the null hypothesis is
m
o
rejected at 5% level of significance and we conclude that the drug is certainly effective in
c
controlling typhoid.
.
a
m
3. CHI-SQUARE TEST FOR THE POPULATION VARIANCE a
yn
Suppose we want to test if the given normal population has a specified variance  2   o2
d
tu
(say).

S
Under the null hypothesis that the population variance  2   o2 , the statistic

n
  x i  x 2   n
 x 2 
 
2
 
2

1
 2  x i2 
i
  ns
  2
  o   o
2
n
i 1  i 1  o

follows chi-square distribution with (n-1) d.f.

Comparing calculated value with tabulated value of  2 for (n-1) d.f at certain level of
significance, we may retain or reject the null hypothesis.

Illustration: A random sample of size 10 from a normal population gave the following
values:

65, 72, 68, 74, 77, 61, 63, 69, 73, 71


LECTURE NOTES ON CHI-SQUARE DISTRIBUTION

Test the hypothesis that population variance is 32.

Solution: We set up the null hypothesis H o :  2  32 against the alternative H 1 :  2  32 .

Computation of sample variance

x x  x   x  x 2
65 -4.3 18.49
72 2.7 7.29
68 -1.3 1.69
74 4.7 22.09
77 7.7 59.29
61 -8.3 68.89
63 -6.3 39.69
69 -0.3 0.09
73 3.7 13.69
71 1.7 2.89

 x =693  x  x  2
=234.10

 x  693  69.3 m
Sample mean x  
c o
n 10
.
a
Under the null hypothesis H o :  2  32 , the test statistic is
m
a
yn
ns 2 234.10
2    7.3156
2 32

d
tu
which follows  2 distribution with (10-1)=9 d.f.

S
The table value of  2 at 5% level of significance is 16.9.

Since the calculated value of  2 is less than the tabulated value of  2 for 9 d.f at 5% level
of significance. Hence H o may be accepted.

CONDITIONS FOR THE VALIDITY OF CHI-SQUARE TEST

The chi-square test can be used if the following conditions are satisfied.

i. N i.e the number of observations must be sufficiently large otherwise the differences
between the actual and observed frequencies would not be normally distributed.
ii. The sample observations should be independent.
iii. No theoretical cell frequency should be less than 5. If any theoretical cell frequency
is less than 5, then for the application of  2 test, it is pooled with the preceeding or
LECTURE NOTES ON CHI-SQUARE DISTRIBUTION

succeeding frequency so that the pooled frequency is more than 5 and finally adjust
for the d.f lost in pooling.
iv. The constraints on the cell frequencies, if any, should be linear (i.e they showed not
involve square and higher powers of the frequencies) such as O   E  N .
i i

2  2 Contingency table

For 2  2 table

a b
c d

Prove that chi-square test for independence gives

N ad  bc 
2
 
2
, N  abcd
a  c b  d a  bc  d 
Solution: Under the hypothesis of independence of attributes
m
a b a+b
c o
c d c+d .
a+c b+d N a
m
a  ba  c  a
yn
E a   (1)
N

a d
u
t N
E b 
b b  d 
(2)

S a  cc  d 
E c   (3)
N

E d  
b  d c  d  (4)
N

 2 
a  Ea 2  b  Eb2  c  Ec 2  d  Ed 2 ( A)
E a  E b  E c  E d 

Now a  E a   a 
a  ba  c  using (1)
N



Na  a 2  ac  ab  bc 
N
LECTURE NOTES ON CHI-SQUARE DISTRIBUTION



aa  b  c  d   a 2  ac  ab  bc   N  a  b  c  d 
N

a 2  ab  ac  da  a 2  ac  ab  bc

N

a  E a   ad  bc
N

Similarly, we will get

ad  bc
b  E b  ad  bc , c  Ec   ad  bc and d  E d  
N N N

Substituting these values in (A), we get

2 
ad  bc 2  1 1 1 1 
 E a   E b   E c   E d 
N2  

ad  bc 2  N N N
m N 

N2 
c o
 a  b a  c   a  b b  d   a  c c  d   b  d c  d 

.

ad  bc 
2


1

1  
   a1

1 

 a  b a  c  a  b b  d    a  c c  d  b  d c  d  
N2
m
ad  bc   2
a  b  cn d
a bd ac 
 
   y     

      


d
2
N   a b a c a b b d a c c d b d 


ad  bc  t
2 u N N 
N S
2
 2  a  b a  c a  b b  d   a  c c  d b  d  (a  b  c  d  N )

2 abcd 
 ad  bc   
 a  b a  c a  b b  d 

N ad  bc 
2
2 
a  ba  c a  bb  d 
LECTURE NOTES ON t-DISTRIBUTION

STUDENT’S t-DISTRIBUTION:
At the beginning of the 20th century, a statistician named William S. Gosset, an employee of
Guinness Breweries in Ireland, was interested in making inferences about mean when σ was
unknown. Because Guinness employees were not permitted to publish research work under their
names, Gosset adopted the pseudonym “Student”. The distribution that he developed has come to
be known as Student’s t- distribution.
If the random variable X is normally distributed, then the following statistic has a t distribution with
n-1 degrees of freedom i-e
𝑋−𝜇
𝑡=
𝑆
𝑛
Notice that this expression has same form as the Z-Statistic, except that S is used to estimate σ,
which is unknown in this case.
m
o
In appearance, the t-distribution is very similar to the standardized normal distribution. Both are
bell-shaped and symmetrical. However, the t- distribution hasc
centre than does the standardized normal distribution. a
. more area in the tails and less in the
Because the value of σ is unknown and S is

m
used to estimate it, the values of t that are observed will be more variable than for Z. As the number
a gradually approaches the standardized normal
ynidentical. Since S becomes a better estimate of σ as the
of degrees freedom increases, the t-distribution
distribution until the two are virtually
d
sample size gets large.
u
ASSUMPTIONS OFtt-DISTRIBUTION:
I. TheSparent population from which the sample has been drawn is normal.
II. The sample observations are independent.
III. The population standard deviation σ is unknown.
APPLICATIONS:
The t-distribution has a number of applications in Statistics and other disciplines, of which are:
I. t- test for significance of single mean, population variance being unknown.
t-test for the significance of the difference between two sample means, the population
variances being equal but unknown.
II. t- test for significance of an observed sample correlation coefficient.
III. t- test for significance of an observed regression coefficient.
LECTURE NOTES ON t-DISTRIBUTION

TEST FOR SINGLE MEAN:


Suppose we are interested to test:
a). If the given normal population has a specified value of the population mean say 𝜇0 .

b). If the sample mean 𝑥 differs significantly from specified value of population mean.
c). If a given random sample x1, x2,…………xn of size n has been drawn from a normal population
with specified mean 𝜇0 .
Basically, all the three problems are same. For all the cases we set up the null hypothesis as:
𝐻0 = 𝜇 = 𝜇0 i.e the population mean is 𝜇.
Under H0, the test statistic is
𝑋 − 𝜇0
𝑡= ~𝑡𝑛 −1
𝑆
m
o
𝑛
Where 𝑆 = 2
𝑛−1
1 2

. c
𝑥 − 𝑥 . The above test statistic is computed and is compared with the
𝑖

a If the calculated value of t is lea


tabulated value of t for (n-1) d.f at certain level of significance.
than the tabulated value of t, then mH is accepted, otherwise rejected.
a1970 in Brazil is expected to be 50 years. A survey
0

was conducted in eleven regionsy


n
Ex: The life expectancy of people in the year
of Brazil and the data obtained are given below. Do the data
confirm the expected view. d
tu54.2, 50.4, 44.2, 49.7, 55.4, 57.0, 58.2, 56.6, 61.9, 57.5, 53.4
Sto test, H =𝜇 = 50 against H ≠ 50
Life expectancy (Years)x:
Sol: Here we have 0 1
Under H0, the test statistic t is
𝑋 − 𝜇0
𝑡= ~𝑡𝑛 −1
𝑆
𝑛
598.5 1 1 𝑥𝑖 2
Here 𝑥 = = 54.41 and 𝑠 2 = 𝑛 −1 𝑥𝑖 − 𝑥 2 = 𝑥𝑖 2 −
11 𝑛−1 𝑛

2
1 598.5
=10 32799.91 − 11
= 23.607
Therefore s= 4.859
Therefore t= 14.626/40859= 3.01.
LECTURE NOTES ON t-DISTRIBUTION
The tabulated value of t at α = 0.05 and 10 d.f is 2.228. Since calculated t is greater than the
tabulated value of t. Therefore, H0 is rejected and we conclude that the life expectancy is more than
5o years.
t- TEST FOR DIFFERENCE OF MEANS
Suppose we want to test if two independent samples have been drawn from two normal populations
having the same means, the population variances being equal.
Let x1, x2,…………xn1 and y1, y2,…………yn2 be two independent random samples from the given
normal populations. We set up the null hypothesis H0: 𝜇𝑥 = 𝜇𝑦 i.e the two samples have been
drawn from the normal populations with the same means. In other words, the sample means 𝑥 and
𝑦 do not differ significantly. Under the assumption that 𝜎1 2 = 𝜎2 2 = 𝜎 2 i.e population variances are
equal but unknown. The test statistic under H0 is
𝑥−𝑦
𝑡= ~𝑡𝑛 1 +𝑛 2 −2

m
1 1
𝑆 𝑛 +𝑛
o
1 2

Where 𝑆 2 = 𝑛
1
1 +𝑛 2 −2
𝑥𝑖 − 𝑥 2
+ 𝑦𝑖 − 𝑦 2
. c
Proof: Here we know that a
m
a
𝑥−𝑦 −𝐸 𝑥−𝑦
𝑍= ~𝑁(0,1)

yn
𝑉( 𝑥 − 𝑦

d
t u
But 𝐸 𝑥 − 𝑦 = 𝐸(𝑥 ) - 𝐸 𝑦 = 𝜇𝑥 − 𝜇𝑦 =0 (By assumption)

S
and 𝑉( 𝑥 − 𝑦 =𝑉(𝑥)+ V(𝑦 )=
𝜎1 2
𝑛1
+
𝜎2 2
𝑛1
= 𝜎2
1
𝑛1
+𝑛
1
2
(By assumption)

Therefore
𝑥−𝑦
𝑍= ~𝑁(0,1)
1 1
𝜎2 +
𝑛1 𝑛2
Let
2 2
2
𝑥𝑖 − 𝑥 + 𝑦𝑖 − 𝑦
𝜒 =
𝜎2
Then Fisher’s t- statistic is given by
LECTURE NOTES ON t-DISTRIBUTION
𝑍
𝑡=
𝜒2
𝑛1 + 𝑛2 − 2
Which gives
𝑥−𝑦
𝑡= ~𝑡𝑛 1 +𝑛 2 −2
1 1
𝑆 𝑛 +𝑛
1 2

Now by comparing the computed value of t with the tabulated value of t for 𝑛1 + 𝑛2 − 2 d.f at a
certain level of significance, we may reject or accept the null hypothesis.
In case if the assumption 𝜎1 2 = 𝜎2 2 = 𝜎 2 does not hold , then the t- statistic is given as:
𝑥−𝑦
𝑡=
𝑠1 2 𝑠2 2
𝑛1 + 𝑛2

m
o
PAIRED t- TEST FOR DIFFERENCE OF MEANS:

.c
In the t-test for difference of means, the two samples were independent of each other. Let us now
take a particular situation where
a
i)
ii) The sample observations x , x a
m
The sample sizes are equal i.e n and n =n1 2

,…………x and y , y ,…………y are nor completely


n in pairs i.e (x ,y ),(x ,y )……..(x ,y ) corresponding to
1 2 n1 1 2 n2

y
independent but are dependent 1 1 2 2 n n
nd
ist, 2 …….nth unitdrespectively.
i i i
tu denote the difference in the observations for ith unit.
Let d =x -y (i= 1,2 ….n)
Under the S null hypothesis that the increments are just by chance i.e H : 𝜇 = 𝜇 , the test 0 𝑥 𝑦

statistic is given by
𝑑
𝑡=𝑠 ~𝑡𝑛 −1
𝑛
1 1 𝑑 2
𝑊ℎ𝑒𝑟𝑒 d=x-y, 𝑑 =𝑛 𝑑𝑖 and 𝑆 2 =𝑛 −1 𝑑2 − 𝑛

Ex: The following table gives the monthly average of total solar radiation on a horizontal and an
inclined surface at a particular place.
LECTURE NOTES ON t-DISTRIBUTION
Month Jan. Feb. March April May June July Aug. Sept. Oct. Nov. Dec.
Radiation on 363 404 518 521 613 587 365 412 469 468 371 330
Horizontal
surface(x)
Radiation on 536 474 556 549 479 422 315 414 505 552 492 507
inclined
surface (y)

Test whether the average daily radiation in a year on a horizontal and an inclined surface are equal
Sol: Here we set up the null hypothesis H0: 𝜇𝑥 = 𝜇𝑦 against H1: 𝜇𝑥 ≠ 𝜇𝑦 . Under H0, the test
statistic is given as :
𝑥−𝑦
𝑡= ~𝑡𝑛 1 +𝑛 2 −2
1 1
𝑆 𝑛 +𝑛
1 2

Here 𝑥 = 5421, 𝑥 = 451.75, 𝑥 2 = 2543583, 𝑦 = 5801, 𝑦 = 483.42, 𝑦 2 = 2859497


m
c o
.
1
𝑆2 = 𝑥𝑖 − 𝑥 2
+ 𝑦𝑖 − 𝑦 2
𝑛1 + 𝑛2 − 2
a𝑦
m
1 𝑥𝑖 2 𝑦𝑖 2
=𝑛 𝑥𝑖 2 − + 𝑖
2

a
1 +𝑛 2 −2 𝑛1 𝑛2

yn
= 6811.06
Therefore, t= -0.94, 𝑡 = 0.94
d
tu
The tabulated value of t at α=0.05 for 22 d.f is 2.074. Since calculated value of t is less than the

S
tabulated value of t, therefore, we accept our null hypothesis and conclude that the average daily
total radiations on the horizontal surface and the inclined surface are equal.
Ex: The following table gives the pulsality index (PI) of 11 patients:
Patient No. 1 2 3 4 5 6 7 8 9 10 11
PI value during 0.45 0.54 0.48 0.62 0.48 0.60 0.45 0.46 0.35 0.40 0.44
seizure(x)
PI value after 0.60 0.65 0.63 0.78 0.63 0.80 0.69 0.62 0.68 0.50 0.57
seizure(y)
Difference 0.15 0.11 0.15 0.16 0.15 0.20 0.24 0.16 0.33 0.10 0.13
Test whether there is a significant increase on the average in PI values after seizure as compared to
during seizure.
LECTURE NOTES ON t-DISTRIBUTION
Sol: We set up the null hypothesis that there is no increase in the average value of PI values after
seizure and during seizure. i.e H0= 𝜇𝑥 = 𝜇𝑦 against H1: 𝜇𝑥 > 𝜇𝑦 . Under H0, the test statistic is:
𝑑
𝑡=𝑠 ~𝑡𝑛 −1
𝑛
1 1 𝑑 2
Where d=x-y, 𝑑=𝑛 𝑑𝑖 and 𝑠 2 =𝑛−1 𝑑2 −
𝑛

1
𝑑𝑖 = 1.88, 𝑑2 = 0.3642, 𝑑 =𝑛 𝑑𝑖 = 0.171
1 𝑑 2
Therefore, 𝑠 2 = 𝑛−1 𝑑2 − 𝑛
= 0.004289

Therefore, t = 8.72
The tabulated value of t at α= 0.05 for 10 d.f is 1.812. Since calculated value of t is greater than the
tabulated value, thus, we reject our null hypothesis and conclude that there is significant increase in
the PI value after seizure in comparison to during seizure. m
c o
.
a
m
a
yn
d
tu
S
LECTURE NOTES ON F-DISTRIBUTION

F-TEST:

A large number of surveys or experiments are conducted to draw conclusions about the effect of
certain factors or treatments. Observations are taken pertaining to the character under study. F-
test is used either for testing the hypothesis about the equality of two population variances or the
equality of two or more population means. Because of this reason, it is considered to be very
poplar and useful distribution and is the backbone of analysis of variance.

F-STATISTIC:

If X is a 𝜒 2 - varaite with n1 degree of freedom and Y is an independent 𝜒 2 - varaite with n2


degree of freedom, then F- Statistic is defined as :

𝑋
𝑛1
𝐹=
𝑌
𝑛2
m
and it follows G.W Snedecor’s F- distribution with (n1,n2) d.f.
c o
.
F-TEST FOR EQUALITY OF POPULATION VARIANCES:
a
m
Let x1, x2,…………xn1 be a random sample of size n1 from the first normal population with
a
yn
2
variance 𝜎1 and y1, y2,…………yn2 be a random sample of size n2 from the second normal
population with variance 𝜎2 2 . Obviouslt the two samples rae independent. We set up the null
d
tu
hypothesis H0:𝜎1 2 = 𝜎2 2 = 𝜎 2 i.e population variances are same. In other words H0 is that the

S
two independent estimates of the common population variance are homogeneous i.e do not differ
significantly.

Under H0, the test statistic is given as:


𝑆1 2
𝐹= ~𝐹 𝑛1 , 𝑛2
𝑆2 2
1 1
Where 𝑆1 2 = 𝑛 𝑥𝑖 − 𝑥 2
, 𝑆2 2 = 𝑛 𝑦𝑖 − 𝑦 2 .
1 −1 2 −1

Since F- test is based on the ratio of two variances, it is also known as variance ratio test. Also, it
should be noted that the available tables of the significant values of F are for the right-tail test i.e
LECTURE NOTES ON F-DISTRIBUTION

against the alternative H1:𝜎1 2 > 𝜎2 2 , in numerical problems we will take greater of the variance
𝑆1 2 or 𝑆2 2 in the numerator and adjust for the degrees of freedom accordingly.

Ex: Life expectancy in 9 regions of Brazil in 1990 and in 11 regions of Brazil in 1970 was as
given in the table below:

Regions 1 2 3 4 5 6 7 8 9 10 11
Life
expectancy 42.7 43.7 34.0 39.2 46.1 48.7 49.4 45.9 55.3 - -
(Yrs) 1900
Life
expectancy 54.2 50.4 44.2 49.7 55.4 57.0 58.2 56.6 61.9 57.5 53.4
(Yrs) 1970

m
Test whether the variation in life expectancy in various regions in 1900 and in 1970 is same or
not.
c o
.
a
Sol: First of all we set up the null hypothesis H0:𝜎1 2 = 𝜎2 2 𝑎𝑔𝑎𝑖𝑛𝑠𝑡 H1:𝜎1 2 ≠ 𝜎2 2 .

m𝑦
a
Here 𝑥 = 405, 𝑥 2 = 18527.78, 𝑦 = 598.5, 2
= 32799.91

y n1
d𝑠 = 𝑛 − 1
2
2
𝑥𝑖
𝑥𝑖 2 −

tu
1
1 𝑛1

S = 37.848

2
1 𝑦𝑖
𝑠2 2 = 𝑦𝑖 2 −
𝑛2 − 1 𝑛2

== 23.607

Since 𝑠1 2 > 𝑠2 2

The test statistic is given as:

𝑆1 2
𝐹= ~𝐹 𝑛1 , 𝑛2
𝑆2 2
LECTURE NOTES ON F-DISTRIBUTION

= 1.603

The tabulated value of F at α=0.05 and (8,10) d.f is 3.85. Since calculated F is less than the
tabulated F, therefore, we accept our null hypothesis and conclude that variation in the life
expectancy in various regions of Brazil in 1900 and 1970 is same.

m
c o
.
a
m
a
yn
d
tu
S

Вам также может понравиться