© All Rights Reserved

Просмотров: 0

© All Rights Reserved

- US President Inquiry
- Business Statistics Formulas
- [6426]Revision Worksheet for Cycle Test - Measures of Dispersion Economics -Grade 11F Final
- EDA in Spark
- The Ins and Outs of Histograms with Excel
- MELJUN_CORTES Statistics Program c++ Source Code
- newressales
- Quiz 1 Notes
- SPSS basics
- Spot Speed Analysis
- [MB0040]MBA Unit (1,2,3,4 Units) Statistics
- Histogram Definition
- IIJEC-2014-08-19-7
- Lecture 8 Management of Information
- Bell Curve or Normal Curve
- 213f15lec2 (1)
- Business Analytics.pdf
- stats eport project
- x1441609093yMath_VII_Old
- k

Вы находитесь на странице: 1из 182

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Class : BBAcam

Unit 1

Statistics: Definition, Importance & Limitation.

Definition of statistics:-“By statistics we mean aggregates of facts affected to a

marked extent by a multiplicity of causesnumerically expressed, enumerated or

estimated according to reasonable standards of accuracy collected in a systematic

manner for a pre-determined purpose and placed in relation to each other”

Importance of Statistics

These days statistical methods are applicable everywhere. There is no field of work

in which statistical methods are not applied. According to A L. Bowley, ‘A

knowledge of statistics is like a knowledge of foreign languages or of Algebra, it

may prove of use at any time under any circumstances”. The importance of the

statistical science is increasing in almost all spheres of knowledge, e g., astronomy,

biology, meteorology, demography, economics and mathematics. Economic

planning without statistics is bound to be baseless.

Statistics serve in administration, and facilitate the work of formulation of new

policies. Financial institutions and investors utilise statistical data to summaries the

past experience. Statistics are also helpful to an auditor, when he uses sampling

techniques or test checking to audit the accounts of his client.

Limitation of statistics

The scope of the science of statistic is restricted by certain limitations :

1. The use of statistics is limited numerical studies: Statistical methods cannot

be applied to study the nature of all type of phenomena. Statistics deal with only

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

expressed. For, example, the health, poverty and intelligence of a group of

individuals, cannot be quantitatively measured, and thus are not suitable subjects

for statistical study.

2. Statistical methods deal with population or aggregate of individuals rather than

with individuals. When we say that the average height of an Indian is 1 metre 80

centimetres, it shows the height not of an individual but as found by the study of all

individuals.

3. Statistical relies on estimates and approximations : Statistical laws are not

exact laws like mathematical or chemical laws. They are derived by taking a

majority of cases and are not true for every individual. Thus the statistical

inferences are uncertain.

4. Statistical results might lead to fallacious conclusions by deliberate manipulation

of figures and unscientific handling. This is so because statistical results are

represented by figures, which are liable to be manipulated. Also the data placed in

the hands of an expert may lead to fallacious results. The figures may be stated

without their context or may be applied to a fact other than the one to which they

really relate. An interesting example is a survey made some years ago which

reported that 33% of all the girl students at John Hopkins University had married

University teachers.

What is frequency distribution

Collected and classified data are presented in a form of frequency distribution.

Frequency distribution is simply a table in which the data are grouped into classes

on the basis of common characteristics and the number of cases which fall in each

class are recorded. It shows the frequency of occurrence of differentvalues of a

single variable. A frequency distribution is constructed to satisfy three objectives :

(i) to facilitate the analysis of data,

(ii) to estimate frequencies of the unknown population distribution from the

distribution of sample data, and

(iii) to facilitate the computation of various statistical measures.

Frequency distribution can be of two types :

1. Univariate Frequency Distribution.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

In this lesson, we shall understand the Univariate frequency distribution.

Univariate distribution incorporatesdifferent values of one variable only whereas

the Bivariate frequency distribution incorporates the values of two variables. The

Univariate frequency distribution is further classified into three categories:

(i) Series of individual observations,

(ii) Discrete frequency distribution, and

(iii) Continuous frequency distribution.

Series of individual observations, is a simple listing of items of each observation. If

marks of 14 students in statistics of a class are given individually, it will form a

series of individual observations.

Marks obtained in Statistics :

Roll Nos. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Marks: 60 71 80 41 81 41 85 35 98 52 50 91 30 88

Marks in Ascending Order Marks in Descending Order

30 98

35 91

41 88

41 85

50 81

52 80

60 71

71 60

80 52

81 50

85 41

88 41

91 35

98 30

Discrete Frequency Distribution:-

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

In a discrete series, the data are presented in such a way that exact measurements

of units are indicated. In a discrete frequency distribution, we count the number of

times each value of the variable in data given to you. This is facilitated through the

technique of tally bars.

In the first column, we write all values of the variable. In the second column, a

vertical bar called tally bar against the variable, we write a particular value has

occurred four times, for the fifth occurrence, we put a cross tally mark ( / ) on the

four tally bars to make a block of 5. The technique of putting cross tally bars at

every fifth repetition facilitates the counting of the number of occurrences of the

value. After putting tally bars for all the values in the data; we count the number of

times each value is repeated and write it against the corresponding value of the

variable in the third column entitled frequency. This type of representation of the

data is called discrete frequency distribution.

We are given marks of 42 students:

55 51 57 40 26 43 46 41 46 48 33 40 26 40 40 41

43 53 45 53 33 50 40 33 40 26 53 59 33 39 55 48

15 26 43 59 51 39 15 45 26 15

We can construct a discrete frequency distribution from the above given marks.

Marks of 42 Students

------------------------------------------

Marks Tally Bars Frequency

------------------------------------------

15 ||| 3

26 5

33 |||| 4

39 || 2

40 5

41 || 2

43 ||| 3

45 || 2

46 || 2

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

48 || 2

50 | 1

51 || 2

53 ||| 3

55 ||| 3

57 | 1

59 || 2

Total 42

The presentation of the data in the form of a discrete frequency distribution is

better than arranging but it does not condense the data as needed and is quite

difficult to grasp and comprehend. This distribution is quite simple in case the

values of the variable are repeated otherwise there will be hardly any condensation.

Continuous Frequency Distribution:-

If the identity of the units about a particular information collected, is neither

relevant nor is the order in which the observations occur, then the first step of

condensation is to classify the data into different classes by dividing the entire

group of values of the variable into a suitable number of groups and then recording

the number of observations in each group. Thus, we divide the total range of values

of the variable (marks of 42 students) i.e. 59-15 = 44 into groups of 10 each, then

we shall get

(42/10) 5 groups and the distribution of marks is displayed by the following

frequency distribution:

Marks of 42 Students

---------------------------------------------------------------------

Marks (×) Tally Bars Number of Students (f)

---------------------------------------------------------------------

15-25 ||| 3

25-35 |||| 9

35-45 || 12

45-55 || 12

55-65 6

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

----------------------------------------------------------------------

Total 42

Graphs of Frequency Distributions

The guiding principles for the graphic representation of the frequency distributions

are same as for the diagrammatic and graphic representation of other types of data.

The information contained in a frequency distribution can be shown in graphs

which reveals the important characteristics and relationships that are not easily

discernible on a simple examination of the frequency tables. The most commonly

used graphs for charting a frequency distribution are :

1. Histogram

2. Frequency polygon

3. Smoothed frequency curves

4. Ogives or cumulative frequency curves.

1. Histogram

The term ‘histogram’ must not be confused with the term ‘historigram’

which relates to time charts. Histogram is the best way of presenting

graphically a simple frequency distribution. The statistical meaning of

histogram is that it is a graph that represents the class frequencies in a

frequency distribution by vertical adjacent rectangles.

While constructing histogram the variable is always taken on the X-axis and

the corresponding classinterval. The distance for each rectangle on the X-

axis shall remain the same in case the class-intervals are uniform throughout;

if they are different the width of the rectangles shall also change

proportionately. The Yaxis represents the frequencies of each class which

constitute the height of its rectangle. We get a series of rectangles each

having a class interval distance as its width and the frequency distance as its

height. The

area of the histogram represents the total frequency.

The histogram should be clearly distinguished from a bar diagram. A bar

diagram is one-dimensional where the length of the bar is important and not

the width, a histogram is two-dimensional where both the length and width

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

has unequal class intervals and suitable adjustments in frequencies are not

made.

The technique of constructing histogram is explained for :

(i) distributions having equal class-intervals, and

(ii) distributions having unequal class-intervals.

--------------------------------------

Classes Frequency

-------------------------------

0-10 5

10-20 11

20-30 19

30-40 21

40 -50 16

50-60 10

60-70 8

70-80 6

80-90 3

90–100 2

-------------------------

Solution :

When class-intervals are unequal the frequencies must be adjusted before

constructing a histogram. We take that class which has the lowest class-interval

and adjust the frequencies of classes accordingly. If one class interval is twice as

wide as the one having the lowest class-interval we divide the height of its

rectangle by two, if it is three times more we divide it by three etc. the heights will

be proportional to the ratios of the frequencies to the width of the classes

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

----------------------------------------------------------------------

2. Frequency Polygon

This is a graph of frequency distribution which has more than four sides. It is

particularly effective in comparing two or more frequency distributions. There are

two ways of constructing a frequency polygon.

(i) We may draw a histogram of the given data and then join by straight line the

mid-points of the upper horizontal side of each rectangle with the adjacent ones.

The figure so formed shall be frequency polygon. Both the ends of the polygon

should be extended to the base line in order to make the area under frequency

polygons equal to the area under Histogram.

(ii) Another method of constructing frequency polygon is to take the mid-points of

the various classintervals and then plot the frequency corresponding to each point

and join all these points by straight lines. The figure obtained by both the methods

would be identical.

Frequency polygon has an advantage over the histogram. The frequency polygons

of several distributions can be drawn on the same axis, which makes comparisons

possible whereas histogram cannot be used in the same way. To compare

histograms we need to draw them on separate graphs.

3. Cumulative Frequency Curves or Ogives

We have discussed the charting of simple distributions where each frequency refers

to the measurement of the class-interval against which it is placed. Sometimes it

becomes necessary to know the number of items whose values are greater or less

than a certain amount. We may, for example, be interested in knowing the number

of students whose weight is less than 65 Ibs. or more than say 15.5 Ibs. To get this

information, it is necessary to change the form of frequency distribution from a

simple to a cumulative distribution. In a cumulative frequency distribution, the

frequency of each class is made to include the frequencies of all the lower or all the

upper classes depending upon the manner in which cumulation is done. The graph

of such a distribution is called a cumulative frequency curve or an Ogive.

There are two method of constructing ogives, namely:

(i) less than method, and

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

In less than method, we start with the upper limit of each class and go on adding

the frequencies.When these frequencies are plotted we get a rising curve.In more

than method, we start with the lower limit of each class and we subtract the

frequency of each class from total frequencies. When these frequencies are plotted,

we get a declining curve. This example would illustrate both types of ogives.

Example : Draw ogives by both the methods from the following data.

Distribution of weights of the students of a college (Ibs.)

-----------------------------------------------------

Weights No. of Students

-----------------------------------------------------

90.5-100.5

100.5-110.5 34

110.5-120.5 139

120.5-130.5 300

130.5-140.5 367

140.5-150.5 319

150.5-160.5 205

160.5-170.5 76

170.5-180.5 43

180.5-190.5 16

190.5-200.5 3

200.5-210.5 4

210.5-220.5 3

220.5-230.5 1

-----------------------------------------------------

Solution : First of all we shall find out the cumulative frequencies of the given

data by less than method.

--------------------------------------------------------------

Less than (Weights) Cumulative Frequency

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

--------------------------------------------------------------

100.5 5

110.5 39

120.5 178

130.5 478

140.5 845

150.5 1164

160.5 1369

170.5 1445

180.5 1488

190.5 1504

200.5 1507

210.5 1511

220.5 1514

230.5 1515

--------------------------------------------------------------

Plot these frequencies and weights on a graph paper. The curve formed is called an

Ogive Now we calculate the cumulative frequencies of the given data by more than

method.

--------------------------------------------------------------

More than (Weights) Cumulative Frequencies

--------------------------------------------------------------

90.5 1515

100.5 1510

110.5 1476

120.5 1337

130.5 1037

140.5 670

150.5 351

160.5 146

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

170.5 70

180.5 27

190.5 11

200.5 8

210.5 4

220.5 1

--------------------------------------------------------------

By plotting these frequencies on a graph paper, we will get a declining curve which

will be our cumulative frequency curve or Ogive by more than method.

Although the graphs are a powerful and effective method of presenting statistical

data, they are not under all circumstances and for all purposes complete substitutes

for tabular and other forms of presentation.The specialist in this field is one who

recognizes not only the advantages but also the limitations of these techniques. He

knows when to use and when not to use these methods and from his experience and

expertise is able to select the most appropriate method for every purpose.

Example :Draw an ogive by less than method and determine the number of

companies earning profits between Rs. 45 crores and Rs. 75 crores :

------------------------------------------------------------------------

Profits No. of Profits No. of

(Rs. crores) Companies (Rs. crores) Companies

------------------------------------------------------------------------

10—20 8 60—70 10

20—30 12 70—80 7

30—40 20 80—90 3

40—50 24 90—100 1

50—6.0 15

------------------------------------------------------------------------

Solution :

OGIVE BY LESS THAN METHOD

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

-----------------------------------------------

Profits No.of

(Rs. crores) Companies

----------------------------------------------

Less than 20 8

Less than 30 20

Less than 40 40

Less than 50 64

Less than 60 79

Less than 70 89

Less than 80 96

Less than 90 99

Less than 100 100

-----------------------------------------------

It is clear from the graph that the number of companies getting profits less than

Rs.75 crores is 92 and the number of companies getting profits less than Rs. 45

crores is 51. Hence the number of companies getting profits between Rs. 45 crores

and Rs. 75 crores is 92 – 51 = 41.

Example :The following distribution is with regard to weight in grams of mangoes

of a given variety. If mangoes of weight less than 443 grams be considered

unsuitable for foreign market, what is the percentage of total mangoes suitable for

it? Assume the given frequency distribution to be typical of the variety:

------------------------------------------------------------------------------------------------

Weight in gms. No. of mangoes Weight in gms. No. of mangoes

---------------------------------------------------------------------------------

410 – 119 10 450 – 159 45

420 – 429 20 460 – 469 18

430 – 139 42 470 – 179 7

440 – 449 54

---------------------------------------------------------------------------------

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Draw an ogive of ‘more than’ type of the above data and deduce how many

mangoes will be more than 443 grams.

Solution : Mangoes weighting more than 443 gms. are suitable for foreign market.

Number of mangoes weighting more than 443 gms. lies in the last four classes.

Number of mangoes weighing between 444 and 449 grams would be

Total number of mangoes weighing more than 443 gms. = 32.4 + 45 + 18 + 7 =

102.4

Percentage of mangoes =

Therefore, the percentage of the total mangoes suitable for foreign market is 52.25.

OGIVE BY MORE THAN METHOD

------------------------------------------------------------------

Weight more than (gms.) No. of Mangoes

------------------------------------------------------------------

410 196

420 186

430 166

440 124

450 70

460 25

470 7

------------------------------------------------------------------

From the graph it can be seen that there are 103 mangoes whose weight will be

more than 443 gms. and are suitable for foreign market.

DIAGRAM:-

Statistical data can be presented by means of frequency tables, graphs and

diagrams. In this lesson, so far we have discussed the graphical presentation. Now

we shall take up the study of diagrams. There are many variety of diagrams but

here we are concerned with the following types only :

(i) Bar diagrams

Bar Diagram:-

A bar diagram may be simple or component or multiple. A simple bar diagram is

used to represent only one variable. Length of the bars is proportional to the

magnitude to be represented. But when we are interested in showing various parts

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

comparisons of more than one variable is to be made at the same time, then

multiple bar chart, which groups two or more bar charts together, is made use of.

We shall now illustrate these by examples.

Example 1 : The following table gives the average approximate yield of rice in

Ibs, per acre in various countries of the world in 2000–05.

-------------------------------------------------------

Country Yield in lbs. per acre

-------------------------------------------------------

India 728

Siam 943

U.S.A. 1469

Italy 2903

Egypt 2153

Japan 2276

-------------------------------------------------------

Indicate this by a suitable diagram

Solution :

In the above example, bars have been erected vertically. Also bars may be erected

horizontally.

One of the important objectives of statistics is to find out various numerical values

which explains theinherent characteristics of a frequency distribution. The first of

such measures is averages. The averages arethe measures which condense a huge

unwieldy set of numerical data into single numerical values whichrepresent the

entire distribution. The inherent inability of the human mind to remember a large

body ofnumerical data compels us to few constants that will describe the data.

Averages provide us the gist and givea bird’s eye view of the huge mass of

unwieldy numerical data. Averages are the typical values around whichother items

of the distribution congregate. This value lie between the two extreme observations

of the distribution and give us an idea about the concentration of the values in the

central part of the distribution. They are called the measures of central tendency.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Averages are also called measures of location since they enable us to locate the

position or place of the distribution in question. Averages are statistical constants

which enables us to comprehend in a single value the significance of the whole

group. According to Croxlon and Cowden, an average value is a single value

within the range of the data that is used to represent all the values in that series.

Since an average issomewhere within the range of data, it is sometimes called a

measure of central value. An average is the most typical representative item of the

group to which it belongs and which is capable of revealing all important

characteristics of that group or distribution.

What are the Objects of Central Tendency

The most important object of calculating an average or measuring central tendency

is to determine a single figure which may be used to represent a whole series

involving magnitudes of the same variable. Second object is that an average

represents the empire data, it facilitates comparison within one group or between

groups of data. Thus, the performance of the members of a group can be compared

with the average performance of different groups.

Third object is that an average helps in computing various other statistical

measures such as dispersion,skewness, kurtosis etc.

Essential of a Good Average

An average represents the statistical data and it is used for purposes of comparison,

it must possess the following properties.

1. It must be rigidly defined and not left to the mere estimation of the observer. If

the definition is rigid, the computed value of the average obtained by different

persons shall be similar.

2. The average must be based upon all values given in the distribution. If the item

is not based on all value it might not be representative of the entire group of data.

3. It should be easily understood. The average should possess simple and obvious

properties. It should be too abstract for the common people.

4. It should be capable of being calculated with reasonable care and rapidity.

5. It should be stable and unaffected by sampling fluctuations.

6. It should be capable of further algebraic manipulation.

Different methods of measuring “Central Tendency” provide us with different

kinds of averages. The following are the main types of averages that are commonly

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

used:

1. Mean

(i) Arithmetic mean

(ii) Weighted mean

(iii) Geometric mean

(iv) Harmonic mean

2. Median

3. Mode

Arithmetic Mean: The arithmetic mean of a series is the quotient obtained by

dividing the sum of the values by the number of items. In algebraic language, if

X1, X2, X3 ....... Xn are the n values of a variate X.

Then the Arithmetic Mean is defined by the following formula:

=

=

Example : The following are the monthly salaries (Rs.) of ten employees in an

office. Calculate the meansalary of the employees: 250, 275, 265, 280, 400, 490,

670, 890, 1100, 1250.

Solution : =

= = Rs. 587

Short-cut Method: Direct method is suitable where the number of items is

moderate and the figures are small sizes and integers. But if the number of items is

large and/or the values of the variate are big, then the process of adding together all

the values may be a lengthy process. To overcome this difficulty ofcomputations, a

short-cut method may be used. Short cut method of computation is based on an

importantcharacteristic of the arithmetic mean, that is, the algebraic sum of the

deviations of a series of individualobservation from their mean is always equal to

zero. Thus deviations of the various values of the variate from an assumed mean

computed and the sum is divided by the number of items. The quotient obtained

is added to the assumed mean lo find the arithmetic mean.

Symbolically, = . where A is assumed mean and dx are deviations = (X – A).

We can solve the previous example by short-cut method.

Computation of Arithmetic Mean

----------------------------------------------------------------------------------

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Number X where dx = (X – A), A = 400

----------------------------------------------------------------------------------

1. 250 –150

2. 275 –125

3. 265 –135

4. 280 –120

5. 400 0

6. 490 +90

7. 670 +270

8. 890 +490

9. 1100 + 700

10. 1250 + 850

----------------------------------------------------------------

N = 10 ∑dx = 1870

--------------------------------------------------------------

By substituting the values in the formula, we get

=

Computation of Arithmetic Mean in Discrete series. In discrete series, arithmetic

mean may be computed by both direct and short cut methods. The formula

according to direct method is:

=

where the variable values X1 X2, .......... Xn, have frequencies f1, f2, ................fn

and N = ∑f.

Example : The following table gives the distribution of 100 accidents during seven

days of the week in a given month. During a particular month there were 5 Fridays

and Saturdays and only four each of other days. Calculate the average number of

accidents per day.

Days : Sun. Mon. Tue. Wed. Thur. Fri. Sat. Total

Number of

accidents : 20 22 10 9 11 8 20 = 100

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

------------------------------------------------------------

Day No. of No. of Days Total Accidents

Accidents in Month

X f fX

-------------------------------------------------------------

Sunday 20 4 80

Monday 22 4 88

Tuesday 10 4 40

Wednesday 9 4 36

Thursday 11 4 44

Friday 8 5 40

Saturday 20 5 100

--------------------------------------------------------------

100 N = 30 ∑fX = 428

--------------------------------------------------------------

= = 14.27 = 14 accidents per day

The formula for computation of arithmetic mean according to the short cut method

is = where A is assumed mean, dx = (X – A) and N = ∑f.

We can solve the previous example by short-cut method as given below :

Calculation of Average Accidents per Day

-------------------------------------------------------------------------

Day X dx = X – A f fdx

(where A = 10)

--------------------------------------------------------------------------

Sunday 20 + 10 4 + 40

Monday 22 + 12 4 + 48

Tuesday 10 +0 4 +0

Wednesday 9 –1 4 –4

Thursday 11 +1 4 +4

Friday 8 –2 5 - 10

Saturday 20 + 10 5 + 50

----------------------------------------------------------------------

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

30 + 128

-----------------------------------------------------------------------

= = = 14 accidents per day

Calculation of arithmetic mean for Continuous Series: The arithmetic mean can

be computed both by direct and short-cut method. In addition, a coding method or

step deviation method is also applied for simplification of calculations. In any case,

it is necessary to find out the mid-values of the various classes in the frequency

distribution before arithmetic mean of the frequency distribution can be computed.

Once the mid-points of various classes are found out, then the process of the

calculation of arithmetic mean is same as in the case of discrete series. In case of

direct method, the formula to be used:

= , when m = mid points of various classes and N = total frequency In the short-cut

method, the following formula is applied:

= where dx = (m – A) and N = ∑f

The short-cut method can further be simplified in practice and is named coding

method. The deviations from the assumed mean are divided by a common factor to

reduce their size. The sum of the products of the deviations and frequencies is

multiplied by this common factor and then it is divided by the total frequency and

added to the assumed mean. Symbolically

= where and i = common factor

Geometric Mean :

In general, if we have n numbers (none of them being zero), then the GM. is

defined as

G.M. =

In case of a discrete series, if x1, x2,............. xn occur f1, f2, ............... fn times

respectively and N is

the

total frequency (i.e. N = f1 + f2...................fn ), then

G.M. =

For convenience, use of logarithms is made extensively to calculate the nth root. In

terms of logarithms

G.M. =

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

and in case of continuous series, G.M. =

Example : Calculate geometric mean of the following data :

x 5 6 7 8 9 10 11

f 2 4 7 10 9 6 2

Solution : Calculation of G.M.

-----------------------------------------------------------------------

x log x f f log x

----------------------------------------------------------------------

5 0.6990 2 1.3980

6 0.7782 4 3.1128

7 0.8451 7 5.9157

8 0.9031 10 9.0310

9 0.9542 9 8.5878

10 1.0000 6 6.0000

11 1.0414 2 2.0828

------------------------------------------------------------------------

N = 40 ∑f log x = 36.1281

--------------------------------------------------------------------------

Median

The median is that value of the variable which divides the group in two equal parts.

One part comprising the values greater than and the other all values less than

median. Median of a distribution may be defined as that value of the variable

which exceeds and is exceeded by the same number of observation. It is the value

such that the number of observations above it is equal to the number of

observations below it.

Thus we know that the arithmetic mean is based on all items of the distribution, the

median is positional average, that is, it depends upon the position occupied by a

value in the frequency distribution.

When the items of a series are arranged in ascending or descending order of

magnitude the value of the middle item in the series is known as median in the case

of individual observation. Symbolically.

Median = size of th item

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

It the number of items is even, then there is no value exactly in the middle of the

series. In such a situation the median is arbitrarily taken to be halfway between the

two middle items. Symbolically.

Median =

Location of Median in Discrete series: In a discrete series, medium is computed

in the following manner:

(i) Arrange the given variable data in ascending or descending order,

(ii) Find cumulative frequencies.

(iii) Apply Med. = size of th item

(iv) Locate median according to the size i.e., variable corresponding to the size or

for next cumulative frequency.

Example: Following are the number of rooms in the houses of a particular locality.

Find median of the data:

No. of rooms: 3 4 5 6 7 8

No of houses: 38 654 311 42 12 2

Solution: Computation of Median

------------------------------------------------------------------------

No. of Rooms No. of Houses cumulative Frequency

X f Cf

-----------------------------------------------------------------------

3 38 38

4 654 692

5 311 1003

6 42 1045

7 12 1057

8 2 1059

------------------------------------------------------------------

Median = size of th item = size of th item = 530 th item.

Median lies in the cumulative frequency of 692 and the value corresponding to this

is 4

Therefore, Median = 4 rooms.

In a continuous series, median is computed in the following manner:

(i) Arrange the given variable data in ascending or descending order.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

(ii) If inclusive series is given, it must he converted into exclusive series to find

real class interval

(iii) Find cumulative frequencies.

(iv) Apply Median = size of th item to ascertain median class.

(v) Apply formula of interpolation to ascertain the value of median.

Median = l1 + or Median = l2 –

where, l1 refers to lower limit of median class,

l2 refers to higher limit of median class,

cfo refers cumulative frequency of previous to median class,

f refers to frequency of median class,

Example: The following table gives you the distribution of marks secured by some

students in an examination:

Marks No. of Students

0—20 42

21—30 38

31—40 120

41—50 84

51— 60 48

61—70 36

71—80 31

Find the median marks.

Solution: Calculation of Median Marks

---------------------------------------------------

Marks No. of Students cf

(x) (f)

--------------------------------------------------

0 – 20 42 42

21 – 30 38 80

31 – 40 120 200

41 – 50 84 284

51 – 60 48 332

61 – 70 36 368

71 – 80 31 399

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

---------------------------------------------------

Median = size of th item = size of th item = 199.5 th item.

which lies in (31 – 40) group, therefore the median class is 30.5 – 40.5.

Applying the formula of interpolation.

Median = l1 +

= 30.5 +

Mode

Mode is that value of the variable which occurs or repeats itself maximum number

of item. The mode is most “ fashionable” size in the sense that it is the most

common and typical and is defined by Zizek as “the value occurring most

frequently in series of items and around which the other items are distributed most

densely.” In the words of Croxton and Cowden, the mode of a distribution is the

value at the point where the items tend to be most heavily concentrated. According

to A.M. Tuttle, Mode is the value which has the greater frequency density in its

immediate neighbourhood. In the case of individual observations, the mode is that

value which is repeated the maximum number of times in the series. The value of

mode can be denoted by the alphabet z also.

Sr. Number : 1 2 3 4 5 6 7 8 9 10

Marks obtained : 10 27 24 12 27 27 20 18 15 30

Solution :

----------------------------------------

Marks No. of students

----------------------------------------

10 1

12 1

15 1

18 1

20 1

24 1

27 3 Mode is 27 marks

30 1

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

----------------------------------------

Calculation of Mode in Discrete series. In discrete series, it is quite often

determined by inspection.We can understand with the help of an example:

X 1 2 3 4 5 6 7

f 4 5 13 6 12 8 6

By inspection, the modal size is 3 as it has the maximum frequency. But this test of

greatest frequency is not fool proof as it is not the frequency of a single class, but

also the frequencies of the neighbour classes that decide the mode. In such cases,

we shall be using the method of Grouping and Analysis table.

Size of shoe 1 2 3 4 5 6 7

Frequency 4 5 13 6 12 8 6

Solution : By inspection, the mode is 3, but the size of mode may be 5. This is so

because the neighboring frequencies of size 5 are greater than the neighbouring

frequencies of size 3. This effect of neighbouring frequencies is seen with the help

of grouping and analysis table technique.

Measures of dispersion

For the study of dispersion, we need some measures which show whether the

dispersion is small or large. There are two types of measure of dispersion, which

are:

(b) Relative Measures of Dispersion

Absolute Measures of Dispersion

These measures give us an idea about the amount of dispersion in a set of

observations. They give the answers in the same units as the units of the original

observations. When the observations are in kilograms, the absolute measure is also

in kilograms. If we have two sets of observations, we cannot always use the

absolute measures to compare their dispersions. We shall explain later as to when

the absolute measures can be used for comparison of dispersion in two or more sets

of data. The absolute measures which are commonly used are:

1. The Range

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

3. The Mean Deviation

4. The Standard Deviation and Variance

These measures are calculated for the comparison of dispersion in two or more sets

of observations. These measures are free of the units in which the original data is

measured. If the original data is in dollars or kilometers, we do not use these units

with relative measures of dispersion. These measures are a sort of ratio and are

called coefficients. Each absolute measure of dispersion can be converted into its

relative measure. Thus the relative measures of dispersion are;

Range

and minimum observations. It is intuitively obvious why we define range in

statistics this way - range should suggest how diversely spread out the values

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

are, and by computing the difference between the maximum and minimum

values, we can get an estimate of the spread of the data.

For example, suppose an experiment involves finding out the weight of lab rats and

the values in grams are 320, 367, 423, 471 and 480. In this case, the range is

Range is quite a useful indication of how spread out the data is, but it has some

serious limitations. This is because sometimes data can have outliers that are

widely off the other data points. In these cases, the range might not give a true

indication of the spread of data.

For example, in our previous case, consider a small baby rat added to the data set

that weighs only 50 grams. Now the range is computed as 480-50 = 430 grams,

which looks like a false indication of the dispersion of data.

taking only two data points into consideration. Thus it cannot give a very good

estimate of how the overall data behaves.

Mean deviation

deviations(ignoring signs) from an average divided by the number of items in a

distribution The average can be mean, median or mode. Theoretically median is d

best average of choice because sum of deviations from median is minimum,

provided signs are ignored. However, practically speaking, arithmetic mean is the

most commonly used average for calculating mean deviation and is denoted by

the symbol MDMD.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

We're going to discuss methods to compute the Mean Deviation for three types of

series:

• Individual Data Series

• Discrete Data Series

• Continuous Data Series

Individual Data Series

When data is given on individual basis. Following is an example of individual

series:

Items 5 10 20 30 40 50 60 70

When data is given alongwith their frequencies. Following is an example of

discrete series:

Items 5 10 20 30 40 50 60 70

Frequency 2 5 1 3 12 0 5 7

When data is given based on ranges alongwith their frequencies. Following is an

example of continous series:

Frequency 2 5 1 3 12

The mean difference (more correctly, 'difference in means') is a standard statistic

that measures the absolute difference between the mean value in two groups in a

clinical trial. It estimates the amount by which the experimental intervention

changes the outcome on average compared with the control.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Formula

Mean Difference=∑x1n−∑x2nMean Difference=∑x1n−∑x2n

Where −

• x1x1 = Mean of group one

• x2x2 = Mean of group two

• nn = Sample size

Example

Problem Statement:

There are 2 dance groups whose data is listed below. Find the mean difference

between these dance groups.

Group 1 3 9 5 7

Group 2 5 3 4 4

Solution:

∑x1=3+9+5+7=24∑x2=5+3+4+4=16M1=∑x1n=244=6M2=∑x2n=164=4MeanDi

fference=6−4=2

Standard Deviation

You might like to read this simpler page on Standard Deviation first.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

• 2. Then for each number: subtract the Mean and square the result

• 3. Then work out the mean of those squared differences.

• 4. Take the square root of that and we are done!

The formula actually says all of that, and I will show you how.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

In the formula above μ (the greek letter "mu") is the mean of all our values ...

9+2+5+4+12+7+8+11+9+3+7+4+12+5+4+10+9+6+9+420

= 14020 = 7

So:

μ=7

Step 2. Then for each number: subtract the Mean and square the result

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

So it says "for each value, subtract the mean and square the result", like this

Example (continued):

(9 - 7)2 = (2)2 = 4

(2 - 7)2 = (-5)2 = 25

(5 - 7)2 = (-2)2 = 4

(4 - 7)2 = (-3)2 = 9

(7 - 7)2 = (0)2 = 0

(8 - 7)2 = (1)2 = 1

To work out the mean, add up all the values then divide by how many.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

But how do we say "add them all up" in mathematics? We use "Sigma": Σ

Sigma Notation

We want to add up all the values from 1 to N, where N=20 in our case because

there are 20 values:

Example (continued):

We already calculated (x1-7)2=4 etc. in the previous step, so just sum them up:

= 4+25+4+9+25+0+1+16+4+16+0+9+25+4+9+9+4+1+4+9 = 178

But that isn't the mean yet, we need to divide by how many, which is done

by multiplying by 1/N (the same as dividing by N):

Example (continued):

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Example (concluded):

σ = √(8.9) = 2.983...

DONE!

Example: Sam has 20 rose bushes, but only counted the flowers on 6 of them!

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

and the "sample" is the 6 bushes that Sam counted the flowers of.

9, 2, 5, 4, 12, 7

But when we use the sample as an estimate of the whole population, the Standard

Deviation formula changes to this:

correction").

The symbols also change to reflect that we are working on a sample instead of the

whole population:

• The mean is now x (for sample mean) instead of μ (the population mean),

• And the answer is s (for Sample Standard Deviation) instead of σ.

But that does not affect the calculations. Only N-1 instead of N changes the

calculations.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Example 2: Using sampled values 9, 2, 5, 4, 12, 7

So:

x = 6.5

Step 2. Then for each number: subtract the Mean and square the result

Example 2 (continued):

To work out the mean, add up all the values then divide by how many.

But hang on ... we are calculating the Sample Standard Deviation, so instead of

dividing by how many (N), we will divide by N-1

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Example 2 (continued):

Example 2 (concluded):

s = √(13.1) = 3.619...

DONE!

Coefficient of variation

How to Find a Coefficient of Variation: Contents:

1. What is the Coefficient of Variation?

2. How to Find the Coefficient of Variation

What is the Coefficient of Variation?

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

the ratio of the standard deviation to the mean (average). For example, the

expression “The standard deviation is 15% of the mean” is a CV.

The CV is particularly useful when you want to compare results from two different

surveys or tests that have different measures or values. For example, if you are

comparing the results from two tests that have different scoring mechanisms. If

sample A has a CV of 12% and sample B has a CV of 25%, you would say that

sample B has more variation, relative to its mean.

Formula

The formula for the coefficient of variation is:

Coefficient of Variation = (Standard Deviation / Mean) * 100.

In symbols: CV = (SD/ ) * 100.

Multiplying the coefficient by 100 is an optional step to get a percentage, as

opposed to a decimal.

A researcher is comparing two multiple-choice tests with different conditions. In

the first test, a typical multiple-choice test is administered. In the second test,

alternative choices (i.e. incorrect answers) are randomly assigned to test takers.

The results from the two tests are:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

SD 10.2 12.7

deviations doesn’t really work, because the means are also different. Calculation

using the formula CV=(SD/Mean)*100 helps to make sense of the data:

Regular Test Randomized Answers

SD 10.2 12.7

CV 17.03 28.35

Looking at the standard deviations of 10.2 and 12.7, you might think that the tests

have similar results. However, when you adjust for the difference in the means, the

results have more significance:

Regular test: CV = 17.03

Randomized answers: CV = 28.35

The coefficient of variation can also be used to compare variability between

different measures. For example, you can compare IQ scores to scores on the

Woodcock-Johnson III Tests of Cognitive Abilities.

Why Sample?

• Pool of possible cases is too large (e.g., 260 million Americans) -- would

cost too much and take too long

• Don't want to use up the cases: e.g., when testing light bulbs to see how long

they last, you take a bulb and leave it on until it burns out. You can't test all

the bulbs this way, because their whole objective is to sell the bulbs, not

burn them out.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

• It's not necessary to survey all cases: for most purposes, taking a sample

yields a estimates that are accurate enough.

• The trade-off is that sampling does introduce some error. You didn't

interview everybody, so certain opinions or combinations of opinions won't

be represented in your data. When the population is very diverse, your

sample can't include all the possible combinations of attributes that are

found in the population, such blacks and whites, men and women, cardiac

patients non-patients, black women, white men, white women with heart

trouble who like Oprah and don't like Ally McBeal, etc.

• Population is the universe of cases. It is the group that you ultimately want

to say something about. For example, if you want to report 'what Americans

think about Clinton', then the population is all Americans.

• Elements are the individual cases in the population (usually, persons)

• Sampling ratio is size of sample divided by size of population. Contrary to

popular belief, a large sampling ratio is not crucial.

• Sampling frame is a specific list of names from which sample elements will

be chosen. The Literary Digest poll in 1936 used a sample of 10 million,

drawn from government lists of automobile and telephone owners. Predicted

Alf Landon would beat Franklin Roosevelt by a wide margin. But instead

Roosevelt won by a landslide. The reason was that the sampling frame did

not match the population. Only the rich owned automobiles and telephones,

and they were the ones who favored Landon.

• Replacement. Sampling with replacement means that after you draw a name

out of the hat and record it, you put the name back and it can be chosen

again. Sampling without replacement means that once you draw the name

out, it is not available to be chosen again.

• Bias. Systematic errors produced by your sampling procedure. For example,

if you sample people and ask them whether they watch Ally McBeal, but the

percentage always comes out too high (maybe because you are interviewing

your friends and your whole group really likes Ally McBeal)

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Non-Probability Sampling

Haphazard/Convenience

• Whoever happens to walk by your office; who's on the street when the

camera crews come out

• If you have a choice, don't use this method. Often produces really wrong

answers, because certain attributes tend to cluster with certain geographic

and temporal variables. For example, at 8am in NYC, most of the people on

the street are workers heading for their jobs. At 10am, there are many more

people who don't work, and the proportion of women is much higher. At

midnight, there are young people and muggers.

Quota

• Is an improvement, but still has problems. How do you know which

categories are key? How many do you get of each category?

Purposive/Judgement

• Good for exploratory, qualitative work, and for pre-testing a questionnaire.

Snowball

interviewed

• Useful for studying invisible/illegal populations, such as drug addicts

Probability Sampling

each individual is the same (or at least known, so it can be readjusted

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

mathematically). These are also called random sampling. They require more work,

but are much more accurate. They also allow the researcher to calculate the amount

of error she can expect, and this is really important.

Simple Random

• Develop a sampling frame, then randomly select elements (place all names

on cards, then randomly draw cards from hat; in Excel, there is a function

for attaching a random number to each cell, then sort and take N largest)

• Typically use sampling without replacement, but with replacement can be

done (and is easier mathematically)

• Any one sample is likely to yield statistics (such as the average income or

the percentage of respondents that watch Ally McBeal) that are different

from the population parameters

• The average statistic from many random samples should equal the

population parameter. In other words, if you took 150 different samples of

Americans, each of 300 people, and calculated the percentage that like Ally

McBeal in each of the samples, then averaged all those percentages together,

that should equal the "real" percentage of all Americans that like Ally

McBeal

• It is the Central Limit Theory that guarantees that as the number of random

samples increases, the average of those samples converges on the population

parameter

• Because of these mathematical guarantees, we can estimate how far off a

sample might be from the population, giving rise to confidence intervals

• Random samples are unbiased and, on average, representative of the

population.

instituting a program to deal with employee drug-taking. To find out, they will test

a sample of employees on an anonymous basis: if a person tests positive, the

company will not know who it is and will not try to find out. The objective is

solely to estimate what percentage of the company might be doing drugs. If the

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

percentage is high enough, the company will consider instituting a mandatory drug

testing program. Given this objective, a simple random sampling design is perfect:

the results will generalize to the whole company.

Stratified Sampling

possible

• Procedure is this: Divide the population into strata (mutually exclusive

classes), such as men and women. Then randomly sample within strata.

• Suppose a population is 51% male and 49% female. To get a sample of 100

people, we randomly choose 51 males (from the population of all males)

and, separately, choose 49 females. Our sample is then guaranteed to have

exactly the correct proportion of sexes.

• This avoids problem of random sampling that the proportions could be 50-

50, 48-52, etc.

• Especially important when one group is so small (say, 3% of the population)

that a random sample might miss them entirely.

creating a stress-management program for employees. To get an idea of what kinds

of needs the program would have to fill, she will interview a sample of 50

employees first. If she does a simple random sample, it's possible that her sample

will not include any representatives of some of the smaller departments, just by

chance. Since she knows that different kinds of jobs within the company produce

different kinds of stress, she wants to get separate samples from the workmen (who

handle dangerous chemicals), the foremen (who balance the interests of the

workmen with management), and the managers (who are responsible to

shareholders). So she uses a stratified random sample.

Cluster Sampling

• Used when (a) sampling frame not available or too expensive, and (b) cost

of reaching an individual element is too high

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

construct it, it would cost too much money to reach randomly selected

mechanics across the entire US: would have to have unbelievable travel

budget

• In cluster sampling, first define large clusters of people. These clusters

should have a lot heterogeneity within, but be fairly similar to other clusters.

For example, cities make good clusters.

• Then sample among the clusters. Then once you have chosen the clusters,

randomly sample within the clusters.

• Clusters might be cities. Once you've chosen the cities, might be able to get

a reasonably accurate list of all the mechanics in each of those cities. Is also

much less expensive to fly to just 10 cities instead of 2000 cities.

• Cluster sampling is less expensive than other methods, but less accurate.

o each stage introduces its own sampling error.

• Suppose you want to sample college students. You start by sampling 300

colleges. Then choose 10 students from each college. Problem is, if the

colleges are of different size, the probability of a person being chosen if they

are from a big college is smaller than for a small college. So need to choose

a proportion of students, not a fixed number. Or don't choose colleges with

equal probability (let the big schools be more likely to be in the sample).

This is called PSS, Proportionate to Size Sampling

Example. Once a quarter, a large retail chain sends auditors to randomly chosen

stores to check that proper procedures are being carried out. They look at the

physical layout, the interactions between staff and customers, backroom

procedures, and so on. A simple random sample could have an auditor visiting a

California store one day, a New York the next, then another California store, and

so on. Using cluster sampling, the auditor might first select a random sample of

states, then visit a random sampling of stores with each state, thus reducing travel

time.

Sample Size

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

• The bigger the better, up to 2500. Beyond 2500, it doesn't really matter

(accuracy increases very slowly after this point)

• The smaller the population, the bigger the sampling ratio that is needed.

• For populations under 1000, you need sampling ratio of 30% (300 elements)

to be really accurate.

• For populations of about 10,000 need sampling ratio of about 10%

• This lesson will show the difference between sampling and nonsampling

errors. Using a sample in order to get information about a population is often

better than conducting a census for many reasons.

• Sampling is less costly and it can be done more quickly than a census which

requires data for the entire population.

population will give different results because these samples contain different

elements. Because of this discrepancy, we say that there is a sampling error.

• Sampling error is the difference between the value of a sample statistics and

the value of the population parameter.

• Suppose, we need to find the sampling error for the mean. Suppose also

there is no nonsampling error which we define below.

• Sampling error = x̄ – μ

• For example, in the lesson about sampling distribution, the 5 scores below

are for the entire population and μ = 86.4

• 80 85 85 90 92

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Assume that the scores are 85, 90, and 92

• The mean score estimated from the sample is 2.6 higher than the mean score

from the population.

• Any collection errors, recording errors, and/or tabulation errors are called

nonsampling errors.

errors are the result of human mistakes.

91.

• 0.34 does not really represent the sampling error since we already calculated

it as 2.6.

• The difference between 2.6 and 0.34 or 1.26 – 0.26 or 2.26 is the

nonsampling error because the value of 2.26 occured as a result of human

mistake.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Reliability of samples

units of inquiry for the collection of data and should be done in a scientific manner.

Therefore it is important to know how closely the measures based on sample

represent the parameters and how much variation one may expect if other samples

are analysed.The measures of reliability are concerned only with fluctuations due

to random sampling and they have nothing to do with observational and

computational errors. Whenever a measure of reliability is computed it is

understood that the sample is adequate and has been selected according to a

rigorously scientific procedure.

According to large sample theory the reliability of a measure such as the arithmetic

mean depends upon the number of cases in the sample and the variability of the

values in the sample. The reliability of a measure is related to the size of the

sample. The degree of variability of cases in a sample also has an important

influence on the reliability of the measures computed from the sample. If the cases

in the sample show a pronounced scatter a greater chance of fluctuation in the

measures would naturally be expected.

Brief explanation of the Central limit theorem

theory that states that given a sufficiently large sample size from a population with

a finite level of variance, the mean of all samples from the same population will be

Unit-2

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

• SEPTEMBER 26, 2012

The probability of happening an event can easily be found using the definition of

probability. But just the definition cannot be used to find the probability of

happening of both the given events. A theorem known as “Multiplication

theorem” solves these types of problems. The statement and proof of

“Multiplication theorem” and its usage in various cases is as follows.

If A and B are any two events of a sample space such that P(A) ≠0 and P(B)≠0,

then

P(A∩B) = P(A) * P(B|A) = P(B) *P(A|B).

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

INDEPENDENT EVENTS:

Two events A and B are said to be independent if there is no change in the

happening of an event with the happening of the other event.

i.e. Two events A and B are said to be independent if

P(A|B) = P(A) where P(B)≠0.

P(A∩B) = P(A) * P(B).

Example:

While laying the pack of cards, let A be the event of drawing a diamond and B

be the event of drawing an ace.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Note:

(1) If 3 events A,B and C are independent the

P(A∩B∩C) = P(A)*P(B)*P(C).

• SEPTEMBER 19, 2012

The probability of happening an event can easily be found using the definition of

probability. But just the definition cannot be used to find the probability of

happening at least one of the given events. A theorem known as “Addition

theorem” solves these types of problems. The statement and proof of “Addition

theorem” and its usage in various cases is as follows.

Two or more events are said to be mutually exclusive if they don’t have any

element in common. i.e. if, the occurrence of one of the events prevents the

occurrence of the others then those events are said to be mutually exclusive.

Example:

The event of getting 2 heads, A and the event of getting 2 tails, B when two

coins are tossed are mutually exclusive.

Because A = {HH}; B = {TT}.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

at least one of those two events. i.e. one of those events will definitely happen.

If A and B are two mutually exhaustive then the probability of their union is 1.

i.e. P(AUB)=1.

Example:

The event of getting a head and the event of getting a tail when a coin is tossed

are mutually exhaustive.

If A and B are any two events then the probability of happening of at least one of

the events is defined as P(AUB) = P(A) + P(B)- P(A∩B).

Proof:

Since events are nothing but sets,

n(AUB)/ n(S) = n(A)/ n(S) + n(B)/ n(S)- n(A∩B)/ n(S)

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Example:

If the probability of solving a problem by two students George and James are 1/2

and 1/3 respectively then what is the probability of the problem to be solved.

Solution:

Let A and B be the probabilities of solving the problem by George and James

respectively.

Then P(A)=1/2 and P(B)=1/3.

P(AUB) = 1/2 +.1/3 – 1/2 * 1/3 = 1/2 +1/3-1/6 = (3+2-1)/6 = 4/6 = 2/3

Note:

If A and B are any two mutually exclusive events then P(A∩B)=0.

Then P(AUB) = P(A)+P(B).

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Conditional Probability

The conditional probability of an event B is the probability that the event will

occur given the knowledge that an event A has already occurred. This probability is

written P(B|A), notation for the probability of B given A. In the case where

events A and B are independent (where event A has no effect on the probability of

event B), the conditional probability of event B given event A is simply the

probability of event B, that is P(B).

If events A and B are not independent, then the probability of the intersection

of A and B (the probability that both events occur) is defined by

P(A and B) = P(A)P(B|A).

dividing by P(A):

Examples

In a card game, suppose a player needs to draw two cards of the same suit in order

to win. Of the 52 cards, there are 13 cards in each suit. Suppose first the player

draws a heart. Now the player wishes to draw a second heart. Since one heart has

already been chosen, there are now 12 hearts remaining in a deck of 51 cards. So

the conditional probability P(Draw second heart|First card a heart) = 12/51.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

of being accepted, and he knows that dormitory housing will only be provided for

60% of all of the accepted students. The chance of the student being

accepted and receiving dormitory housing is defined by

P(Accepted and Dormitory Housing) = P(Dormitory

Housing|Accepted)P(Accepted) = (0.60)*(0.80) = 0.48.

To calculate the probability of the intersection of more than two events, the

conditional probabilities of all of the preceding events must be considered. In

the case of three events, A, B, and C, the probability of the intersection P(A

and B and C) = P(A)P(B|A)P(C|A and B).

Consider the college applicant who has determined that he has 0.80 probability of

acceptance and that only 60% of the accepted students will receive dormitory

housing. Of the accepted students who receive dormitory housing, 80% will have

at least one roommate. The probability of being accepted and receiving dormitory

housing and having no roommates is calculated by:

P(Accepted and Dormitory Housing and No Roommates) =

P(Accepted)P(Dormitory Housing|Accepted)P(No Roomates|Dormitory Housing

and Accepted) = (0.80)*(0.60)*(0.20) = 0.096. The student has about a 10% chance

of receiving a single room at the college.

by Bayes's formula. The formula is based on the expression P(B) = P(B|A)P(A) +

P(B|Ac)P(Ac), which simply states that the probability of event B is the sum of the

conditional probabilities of event B given that event A has or has not occurred. For

independent events A and B, this is equal to P(B)P(A) + P(B)P(Ac) = P(B)(P(A) +

P(Ac)) = P(B)(1) = P(B), since the probability of an event and its complement

must always sum to 1. Bayes's formula is defined as follows:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Example

Suppose a voter poll is taken in three states. In state A, 50% of voters support the

liberal candidate, in state B, 60% of the voters support the liberal candidate, and in

state C, 35% of the voters support the liberal candidate. Of the total population of

the three states, 40% live in state A, 25% live in state B, and 35% live in state C.

Given that a voter supports the liberal candidate, what is the probability that she

lives in state B?

By Bayes's formula,

P(Voter supports liberal candidate|Voter lives in state B)P(Voter lives in state B)/

(P(Voter supports lib. cand.|Voter lives in state A)P(Voter lives in state A)

+

P(Voter supports lib. cand.|Voter lives in state B)P(Voter lives in state B)

+

P(Voter supports lib. cand.|Voter lives in state C)P(Voter lives in state C))

= (0.60)*(0.25)/((0.50)*(0.40) + (0.60)*(0.25) + (0.35)*(0.35))

= (0.15)/(0.20 + 0.15 + 0.1225) = 0.15/0.4725 = 0.3175.

Independent Events

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

LO 6.7: Determine whether two events are independent or dependent and justify

your conclusion.

probability notation to define this more precisely).

Independent Events:

• Two events A and B are said to be independent if the fact that one event has

occurred does not affect the probability that the other event will occur.

• If whether or not one event occurs does affect the probability that the other

event will occur, then the two events are said to be dependent.

Here are a few examples:

EXAMPLE:

A woman’s pocket contains two quarters and two nickels.

She randomly extracts one of the coins and, after looking at it, replaces it before

picking a second coin.

Let Q1 be the event that the first coin is a quarter and Q2 be the event that the

second coin is a quarter.

• Why?

Since the first coin that was selected is replaced, whether or not Q1 occurred (i.e.,

whether the first coin was a quarter) has no effect on the probability that the second

coin will be a quarter, P(Q2).

In either case (whether Q1 occurred or not), when she is selecting the second coin,

she has in her pocket:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

EXAMPLE:

A woman’s pocket contains two quarters and two nickels.

She randomly extracts one of the coins, and without placing it back into her

pocket, she picks a second coin.

As before, let Q1 be the event that the first coin is a quarter, and Q2 be the event

that the second coin is a quarter.

• Q1 and Q2 are not independent. They are dependent. Why?

Since the first coin that was selected is not replaced, whether Q1 occurred (i.e.,

whether the first coin was a quarter) does affect the probability that the second

coin is a quarter, P(Q2).

If Q1 occurred (i.e., the first coin was a quarter), then when the woman is

selecting the second coin, she has in her pocket:

However, if Q1 has not occurred (i.e., the first coin was not a quarter, but a

nickel), then when the woman is selecting the second coin, she has in her pocket:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

In these last two examples, we could actually have done some calculation in order

to check whether or not the two events are independent or not.

Sometimes we can just use common sense to guide us as to whether two events are

independent. Here is an example.

EXAMPLE:

Two people are selected simultaneously and at random from all people in the

United States.

Let B1 be the event that one of the people has blue eyes and B2 be the event that

the other person has blue eyes.

In this case, since they were chosen at random, whether one of them has blue eyes

has no effect on the likelihood that the other one has blue eyes, and therefore B1

and B2 are independent.

On the other hand …

EXAMPLE:

A family has 4 children, two of whom are selected at random.

Let B1 be the event that one child has blue eyes, and B2 be the event that the other

chosen child has blue eyes.

In this case, B1 and B2 are not independent, since we know that eye color is

hereditary.

Thus, whether or not one child is blue-eyed will increase or decrease the chances

that the other child has blue eyes, respectively.

Comments:

• It is quite common for students to initially get confused about the distinction

between the idea of disjoint events and the idea of independent events. The

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

purpose of this comment (and the activity that follows it) is to help students

develop more understanding about these very different ideas.

The idea of disjoint events is about whether or not it is possible for the events to

occur at the same time (see the examples on the page for Basic Probability Rules).

The idea of independent events is about whether or not the events affect each

other in the sense that the occurrence of one event affects the probability of the

occurrence of the other (see the examples above).

The following activity deals with the distinction between these concepts.

The purpose of this activity is to help you strengthen your understanding about the

concepts of disjoint events and independent events, and the distinction between

them.

Learn by Doing: Independent Events

• In Example 2: A and B are not disjoint and not independent

• In Example 3: A and B are disjoint and not independent.

Why did we leave out the case when the events are disjoint and independent?

Independent Independent

Disjoint

If events are disjoint then they must be not independent, i.e. they must be

dependent events.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Why is that?

• Recall: If A and B are disjoint then they cannot happen together.

• In other words, A and B being disjoint events implies that if event A occurs

then B does not occur and vice versa.

• Well… if that’s the case, knowing that event A has occurred dramatically

changes the likelihood that event B occurs – that likelihood is zero.

• This implies that A and B are not independent.

Now that we understand the idea of independent events, we can finally get to rules

for finding P(A and B) in the special case in which the events A and B are

independent.

Later we will present a more general version for use when the events are not

necessarily independent.

LO 6.8: Apply the multiplication rule for independent events to calculate P(A and

B) for independent events.

beginning with the multiplication rule for independent events.

Using a Venn diagram, we can visualize “A and B,” which is represented by the

overlap between events A and B:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

• If A and B are two INDEPENDENT events, then P(A and B) = P(A) * P(B).

Comment:

• When dealing with probability rules, the word “and” will always be associated

with the operation of multiplication; hence the name of this rule, “The

Multiplication Rule.”

•

EXAMPLE:

Recall the blood type example:

Two people are selected simultaneously and at random from all people in the

United States.

• Let O1= “person 1 has blood type O” and

• O2= “person 2 has blood type O”

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Since they were chosen simultaneously and at random, the blood type of one has

no effect on the blood type of the other. Therefore, O1 and O2 are independent,

and we may apply Rule 6:

Did I Get This?: Probability Rule Six

Comments:

• We now have an Addition Rule that says

P(A or B) = P(A) + P(B) for disjoint events,

and a Multiplication Rule that says

The purpose of this comment is to point out the magnitude of P(A or B) and of P(A

and B) relative to either one of the individual probabilities.

Since probabilities are never negative, the probability of one event or another is

always at least as large as either of the individual probabilities.

Since probabilities are never more than 1, the probability of one event and another

generally involves multiplying numbers that are less than 1, therefore can never be

more than either of the individual probabilities.

Here is an example:

EXAMPLE:

Consider the event A that a randomly chosen person has blood type A.

Modify it to a more general event — that a randomly chosen person has blood type

A or B — and the probability increases.

Modify it to a more specific (or restrictive) event — that not just one randomly

chosen person has blood type A, but that out of two simultaneously randomly

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

chosen people, person 1 will have type A and person 2 will have type B — and the

probability decreases.

• The word “and” is associated in our minds with “adding more stuff.” Therefore,

some students incorrectlythink that P(A and B) should be larger than either one

of the individual probabilities, while it is actually smaller, since it is a more

specific (restrictive) event.

• Also, the word “or” is associated in our minds with “having to choose between”

or “losing something,” and therefore some students incorrectly think that P(A

or B) should be smaller than either one of the individual probabilities, while it is

actually larger, since it is a more general event.

Practically, you can use this comment to check yourself when solving problems.

For example, if you solve a problem that involves “or,” and the resulting

probability is smaller than either one of the individual probabilities, then you know

you have made a mistake somewhere.

Comment:

• Probability rule six can be used as a test to see if two events are independent or

not.

• If you can easily find P(A), P(B), and P(A and B) using logic or are provided

these values, then we can test for independent events using the multiplication

rule for independent events:

IF P(A)*P(B) = P(A and B) THEN A and B are independent events,

otherwise, they are dependent events.

As you’ve seen, the last three rules that we’ve introduced (the Complement Rule,

the Addition Rules, and the Multiplication Rule for Independent Events) are

frequently used in solving problems.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Before we move on to our next rule, here are two comments that will help you use

these rules in broader types of problems and more effectively.

Comment:

• As we mentioned before, the Addition Rule for Disjoint events (rule four) can

be extended to more than two disjoint events.

• Likewise, the Multiplication Rule for independent events (rule six) can be

extended to more than two independent events.

• So if A, B and C are three independent events, for example, then P(A and B and

C) = P(A) * P(B) * P(C).

• These extensions are quite straightforward, as long as you remember that “or”

requires us to add, while “and” requires us to multiply.

EXAMPLE:

Three people are chosen simultaneously and at random.

We’ll use the usual notation of B1, B2 and B3 for the events that persons 1, 2 and

3 have blood type B, respectively.

We need to find P(B1 and B2 and B3). Let’s solve this one together:

EXAMPLE:

A fair coin is tossed 10 times. Which of the following two outcomes is more

likely?

(a) HHHHHHHHHH

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

(b) HTTHHTHTTH

In fact, they are equally likely. The 10 tosses are independent, so we’ll use the

Multiplication Rule for Independent Events:

(1/2)10

• P(HTTHHTHTTH) = P(H) * P(T) * … * P(H) = 1/2 * 1/2 *… * 1/2 = (1/2)10

Here is the idea:

• There are actually 1,024 possible outcomes to this experiment, all of which are

equally likely.

Therefore,

• while it is true that it is more likely to get an outcome that has 5 heads and 5

tails than an outcome that has only heads

since there is only one possible outcome which gives all heads

and many possible outcomes which give 5 heads and 5 tails

• if we are comparing 2 specific outcomesas we do here, they are equally likely.

IMPORTANT Comments:

• Only use the multiplication rule for independent events, rule six, which says

P(A and B) = P(A)P(B) if you are certain the two events are independent.

o Probability rule six is ONLY true for independent events.

• When finding P(A or B) using the general addition rule: P(A) + P(B) – P(A

and B),

o do NOT use the multiplication rule for independent events to calculate P(A

and B), use only logic and counting.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Bayes’ theorem

Bayes’ theorem is a way to figure out conditional probability. Conditional

probability is the probability of an event happening, given that it has some

relationship to one or more other events. For example, your probability of getting a

parking space is connected to the time of day you park, where you park, and what

conventions are going on at any time. Bayes’ theorem is slightly more nuanced. In

a nutshell, it gives you the actual probability of an event given information

about tests.

• “Events” Are different from “tests.” For example, there is a test for liver

disease, but that’s separate from the event of actually having liver disease.

• Tests are flawed: just because you have a positive test does not mean you

actually have the disease. Many tests have a high false positive rate. Rare

events tend to have higher false positive rates than more common events.

We’re not just talking about medical tests here. For example, spam filtering can

have high false positive rates. Bayes’ theorem takes the test results and

calculates your real probability that the test has identified the event.

Bayes’ Theorem (also known as Bayes’ rule) is a deceptively simple formula used

to calculate conditional probability. The Theorem was named after English

mathematician Thomas Bayes (1701-1761). The formal definition for the rule is:

In most cases, you can’t just plug numbers into an equation; You have to figure out

what your “tests” and “events” are first. For two events, A and B, Bayes’ theorem

allows you to figure out p(A|B) (the probability that event A happened, given that

test B was positive) from p(B|A) (the probability that test B happened, given that

event A happened). It can be a little tricky to wrap your head around as technically

you’re working backwards; you may have to switch your tests and events around,

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

which can get confusing. An example should clarify what I mean by “switch the

tests and events around.”

Bayes’ Theorem Example #1

You might be interested in finding out a patient’s probability of having liver

disease if they are an alcoholic. “Being an alcoholic” is the test (kind of like a

litmus test) for liver disease.

• A could mean the event “Patient has liver disease.” Past data tells you that 10%

of patients entering your clinic have liver disease. P(A) = 0.10.

• B could mean the litmus test that “Patient is an alcoholic.” Five percent of the

clinic’s patients are alcoholics. P(B) = 0.05.

• You might also know that among those patients diagnosed with liver disease,

7% are alcoholics. This is your B|A: the probability that a patient is alcoholic,

given that they have liver disease, is 7%.

Bayes’ theorem tells you:

P(A|B) = (0.07 * 0.1)/0.05 = 0.14

In other words, if the patient is an alcoholic, their chances of having liver disease is

0.14 (14%). This is a large increase from the 10% suggested by past data. But it’s

still unlikely that any particular patient has liver disease.

More Bayes’ Theorem Examples

Bayes’ Theorem Problems Example #2

Another way to look at the theorem is to say that one event follows another. Above

I said “tests” and “events”, but it’s also legitimate to think of it as the “first event”

that leads to the “second event.” There’s no one right way to do this: use the

terminology that makes most sense to you.

In a particular pain clinic, 10% of patients are prescribed narcotic pain killers.

Overall, five percent of the clinic’s patients are addicted to narcotics (including

pain killers and illegal substances). Out of all the people prescribed pain pills, 8%

are addicts. If a patient is an addict, what is the probability that they will be

prescribed pain pills?

Step 1: Figure out what your event “A” is from the question. That information

is in the italicized part of this particular question. The event that happens first (A)

is being prescribed pain pills. That’s given as 10%.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Step 2: Figure out what your event “B” is from the question. That information

is also in the italicized part of this particular question. Event B is being an addict.

That’s given as 5%.

Step 3: Figure out what the probability of event B (Step 2) given event A (Step

1). In other words, find what (B|A) is. We want to know “Given that people are

prescribed pain pills, what’s the probability they are an addict?” That is given in

the question as 8%, or .8.

Step 4: Insert your answers from Steps 1, 2 and 3 into the formula and solve.

P(A|B) = P(B|A) * P(A) / P(B) = (0.08 * 0.1)/0.05 = 0.16

The probability of an addict being prescribed pain pills is 0.16 (16%).

Expected Values

EV, average, mean value, mean, or first moment. More practically, the expected

possible values.

Contents:

1. What is a Binomial Distribution?

2. The Bernoulli Distribution

3. The Binomial Distribution Formula

4. Worked Examples

What is a Binomial Distribution?

A binomial distribution can be thought of as simply the probability of a

SUCCESS or FAILURE outcome in an experiment or survey that is repeated

multiple times. The binomial is a type of distribution that has two possible

outcomes (the prefix “bi” means two, or twice). For example, a coin toss has only

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

two possible outcomes: heads or tails and taking a test could have two possible

outcomes: pass or fail.

The first variable in the binomial formula, n, stands for the number of times the

experiment runs. The second variable, p, represents the probability of one specific

outcome. For example, let’s suppose you wanted to know the probability of getting

a 1 on a die roll. if you were to roll a die 20 times, the probability of rolling a one

on any throw is 1/6. Roll twenty times and you have a binomial distribution of

(n=20, p=1/6). SUCCESS would be “roll a one” and FAILURE would be “roll

anything else.” If the outcome in question was the probability of the die landing on

an even number, the binomial distribution would then become (n=20, p=1/2).

That’s because your probability of throwing an even number is one half.

Criteria

Binomial distributions must also meet the following three criteria:

1. The number of observations or trials is fixed. In other words, you can only

figure out the probability of something happening if you do it a certain number

of times. This is common sense—if you toss a coin once, your probability of

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

getting a tails is 50%. If you toss a coin a 20 times, your probability of getting a

tails is very, very close to 100%.

2. Each observation or trial is independent. In other words, none of your trials

have an effect on the probability of the next trial.

3. The probability of success (tails, heads, fail or pass) is exactly the same from

one trial to another.

b(x; n, P) = nCx * Px * (1 – P)n – x

Where:

b = binomial probability

x = total number of “successes” (pass or fail, heads or tails etc.)

P = probability of a success on an individual trial

n = number of trials

different way, because nCx = n!/x!(n-x)! (this binomial distribution formula

uses factorials (What is a factorial?). “q” in this formula is just the probability

of failure (subtract your probability of success from 1).

The binomial distribution formula can calculate the probability of success for

binomial distributions. Often you’ll be told to “plug in” the numbers to

the formula and calculate. This is easy to say, but not so easy to do—unless you

are very careful with order of operations, you won’t get the right answer. If you

have a Ti-83 or Ti-89, the calculator can do much of the work for you. If not,

here’s how to break down the problem into simple steps so you get the answer

right—every time.

Example 1

Q. A coin is tossed 10 times. What is the probability of getting exactly 6

heads?

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

The number of trials (n) is 10

The odds of success (“tossing a heads”) is 0.5 (So 1-p = 0.5)

x=6

P(x=6) = 10C6 * 0.5^6 * 0.5^4 = 210 * 0.015625 * 0.0625 = 0.205078125

Tip: You can use the combinations calculator to figure out the value for nCx.

How to Work a Binomial Distribution Formula: Example 2

80% of people who purchase pet insurance are women. If 9 pet insurance

owners are randomly selected, find the probability that exactly 6 are women.

Step 1: Identify ‘n’ from the problem. Using our sample question, n (the number

of randomly selected items) is 9.

Step 2: Identify ‘X’ from the problem. X (the number you are asked to find the

probability for) is 6.

Step 3: Work the first part of the formula. The first part of the formula is

n! / (n – X)! X!

Substitute your variables:

9! / ((9 – 6)! × 6!)

Which equals 84. Set this number aside for a moment.

Step 4: Find p and q. p is the probability of success and q is the probability of

failure. We are given p = 80%, or .8. So the probability of failure is 1 – .8 = .2

(20%).

Step 5: Work the second part of the formula.

pX

= .86

= .262144

Set this number aside for a moment.

Step 6: Work the third part of the formula.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

q(n – X)

= .2(9-6)

= .23

= .008

Step 7: Multiply your answer from step 3, 5, and 6 together.

84 × .262144 × .008 = 0.176.

Example 3

60% of people who purchase sports cars are men. If 10 sports car owners are

randomly selected, find the probability that exactly 7 are men.

Step 1:: Identify ‘n’ and ‘X’ from the problem. Using our sample question, n (the

number of randomly selected items—in this case, sports car owners are randomly

selected) is 10, and X (the number you are asked to “find the probability” for) is

7.

Step 2: Figure out the first part of the formula, which is:

n! / (n – X)! X!

Substituting the variables:

10! / ((10 – 7)! × 7!)

Which equals 120. Set this number aside for a moment.

Step 3: Find “p” the probability of success and “q” the probability of failure. We

are given p = 60%, or .6. therefore, the probability of failure is 1 – .6 = .4 (40%).

Step 4: Work the next part of the formula.

pX

= .67

= .0.0279936

Set this number aside while you work the third part of the formula.

Step 5: Work the third part of the formula.

q(.4 – 7)

= .4(10-7)

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

= .43

= .0.064

Step 6: Multiply the three answers from steps 2, 4 and 5 together.

120 × 0.0279936 × 0.064 = 0.215.

Poisson Distribution

given interval when the expected number of events is known and the events occur

independently of one another. The symbol μ denotes the expected number of

events that occur during the interval. The probability that there are

exactly x occurrences in the interval can be determined by the

formula , where e is a constant equal to approximately

2.71828 (the base of the natural logarithm system), μ is the expected number of

events that occur in the interval, x is the actual number of events that occur in the

interval, and x! is the factorial of x.

Applications of Normal Distribution

the rest of the course. In this lecture, we will look at a few problems that illustrate

what you can do with normal distributions. One of the variables that we know do

follow normal distributions is the height of people. For all these problems, we’re

going to assume that women’s heights are normally distributed with a mean of 65

inches and a standard deviation of 3 inches. In the textbook’s notation, we can also

state .

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

tall (5’4” to 5’9”)? Put another way, what fraction of women’s heights are in

this range? Using the notation of random variables, we would write this as

P(64 < X < 69).

First, draw a horizontal axis and label it x, write the units (inches) below it, and

draw a normal pdf centered over the mean of 65 inches. Then mark and label 65 on

the axis, mark and label 64 to the left of it and 69 to the right of it, draw vertical

lines from the 64 and the 69 to the curve and shade the part between them, above

the x-axis, and under the curve:

If you are using GeoGebra, then you will immediately see that the software tells

you P(64 < X <69) =0.5393. If you are using the calculator, then you need to find

the normalcdf (not normalpdf) function. Enter the number on the left where the

shading begins, the number on the right where it ends, the mean of the distribution,

and its standard deviation, all separated by commas, normalcdf (64, 69, 65, 3), and

you will get 0.539347. Round this to the nearest ten-thousandth (four places after

the decimal point), or equivalently to the nearest hundredth of a percent, and you

come up with the correct answer: 0.5393, or 53.93%.

In the last lecture, we mentioned that in the old days, everyone has to learn how to

look up a Z-table, the table the shows the relationship between area and Z-score for

the standard normal. Then how does GeoGebra and normalcdf do it? Well, it’s no

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

magic. The software simply converts any normal distribution to a standard normal,

using the familiar relationship of Z-score:

the same area, just under a different scale:

It’s not necessary that you always convert all normal distributions to Z, but it’s

useful to recognize how it is handled by the software, since we will be doing the

same later in inferential statistics.

2) What is the probability that a woman is taller than 5 feet, 10 inches, or 70

inches? Put another way, what fraction of women are taller than 70 inches?

This would be written as P(X > 70).

Start the same way as in Problem 1, but you have to mark and label only one

number besides the mean, the 70. Then shade to the right of the 70, because that’s

where the taller heights are:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

complication using normalcdf is that there is no number on the right where the

shading ends, so put in a big one, and if you’re not sure if it’s big enough put in a

bigger one and see if it changes your answer, at least to the nearest ten-thousandth.

normalcdf ( 70, 1000, 65, 3)=0.04779, so the rounded answer is 0.0478, or 4.78%.

In the problems above, we found the probability that the random variable falls

within a certain range. Now we’re going to reverse the process. We’ll start with the

probability of a certain range, and then we’ll have to find the values of the random

variable that determine that range. I’ll call these values cut-offs. Sometimes they

are also called “inverse probability” problems.

In these three problems, we’ll use the same situation as above: Women’s heights

are normally distributed with a mean of 65 inches and a standard deviation of 3

inches.

1) How short does a woman have to be to be in the shortest 10% of women?

If we call this cut-off c, this could be written as finding c such that P(X < c)

= 0.10.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

We’ll do the same kind of diagram as before, but this time we’ll label the known

probability, 10%, and we do this above the shaded area, definitely not on the x-

axis, because it’s an area, not a height. The hardest part of the diagram is deciding

which side of the mean to put the c on and which side of the c to shade.

You really have to think about it. In this case, since by definition 50% of women

are shorter than the mean, the cut-off for 10% has to be less than the mean.

The picture here shows that how GeoGebra can be used to find the cut-off values:

instead of entering the cut-off values, you can enter 0.10 as the probability, and

GeoGebra will solve for the cut-off value (61.1553).

Using the calculator, you will need to resort to the invNorm function, followed by

the percent of data under the normal curve to the left of (always to the left of, no

matter which side of c the shading is on) the cut-off, then the mean and standard

deviation, separated by commas.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

So in our example, we will do invNorm (0.10, 65, 3), or, to the nearest inch, like

the mean and standard deviation, 61 inches. So about 10% of women are shorter

than 61 inches. You can check this using normalcdf, and you might as well use

more of the cut-off than we rounded to, for greater assurance that your check

shows you got the right answer. You get normalcdf (0, 61.1553, 65, 3), which

come to 0.0999997, or 10%.

2) How tall does a woman have to be to be in the tallest fourth of women?

(What is the cut-off for the tallest 25% of women?) If we call this height c,

we want to find the value of c such that P(X > c) = 0.25. Here’s the diagram:

In GeoGebra it’s quite simple: you will just have to switch the left to the right tail.

In the calculator, when we use invNorm we must put in 0.75, because the

calculator finds cut-offs for areas to the left only: invNorm (0.75, 65, 3). Here 0.75

comes from the fact that the total area must be equal to 1. When we subtract the

area to the right, we are getting the area to the left of the cut-off.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

compute these values. To see how this is done, you will first need to first the cut-

off value for the 25% area to the right:

$P(Z > 0.67) = 0.25$

Then using the relationship between the Z score and X, we can solve for x as the

unknown:

Using the algebra you have learned, you will find x = 3*0.67 + 65 = 67.0, which is

how the software arrived at the answer. You won’t have to do it this way every

time, but it’s helpful to keep in mind, since this relation is used later on in finding

the margin of error for confidence intervals.

3) What if we’re interested in finding cut-offs for a middle group of

women’s heights, say the middle 40%? Obviously, we’re looking for two

numbers here, one on either side of the mean, with the same distance to the

mean. Call them and . Then we are looking for these values so

that

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

You probably noticed that the normal calculator in GeoGebra can’t really find two

cut-offs at once in fact, the figure above was drawn using a different tool. But

and are not two independent values, since they are equally far from 65, the

mean. To use the normal calculator, we must find out how much area is under the

curve to the left of . Well, if 100% of area is under the entire curve, then what’s

left over after taking away the middle 40% is 1-0.40=0.60, and since that 60% is

split evenly between the two tails (the parts at the sides), that gives 30% for each

tail. So is the number such that .

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

How much area is there under the curve to the left of ? Either subtract the 30% to

the right from 100%, or add up the 30% in the left tail and the 40% in the middle,

and you’ll get 70% either way. So is the number such that , and

you will find that inches. So to the first decimal, the middle 40% of

heights go from 63.4 to 66.6 inches. If you use invNorm on a calculator, the

process will be similar.

Summary

Here are a few tips that may help you solve problems related to the normal

distribution:

1) First identify the distribution: is it continuous? Is it Normal?

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

2) Draw a graph of the normal PDF with the mean and standard deviation

3) Examine the question to see whether you are looking for a probability, or

cut-off values.

4) Shade the approximate areas under the normal PDF.

5) Use the software/calculator to solve the unknown, and compare the output

with your graph.

Unit-3

Gather sample data and calculate a test statistic where the sample statistic is

compared to the parameter value. The test statistic is calculated under the

Level of Significance

What is statistical significance anyway? In this post, I’ll continue to focus on

concepts and graphs to help you gain a more intuitive understanding of how

hypothesis tests work in statistics.

To bring it to life, I’ll add the significance level and P value to the graph in my

previous post in order to perform a graphical version of the 1 sample t-test. It’s

easier to understand when you can see what statistical significance truly means!

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Here’s where we left off in my last post. We want to determine whether our sample

mean (330.6) indicates that this year's average energy cost is significantly different

from last year’s average energy cost of $260.

means we’d obtain under the assumption that the null hypothesis is true

(population mean = 260) and we repeatedly drew a large number of random

samples.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

I left you with a question: where do we draw the line for statistical significance on

the graph? Now we'll add in the significance level and the P value, which are the

decision-making tools we'll need.

• Null hypothesis: The population mean equals the hypothesized mean (260).

• Alternative hypothesis: The population mean differs from the hypothesized

mean (260).

The significance level, also denoted as alpha or α, is the probability of rejecting the

null hypothesis when it is true. For example, a significance level of 0.05 indicates a

5% risk of concluding that a difference exists when there is no actual difference.

nature. A picture makes the concepts much easier to comprehend!

The significance level determines how far out from the null hypothesis value we'll

draw that line on the graph. To graph a significance level of 0.05, we need to shade

the 5% of the distribution that is furthest away from the null hypothesis.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

In the graph above, the two shaded areas are equidistant from the null hypothesis

value and each area has a probability of 0.025, for a total of 0.05. In statistics, we

call these shaded areas the critical region for a two-tailed test. If the population

mean is 260, we’d expect to obtain a sample mean that falls in the critical region

5% of the time. The critical region defines how far away our sample statistic must

be from the null hypothesis value before we can say it is unusual enough to reject

the null hypothesis.

Our sample mean (330.6) falls within the critical region, which indicates it is

statistically significant at the 0.05 level.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

We can also see if it is statistically significant using the other common significance

level of 0.01.

The two shaded areas each have a probability of 0.005, which adds up to a total

probability of 0.01. This time our sample mean does not fall within the critical

region and we fail to reject the null hypothesis. This comparison shows why you

need to choose your significance level before you begin your study. It protects you

from choosing a significance level because it conveniently gives you significant

results!

Thanks to the graph, we were able to determine that our results are statistically

significant at the 0.05 level without using a P value. However, when you use the

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

value to your significance level to make this determination.

P-values are the probability of obtaining an effect at least as extreme as the one in

your sample data, assuming the truth of the null hypothesis.

This definition of P values, while technically correct, is a bit convoluted. It’s easier

to understand with a graph!

To graph the P value for our example data set, we need to determine the distance

between the sample mean and the null hypothesis value (330.6 - 260 = 70.6). Next,

we can graph the probability of obtaining a sample mean that is at least as extreme

in both tails of the distribution (260 +/- 70.6).

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

In the graph above, the two shaded areas each have a probability of 0.01556, for a

total probability 0.03112. This probability represents the likelihood of obtaining a

sample mean that is at least as extreme as our sample mean in both tails of the

distribution if the population mean is 260. That’s our P value!

When a P value is less than or equal to the significance level, you reject the null

hypothesis. If we take the P value for our example and compare it to the common

significance levels, it matches the previous graphical results. The P value of

0.03112 is statistically significant at an alpha level of 0.05, but not at the 0.01

level.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

If we stick to a significance level of 0.05, we can conclude that the average energy

cost for the population is greater than 260.

A common mistake is to interpret the P-value as the probability that the null

hypothesis is true. To understand why this interpretation is incorrect, please read

my blog post How to Correctly Interpret P Values.

to determine which statement is best supported by the sample data. A test result is

statistically significant when the sample statistic is unusual enough relative to the

null hypothesis that we can reject the null hypothesis for the entire population.

“Unusual enough” in a hypothesis test is defined by:

• The assumption that the null hypothesis is true—the graphs are centered on

the null hypothesis value.

• The significance level—how far out do we draw the line for the critical

region?

• Our sample statistic—does it fall in the critical region?

Keep in mind that there is no magic significance level that distinguishes between

the studies that have a true effect and those that don’t with 100% accuracy. The

common alpha values of 0.05 and 0.01 are simply based on tradition. For a

significance level of 0.05, expect to obtain sample means in the critical region 5%

of the time when the null hypothesis is true. In these cases, you won’t know that

the null hypothesis is true but you’ll reject it because the sample mean falls in the

critical region. That’s why the significance level is also referred to as an error rate!

This type of error doesn’t imply that the experimenter did anything wrong or

require any other unusual explanation. The graphs show that when the null

hypothesis is true, it is possible to obtain these unusual sample means for no reason

other than random sampling error. It’s just luck of the draw.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Significance levels and P values are important tools that help you quantify and

control this type of error in a hypothesis test. Using these tools to decide when to

reject the null hypothesis increases your chance of making the correct decision.

The null hypothesis can be thought of as the opposite of the "guess" the research

made (in this example the biologist thinks the plant height will be different for the

fertilizers). So the null would be that there will be no difference among the groups

of plants. Specifically in more statistical language the null for an ANOVA is that

the means are the same. We state the Null hypothesis as:

for k levels of an experimental treatment.

Note: Why do we do this? Why not simply test the working hypothesis directly?

The answer lies in the Popperian Principle of Falsification. Karl Popper (a

philosopher) discovered that we can’t conclusively confirm a hypothesis, but we

can conclusively negate one. So we set up a Null hypothesis which is effectively

the opposite of the working hypothesis. The hope is that based on the strength of

the data we will be able to negate or Reject the Null hypothesis and accept an

alternative hypothesis. In other words, we usually see the working hypothesis in

HA.

Step 2: State the Alternative Hypothesis

[Math Processing Error]

The reason we state the alternative hypothesis this way is that if the Null is

rejected, there are many possibilities.

For example, [Math Processing Error] is one possibility, as is [Math Processing

Error]. Many people make the mistake of stating the Alternative Hypothesis

as: [Math Processing Error] which says that every mean differs from every other

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

mean. This is a possibility, but only one of many possibilities. To cover all

alternative outcomes, we resort to a verbal statement of ‘not all equal’ and then

follow up with mean comparisons to find out where differences among means

exist. In our example, this means that fertilizer 1 may result in plants that are

really tall, but fertilizers 2, 3 and the plants with no fertilizers don't differ from one

another. A simpler way of thinking about this is that at least one mean is different

from all others.

Step 3: Set [Math Processing Error]

If we look at what can happen in a hypothesis test, we can construct the following

contingency table:

In Reality

Type II Error

Accept

OK β = probability of Type

H0

II Error

Type I Error

Reject H0 α = probability of Type I OK

Error

You should be familiar with type I and type II errors from your introductory

course. It is important to note that we want to set [Math Processing Error] before

the experiment (a-priori) because the Type I error is the more ‘grevious’ error to

make. The typical value of [Math Processing Error] is 0.05, establishing a 95%

confidence level. For this course we will assume [Math Processing Error] =0.05.

Step 4: Collect Data

Remember the importance of recognizing whether data is collected through an

experimental design or observational.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

For categorical treatment level means, we use an F statistic, named after R.A.

Fisher. We will explore the mechanics of computing the Fstatistic beginning in

Lesson 2. The F value we get from the data is labeled Fcalculated.

As with all other test statistics, a threshold (critical) value of F is established.

This F value can be obtained from statistical tables, and is referred to

as Fcritical or [Math Processing Error]. As a reminder, this critical value is the

minimum value for the test statistic (in this case the F test) for us to be able to

reject the null.

The F distribution, [Math Processing Error], and the location of Acceptance /

Rejection regions are shown in the graph below:

If the Fcalculated from the data is larger than the Fα, then you are in the Rejection

region and you can reject the Null Hypothesis with (1-α) level of confidence.

Note that modern statistical software condenses step 6 and 7 by providing a p-

value. The p-value here is the probability of getting an Fcalculated even greater than

what you observe. If by chance, the Fcalculated = [Math Processing Error], then

the p-value would exactly equal to α. With larger Fcalculated values, we move further

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

into the rejection region and the p-value becomes less than α. So the decision rule

is as follows:

If the p-value obtained from the ANOVA is less than α, then Reject H0 and Accept

HA.

Z-test Vs T-test

Sometimes, measuring every single piece of item is just not practical. That is why

we developed and use statistical methods to solve problems. The most practical

way to do it is to measure just a sample of the population. Some methods test

hypotheses by comparison. The two of the more known statistical hypothesis test

are the T-test and the Z-test. Let us try to breakdown the two.

A T-test is a statistical hypothesis test. In such test, the test statistic follows a

Student’s T-distribution if the null hypothesis is true. The T-statistic was

introduced by W.S. Gossett under the pen name “Student”. The T-test is also

referred as the “Student T-test”. It is very likely that the T-test is most commonly

used Statistical Data Analysis procedure for hypothesis testing since it is

straightforward and easy to use. Additionally, it is flexible and adaptable to a broad

range of circumstances.

There are various T-tests and two most commonly applied tests are the one-sample

and paired-sample T-tests. One-sample T-tests are used to compare a sample mean

with the known population mean. Two-sample T-tests, the other hand, are used to

compare either independent samples or dependent samples.

T-test is best applied, at least in theory, if you have a limited sample size (n < 30)

as long as the variables are approximately normally distributed and the variation of

scores in the two groups is not reliably different. It is also great if you do not know

the populations’ standard deviation. If the standard deviation is known, then, it

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

would be best to use another type of statistical test, the Z-test. The Z-test is also

applied to compare sample and population means to know if there’s a significant

difference between them. Z-tests always use normal distribution and also ideally

applied if the standard deviation is known. Z-tests are often applied if the certain

conditions are met; otherwise, other statistical tests like T-tests are applied in

substitute. Z-tests are often applied in large samples (n > 30). When T-test is used

in large samples, the t-test becomes very similar to the Z-test. There are

fluctuations that may occur in T-tests sample variances that do not exist in Z-tests.

Because of this, there are differences in both test results.

Summary:

test follows a Student’s T-distribution.

2. A T-test is appropriate when you are handling small samples (n < 30) while a Z-

test is appropriate when you are handling moderate to large samples (n > 30).

3. T-test is more adaptable than Z-test since Z-test will often require certain

conditions to be reliable. Additionally, T-test has many methods that will suit any

need.

4. T-tests are more commonly used than Z-tests.

5. Z-tests are preferred than T-tests when standard deviations are known.

Chi-square Test

Chi-square test is one of the important nonparametric tests that is used to compare

more than two variables for a randomly selected data. The expected frequencies are

calculated based on the conditions of null hypothesis. The rejection of null

hypothesis is based on the differences of actual value and expected value.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

The data can be examined by using the two types of Chi-square test, which is given

below:

1. Chi-square goodness of fit test

It is used to observe that the closeness of a sample matches a population. The Chi-

square test statistic is,

where Oi is the observed count, k is categories, and Ei is the expected counts.

2. Chi-square test for independence of two variables

It is used to check whether the variables are independent of each other or not. The

Chi-square test statistic is,

where Oi is the observed count, r is number of rows, c is the number of columns,

and Ei is the expected counts.

divided by their respective degrees of freedom.

the F distribution.

• The distribution is non-symmetric

• The mean is approximately 1

• There are two independent degrees of freedom, one for the numerator, and

one for the denominator.

• There are many different F distributions, one for each pair of degrees of

freedom.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

F-Test

The F-test is designed to test if two population variances are equal. It does this by

comparing the ratio of two variances. So, if the variances are equal, the ratio of the

variances will be 1.

All hypothesis testing is done under the assumption the null hypothesis is true

If the null hypothesis is true, then the F test-statistic given above can be

simplified (dramatically). This ratio of sample variances will be test

statistic used. If the null hypothesis is false, then we will reject the null

hypothesis that the ratio was equal to 1 and our assumption that they were equal.

There are several different F-tables. Each one has a different level of significance.

So, find the correct level of significance first, and then look up the numerator

degrees of freedom and the denominator degrees of freedom to find the critical

value.

You will notice that all of the tables only give level of significance for right tail

tests. Because the F distribution is not symmetric, and there are no negative values,

you may not simply take the opposite of the right critical value to find the left

critical value. The way to find a left critical value is to reverse the degrees of

freedom, look up the right critical value, and then take the reciprocal of this value.

For example, the critical value with 0.05 on the left with 12 numerator and 15

denominator degrees of freedom is found of taking the reciprocal of the critical

value with 0.05 on the right with 15 numerator and 12 denominator degrees of

freedom.

Since the left critical values are a pain to calculate, they are often avoided

altogether. This is the procedure followed in the textbook. You can force the F test

into a right tail test by placing the sample with the large variance in the numerator

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

and the smaller variance in the denominator. It does not matter which sample has

the larger sample size, only which sample has the larger variance.

The numerator degrees of freedom will be the degrees of freedom for whichever

sample has the larger variance (since it is in the numerator) and the denominator

degrees of freedom will be the degrees of freedom for whichever sample has the

smaller variance (since it is in the denominator).

If a two-tail test is being conducted, you still have to divide alpha by 2, but you

only look up and compare the right critical value.

Assumptions / Notes

• The test statistic is F = s1^2 / s2^2 where s1^2 > s2^2

• Divide alpha by 2 for a two tail test and then find the right critical value

• If standard deviations are given instead of variances, they must be squared

• When the degrees of freedom aren't given in the table, go with the value

with the larger critical value (this happens to be the smaller degrees of

freedom). This is so that you are less likely to reject in error (type I error)

• The populations from which the samples were obtained must be normal.

• The samples must be independent

Non parametric tests are tests that do not required that the underlying population be

Normal or indeed that they have any single mathematical form and some even

apply to non numerical data. Non-parametric methods are also known as

distribution free methods since they do not have any underlying population.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Definition

Back to Top

Non parametric tests are defined as the mathematical methods used in statistical

hypothesis testing which, unlike parametric tests, do not make assumptions about

the frequency distribution of variables that are to be assessed. Non parametric test

used when there are skewed data, and it covers techniques that do not rely on data

belonging to any particular distribution.

The word non-parametric does not exactly mean that these models do not have

have any parameters. Actually, the fact is that the nature and number of parameters

is quite flexible and not predefined. Therefore, non-parametric models are known

as distribution-free models.

The situations in which non-parametric tests are used are listed below.

(i) In case of parametric tests are not satisfied.

(ii) When testing hypothesis does not have any distribution.

(iii) In case of requirement of a quick data analysis.

(iv) In case of unscaled data.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

(i) Easy to understand.

(ii) No lengthy calculations.

(iii) No requirement of assumption of distribution.

(iv) Applicable to all kinds of data.

(i) They are less efficient in comparison to parametric tests.

(ii) The results may or may not provide actual answer because they are distribution

free.

Sign Test Statistics

Back to Top

2. The paired data are obtained from similar conditions

3. No assumptions are made regarding the original population.

Sign Test is merely based on the signs (+ or -) of the deviations x-y and not on

their magnitudes. This test is applicable when ties or zero differences between the

paired observations cannot occur. If tie or zero differences are occurred, then they

must be excluded from the analysis and the number of paired observations counted

is also reduced. This method can be used to analyze individual data also

Sign Test

Let didi = xixi - yiyi. di can be positive or negative and all di’s are independent.

We take null hypothesis as H0H0 : p = 1212 and Alternate Hypothesis H1H1:

p≠12≠12.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

If both of np or nq > 5,

σpσp = pqn−−√pqn.

Where zp is the value obtained from the standard normal table with αα level of

significance. If αα not given , αα = 0.05.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

we reject the hypothesis.

Case 3 (n ≥ 30)

If n ≥ 30, we find Mean = np and standard deviation = npq

Find the normal table value for the given αα. If |z| ≤≤ table value, we accept the

null hypothesis of sign test, otherwise we reject the hypothesis.

Kruskal-Wallis H-test

Back to Top

Kruskal Wallis H test is used in case of testing two or more populations are

identical. In this test, the null hypothesis is H0:μ1=μ2=γ30:μ1=μ2=γ3 (when there

are three populations). And alternative hypothesis is H1:μ1≠μ2≠γ3H1:μ1≠μ2≠γ3

In Kruskal-Wallis test, we first calculate ranks of the observations items in the

samples and then determine the rank sums for each sample.

H = 12n(n+1)(∑mi−lRiNi)−3(n+1)12n(n+1)(∑i−lmRiNi)−3(n+1)

where,

n is the total number of observations in all samples,

m is the number of samples,

nini represents the number of observations in ithith sample,

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Here, we use the x2x2 distribution with m - 1 degrees of freedom and α level of

significance to calculate the critical value. If the calculated value is less than x2x2,

then null hypothesis is accepted, otherwise rejected.

Back to Top

Whenever few assumption in the given population are uncertain, we use non-

parametric tests which can be referred as parametric counterparts. When data are

not normally distributed or when they are on an ordinal level of measurement, non-

parametric tests should be used. The basic rule is to use a parametric t test for data

normally distributed and a nonparametric test for skewed data.

Paired sample t test is used to compare two means scores and these scores come

from the same group. Pair samples t test used when variables are independent and

has two levels and those levels are repeated measures.

Examples

Back to Top

Solved Examples

Question 1: Use Kruskal Wallis test to test for differences in mean among 3

samples for α = 0.05

Sample 1 : 100, 65, 102, 86, 80, 89, 98, 96, 91, 101

Sample 2: 84, 103, 126, 62, 92, 97, 95, 90, 94, 76

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Sample 3: 90, 99, 57, 106, 88, 91, 88, 102, 77, 90.

Solution:

H1: μ1≠μ2≠μ3μ1≠μ2≠μ3

We first find the rank of the items in the samples (considering whole group as one)

and then find the rank sums of each sample.

100 24 83 7 90 13

65 3 103 28 99 23

86 8 62 2 106 29

80 6 92 17 88 9.5

89 11 97 21 91 15.5

98 22 95 19 88 9.5

96 20 90 13 102 26.5

91 15.5 94 18 77 5

101 25 76 4 90 13

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Here

nn = 30

nini = 10 for all i

m=3

Degrees of freedom = m - 1 = 3 - 1 = 2.

Test statisic,

H = 12n(n+1)(∑mi−lRiNi)−3(n+1)12n(n+1)(∑i−lmRiNi)−3(n+1)

= 1230(30+1)(161210+159210+145210)−3(30+1)1230(30+1)(161210+159210+14

5210)−3(30+1)

= 0.196

From the x2x2 distribution with m - 1 degrees of freedom and α level of

significance we get critical value = 5.991. Since H < 5.991, we accept the null

hypothesis and we conclude that there is no difference in the mean among 3

samples.

Question 2: The following data show the employee’s rate of defective work before

and after a change in the wage incentive plan. Compare the following two sets of

data to see whether the change has lowered the defective units produced. Use sign

test with αα= 0.01

Before: 9, 8, 7, 10, 8, 11, 9, 7, 6, 9, 11, 9

After: 7, 6, 9, 7, 10, 9, 10, 8, 6, 7, 10, 9.

Solution:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Alternate Hypothesis H1: p < 1212

Here we use one tailed test since we have to check whether the change is lowered.

αα = 0.01.

np = (410410) ×× 10 = 4 < 5

Since P > 0.01, we accept the null hypothesis of the sign test.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

... ANOVA is used to test general rather than specific differences among means.

Analysis of variance/Assumptions

< Analysis of variance

Jump to navigationJump to search

Assumptions

ANOVA models are parametric, relying on assumptions about the distribution of

the dependent variables (DVs) for each level of the independent variable(s) (IVs).

Initially the array of assumptions for various types of ANOVA may seem

bewildering. In practice, the first two assumptions here are the main ones to check.

Note that the larger the sample size, the more robust ANOVA is to violation of the

first two assumptions: normality and homoscedasticity (homogeneity of variance).

approximately normally distributed. Check via

histograms, skewness and kurtosis overall and for each cell (i.e. for each

group for each DV)

2. Homogeneity of variance: The variance in each cell should be similar.

Check via Levene's test or other homogeneity of variance tests which are

generally produced as part of the ANOVA statistical output.

3. Sample size: per cell > 20 is preferred; aids robustness to violation of the

first two assumptions, and a larger sample size increases power

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

not be dependent on another variable or group (usually guaranteed by the

design of the study)

These assumptions apply to independent sample t-tests (see also t-test

assumptions), one-way ANOVAs and factorial ANOVAs.

For ANOVA models involving repeated measures, there is also the assumptions

of:

similar variances

2. Homogeneity of covariance matrices of the depending variables: tests the

null hypothesis that the observed covariance matrices of the dependent

variables are equal across groups (see Box's M)

that relationship between the response and treatment for

describes the one-way ANOVA is given by

the Yij=μ+τi+ϵij,

relationship

between the where Yij represents the j-th observation

response (j=1,2,…,ni)on the i-th treatment

and the (i=1,2,…,k levels). So, Y23represents the third

treatment observation using level 2 of the factor. μis the

(between common effect for the whole

the experiment, τi represents the i-th treatment effect,

dependent and ϵij represents the random error present in the j-

and th observation on the i-th treatment.

independent

variables)

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

effects independently (NID) distributed, with mean zero

model and variance σ2ϵ. μ is always a fixed parameter,

and τ1,τ2,…,τk are considered to be fixed

parameters if the levels of the treatment are

fixedand not a random sample from a population of

possible levels. It is also assumed that μ is chosen

so that

∑τi=0,i=1,…,k

holds. This is the fixed effects model.

effects the model equation remains the same. However,

model now the τi values are random variables assumed to

be NID(0, στ) This is the random effects model.

how these levels are chosen in a given experiment.

Two-Way ANOVA

variance. There are two independent variables (hence the name two-way).

Assumptions

• The populations from which the samples were obtained must be normally or

approximately normally distributed.

• The samples must be independent.

• The variances of the populations must be equal.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Hypotheses

The null hypotheses for each of the sets are given below.

1. The population means of the first factor are equal. This is like the one-way

ANOVA for the row factor.

2. The population means of the second factor are equal. This is like the one-

way ANOVA for the column factor.

3. There is no interaction between the two factors. This is similar to performing

a test for independence with contingency tables.

Factors

The two independent variables in a two-way ANOVA are called factors. The idea

is that there are two variables, factors, which affect the dependent variable. Each

factor will have two or more levels within it, and the degrees of freedom for each

factor is one less than the number of levels.

Treatment Groups

Treatement Groups are formed by making all possible combinations of the two

factors. For example, if the first factor has 3 levels and the second factor has 2

levels, then there will be 3x2=6 different treatment groups.

As an example, let's assume we're planting corn. The type of seed and type of

fertilizer are the two factors we're considering in this example. This example has

15 treatment groups. There are 3-1=2 degrees of freedom for the type of seed, and

5-1=4 degrees of freedom for the type of fertilizer. There are 2*4 = 8 degrees of

freedom for the interaction between the type of seed and type of fertilizer.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

The data that actually appears in the table are samples. In this case, 2 samples from

each treatment group were taken.

Seed A-402 106, 110 95, 100 94, 107 103, 104 100, 102

Seed B-894 110, 112 98, 99 100, 101 108, 112 105, 107

Main Effect

The main effect involves the independent variables one at a time. The interaction is

ignored for this part. Just the rows or just the columns are used, not mixed. This is

the part which is similar to the one-way analysis of variance. Each of the variances

calculated to analyze the main effects are like the between variances

Interaction Effect

The interaction effect is the effect that one factor has on the other factor. The

degrees of freedom here is the product of the two degrees of freedom for each

factor.

Within Variation

The Within variation is the sum of squares within each treatment group. You have

one less than the sample size (remember all treatment groups must have the same

sample size for a two-way ANOVA) for each treatment group. The total number of

treatment groups is the product of the number of levels for each factor. The within

variance is the within variation divided by its degrees of freedom.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

F-Tests

There is an F-test for each of the hypotheses, and the F-test is the mean square for

each main effect and the interaction effect divided by the within variance. The

numerator degrees of freedom come from each effect, and the denominator degrees

of freedom is the degrees of freedom for the within variance in each case.

It is assumed that main effect A has a levels (and A = a-1 df), main effect B has b

levels (and B = b-1 df), n is the sample size of each treatment, and N = abn is the

total sample size. Notice the overall degrees of freedom is once again one less than

the total sample size.

Source SS df MS F

a-1

b-1

Effect (a-1)(b-1)

ab(n-1)

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

abn - 1

Statistical Inference

an underlying probability distribution.[1] Inferential statistical analysis infers

properties of a population, for example by testing hypotheses and deriving

estimates. It is assumed that the observed data set is sampled from a larger

population.

Inferential statistics can be contrasted with descriptive statistics. Descriptive

statistics is solely concerned with properties of the observed data, and it does not

rest on the assumption that the data come from a larger population.

1 Theory of estimation 1.1 Basic concepts True parameter of the population ϑ

number of samples the average over all estimations lies near the true parameter.

estimators θb1 and θb2. • Estimator θb1 is relatively efficient compared to θb2, if

has the smallest variance among all unbiased estimators of ϑ. 1.2 Estimation

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

|ϑ)} → maximize Least squares method Quadratic form Q(ϑ) = Pn i=1 (xi − E[Xi

Introduction to Estimation

To estimate means to esteem (to give value to). An estimator is any quantity

calculated from the sample data which is used to give information about an

unknown quantity in the population. For example, the sample mean is an estimator

of the population mean m.

estimate, or a range of values, referred to as a confidence interval. Whenever we

use point estimation, we calculate the margin of error associated with that point

estimation.

value by using the symbol 'hat'. For example, true population standard deviation s

is estimated from a sample population standard deviation.

Again, the usual estimator of the population mean is = Sxi / n, where n is the size

of the sample and x1, x2, x3,.......,xn are the values of the sample. If the value of the

estimator in a particular sample is found to be 5, then 5 is the estimate of the

population mean µ.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

A "Good" estimator is the one which provides an estimate with the following

qualities:

when the expected value of that estimator can be shown to be equal to the

parameter being estimated. For example, the mean of a sample is an unbiased

estimate of the mean of the population from which the sample was drawn.

Unbiasedness is a good quality for an estimate, since, in such a case, using

weighted average of several estimates provides a better estimate than each one of

those estimates. Therefore, unbiasedness allows us to upgrade our estimates. For

example, if your estimates of the population mean µ are say, 10, and 11.2 from two

independent samples of sizes 20, and 30 respectively, then a better estimate of the

population mean µ based on both samples is [20 (10) + 30 (11.2)] (20 + 30) =

10.75.

that estimate. The larger the standard error the more error in your estimate. The

standard deviation of an estimate is a commonly used index of the error entailed in

estimating a population parameter based on the information in a random sample of

size n from the entire population.

estimate with smaller standard error. Therefore, your estimate is "consistent" with

the sample size. That is, spending more money to obtain a larger sample produces

a better estimate.

Efficiency: An efficient estimate is one which has the smallest standard error

among all unbiased estimators.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

The "best" estimator is the one which is the closest to the population parameter

being estimated.

The above figure illustrates the concept of closeness by means of aiming at the

center for unbiased with minimum variance. Each dart board has several samples:

The first one has all its shots clustered tightly together, but none of them hit the

center. The second one has a large spread, but around the center. The third one is

worse than the first two. Only the last one has a tight cluster around the center,

therefore has good efficiency.

estimator is extremely variable, then the estimates it produces may not on average

be as close to the population parameter as a biased estimator with small variance.

The following chart depicts the quality of a few popular estimators for the

population mean µ:

The widely used estimator of the population mean µ is = Sxi/n, where n is the

size of the sample and x1, x2, x3,......., xn are the values of the sample that have all

of the above good properties. Therefore, it is a "good" estimator.

comparison, then small sample sizes are unlikely to yield any stable estimate. The

mean is sensible in a symmetrical distribution as a measure of central tendency;

but, e.g., with ten cases, you will not be able to judge whether you have a

symmetrical distribution. However, the mean estimate is useful if you are trying to

estimate the population sum, or some other function of the expected value of the

distribution. Would the median be a better measure? In some distributions (e.g.,

shirt size) the mode may be better. BoxPlot will indicate outliers in the data set. If

there are outliers, the median is better than the mean as a measure of central

tendency.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

You might like to use Descriptive Statistics Applet for obtaining "good" estimates.

being estimated. There is uncertainty because inferences are based on a random

sample of finite size from the entire population or process of interest. To judge the

statistical procedure we can ask what would happen if we were to repeat the same

study, over and over, getting different data (and thus different confidence intervals)

each time.

difference of a measured outcome between groups, rather than a simple indication

of whether or not it is statistically significant. Confidence intervals present a range

of values, on the basis of the sample data, in which the value of such a difference

may lie.

Know that a confidence interval computed from one sample will be different from

a confidence interval computed from another sample.

Understand the relationship between sample size and width of confidence interval,

moreover, know that sometimes the computed confidence interval does not contain

the true value.

Let's say you compute a 95% confidence interval for a mean m . The way to

interpret this is to imagine an infinite number of samples from the same

population, 95% of the computed intervals will contain the population mean m ,

and at most 5% will not. However, it is wrong to state, "I am 95% confident that

the population mean m falls within the interval."

by a process such that the interval will contain the true value 95% of the time. This

means that "95%" is a property of the process, not the interval.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

interval (CI) center and lowest at the boundaries? Does the probability of

occurrence of the population mean in a confidence interval vary in a measurable

way from the center to the boundaries? In a general sense, normality condition is

assumed, and then the interval between CI limits is represented by a bell shaped t

distribution. The expectation (E) of another value is highest at the calculated mean

value, and decreases as the values approach the CI limits.

Tolerance Interval and CI: A good approximation for the single measurement

tolerance interval is n½ times confidence interval of the mean.

You need to use Sample Size Determination JavaScript at the design stage of your

statistical investigation in decision making with specific subjective requirements.

confidence intervals from two samples do not overlap, there is a statistically

significant difference, say at 5%. However, the other way is not true; two

confidence intervals can overlap even when there is a significant difference

between them.

their values are 10 and 22 with equal standard error of 4. The 95% confidence

interval for the two statistics (using the critical value of 1.96) are: [2.2, 17.8] and

[14.2, 29.8], respectively. As you see they display considerable overlap. However,

the z-statistic for the two-population mean is: |22 -10|/(16 + 16)½ = 2.12 which is

clearly significant under the same conditions as applied for constructing the

confidence intervals.

One should examine the confidence interval for the difference explicitly. Even if

the confidence intervals are overlapping, it is hard to find the exact overall

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

confidence level. However, the sum of individual confidence levels can serve as an

upper limit. This is evident from the fact that: P(A and B) £ P(A) + P(B).

Interval estimation

an interval of possible (or probable) values of an unknown population parameter,

in contrast to point estimation, which is a single number.

Test of hypothesis

a hypothesis that is testable on the basis of observing a process that is modeled via

a set of random variables.[1] A statistical hypothesis test is a method of statistical

inference. Commonly, two statistical data sets are compared, or a data set obtained

by sampling is compared against a synthetic data set from an idealized model. A

hypothesis is proposed for the statistical relationship between the two data sets, and

this is compared as an alternative to an idealized null hypothesis that proposes no

relationship between two data sets. The comparison is deemed statistically

significant if the relationship between the data sets would be an unlikely realization

of the null hypothesis according to a threshold probability—the significance level.

Hypothesis tests are used in determining what outcomes of a study would lead to a

rejection of the null hypothesis for a pre-specified level of significance. The

process of distinguishing between the null hypothesis and the alternative

hypothesis is aided by identifying two conceptual types of errors, type 1 and type

2, and by specifying parametric limits on e.g. how much type 1 error will be

permitted.

An alternative framework for statistical hypothesis testing is to specify a set

of statistical models, one for each candidate hypothesis, and then use model

selection techniques to choose the most appropriate model.[2] The most common

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

factor.

Confirmatory data analysis can be contrasted with exploratory data analysis, which

may not have pre-specified hypotheses.

In the statistics literature, statistical hypothesis testing plays a fundamental

role.[5] The usual line of reasoning is as follows:

2. The first step is to state the relevant null and alternative hypotheses. This is

important, as mis-stating the hypotheses will muddy the rest of the process.

3. The second step is to consider the statistical assumptions being made about

the sample in doing the test; for example, assumptions about the statistical

independence or about the form of the distributions of the observations. This

is equally important as invalid assumptions will mean that the results of the

test are invalid.

4. Decide which test is appropriate, and state the relevant test statistic T.

5. Derive the distribution of the test statistic under the null hypothesis from the

assumptions. In standard cases this will be a well-known result. For

example, the test statistic might follow a Student's t distribution or a normal

distribution.

6. Select a significance level (α), a probability threshold below which the null

hypothesis will be rejected. Common values are 5% and 1%.

7. The distribution of the test statistic under the null hypothesis partitions the

possible values of T into those for which the null hypothesis is rejected—the

so-called critical region—and those for which it is not. The probability of

the critical region is α.

8. Compute from the observations the observed value tobs of the test statistic T.

9. Decide to either reject the null hypothesis in favor of the alternative or not

reject it. The decision rule is to reject the null hypothesis H0 if the observed

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

value tobs is in the critical region, and to accept or "fail to reject" the

hypothesis otherwise.

An alternative process is commonly used:

1. Compute from the observations the observed value tobs of the test statistic T.

2. Calculate the p-value. This is the probability, under the null hypothesis, of

sampling a test statistic at least as extreme as that which was observed.

3. Reject the null hypothesis, in favor of the alternative hypothesis, if and only

if the p-value is less than the significance level (the selected probability)

threshold.

The two processes are equivalent.[6] The former process was advantageous in the

past when only tables of test statistics at common probability thresholds were

available. It allowed a decision to be made without the calculation of a probability.

It was adequate for classwork and for operational use, but it was deficient for

reporting results.

The latter process relied on extensive tables or on computational support not

always available. The explicit calculation of a probability is useful for reporting.

The calculations are now trivially performed with appropriate software.

Assume that a biological population is sampled and you wish to estimate the mean

value of some variable within that population. In chapter 3, we saw that the Central

Limit Theorem indicates that, when the population distribution is normal, the

sampling distribution of the mean also will be normal. In addition, we saw that,

when using the sample standard deviation, s, to estimate σσ, the tdistribution can

be used to represent the sampling distribution of the mean. Thus, the t distribution

can be used to test hypotheses about the population mean, μμ. This is referred to as

the "one sample t test."

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

The t test evaluates the hypothesis that the parametric mean, μμ, is equal to a

particular value. That is, it tests $H_{o}: \mu = \mu {o},where,where\mu

_{o}isthespecificvalueofinterest.Ifisthespecificvalueofinterest.IfH{o}$ is true, then

the value,

t=y¯−μosn√,t=y¯−μosn,

then the formula simplifies to

t=y¯sn√,t=y¯sn,

know that, theoretically, this test statistics will follow a t distribution, we have a

way of calculating a p-value that can be used to evaluate HoHo within the context

of the approach laid out by Neyman and Pearson.

Going back to the example in chapter 3, a random sample of 20 items was taken

from a population in which we knew that the variable of interest, yy, followed a

normal distribution. In the sample, y¯=8.48y¯=8.48, and s=5s=5. Now, let’s

assume that you wish to test Ho:μ=6Ho:μ=6 vs. Ha:μ≠6Ha:μ≠6. The value of our

observed test statistic can be calculated.

would expect, assuming that HoHo is true. Specifically, we will calculate the two-

tailed p-value.

The approach is illustrated in Fig.6.2. Keep in mind that this p-

value, P(|t|≥tobs)P(|t|≥tobs), is valid for Ha:μ≠6Ha:μ≠6. Had the alternative

hypothesis been Ha:μ<6Ha:μ<6, the appropriate p-value would have

been P(t≤tobs)P(t≤tobs). If the alternative hypothesis had been Ha:μ>6Ha:μ>6, the

appropriate p-value would have been P(t≥tobs)P(t≥tobs).

Under a strict Neyman-Pearson interpretation, because p=p=0.039 is less

than α=0.05α=0.05, we would reject HoHo in favor of HaHa and conclude

that μ≠6μ≠6.

Hypothesis Testing for a Proportion

Printer-friendly version

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Ultimately we will measure statistics (e.g. sample proportions and sample means)

and use them to draw conclusions about unknown parameters (e.g. population

proportion and population mean). This process, using statistics to make judgments

or decisions regarding population parameters is called statistical inference.

P-hat is called the sample proportion and remember it is a statistic (soon we will

look at sample means, ¯xx¯.) But how can p-hat be an accurate measure of p, the

population parameter, when another sample of 100 coin flips could produce 53

heads? And for that matter we only did 100 coin flips out of an uncountable

possible total!

The fact that these samples will vary in repeated random sampling taken at the

same time is referred to as sampling variability. The reason sampling variability is

acceptable is that if we took many samples of 100 coin flips an calculated the

proportion of heads in each sample then constructed a histogram or boxplot of the

sample proportions, the resulting shape would look normal (i.e. bell-shaped) with a

mean of 50%.

[The reason we selected a simple coin flip as an example is that the concepts just

discussed can be difficult to grasp, especially since earlier we mentioned that rarely

is the population parameter value known. But most people accept that a coin will

produce an equal number of heads as tails when flipped many times.]

A statistical hypothesis test is a procedure for deciding between two possible

statements about a population. The phrase significance testmeans the same thing

as the phrase "hypothesis test."

The two competing statements about a population are called the null hypothesis

and the alternative hypothesis.

▪ A typical null hypothesis is a statement that two variables are not related. Other

examples are statements that there is no difference between two groups (or

treatments) or that there is no difference from an existing standard value.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

variables or there is a difference between two groups or there is a difference from a

previous or existing standard.

NOTATION: The notation Ho represents a null hypothesis and Ha represents an

alternative hypothesis and po is read as p-not or p-zero and represents the null

hypothesized value. Shortly, we will substitute μo for when discussing a test of

means.

Ho: p = po

Ha: p ≠ po or Ha: p > po or Ha: p < po [Remember, only select one Ha]

The first Ha is called a two-sided test since "not equal" implies that the true value

could be either greater than or less than the test value, po. The other two Ha are

referred to as one-sided tests since they are restricting the conclusion to a specific

side of po.

Example 3 – This is a test of a proportion:

A Tufts University study finds that 40% of 12th grade females feel they are

overweight. Is this percent lower for college age females? Let p = proportion of

college age females who feel they are overweight. Competing hypothesis are:

Ho: p = .40 (or greater) That is, no difference from Tufts study finding.

Ha: p < .40 (proportion feeling they are overweight is less for college age females.

Example 4 – This is a test of a mean:

Is there a difference between the mean amount that men and women study per

week? Competing hypotheses are:

Null hypothesis: There is no difference between mean weekly hours of study for

men and women, writing in statistical language as μ1 = μ2

Alternative hypothesis: There is a difference between mean weekly hours of study

for men and women, writing in statistical language as μ1≠ μ2

This notation is used since the study would consider two independent samples: one

from Women and another from Men.

Test Statistic and p-value

▪ A test statistic is a summary of a sample that is in some way sensitive to

differences between the null and alternative hypothesis.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

▪ A p-value is the probability that the test statistic would "lean" as much (or more)

toward the alternative hypothesis as it does if the real truth is the null hypothesis.

That is, the p-value is the probability that the sample statistic would occur under

the presumption that the null hypothesis is true.

A small p-value favors the alternative hypothesis. A small p-value means the

observed data would not be very likely to occur if we believe the null hypothesis is

true. So we believe in our data and disbelieve the null hypothesis. An easy

(hopefully!) way to grasp this is to consider the situation where a professor states

that you are just a 70% student. You doubt this statement and want to show that

you are better that a 70% student. If you took a random sample of 10 of your

previous exams and calculated the mean percentage of these 10 tests, which mean

would be less likely to occur if in fact you were a 70% student (the null

hypothesis): a sample mean of 72% or one of 90%? Obviously the 90% would be

less likely and therefore would have a small probability (i.e. p-value).

Using the p-value to Decide between the Hypotheses

▪ The significance level of a test is the border used for deciding between the null and

alternative hypotheses.

▪ Decision Rule: We decide in favor of the alternative hypothesis when a p-value is

less than or equal to the significance level. The most commonly used significance

level is 0.05.

In general, the smaller the p-value the stronger the evidence is in favor of the

alternative hypothesis.

Example 3 Continued:

In a recent elementary statistics survey, the sample proportion (of women) saying

they felt overweight was 37 /129 = .287. Note that this leans toward the alternative

hypothesis that the "true" proportion is less than .40. [Recall that the Tufts

University study finds that 40% of 12th grade females feel they are overweight. Is

this percent lower for college age females?]

Step 1: Let p = proportion of college age females who feel they are overweight.

Ho: p = .40 (or greater) That is, no difference from Tufts study finding.

Ha: p < .40 (proportion feeling they are overweight is less for college age females.

Step 2:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

If npo ≥ 10 and n(1 – po) ≥ 10 then we can use the following Z-test statistic: Since

both (129) × (0.4) and (129) × (0.6) > 10 [or consider that the number of successes

and failures, 37 and 92 respectively, are at least 10] we calculate the test statistic

by:

z=^p−p0√p0(1−p0)nz=p^−p0p0(1−p0)n

Note: In computing the Z-test statistic for a proportion we use the hypothesized

value po here not the sample proportion p-hat in calculating the standard error! We

do this because we "believe" the null hypothesis to be true until evidence says

otherwise.

z=0.287−0.40√ 0.40(1−0.40)129 =−2.62z=0.287−0.400.40(1−0.40)129=−2.62

Step 3: The p-value can be found from Standard Normal Table

Calculating p-value:

The method for finding the p-value is based on the alternative hypothesis:

2 × P(Z ≥ | z | ) for Ha : p ≠ po where |z| is the absolute value of z

P(Z ≥ z ) for Ha : p > po

P(Z ≤ z) for Ha : p < po

In our example we are using Ha : p < .40 so our p-value will be found from P(Z ≤

z) = P(Z ≤ -2.62) and from Standard Normal Table this is equal to 0.0044.

Step 4: We compare the p-value to alpha, which we will let alpha be 0.05. Since

0.0044 is less than 0.05 we will reject the null hypothesis and decide in favor of the

alternative, Ha.

Step 5: We’d conclude that the percentage of college age females who felt they

were overweight is less than 40%. [Note: we are assuming that our sample, since

not random, is representative of all college age females.]

The p-value= .004 indicates that we should decide in favor of the alternative

hypothesis. Thus we decide that less than 40% of college women think they are

overweight.

The "Z-value" (-2.62) is the test statistic. It is a standardized score for the

difference between the sample p and the null hypothesis value p = .40. The p-

value is the probability that the z-score would lean toward the alternative

hypothesis as much as it does if the true population really was p = .40.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Chi-Square-tests and F-tests for variance or standard deviation both require that the

original population be normally distributed.

To test a claim about the value of the variance or the standard deviation of a

population, then the test statistic will follow a chi-square distribution

with n−1n−1 dgrees of freedom, and is given by the following formula.

χ2=(n−1)s2σ20χ2=(n−1)s2σ02

The television habits of 30 children were observed. The sample mean was found to

be 48.2 hours per week, with a standard deviation of 12.4 hours per week. Test the

claim that the standard deviation was at least 16 hours per week.

H0:σ=16H0:σ=16

Ha:σ<16Ha:σ<16

• We shall choose α=0.05α=0.05.

• The test statistic is

χ2=(n−1)s2σ20=(30−1)12.42162=17.418χ2=(n−1)s2σ02=(30−1)12.42162=

17.418.

• The p-value is p=χ2cdf(0,17.418,29)=0.0447p=χ2cdf(0,17.418,29)=0.0447.

• Since p<αp<α, we reject H0H0.

• The variation in television watching was less than 16 hours per week.

equivalent to σ21σ22=1σ12σ22=1. Since sample variances are related to chi-

square distributions, and the ratio of chi-square distributions is an F-distribution,

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

we can use the F-distribution to test against a null hypothesis of equal variances.

Note that this approach does not allow us to test for a particular magnitude of

difference between variances or standard deviations.

Given sample sizes of n1n1 and n2n2, the test statistic will have n1−1n1−1 and

n2−1n2−1 degrees of freedom, and is given by the following formula.

F=s21s22F=s12s22

If the larger variance (or standard deviation) is present in the first sample, then the

test is right-tailed. Otherwise, the test is left-tailed. Most tables of the F-

distribution assume right-tailed tests, but that requirement may not be necessary

when using technology.

Samples from two makers of ball bearings are collected, and their diameters (in

inches) are measured, with the following results:

• Bigelow: n2=120n2=120, s2=0.0428s2=0.0428

Assuming that the diameters of the bearings from both companies are normally

distributed, test the claim that there is no difference in the variation of the

diameters between the two companies.

H0:σ1=σ2H0:σ1=σ2

Ha:σ1≠σ2Ha:σ1≠σ2

• We shall choose α=0.05α=0.05.

• The test statistic is

F=s21s22=0.039520.04282=0.8517F=s12s22=0.039520.04282=0.8517.

• Since the first sample had the smaller standard deviation, this is a left-tailed

test. The p-value is

p=Fcdf(0,0.8517,79,119)=0.2232p=Fcdf(0,0.8517,79,119)=0.2232.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

• There is insufficient evidence to conclude that the diameters of the ball

bearings in the two companies have different standard deviations.

If the two samples had been reversed in our computations, we would have obtained

the test statistic F=1.1741F=1.1741, and performing a right-tailed test, found the

p-value p=Fcdf(1.1741,∞,119,79)=0.2232p=Fcdf(1.1741,∞,119,79)=0.2232.

Of course, the answer is the same.

UNIT-4

Correlation Analysis:

Correlation is a statistical tool that helps to measure and analyze the degree of

relationship between two variables. „ Correlation analysis deals with the

association between two or more variables.

Types of Correlation Type I :-1) Positive Correlation. 2)Negative Correlation

Positive Correlation: The correlation is said to be positive correlation if the

values of two variables changing with same direction. Ex. Pub. Exp. & sales,

Height & weight.

Negative Correlation: The correlation is said to be negative correlation when the

values of variables change with opposite direction. Ex. Price & qty. demanded.

Direction of the Correlation:-

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

increasing, Y is increasing „ As X is decreasing, Y is decreasing „ E.g., As height

increases, so does weight. „

Negative relationship – Variables change in opposite directions. „ As X is

increasing, Y is decreasing „ As X is decreasing, Y is increasing „ E.g., As TV

time increases, grades decrease

3)Partial Correlation. 4)Total Correlation

Simple correlation: -Under simple correlation problem there are only two

variables are studied. „

Multiple Correlation:- Under Multiple Correlation three or more than three

variables are studied. Ex. Q d = f ( P,P C, P S, t, y ) „

Partial correlation:- analysis recognizes more than two variables but considers

only two variables keeping the other constant. „

Total correlation:- is based on all the relevant variables, which is normally not

feasible.

Types of Correlation Type III:-

„ Linear correlation:- Correlation is said to be linear when the amount of change

in one variable tends to bear a constant ratio to the amount of change in the other.

The graph of the variables having a linear relationship will form a straight line. Ex

X = 1, 2, 3, 4, 5, 6, 7, 8,Y = 5, 7, 9, 11, 13, 15, 17, 19, Y = 3 + 2x „

Non Linear correlation:- The correlation would be non linear if the amount of

change in one variable does not bear a constant ratio to the amount of change in the

other variable.

2) Karl Pearson’s Coefficient of Correlation „

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

• arson’s ‘r’ is the most common correlation coefficient.

• Karl Pearson’s Coefficient of Correlation denoted by- ‘r’ The coefficient of

correlation ‘r’ measure the degree of linear relationship between two

variables say x & y

• Karl Pearson’s Coefficient of Correlation denoted by- r -1 ≤ r ≥ +1 „

• Degree of Correlation is expressed by a value of Coefficient „ Direction of

change is Indicated by sign ( - ve) or ( + ve)

• When deviation taken from actual mean: r(x, y)= Σxy / √ Σx² Σy²

Σdx²-( Σdx)² √N Σdy²-( Σdy)²

• The correlation coefficient lies between -1 & +1 symbolically ( - 1 ≤ r ≥ 1 )

„

• The correlation coefficient is independent of the change of origin & scale. „

• The coefficient of correlation is the geometric mean of two regression

coefficient.

r = √ bxy * byx

• The one regression coefficient is (+ve) other regression coefficient is also

(+ve) correlation coefficient is (+ve)

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

• „ When statistical series in which the variables under study are not capable

of quantitative measurement but can be arranged in serial order, in such

situation pearson’s correlation coefficient can not be used in such case

Spearman Rank correlation can be used.

• R = 1- (6 ∑ D2 ) / N (N 2 – 1) „

• R = Rank correlation coefficient „

• D = Difference of rank between paired item in two series. „

• N = Total number of observation.

• Equal Ranks or tie in Ranks: In such cases average ranks should be assigned

to each individual.

• R = 1- (6 ∑ D2 ) + AF / N (N 2 – 1) AF = 1/12(m 1 3 – m 1) + 1/12(m 2 3 –

m 2) +…. 1/12(m 2 3 – m 2 )

m = The number of time an item is repeated

Example:-The scores for nine students in physics and math are as follows:

Physics: 35, 23, 47, 17, 10, 43, 9, 6, 28

Mathematics: 30, 33, 45, 23, 8, 49, 12, 4, 31

Compute the student’s ranks in the two subjects and compute the Spearman rank

correlation.

Step 1: Find the ranks for each individual subject. I used the Excel rank function to

find the ranks. If you want to rank by hand, order the scores from greatest to

smallest; assign the rank 1 to the highest score, 2 to the next highest and so on:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Step 2: Add a third column, d, to your data. The d is the difference between ranks.

For example, the first student’s physics rank is 3 and math rank is 5, so the

difference is 3 points. In a fourth column, square your d values.

4 + 4 + 1 + 0 + 1 + 1 + 1 + 0 + 0 = 12. You’ll need this for the formula (the Σ d 2 is

just “the sum of d-squared values”).

Step 4: Insert the values into the formula. These ranks are not tied, so use the first

formula:

= 1 – (6*12)/(9(81-1))

= 1 – 72/720

= 1-0.1

= 0.9

The Spearman Rank Correlation for this set of data is 0.9.

Regression Analysis

In statistical modeling, regression analysis is a set of statistical processes

for estimating the relationships among variables. It includes many techniques for

modeling and analyzing several variables, when the focus is on the relationship

between a dependent variable and one or more independent variables (or

'predictors'). More specifically, regression analysis helps one understand how the

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

typical value of the dependent variable (or 'criterion variable') changes when any

one of the independent variables is varied, while the other independent variables

are held fixed.

Most commonly, regression analysis estimates the conditional expectation of the

dependent variable given the independent variables – that is, the average value of

the dependent variable when the independent variables are fixed. Less commonly,

the focus is on a quantile, or other location parameter of the conditional

distribution of the dependent variable given the independent variables. In all cases,

a function of the independent variables called the regression function is to be

estimated. In regression analysis, it is also of interest to characterize the variation

of the dependent variable around the prediction of the regression function using

a probability distribution. A related but distinct approach is Necessary Condition

Analysis[1] (NCA), which estimates the maximum (rather than average) value of

the dependent variable for a given value of the independent variable (ceiling line

rather than central line) in order to identify what value of the independent variable

is necessary but not sufficient for a given value of the dependent variable.

Regression analysis is widely used for prediction and forecasting, where its use has

substantial overlap with the field of machine learning. Regression analysis is also

used to understand which among the independent variables are related to the

dependent variable, and to explore the forms of these relationships. In restricted

circumstances, regression analysis can be used to infer causal

relationships between the independent and dependent variables. However this can

lead to illusions or false relationships, so caution is advisable;[2] for

example, correlation does not prove causation.

Many techniques for carrying out regression analysis have been developed.

Familiar methods such as linear regression and ordinary least squares regression

are parametric, in that the regression function is defined in terms of a finite number

of unknown parametersthat are estimated from the data. Nonparametric

regression refers to techniques that allow the regression function to lie in a

specified set of functions, which may be infinite-dimensional.

The performance of regression analysis methods in practice depends on the form of

the data generating process, and how it relates to the regression approach being

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

used. Since the true form of the data-generating process is generally not known,

regression analysis often depends to some extent on making assumptions about this

process. These assumptions are sometimes testable if a sufficient quantity of data is

available. Regression models for prediction are often useful even when the

assumptions are moderately violated, although they may not perform optimally.

However, in many applications, especially with small effects or questions

of causality based on observational data, regression methods can give misleading

results.[3][4]

In a narrower sense, regression may refer specifically to the estimation of

continuous response (dependent) variables, as opposed to the discrete response

variables used in classification.[5] The case of a continuous dependent variable may

be more specifically referred to as metric regression to distinguish it from related

problems.[6]

Definition: The Regression Line is the line that best fits the data, such that the

overall distance from the line to the points (variable values) plotted on a graph is

the smallest. In other words, a line used to minimize the squared deviations of

predictions is called as the regression line.

There are as many numbers of regression lines as variables. Suppose we take two

variables, say X and Y, then there will be two regression lines:

▪ Regression line of Y on X: This gives the most probable values of Y from the

given values of X.

▪ Regression line of X on Y: This gives the most probable values of X from the

given values of Y.

The algebraic expression of these regression lines is called as Regression

Equations. There will be two regression equations for the two regression lines.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

The correlation between the variables depend on the distance between these two

regression lines, such as the nearer the regression lines to each other the higher is

the degree of correlation, and the farther the regression lines to each other the

lesser is the degree of correlation.

The correlation is said to be either perfect positive or perfect negative when the

two regression lines coincide, i.e. only one line exists. In case, the variables are

independent; then the correlation will be zero, and the lines of regression will be at

right angles, i.e. parallel to the X axis and Y axis.

Note: The regression lines cut each other at the point of average of X and Y. This

means, from the point where the lines intersect each other the perpendicular is

drawn on the X axis we will get the mean value of X. Similarly, if the horizontal

line is drawn on the Y axis we will get the mean value of Y.

Question: Find the equation of the two lines of regression and hence find

correlation coefficient from the following data.

0

Here n = 8

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Y¯¯¯¯=b+cy¯¯¯=a+c. ∑yn=69+1.−28=68.75Y¯=b+cy¯=a+c. ∑yn=69+1.−28=68.

75

bxy= n∑xy− ∑x∑yn∑y2−(∑y)2=8(26)−(0)(−2)8(52)−(−2)2=0.5049bxy= n∑xy−

∑x∑yn∑y2−(∑y)2=8(26)−(0)(−2)8(52)−(−2)2=0.5049

byx= n∑xy− ∑x∑yn∑x2−(∑x)2=8(26)−(0)(−2)8(36)−(0)2=0.7222byx= n∑xy− ∑x

∑yn∑x2−(∑x)2=8(26)−(0)(−2)8(36)−(0)2=0.7222

∴∴ Regression equation of Y on X is

Y=Y¯¯¯¯=byx(X− X¯¯¯¯)Y=Y¯=byx(X− X¯)

∴∴ Y - 68.75 = 0.7222(X-68)

∴∴ Y = 0.7222 X + 19.6389

∴∴ Regression equation of X on Y is Y = X¯¯¯¯=bxy(Y− Y¯¯¯¯)X¯=bxy(Y− Y¯)

∴∴ X - 68 = 0.5049(Y - 68.75)

∴∴ X = 0.5409 Y + 33.2913

Now , r = ±± byx∗bxy−−−−−−−√byx∗bxy

= ±± 0.5049∗0.7222−−−−−−−−−−−−√0.5049∗0.7222

= ±± 0.6039

Since , byx and bxy are both positive , r is positive

∴∴ r = 0.6039

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

coefficient of correlation (r) = 0.6039

Bivariate Frequency Distribution

“uni” means “one”).

It is the simplest form of representing data. It doesn’t deal with relationships of

variable. It’s one of the most important requirement is for taking data,

summarizing and finds patterns in the data.

But now, What if number of students and their respective marks with respect to

subjects were given. Data in statistics needs to be classified according to how

many variables are in a particular study. Here if two variables are involved then

their frequency distribution is known as “Bivariate frequency distribution”.

Defintion

Back to Top

Let us discuss discuss the concept with the help of an example.

A backbone of Statistics is data collection and process and analysis over the data.

When small data is available for analysis then it is not of a problem:

For example: Marks achieved by a student out of 100 in all subjects. Find

percentage:

Mathematics =75

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Statistics = 90

English = 80

Physics = 75

Chemistry = 85

Data seems easy to operate, isn’t it?

But when large data is available,

For example:

class of 60 students: 40,50,80,90,56,60,40,77,78,92,95………….and so on.

Similarly, 60 values are available. Then it is easy to represent the data in tabular

format such as:

Marks (in

Frequency (Number of students achieved marks in that

interval)

interval)

0 - 10 0

11 - 20 1

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

21 - 30 2

31 - 40 2

41 - 50 5

51 - 60 10

61 - 70 10

71 - 80 20

81 - 90 10

91 - 100 0

Total 60

1. Scatter plot:

In scatter plots, it is possible to get idea about relationship between both variables

in a glance. In scatter plot, points are plotted on X and Y axis. The one which is

dependent variable is taken on Y axis and independent is taken on X axis. The

scatter plot looks as follows:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

2. Regression analysis:

Regression analysis allows to estimate future trends of data. It identifies data,

allows to fit that in one linear line and then by substituting values of independent

variables, future values of dependent variables can be easily found. It also gives

knowledge of slope and intercepts of line and hence can be tested for whole

population of that sample.

3. Correlation coefficients:

Correlation coefficient indicates how much two variables are related to each other.

Steps and calculations to be performed are shown below. Value for correlation is

always between -1 and 1. Basically -1 means there is perfect negative correlation

and 1 stands for perfect positive correlation. Where value of correlation coefficient

is zero indicates, no relationship between x and y at all. [Negative relationship is

when one variable increased, other has to decrease. And Positive indicates, when

one variable increases, other has to increase.]

Back to Top

If given data has numerical values on both sides, and it is required to recognize,

how much they are related to each other. It such cases, there is a way to find out if

there is correlation between 2 variables or if they are related to each other, if yes,

how much. Using “correlation coefficient(r)”

Consider given table,

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

. . . . . . .

. . . . . . .

of YY with respective frequency total, S1S1, S2….SkS2….Sk.

Where,

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

E[XY]E[XY] = 1G1G×∑∑Xi×Yi×Aij×∑∑Xi×Yi×Aij

σxσx = 1∑Ti×Ti(Xi−x̅)2−−−−−−−−−−−−−−−√1∑Ti×Ti(Xi−x̅)2

σyσy = 1∑Si×Ti(Yi−y̅)2−−−−−−−−−−−−−−−√1∑Si×Ti(Yi−y̅)2

Back to Top

as, Bivariate Frequency distribution table or joint frequency distribution table. This

is very much useful in real life. When we have two variables (X and Y) and they

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

are related then we may perform bivariate analysis on them to find out their

relationship.

For example:

2. Student’s grade with hours spent on studies

The standard error of the estimate is a measure of the accuracy of predictions made

with a regression line. Consider the following data.

The second column (Y) is predicted by the first column (X). The slope and Y

intercept of the regression line are 3.2716 and 7.1526 respectively. The third

column, (Y'), contains the predictions and is computed according to the formula:

The fourth column (Y-Y') is the error of prediction. It is simply the difference

between what a subject's actual score was (Y) and what the predicted score is (Y').

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

The sum of the errors of prediction is zero. The last column, (Y-Y')², contains the

squared errors of prediction.

The regression line seeks to minimize the sum of the squared errors of prediction.

The square root of the average squared error of prediction is used as a measure of

the accuracy of prediction. This measure is called the standard error of the estimate

and is designated as σest. The formula for the standard error of the estimate is:

where N is the number of pairs of (X,Y) points. For this example, the sum of the

squared errors of prediction (the numerator) is 70.77 and the number of pairs is 12.

The standard error of the estimate is therefore equal to:

populationcorrelation between X and Y. For this example,

parameters and therefore has to estimate from a sample. The symbol s est is used for

the estimate of σest. The relevant formulas are:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

and

(Note that Sy has a capital rather than a small "S" so it is computed with N in the

denominator). The similarity between the standard error of the estimate and the

standard deviation should be noted: The standard deviation is the square root of the

average squared deviation from the mean; the standard error of the estimate is the

square root of the average squared deviation from the regression line. Both

statistics are measures of unexplained variation.

Time Series

A time series is a series of data points indexed (or listed or graphed) in time order.

Most commonly, a time series is a sequence taken at successive equally spaced

points in time. Thus it is a sequence of discrete-time data. Examples of time series

are heights of ocean tides, counts of sunspots, and the daily closing value of

the Dow Jones Industrial Average.

Time series are very frequently plotted via line charts. Time series are used

in statistics, signal processing, pattern recognition, econometrics, mathematical

finance, weather forecasting, earthquake

prediction, electroencephalography, control

engineering, astronomy, communications engineering, and largely in any domain

of applied science and engineering which involves temporalmeasurements.

Time series analysis comprises methods for analyzing time series data in order to

extract meaningful statistics and other characteristics of the data. Time

series forecasting is the use of a model to predict future values based on

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

way as to test theories that the current values of one or more independent time

series affect the current value of another time series, this type of analysis of time

series is not called "time series analysis", which focuses on comparing values of a

single time series or multiple dependent time series at different points in

time.[1] Interrupted time series analysis is the analysis of interventions on a single

time series.

Time series data have a natural temporal ordering. This makes time series analysis

distinct from cross-sectional studies, in which there is no natural ordering of the

observations (e.g. explaining people's wages by reference to their respective

education levels, where the individuals' data could be entered in any order). Time

series analysis is also distinct from spatial data analysis where the observations

typically relate to geographical locations (e.g. accounting for house prices by the

location as well as the intrinsic characteristics of the houses). A stochastic model

for a time series will generally reflect the fact that observations close together in

time will be more closely related than observations further apart. In addition, time

series models will often make use of the natural one-way ordering of time so that

values for a given period will be expressed as deriving in some way from past

values, rather than from future values (see time reversibility.)

Time series analysis can be applied to real-valued, continuous

data, discrete numeric data, or discrete symbolic data (i.e. sequences of characters,

such as letters and words in the English language[2]).

There are many objectives related to time series analysis, objectives of time series

analysis may be classified as

1. Description

2. Explanation

3. Prediction

4. Control

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Description

The first step in the analysis is to plot the data and obtain simple descriptive

measures (such as plotting data, looking for trends, seasonal fluctuations and so

above figure , there is a regular seasonal pattern of price change although this price

pattern is not consistent. Graph enables to look for “wild” observations or outlier

(not appear to be consistent with the rest of the data). Graphing the time

series make possible the presence of turning points where upward trend suddenly

changed to a downward trend. If there is turning point, different models may have

to be fitted to the two parts of the series.

Explanation

Observations taken on two or more variables, making possible to use the variation

in one time series to explain the variation in another series. This may lead to

deeper understanding. Multiple regression model may be helpful in this case.

Prediction

Given an observed time series, one may want to predict the future values of the

series. It is an important task in sales of forecasting and is the analysis of economic

and industrial time series. Prediction and forecasting used interchangeably.

Control

When time series generated to measure the quality of a manufacturing process (the

aim may be) to control the process. Control procedures are of several different

kinds. In quality control, the observations are plotted on control chart and the

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

fitted to the series. Future values of the series are predicted and then the input

process variables are adjusted so as to keep the process on target.

The factors that are responsible for bringing about changes in a time series,

also called the components of time series, are as follows:

2. Seasonal Movements

3. Cyclical Movements

4. Irregular Fluctuations

Secular Trends

The secular trend is the main component of a time series which results from long

term effects of socio-economic and political factors. This trend may show the

growth or decline in a time series over a long period. This is the type of tendency

which continues to persist for a very long period. Prices and export and import

data, for example, reflect obviously increasing tendencies over time.

Seasonal Trends

These are short term movements occurring in data due to seasonal factors. The

short term is generally considered as a period in which changes occur in a time

series with variations in weather or festivities. For example, it is commonly

observed that the consumption of ice-cream during summer is generally high and

hence an ice-cream dealer's sales would be higher in some months of the year

while relatively lower during winter months. Employment, output, exports, etc.,

are subject to change due to variations in weather. Similarly, the sale of garments,

umbrellas, greeting cards and fire-works are subject to large variations during

festivals like Valentine’s Day, Eid, Christmas, New Year's, etc. These types of

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

variations in a time series are isolated only when the series is provided biannually,

quarterly or monthly.

Cyclic Movements

These are long term oscillations occurring in a time series. These oscillations are

mostly observed in economics data and the periods of such oscillations are

generally extended from five to twelve years or more. These oscillations are

associated with the well known business cycles. These cyclic movements can be

studied provided a long series of measurements, free from irregular fluctuations, is

available.

Irregular Fluctuations

These are sudden changes occurring in a time series which are unlikely to be

repeated. They are components of a time series which cannot be explained by

trends, seasonal or cyclic movements. These variations are sometimes called

residual or random components. These variations, though accidental in nature, can

cause a continual change in the trends, seasonal and cyclical oscillations during the

forthcoming period. Floods, fires, earthquakes, revolutions, epidemics, strikes etc.,

are the root causes of such irregularities.

A time series may not be affected by all type of variations. Some of these type of

variations may affect a few time series, while the other series may be effected by

all of them. Hence, in analysing time series, these effects are isolated. In classical

time series analysis it is assumed that any given observation is made up of trend,

seasonal, cyclical and irregular movements and these four components have

multiplicative relationship.

Symbolically :

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

O = T × S × C × I

where O refers to original data,

T refers to trend.

S refers to seasonal variations,

C refers to cyclical variations and

I refers lo irregular variations.

This is the most commonly used model in the decomposition of time series.

There is another model called Additive model in which a particular observation in

a time series is the sum of these four components.

O=T+S+C+I

To prevent confusion between the two models, it should be made clear that in

Multiplicative model S, C, and I are indices expressed as decimal percents whereas

in Additive model S, C and I are quantitative deviations about trend that can be

expressed as seasonal, cyclical and irregular in nature. If in a multiplicative model.

T = 500, S = 1.4, C = 1.20 and I = 0.7 then

O = T × S × C × I

By substituting the values we get

O = 500 × 1.4 × 1.20 × 0.7 = 608

In additive model, T = 500, S = 100, C = 25, I = –50

O = 500 + 100 + 25 – 50 = 575

The assumption underlying the two schemes of analysis is that whereas there is no

interaction among the different constituents or components under the additive

scheme, such interaction is very much present in the multiplicative scheme. Time

series analysis, generally, proceed on the assumption of multiplicative formulation.

Methods of Measuring Trend

Trend can be determined : (i) Free hand curve method ; (ii) moving averages

method ; (iii) semiaverages method; and (iv) least-squares method. Each of these

methods is described below :

(i) Freehand Curve Method : The term freehand is used to any non-mathematical

curve in statistical analysis even if it is drawn with the aid of drafting instruments.

This is the simplest method of studying trend of a time series. The procedure for

drawing free hand curve is an follows :

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

(ii) The direction of the plotted data is carefully observed.

(iii) A smooth line is drawn through the plotted points.

While fitting a trend line by the freehand method, an attempt should be made that

the fitted curve conforms to these conditions.

(i) The curve should be smooth either a straight line or a combination of long

gradual curves.

(ii) The trend line or curve should be drawn through the graph of the data in such a

way that the areas below and above the trend line are equal to each other.

(iii) The vertical deviations of the data above the trend line must equal to the

deviations below the line.

(iv) Sum of the squares of the vertical deviations of the observations from the trend

should be minimum.

Illustration : Draw a time series graph relating to the following data and fit the

trend by freehand method :

Year Production of Steel

(million tonnes)

1994 20

1995 22

1996 30

1997 28

1998 32

1999 25

2000 29

2001 35

2002 40

2003 32

The trend line drawn by the freehand method can be extended to project future

values. However, the freehand curve fitting is too subjective and should not be

used as a basis for prediction. Method of Moving Averages : The moving average

is a simple and flexible process of trend measurement which is quite accurate

under certain conditions. This method establishes a trend by means of a series of

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

The process of successively averaging, say, three years data, and establishing each

average as the moving-average value of the central year in the group, should be

carried throughout the entire series. For a five-item, seven-item or other moving

averages, the same procedure is followed : the average obtained each time being

considered as representative of the middle period of the group.

The choice of a 5-year, 7-year, 9-year, or other moving average is determined by

the length of period necessary to eliminate the effects of the business cycle and

erratic fluctuations. A good trend must be free from such movements, and if there

is any definite periodicity to the cycle, it is well to have the moving average to

cover one cycle period. Ordinarily, the necessary periods will range between three

and ten years for general business series but even longer periods are required for

certain industries.

In the preceding discussion, the moving averages of odd number of years were

representatives of the middle years. If the moving average covers an even number

of years, each average will still be representative of the midpoint of the period

covered, but this mid-point will fall halfway between the two middle years. In the

case of a four-year moving average, for instance each average represents a point

halfway between the second and third years . In such a case, a second moving

average may be used to ‘recentre’ the averages.

That is, if the first moving averages gives averages centering half-way between the

years, a further two-point moving average will recentre the data exactly on the

years.

This method, however, is valuable in approximating trends in a period of transition

when the mathematical lines or curves may be inadequate. This method provides a

basis for testing other types of trends, even though the data are not such as to

justify its use otherwise.

Illustration : Calculate 5-yearly moving average trend for the time series given

below.

Year : 1990 1991 1992 1993 1994 1995 1996 1997 1998

1999 2000

Quantity : 239 242 238 252 257 250 273 270 268

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

288 284

Year : 2001 2002 2003 2004 2005 2006 2007 2008 2009

2010

Quantity : 282 300 303 298 313 317 309 329 333

327

Solution :

Year Quantity 5-yearly moving total 5-yearly moving average

1990 239

1991 242

1992 238 1228 245.6

1993 252 1239 247.8

1994 257 1270 254.0

1995 250 1302 260.4

1996 273 1318 263.6

1997 270 1349 269.8

1998 268 1383 276.6

1999 288 1392 278.4

1990 284 1422 284.4

2001 282 1457 291.4

2002 300 1467 293.4

2003 303 1496 299.2

2004 298 1531 306.2

2005 313 1540 308.0

2006 317 1566 313.2

2007 309 1601 320.2

2008 329 1615 323.0

2009 333

2010 327

To simplify calculation work: Obtain the total of first five years deta. Find out the

difference between the first and sixth term and add to the total to obtain the total of

second to sixth term. In this way the difference between the term to be omitted and

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

the term to be included is added to the preceding total in order to obtain the next

successive total.

Illustration : Fit a trend line by the method of four-yearly moving average to the

following time series data.

Year : 1995 1996 1997 1998 1999 2000 2001 2002

Sugar production (lakh tons) : 5 6 7 7 6 8 9 10

Year : 2003 2004 2005 2006

Sugar production (lakh tons) : 9 10 11 11

Solution :

Remark : Observe carefully the placement of totals, averages between the lines.

Merits

1. This is a very simple method.

2. The element of flexibility is always present in this method as all the calculations

have not to be altered if same data is added. It only provides additional trend

values.

3. If there is a coincidence of the period of moving averages and the period of

cyclical fluctuations, the fluctuations automatically disappear.

4. The pattern of moving average is determined in the trend of data and remains

unaffected by the choice of method to be employed.

5. It can be put to utmost use in case of series having strikingly irregular trend.

Limitations

1. It is not possible to have a trend value for each and every year. As the period of

moving average increases, there is always an increase in the number of years for

which trend values cannot be calculated and known. For example, in a five yearly

moving average, trend value cannot be obtained for the first two years and last two

years, in a seven yearly moving average for the first three years and last three years

and so on. But usually values of the extreme years are of great interest.

2. There is no hard and fast rule for the selection of a period of moving average.

3. Forecasting is one of the leading objectives of trend analysis. But this objective

remains unfulfilled because moving average is not represented by a mathematical

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

function.

4. Theoretically it is claimed that cyclical fluctuations are ironed out if period of

moving average coincide with period of cycle, but in practice cycles are not

perfectly periodic.

Trend by the Method of Semi-averages : This method can be used if a straight

line trend is to be obtained. Since the location of only two points is necessary to

obtain a straight line equation, it is obvious that we may select two representative

points and connect them by a straight line. Data are divided into two halves and an

average is obtained for each half. Each such average is shown against the mid-

point of the half period, we obtain two points on a graph paper. By joining these

points, a straight line trend is obtained.

The method is to be commended for its simplicity and used to some extent in

practical work. This method is also flexible, for it is permissible to select

representative periods to determine the two points. Unrepresentative years may be

ignored.

Method of Least Squares : If a straight line is fitted to the data it will serve as a

satisfactory trend, perhaps the most accurate method of fitting is that of least

squares. This method is designed to accomplish two results.

(i) The sum of the vertical deviations from the straight line must equal zero.

(ii) The sum of the squares of all deviations must be less than the sum of the

squares for any other conceivable straight line.

There will be many straight lines which can meet the first condition. Among all

different lines, only one line will satisfy the second condition. It is because of this

second condition that this method is known as the method of least squares. It may

be mentioned that a line fitted to satisfy the second condition, will automatically

satisfy the first condition.

The formula for a straight-line trend can most simply be expressed as

Yc = a + bX

where X represents time variable, Yc is the dependent variable for which trend

values are to be calculated and a and b are the constants of the straight tine to be

found by the method of least squares.

Constant is the Y-intercept. This is the difference between the point of the origin

(O) and the point of the trend line and Y-axis intersect. It shows the value of Y

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

when X = 0, constant b indicates the slope which is the change in Y for each unit

change in X.

Let us assume that we are given observations of Y for n number of years. If we

wish to find the values of constants a and b in such a manner that the two

conditions laid down above are satisfied by the fitted equation.

Mathematical reasoning suggests that, to obtain the values of constants a and b

according to the Principle of Least Squares, we have to solve simultaneously the

following two equations.

∑Y = na + b∑Y ...(i)

∑XY = a∑X + b∑X2 ...(ii)

Solution of the two normal equations yield the following values for the constants a

and b :

b=

and a =

Least Squares Long Method : It makes use of the above mentioned two normal

equations without attempting to shift the time variable to convenient mid-year.

This method is illustrated by the following example.

Illustration : Fit a linear trend curve by the least-squares method to the following

data :

Year Production (Kg.)

2001 3

2002 5

2003 6

2004 6

2005 8

2006 10

2007 11

2008 12

2009 13

2010 15

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Solution : The first year 2001 is assumed to be 0, 2002 would become 1, 2003

would be 2 and so on. The various steps are outlined in the following table.

----------------------------------------------------

Year Production

Y X XY X2

1 2 3 4 5

----------------------------------------------------

2001 3 0 0 0

2002 5 1 5 1

2003 6 2 12 4

2004 6 3 18 9

2005 8 4 32 16

2006 10 5 50 25

2007 11 6 66 36

2008 12 7 84 49

2009 13 8 104 64

2010 15 9 135 11

Total 89 45 506 285

-----------------------------------------------------

The above table yields the following values for various terms mentioned below :

n = 10, ∑X = 45, ∑X2 = 285, ∑Y = 89, and ∑XY = 506

Substituting these values in the two normal equations, we obtain

89 = 10a + 45b ...(i)

506 = 45a + 285b ...(ii)

Multiplying equation (i) by 9 and equation (ii) by 2, we obtain

80l = 90a + 405b ...(iii)

1012 = 90a + 570b ...(iv)

Subtracting equation (iii) from equation (iv), we obtain

211 = 165b or b = 211/165 = 1.28

Substituting the value of b in equation (i), we obtain

89 = 10a + 45 × 1.28

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

89 = 10a + 57.60

10a = 89 – 57.6

10a = 31.4

a = 31.4/10 = 3.14

Substituting these values of a and b in the linear equation, we obtain the following

trend line

Yc = 3. 14 + 1.28X

Inserting various values of X in this equation, we obtain the trend values as below :

-----------------------------------------------------------------

Year Observed Y bxX Yc (Col. 3 plus Col. 4)

1 2 3 4 5

-----------------------------------------------------------------

2001 3 3.14 1.28 × 0 3.14

2002 5 3.14 1.28 × 1 4.42

2003 6 3.14 1.28 × 2 5.70

2004 6 3.14 1.28 × 3 6.98

2005 8 3.14 1.28 × 4 8.26

2006 10 3.14 1.28 × 5 9.54

2007 11 3.14 1.28 × 6 10.82

2008 12 3.14 1.28 × 7 12.10

2009 13 3.14 1.28 × 8 13.38

2010 15 3.14 1.28 × 9 14.66

-------------------------------------------------------------------

Least Squares Method : We can take any other year as the origin, and for that

year X would be 0. Considerable saving of both time and effort is possible if the

origin is taken in the middle of the whole time span covered by the entire series.

The origin would than be located at the mean of the X values. Sum of the X values

would then equal 0. The two normal equations would then be simplified to

∑Y = Na ...(i)

or a =

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Two cases of short cut method are given below. In the first case there are odd

number of years while in the second case the number of observations are even.

Illustration : Fit a straight line trend on the following data :

Year 1996 1997 1998 1999 2000 2001 2002 2003 2004

Y 4 7 7 8 9 11 13 14 17

Solution : Since we have 9 observations, therefore, the origin is taken at 2000 for

which X is assumed to be 0.

------------------------------

Year Y X XY X2

------------------------------

1996 4 – 4 – 16 16

1997 7 – 3 – 21 9

1998 7 – 2 – 14 4

1999 8 – 1 – 8 1

2000 9 0 0 0

2001 11 1 11 1

2002 13 2 26 4

2003 14 3 42 9

2004 17 4 68 16

-----------------------------

Total 90 0 88 60

------------------------------

Thus n = 9, SY = 90, SX = 0, SXY = 88, and SX2 = 60

Substituting these values in the two normal equations, we get

90 = 9a or a = 90/9 or a = 10

88 = 60 or b = 88/60 or b = 1.47

Trend equation is : Yc = 10 + 1.47 X

Inserting the various values of X, we obtain the trend values as below :

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Solution : Here there are two mid-years viz; 2006 and 2007. The mid-point of the

two years is assumed to be 0 and the time of six months is treated to be the unit.

On this basis the calculations are as shown below:

----------------------------------------------

Years Observed Y X XY X2

----------------------------------------------

2003 6.7 – 7 – 46.9 49

2004 5.3 – 5 – 26.5 25

2005 4.3 – 3 – 12.9 9

2006 6.1 – 1 – 6.1 1

2007 5.6 1 5.6 1

2008 7.9 3 23.7 9

2009 5.8 5 29.0 25

2010 6.1 7 42.7 49

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

----------------------------------------------

Total 47.8 0 8.6 168

----------------------------------------------

From the above computations, we get the following values.

n = 8, ∑Y = 47.8, ∑X = 0, ∑XY = 8.6, ∑X2 = 168

Substituting these values in the two normal equations, we obtain

47.8 = 8a or a = 47.8/8 or a = 5.98 and 8.6 = 168 b or = 8.6/168 or b = 0.051

The equation for the trend line is : Yc = 5.98 + 0.051X

Trend values generated by this equation are below :

The simplest example of the non-linear trend is the second degree parabola, the

equation is written in the form :

2

Yc = a + bX + cX

When numerical values for a, b and c have been derived, the trend value for any

year may be

computed substituting in the equation the value of X for that year. The values of a,

b and c can be determined

by solving the following three normal equations simultaneously:

(i) ∑Y = Na + bSX + c∑X2

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

(iii) ∑X2Y = a∑X2 + b∑X3 + c∑X4

Note that the first equation is merely the summation of the given function, the

second is the summation of X multiplied into the given function, and the third is

the summation of X2 multiplied into the given function.

When time origin is taken between two middle years SX would be zero. In that

case the equations are reduced to :

(i) ∑Y = Na + c∑X 2

(iii) ∑X2Y = a∑X2 + c∑X4

The value of b can now directly be obtained from equation (ii) and value of a and c

by solving equations (i) and (iii) simultaneously. Thus,

a=b=c=

Illustration : The price of a commodity during 2000 – 2005 is given below. Fit a

parabola Y = a + bX + cX2 to this data. Estimate the price of the commodity for the

year 2010 :

Year Price Year Price

2000 100 2003 140

2001 107 2004 181

2002 128 2005 192

Also plot the actual and trend values on graph.

Solution : To determine the value a, b and c, we solve the following normal

equations:

∑Y = Na + b∑X + c∑X2

∑XY = a∑X + b∑X +

2

c∑X3

∑X2Y = a∑X + b∑X + c∑X4

2 3

-----------------------------------------------------------------------------------

Year Y X X2 X3 X4 XY X2Y Yc

-----------------------------------------------------------------------------------

2000 100 – 2 4 – 8 16 – 200 400 97.744

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

2002 128 0 0 0 0 0 0 126.680

2003 140 +1 1 +1 1 +140 140 146.506

2004 181 +2 4 +8 16 + 362 724 169.904

2005 192 +3 9 +27 81 +576 1728 196.874

--------------------------------------------------------------------------------------

N = 6 ∑Y = 848 ∑X = 3 ∑X2 = 19 ∑X3 = 27 ∑X4 = 115 ∑XY = 771 ∑X2Y =

3099 ∑Yc = 848.134

--------------------------------------------------------------------------------------

848 = 6a + 3b + 19c ...(i)

771 = 3a +19b +27c ...(ii)

3,099 = 19a + 27b +115c ...(iii)

Solving Eqns. (i) and (ii), get

35b + 35c = 695 ...(iv)

Multiplying Eqn. (ii) by 19 and Eqn. (iii) by 3. Subtracting (iii) from (ii), we get

5352 = 280b + 168 c ...(v)

Solving Eqns. (iv) and (v), we get

c = 1.786

Substituting the value of c in Eqn. (iv), we get

b = 18.04 [35 b +(35 × 1.786) = 695]

Putting the value of b and c in Eqn. (i), we get

a = 126.68 [848 = 6a + (3 × 18.04) + (19 × 1.786))

Thus a = 126.68, b =18.04 and c = 1.786

Substituting the values in the equation

Yc = 126.68 + 18.04X + 1.786X2

When X = – 2, Y = 126.68 + 18.04(–2) + 1.786(– 2)2

= 126.68 – 36.08 + 7.144 = 97.744

When X = –1, Y = 126.68 + 18.04(–1) + 1.786(–1) 2

When X

= 0, Y = 126.68

When X = l, Y = 126.68 + 18.04 + 1.786 = 146.506

When X = 2, Y = 126.68 + 18.04(2) + 1.786(2)2

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

When X = 3, Y = 126.68 + 18.04(3) + 1.786(3)2

= 126.68 + 54.12 + 16.074 = 196.874

Price for 2010, Y = 126.68 + 18.04(8) + 1.786(8)2

When X = 8 = 126.68 + 144.32 + 114.304 = 385.304

Thus the likely price of the commodity for the year 2010 is Rs.385.304.

The graph of the actual trend values values is given below:

Conversion of Annual Trend Equation to Monthly Trend Equation

Fiting a trend line by least squares to monthly data may be excessively time

consuming. It is more convenient to compute the trend equation from annual data

and then convert this trend equation to a monthly trend equation.

There are two possible situations: (i) the Y units are annual totals, for example, the

total number of passenger cars sold; (ii) the Y units are monthly averages, for

example average monthly wholesale price Index.

Where Data are Annual Totals

A trend equation operative on an annual level is to be reduced to a monthly level.

Constant value, a, is expressed in terms of annual Y values. To express it in terms

of monthly values, we must divide it by 12. Similarly b is to be divided by 12 to

convert the annual change to a monthly change. But this division shows us only the

change for any month of two consecutive years, whereas we want change for two

consecutive months. Therefore b is to be divided by 12 once again. Consequently,

to convert annual trend equation to a monthly trend equation, when the annual data

are expressed as annual totals, we divide a by 12 and b by 144.

Where the Data are given as monthly averages per year

In this case, Y values are on a monthly level. Therefore, a value remains

unchanged in the conversion process. The b value in this case shows us the change

on a monthly level, but from a month in one year to the corresponding month in the

following year. Here, it is necessary only to convert b value to make it measure the

change between consecutive month by dividing it with 12 only.

Merits

(i) This method has no place for subjectivity since it is a mathematical method of

measuring trend,

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

(ii) This method gives the line of best fit because from this line the sum of the

positive and negative deviations is zero and the total of the squares of these

deviations is minimum.

Limitations

The best practicable use of mathematical trends is for describing movements in

time series. It does not provide a clue to the causes of such movements. Therefore,

forecasting on this basis may be quite risky.

Forecasting will be valid if there is a functional relationship between the variable

under consideration and time for a particular trend. But if trend describes the past

behaviour, it hardly throws light on the causes which may influence the future

behaviour.

The other limitation is that if some items are added to the original data, a new

equation has to be obtained.

Curvilinear Trend

Sometimes, the time series may not be represented by a straight line trend. Such

trends are known as curvilinear trends. If the curvilinear trend is represented by a

straight line or semi-log paper, or by polynomials of second or higher degree or by

double logarithmic function, then the method of least squares is also applicable to

such cases.

MEASUREMENT OF SEASONAL VARIATIONS

Seasonal variations are those rhythmic changes in the time series data that occur

regularly each year. They have their origin in climatic or institutional factors that

affect either supply or demand or both. It is important that these variations be

measured accurately for three reasons. First, the investigator wants to eliminate

seasonal variations from the data he is studying. Second, a precise knowledge of

the seasonal pattern aid in planning future operations. Lastly, complete knowledge

of seasonal variations is of use to those who are trying to remove the cause of

seasonals or are attempting to mitigate the problem by diversification, off setting

opposing seasonal patterns, or some other means.

Since the number of calender days and working days vary from month to month,

therefore, it is essential to adjust the monthly figures if the same are based on daily

quantities, otherwise, there is no need for such adjustment when we deal with

either volume of inventories or of bank deposits because then the values are not

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Methods of Measuring Seasonal Variations

1. Method of Simple Averages (Weekly, Monthly or Quarterly).

2. Ratio-to-Trend Method.

3. Ratio-to-Moving Average Method.

4. Link Relatives Method.

Methods of Simple Average

This is the simplest method of obtaining a seasonal index. The following steps are

necessary for calculating the index :

(i) Average the unadjusted date by years and months or quarters if quarterly data

are given.

(ii) Find totals of January, February etc.

(iii) Divide each total by the number of years for which data are given. For

example, if we are given monthly data for five years then we shall first obtain total

for each month for five years and divide each total by 5 to obtain an average.

(iv) Obtain an average of monthly averages by dividing the total of monthly

averages by 12.

(v) Taking the average of monthly average as 100, compute the percentage of

various monthly averages as follows:

Seasonal Index for January

=

If instead of the average of each month, the total of each month are obtained, we

will get the same result.

The following example shall illustrate the method.

Illustration : Consumption of monthly electric power in million of kilowat (k.w.)

hours for street lighting in India during 1999-2003 is given below:

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

(i) Column No. 7 gives the total for each month for five years.

(ii) In column No. 8 each total of column No. 7 has been divided by 5 to obtain an

average for each month.

(iii) The average of monthly averages is obtained by dividing the total of monthly

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

averages by 12.

(iv) In column No. 9 each monthly average has been expressed as percentage of the

average of monthly averages. Thus, the percentage for January

=

Percentage for February =

If instead of monthly data, we are given weekly or quarterly data, we shall

compute weekly or quarterly averages by following the same procedure.

Ratio-to-moving average method : The method of monthly totals or monthly

averages does not give any consideration to the trend which may be present in the

data. The ratio-to-moving-average method is one of the simplest of the commonly

used devices for measuring seasonal variation which takes the trend into

consideration: The steps to compute seasonal variation are as follows :

(i) Arrange the unadjusted data by years and months.

(ii) Compute the trend values by the method of moving averages. For this purpose

take 12 month moving average followed by a two-month moving average to

recentre the trend values.

(iii) Express the data for each month as a percentage ratio of the corresponding

moving-average trend value.

(iv) Arrange these ratios by months and years.

(v) Aggregate the ratios for January, February etc.

(vi) Find the average ratio for each month.

(vii) Adjust the average monthly ratios found in step (vi) so that they will

themselves average 100 percent. These adjusted ratios will be the seasonal indices

for various months.

A seasonal index computed by the ratios-to-moving-average method ordinarily

does not fluctuate so much as the index based on straight-line trends. This is

because the 12-month moving average follows the cyclical course of the actual

data quite closely. Therefore the index ratios obtained by this method are often

more representative of the data from which they are obtained than is the case in the

ratio-to-trend method which will be discussed later on.

Illustration : Prepare a monthly seasonal index from the following data, using

moving averages method :

Monthly Sales of XYZ Products Co,. Ltd. (Rs.)

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Year

2000 2001 2002

January 3,639 3,913 4,393

February 3,591 3,856 4,530

March 3,326 3,714 4,287

April 3,469 3,820 4.405

May 3,321 3,647 4,024

June 3,320 3,498 3,992

July 3,205 3,476 3,795

August 3,205 3,354 3,492

September 3,255 3,594 3,571

October 3,550 3,830 3,923

November 3,771 4,183 3,984

December 3,772 4,482 3,880

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Putting average of monthly averages as 100, monthly averages have been admitted

to obtain seasonal index for each month.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

For February =

Merits

This method is more widely used in practice than other methods. The index

calculated by the ratioto- moving average method does not fluctuate very much.

The 12-month moving average follows the cyclical course of the actual data

closely. So index ratios are the true representative of the data from which they have

been obtained.

Limitations

All seasonal index numbers cannot be calculated for each month for which data is

available. When a four month average is taken, 2 months, in the beginning and 2

months in the end are left out for which we cannot calculate seasonal index

numbers.

The ratio-to-trend method : The ratio-to-trend method is similar to ratio-to-

moving-average method.

The only difference is the way of obtaining the trend values. Whereas in the ratio-

to-moving-average method, the trend values are obtained by the method of moving

averages, in the ratio-to-trend method, the corresponding trend is obtained by the

method of least sequares.

The steps in the calculation of seasonal variation are as follows :

(i) Arrange the unadjusted data by years and months.

(ii) Compute the trend values for each month with the help of least squares

equation.

(iii) Express the data for each month as a percentage ratio of the corresponding

trend value.

(iv) Aggregate the January’s ratios, February’s ratios, etc., computed previously

(v) Find the average ratio for each month.

(vi) Adjust the average ratios found in step (v) so that they will themselves average

100 per cent.

The last step gives us the seasonal index for each month.

Sometimes the median is used in place of the arithmetic average of the ratios-to-

trend. The choice depends upon circumstances but there is a preference for the

median if several erratic ratios are found. In fact, if a fairly large number of years,

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

say, 20 or 15, are used in the computation, it is not uncommon to omit extremely

erratic ratios from the computation of average of monthly ratios. Only the

arithmetic average should be used for small number of years.

This method has the advantage of simplicity and case of interpretation. Although it

makes allowance for the trend, it may be influenced by errors in the calculation of

the trend. The method may also be influenced by cyclical and erratic influences.

This source of possible error is eliminated by the selection of a period of time in

which depression is offset by prosperity.

Illustration : Find seasonal variations by the ratio-to-trend method from the

following data :

Year 1st Quarter 2nd Quarter 3rd Quarter 4th Quarter

2000 30 40 36 34

2001 34 52 40 44

2002 40 58 54 48

2003 54 76 68 62

2004 80 92 86 82

Solution : For finding out seasonal variations by ratio-to-trend method, first the

trend for yearly data will be obtained and convert them into quarterly data.

Average 92.78 118.28 102.92 89.12

The average of quarterly average of trend figures :

Quarterly seasonal Index for 1st Quarter :

Quarterly seasonal Index for 2rd Quarter :

Quarterly seasonal Index for 3rd Quarter :

Quarterly seasonal Index for 4th Quarter :

The total of seasonal indices should be equal to 400 and that for monthly indices

should be 1200.

Merits

(i) This method is based on a logical procedure for measuring seasonal variations.

This procedure has an advantage over the moving average method for it has a ratio

to trend value for each month for which data is available. So this method avoids

loss of data which is inherent in the case of moving averages. If the period of time

series is very short then the advantage becomes more prominent.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

(iii) It is easy to understand.

Limitations :

If the cyclical changes are very wide in the time series, the trend can never follow

the actual data, as closely as a 12-month moving average will follow, under the

ratio-to-trend method. There will be more bias in a seasonal index computed by

ratio to trend method.

4. Link Relatives Method

Among all the methods of measuring seasonal variation, link relatives method is

the most difficult one. When this method is adopted the following steps are taken

to calculate the seasonal variation indices :

(i) Calculate the link relatives of the seasonal figures. Link relatives are calculated

by dividing the figure of each season* by the figure of immediately preceding

season and multiplying it by 100.

These percentages are called link relatives since they link each month (or quarter

or other time period) to the preceding one.

(ii) Calculating the average of the link relatives for each season. While calculating

average we might take arithmetic average but median is probably better. The

arithmetic average would give undue weight to extreme cases which were not

primarily due to seasonal influences.

(iii) Convert these averages into chain relatives on the base of the first season.

(iv) Calculate the chain relatives of the first season on the basis of the last season.

There will be some difference between the chain relative of the first season and the

chain relative calculated by the previous method. This difference will be due to

long-term changes. It is therefore necessary to correct these chain relatives.

(v) For correction, the chain relative of the first season calculated by first method is

deducted from the chain relative (of the first season) calculated by the second

method. The difference is divided by the number of seasons. The resulting figure

multiplied by 1,2,3 (and so on) is deducted respectively from the chain relatives of

the 2nd, 3rd, 4th (and so on) seasons. These are correct chain relatives.

(vi) Express the corrected chain relatives as percentage of their averages. These

provide the required seasonal indices by the method of link relatives. The

following example will illustrate the process.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Chain relative of the first quarter (on the basis of first quarter) = 100

Chain relative of the first quarter (on the basis of first quarter) =

Difference between these chain relatives = 106.7 – 100 = 6.7

Difference per quarter =

Adjusted chain relatives are obtained by subtracting 1 × 1.675, 2 × 1.675, 3 ×

1.675 from the chain relatives of 2nd , 3rd and 4th quarters, respectively.

Seasonal variation indices are calculated as below :

Seasonal variation index =

Meaning of “Normal” in Business Statistics

Business is often said to be “above normal” or “below normal”. When so used the

term “normal” is generally recognized to mean a level of activity which is

characterized by the presence of basic trend and seasonal variation. This implies

that the influence of business cycles and erratic fluctuations on the level of activity

is assumed to be insignificant. Therefore, the product of trend value for any period

when adjusted by the seasonal index for that period gives us an estimate of the

normal activity during that period.

Measuring Cycle as the residual

Business cyclical variations are measured either as the difference between the

observed value and the “normal”. Whatever remains after elimination of secular

trend and seasonal variations from the time series, is said to be composed of

cyclical variations and Irregular movements.

Second degree Parabola

The simplest form of the non-linear trend is the second degree parabola. It is used

to find long term trend. We use the following equation for finding second degree

trend –

Yc = a + bX + cX2

To know the value of a, b and c we use the following three normal equations –

∑Y = Na + b∑X + c∑X2

∑XY = a∑X + b∑X2 + c∑X3

∑X2Y = a∑X2 + b∑X3 + c∑X4

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

A second degree trend equation is apporpriate for the secular trend component of a

time series when the data do not fall in a straight line.

Illustration: Fit a parabola (Yc = a + bX + cX2) from the following

Years 1 2 3 4 5 6 7

Values 35 38 40 42 36 39 45

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

– 84c = – 4

c = 4/84 = 0.05

By substituting the value of c in equation (i) we get the value of a

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

7a + 28 × 4/48 = 275

7a = 275 – 1.33

a = 273.67/7 = 39.09

We may get the value of b with the help of equation (ii)

28b = 28

b = 1

The required equation would be:

Yc = 39.09 + 1X + 0.05 X2

= 39.09 + X + 0.05 X2

With the help of above equation we can estimate the value for year 8 where x = 4

Yc = 39.09 + 4 + 0.05 (4)2

= 39.09 + 4 + 0.8 = 43.89

Exponential Trend

The equation for exponential trend is of the form: y = abx

Taking log of both sides we get log y = log a + x log b

To get the value of a and b we have normal equation

∑logy = Nlog a + logb ∑X

∑(x. log y) = log a∑x + log b∑X2

When we slove these equations we get –

log a = and log b =

Illustration : The production of certain raw material by a company in lakh tons for

the years 1996 to 2002 are given below:

Year : 1996 1997 1998 1999 2000 2001 2002

Production : 32 47 65 92 132 190 275

Estimate Production figure for the year 2003 using an equation of the form y = ab1

where x = years and y = production

Solution :

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

for 2003, x would be 4 and log y will be

log y = 1.9704 + .154(4) = 2.5864

y = AL 2.5864 = 385.9

Thus estimated production for 2003 would be 385.9 lakh tons.

In Correlation we study the linear correlation between two random variables x and

y. We now look at the line in the xy plane that best fits the data (x1, y1), …, (xn, yn).

Recall that the equation for a straight line is y = bx + a, where

b= the slope of the line

a = y-intercept, i.e. the value of y where the line intersects with the y-axis

For our purposes we write the equation of the best fit line as

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

The best fit line is the line for which the sum of the distances between each of

the n data points and the line is as small as possible. A mathematically useful

approach is therefore to find the line with the property that the sum of the

following squares is minimum.

Theorem 1: The best fit line for the points (x1, y1), …, (xn, yn) is given by

where

Click here for the proof of Theorem 1. Two proofs are given, one of which does

not use calculus.

Definition 1: The best fit line is called the regression line.

Observation: The theorem shows that the regression line passes through the point

(x̄, ȳ) and has equation

Note too that b = cov(x,y)/var(x). Since the terms involving n cancel out, this can

be viewed as either the population covariance and variance or the sample

covariance and variance. Thus a and b can be calculated in Excel as follows where

R1 = the array of y values and R2 = the array of x values:

b = SLOPE(R1, R2) = COVAR(R1, R2) / VARP(R2)

a = INTERCEPT(R1, R2) = AVERAGE(R1) – b * AVERAGE(R2)

Property 1:

Proof: By Definition 2 of Correlation,

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Excel Functions: Excel provides the following functions for forecasting the value

of y for any x based on the regression line. Here R1 = the array of y data values

and R2 = the array of x data values:

SLOPE(R1, R2) = slope of the regression line as described above

INTERCEPT(R1, R2) = y-intercept of the regression line as described above

FORECAST(x, R1, R2) calculates the predicted value y for the given value of x.

Thus FORECAST(x, R1, R2) = a + b * x where a = INTERCEPT(R1, R2) and b =

SLOPE(R1, R2).

TREND(R1, R2) = array function which produces an array of predicted y values

corresponding to x values stored in array R2, based on the regression line

calculated from x values stored in array R2 and y values stored in array R1.

TREND(R1, R2, R3) = array function which predicts the y values corresponding

to the x values in R3 based on the regression line based on the x values stored in

array R2 and y values stored in array R1.

To use TREND(R1, R2), highlight the range where you want to store the predicted

values of y. Then enter TREND and a left parenthesis. Next highlight the array of

observed values for y (array R1), enter a comma and highlight the array of

observed values for x (array R2) followed by a right parenthesis. Finally

press Crtl-Shft-Enter.

To use TREND(R1, R2, R3), highlight the range where you want to store the

predicted values of y. Then enter TREND and a left parenthesis. Next highlight the

array of observed values for y (array R1), enter a comma and highlight the array of

observed values for x(array R2) followed by another comma and highlight the

array R3 containing the values for x for which you want to predict y values based

on the regression line. Now enter a right parenthesis and press Crtl-Shft-Enter.

Excel 2016 Function: Excel 2016 introduces a new

function FORECAST.LINEAR, which is equivalent to FORECAST.

Example 1: Calculate the regression line for the data in Example 1 of One Sample

Hypothesis Testing for Correlation and plot the results.

Chanderprabhu Jain College of Higher Studies

&

School of Law

An ISO 9001:2008 Certified Quality Institute

(Recognized by Govt. of NCT of Delhi, Affiliated to GGS Indraprastha University, Delhi)

Using Theorem 1 and the observation following it, we can calculate the slope b and

y-intercept a of the regression line that best fits the data as in Figure 1 above.

Using Excel’s charting capabilities we can plot the scatter diagram for the data in

columns A and B above and then select Layout > Analysis|Trendline and choose

a Linear Trendline from the list of options. This will display the regression line

given by the equation y = bx + a(see Figure 1).

- US President InquiryЗагружено:edrickshaw
- Business Statistics FormulasЗагружено:Nikethan Sanisetty
- [6426]Revision Worksheet for Cycle Test - Measures of Dispersion Economics -Grade 11F FinalЗагружено:Noushad Ali
- EDA in SparkЗагружено:Will Gao
- The Ins and Outs of Histograms with ExcelЗагружено:Spider Financial
- MELJUN_CORTES Statistics Program c++ Source CodeЗагружено:MELJUN CORTES, MBA,MPA
- newressalesЗагружено:Mr Nerd
- Quiz 1 NotesЗагружено:Adrienne Crawford
- SPSS basicsЗагружено:Dinesh Kc
- Spot Speed AnalysisЗагружено:Amit Garg
- [MB0040]MBA Unit (1,2,3,4 Units) StatisticsЗагружено:Anuj Kalsi
- Histogram DefinitionЗагружено:tutorvistateam
- IIJEC-2014-08-19-7Загружено:Anonymous vQrJlEN
- Lecture 8 Management of InformationЗагружено:Bilal Salameh
- Bell Curve or Normal CurveЗагружено:kokolay
- 213f15lec2 (1)Загружено:markydee_20
- Business Analytics.pdfЗагружено:Ey Zalim
- stats eport projectЗагружено:api-320818400
- x1441609093yMath_VII_OldЗагружено:Vidhya Raja
- kЗагружено:SurendraTripathi
- aj ka kaamЗагружено:Anonymous fg4ACq
- Helpful for ReportЗагружено:Jainuddin Sh
- OutputЗагружено:Adjie Fatahillah
- Lecture 6Загружено:Manjunath
- statistikЗагружено:Crista Resti Starilla
- [Jack R. Fraenkel, Norman E. Wallen] How to Design(BookFi)-218-229Загружено:Aii LoveOppasukkie
- 18103261_Hmwk01Загружено:Abhiram shukla
- 3-Penyajian Data (4).pdfЗагружено:Kinzie Feliciano Pinontoan
- 4th Quarter PtЗагружено:Ayra Zerimar Somar
- Test Averages 1.docxЗагружено:asiyah ahmad

- Continuous Probability DistributionЗагружено:Pham Ngoc Mai
- jurnal 3Загружено:chloramphenicol
- Chapter 6; We Have Seen His Star Question SheetЗагружено:Jason R. Pierre
- Media ViolenceЗагружено:Tina Yuan
- jennifer baccellieri resume 2014Загружено:api-275646151
- ICTA Whitepaper_FinalЗагружено:ICT Authority Kenya
- Freedom of God in Mercy and JudgmentЗагружено:renehull
- s41467-018-04071-5Загружено:niga maria anca
- edp 302 paperЗагружено:api-312579224
- WSH Profile 2014 by momЗагружено:Aaron Bourne Lee
- 498_1314349467084Загружено:Ravishankar Sivasothy
- Maths_MTЗагружено:kpraful
- Antonio,Liberty Thermo PrintЗагружено:yeng botz
- NTSE Syllabus for Class 10thЗагружено:Shivam Gupta
- Blood & BronzeЗагружено:Hugo Bouladou
- Faculty Manual 2009-2012 - FinalЗагружено:Michael V. Magallano
- Pre-calculus / math notes (unit 18 of 22)Загружено:OmegaUser
- lifelinenewsletter-2Загружено:api-299960064
- markingЗагружено:SyedHussain
- Explain Active DirectoryЗагружено:gtarasan1
- The Authors Named PherecydesЗагружено:brysonru
- Contoh Soal Try Out Bahasa Inggris Kelas 10 SMA MAЗагружено:Arsy
- FDI StrategyЗагружено:Niraj Shrivastava
- Slope StabilityЗагружено:Pamela Danisza Narváez Hernández
- Peltier ACЗагружено:DexServ
- The Yellow Horde by Evarts, Hal G. (Hal George), 1887-1934Загружено:Gutenberg.org
- the Social Function of ScienceЗагружено:Mitsos Tziembakis
- Georgia - Language, Culture, Customs and EtiquetteЗагружено:Machrachuna
- JOGET OverviewЗагружено:ejespino1127
- 3rd Dca - Verneret v. Foreclosure Advisors, Llc. ReversedЗагружено:Foreclosure Fraud

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.