Вы находитесь на странице: 1из 34

HS550: Statistical Methods (3-0-2-4)

(Feb-June Semester, 2019)

Course instructor: Shyamasree Dasgupta


Office: A1-303; ​Phone: 1905267122
Drop me an email anytime at
shyamasree@iitmandi.ac.in
Indian Institute of Technology Mandi

2/28/2019 1
Module 1: Representation of Data and Descriptive Statistics
[Week 1-3 (7 lectures)]*

How to represent the data that


you have – table and charts?

Who is the “one” representative


value of the dataset?

What is the average deviation


from the representative value?

If you have data on two


variables, how to check their
relationship?

* No classes on 20th and 22nd Feb

2/28/2019 2
Books (Also consult Basic Statistics by Nagar and Das)

2/28/2019 3
Let’s appreciate the need for
“quantity” as well as “quality”

2/28/2019 4
Comparison between these two events are possible only when both quantitative
and qualitative information are available

• British Indian Army killed a huge • Thuggees killed a huge number


number of civilians in Jallianwala civilians in various parts of India
Bagh

• Around 400 people were killed • Around 2 million individuals were


killed

• Open fired on a group of • It was a part of criminal activities


unarmed, nonviolent protesters of the Thuggees and was related
and pilgrims to robbery

• It happened in the year 1919 • It happened over a period of 600


years – (1290 -1870)
2/28/2019 5
• The distinction between the quantitative and
qualitative information, as they are often articulated,
is misleading.

• Both are equally important to know whether an event/


a finding is typical or atypical.

2/28/2019 6
Tracing the history of data representation

Mortality Table of
John Graunt (1661)

Overcoming the
problem of
“can’t see the forest
for the tree”

2/28/2019 7
Scottish imports/exports by W. Playfair (1786)

2/28/2019 8
1859: Florence Nightingale’s
polar area diagram
2/28/2019 9
Not the numbers but the arrangement of
the numbers tells the story!
Click on the link!

2/28/2019 10
Types of Data

Nominal variable (or Attribute based variable): Pass/Fail?


Category

Cardinal variable: height? 5.5ft


A number

Ordinal variable: 1st/2nd/3rd/...../last but one/last?


Rank

Observe the Data_1 carefully and identify the variables as Nominal,


Ordinal or Cardinal. Also observe that all the cardinal variables can
be converted to ordinal variables

2/28/2019 11
What are the heights of the
students in IIT Mandi?
Height of Roll no. .....is 5.2 ft
Height of Roll no. ..... Is 4.9 ft
and so on....

When data is in a raw format, the first task is to


arrange them in a meaningful manner. You may lose
some of the details while doing it, but that’s fine!

2/28/2019 12
Arrangement makes life easy!

• Height (in ft) of 30 students:


5.2, 5.9, 4,9, 5.6, 6.1, 4.9, 5.5, 5.8,
5.7,6.0,5.0, 6.2, 5.7, 4.8, 5.8, 5.6, 5.7,
6.0, 4.8, 5.7,5.9, 5.4, 5.2, 4.8, 5.4, 5.2,
5.2, 5.4, 5.7, 5.7

2/28/2019 13
Tabular representation of data

2/28/2019 14
A table is prepared to represent the summary of the data. The table that you
want to create out of any raw data should depend on your research objective.
Same data can be tabulated in various ways to answer the particular research
question that you are trying to address. Further, ask yourself the following
questions before you proceed to create any table.

Tables based on
Cardinal data? Tables based on
Nominal Data?
Here you are the one to construct
class intervals, which will act as
categories Here you know your categories

A table for more than one


A table for variables?
one variable?
Think carefully how would you
This is rather simple! like to create subgroups of a
variable!
2/28/2019 15
Representing one variable in a table
Table 1: Distribution of households according to ownership of
agricultural land (Based on Data_1)

No. of
Classes (in Classes (in Midpoint households % of
hectare) acre) (xi) (fi) households
Landless 0 0 0 11 46%
Marginal <1 <2.5 1.25 2 8%
Small 1-2 2.5-5 3.75 2 8%
Semi medium 2-4 5-10 7.5 0 0%
Medium 4-10 10-30 20 2 8%
Large >10 >30 45 7 29%
Total 24 100%

Observe that there is a logic behind defining the classes in such a manner. [Note: In India,
following categories of landholdings are generally used: Marginal: <1 ha; Small: 1.01–2 ha;
Semi-medium: 2–4 ha; Medium: 4–10 ha; Large: >10 ha. However, to use these categories
as your classes, you have to convert the landholding from acre to hectare (ha) and 1 ha
=2.5 acre (approx)]

2/28/2019 17
Representing 2 variables in one table
Table 2: Distribution of households according to their castes
(mentioned as 'category') in various villages (Based on Data_1)

Caste
Village SC ST OBC Gen Total
Paschim Malipur 0 0 0 9 9
Sherpara 0 0 1 6 7
Madanpur 0 0 0 3 3
Gopalpur 3 0 0 1 4
Purushia 0 1 0 1 2
Total 3 1 1 20 25

2/28/2019 19
Table 3: Distribution of households according to caste, religion
and monthly expenditure
Stub Title
Caption

Caste SC ST OBC Gen Total


Religion H M T H M T H M T H M T H M T
Expenditure
in Rs.
<5000 2 0 2 0 0 0 0 1 1 5 5 10 7 6 13
5000-10000 1 0 1 0 0 0 0 0 0 4 2 6 5 2 7

10000-15000 0 0 0 1 0 1 0 0 0 1 1 2 2 1 3
>15000 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1
Total 3 0 3 1 0 1 0 1 1 10 9 19 14 10 24

Observe that there is column that displays the total number of households under each
category. H: Hindu, M: Muslim, T: Total. Since one data point is missing under the
variable monthly expenditure, the total count will remain 24. Body of the
Table
2/28/2019 22
Frequency Distribution

Table 1: Distribution of households according to ownership of


agricultural land

Cumulative
Frequency Freq.
No. of Relative (Fi) Density
Classes Midpoint households Frequency (fi/class
(in acre) (xi) (fi) (fi/N) < type > type length)
Landless 0 0 11 0.46 11 24 -
Marginal 0-2.5 1.25 2 0.08 13 13 0.8
Small 2.5-5 3.75 2 0.08 15 11 0.8
Semi-med 5-10 7.5 0 0.00 15 9 0
Medium 10-30 20 2 0.08 17 9 0.1
Large 30-60 45 7 0.30 24 7 0.23
Total N=24 1

2/28/2019 24
Diagrammatic representation of data

2/28/2019 25
Figure 1:
Distribution of households according to ownership of agricultural land

12 11 Bar/column diagram
10 Bar diagram
No. of households

8 7
6
Pie chart
4
2 2 2
2
0
0 Large
Landless Marginal Small Semi Medium Large 29%
medium
Landless
46%

Medium
8%
Small
8% Marginal
2/28/2019 9% 26
Table 2: Distribution of households according to their
castes (mentioned as 'category') in various villages

Caste
Village SC ST OBC Gen Total
Paschim Malipur 0 0 0 9 9
Sherpara 0 0 1 6 7
Madanpur 0 0 0 3 3
Gopalpur 3 0 0 1 4
Purushia 0 1 0 1 2
Total 3 1 1 20 25

2/28/2019 27
Figure 2:
Distribution of households according to their castes in various villages
10
No of households
Bar diagram SC
8 ST
6 OBC
4 Gen
2
0
Paschim Sherpara Madanpur Gopalpur Purushia
Malipur
Stacked bar diagrams
100% 10
90%
80%
8
70%
60%
6
50%
40%
4
30%
20%
10%
2
0%
Paschim Sherpara Madanpur Gopalpur Purushia 0
Malipur Paschim Sherpara Madanpur Gopalpur Purushia
Malipur

2/28/2019
28
Table 3: Distribution of households according to caste,
religion and monthly expenditure

Caste SC ST OBC Gen Total


Religion H M T H M T H M T H M T H M T
Expenditure
<5000 2 0 2 0 0 0 0 1 1 5 5 10 7 6 13
5000-10000 1 0 1 0 0 0 0 0 0 4 2 6 5 2 7

10000-15000 0 0 0 1 0 1 0 0 0 1 1 2 2 1 3
>15000 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1
Total 3 0 3 1 0 1 0 1 1 10 9 19 14 10 24

Observe that there is column that displays the total number of households under
each category. H: Hindu, M: Muslim, T: Total. Since one data point is missing
under the variable monthly expenditure, the total count will remain 24.

2/28/2019 29
Figure 3.
Distribution of households according to caste, religion and monthly expenditure

12

10
No. of households

>15000
6
10000-15000
5000-10000
4
<5000
2

0
H M T H M T H M T H M
SC ST OBC Gen

2/28/2019 30
• Histogram is another way of representation
– isto-s – ‘mast’/ something set upright
– gram-ma – something written/graphics
– Histogram

Not really!!
This is a column diagram.
Also remember, the term `histogram' was coined by the
statistician Karl Pearsonwhile talking about the geometry of
statistics (1892).
2/28/2019 32
• Have a careful look at the monthly expenditure data

• Identify the highest value and the lowest value

• The highest is 18000 and the lowest is 2000 (in INR)

• So, consider the range 2000 INR – 18000 INR

• Bin the range into a series of intervals (continuous but disjoint) and
identify frequency corresponding to each range

• Bins may contain less that the lowest value and more than the highest
value

2/28/2019 33
Frequency Table

Class Interval Midpoint (xi) Frequency Freq. Density


(expenditure in INR) (fi)

0-5000 2500 13 0.0026

5000-10000 7500 7 0.0014

10000-15000 12500 3 0.0006

15000-20000 17500 1 0.0002

Total 24 0.0048

2/28/2019 34
0.0050
Proportion of households

0.0040

0.0030

0.0020

Frequency curve
0.0010

0.0000
2500 7500 12500 17500

Monthly expenditure in INR

1. Widths are proportional to classes


2. Heights are proportional to frequency density
3. The area of each bar represents the frequency
4. Notice, histogram is appropriate even when the class intervals are unequal

2/28/2019 35
Central tendency and dispersion

2/28/2019 36
Type of data Measure of central Measure of
tendency dispersion

Cardinal Mean Standard deviation

Ordinal Median Quartile deviation

Nominal/ Attributes Mode Range

2/28/2019 37
Correlation:
Chalk and Talk!

2/28/2019 38
Food for thought…

1. The household that spends INR 18000


has 14 family members, whereas the
household that spends INR 7000 has
only 3 family members!! How do you
take this information into account?

2. A histogram can be drawn with


unequal class intervals. Why? Try
creating a Histogram based on the data
on landholding

3. How do you calculate the appropriate


central value and dispersion for the
variables in the data set?

2/28/2019 39
Questions?

Cartoon curtsey:
The Cartoon Guide to Statistics
By Larry Gonick and Woollcott Smith
2/28/2019 40

Вам также может понравиться