Академический Документы
Профессиональный Документы
Культура Документы
By
Dr. Javed Iqbal
Associate Professor, IBA Karachi
Suggested Text:
Gerald Keller (2014). Statistics for Management and Economics. Cengage Learning
Now a days, new fields related to data are being developed. They use interaction of
Statistics and Computer Sciences to arrive at fascinating fields of ‘Data Science’,
‘Big Data Analytics’, ‘Machine Learning’, ‘Data Mining’ etc. Computer
science with the aid of statistical knowledge is rapidly expanding and highly
demanding area.
As Christopher Scott (2005) [Proceedings of the 2005 CBMS Network Meeting], an expert of
public policy and water management, remarks that wherever possible, public policy decisions
should be informed by careful analysis using sound and transparent data. The evidence-based
policymaking means that systematic and rigorous use of statistics in decision making. The data
driven procedure means a decision-making process which involves collecting data, extracting
patterns and facts from that data, and utilizing those facts to make inferences that influence
decision-making. This means to make organizational decisions based on actual data rather than
intuition or observation alone. Using the evidence based decision making avoids bias in decision
making. Criteria other than those associated with evidence-based policymaking are often used to
make public choices. These alternative criteria include: (i) Power and influence of sectional
interest (ii) Corruption (iii) Political ideology (iv) Arbitrariness (v) Anecdotes (hearsay rather
than hard facts).
Countries: GDP per capita, human development index score, literacy rare, life
expectancy, infant mortality rate.
2. Time Series: Data obtained over several time points for an individual unit
e.g. after tax profit of ICI company for last 10 years, number of violent
crimes in Karachi recorded over last several years etc.
3. Panel: The data type having both the cross section and time dimensions e.g.
after tax profit of a selected set of firms recorded for each of the year from
2010 to 2018. GDP of a selected set of countries recorded for the last 50
years etc.
Data Sources
1. Primary Data: Data collected specifically for the purpose under study by
the researcher e.g. by an experiment, by a survey (personal interview
method, telephone survey, self-administer questionnaire, internet survey
etc.)
2. Secondary Data: Data that were collected by someone else and available as
published document, web database, unpublished record etc. The researcher
uses these data for his/her own purpose under study.
4
All listed companies in PSX are required by law to disclose their financial statements to public
via their website.
In addition to these are there several important but sensitive databases not accessible publically
e.g. the NADRA database.
3. IMF Databases
The International Monetary Fund publishes many important databases related to money and
finance for the UN member countries. An important document is the IFS (International Financial
Statistics) which is published at monthly, quarterly and annual frequencies. This document
contains time series data on many financial, macroeconomic and trade related variables for all
countries.
https://www.imf.org/en/Data
4. International Survey and Indices
There are now many different indices published by various organizations e.g. the World
Freedom Index (https://www.heritage.org/index/ranking) publishes ranking of countries with
respect to economic freedom. The PEW Research Centre (https://www.pewresearch.org/fact-
tank/2017/04/05/christians-remain-worlds-largest-religious-group-but-they-are-declining-in-
europe/) publishes various statistics e.g. about world religions. The Human Development Index
(http://hdr.undp.org/en/content/human-development-index-hdi) published by UNDP provides
rankings of countries with respect to human development and its various component. This index
was first developed by Pakistani Economist Dr. Mehboob-ul-Haq. Global Peace Index
(https://en.wikipedia.org/wiki/Global_Peace_Index) provides a list of countries with respect to
peace indicators. Corruption Perception Index by Transparency International
(https://www.transparency.org/cpi2018).
Bar Chart: Simple bar chart presents a rectangle corresponding to each category:
7
Source: Statistical Abstract of the United States, 2012, Table 1389
Sol: Use the Excel Data Files CO2.xls and Select the two columns including
headings, then go to Insert>Select Column Chart as indicated here.
8
Top Carbon Dioxide CO2 Producing Countries
9000.0
8000.0
7000.0
6000.0
5000.0
4000.0
3000.0
2000.0
1000.0
0.0
y n y n s
li a da na an di
a
Ira Ita
l
pa ou
th ico si a bi
a ca
fri gdo
m te
s tra ana Chi rm In Ja S ex Rus Ara A S ta
Au C
Ge a, M i
ut
h Ki
n d
re ud te
Ko Sa So ited Uni
Un
Excercise1: The file crime.xls give the data on number of crimes in different
regions and provinces in Pakistan. Produce some bar charts and comment.
Exercise2: When will the world run out of oil? One way to judge is to determine
the oil reserves of the countries around the world. The next table displays the
known oil reserves of the top 15 countries. Graphically describe the figures. Use
Excel File oil.xls
Pie Chart: A circular diagram divided into sectors representing a category. The
angle of each sector is proportional to the value of the category.
Example: The following information (Excel File accidents.xls) gives the number
of total road accidents and also the number of accidents per million populations in
Pakistan for the four provinces and Islamabad. Draw a pie chart.
9
Total Number of Road Accidents (2016-17)
Baluchistan Islamabad
4% 2%
Punjab
40%
KPK
44%
Sindh
9%
Sol: In Excel, select data range (including heads) then use Inset>Chart (select Pie
Chart). Then right click on any sector and from Format Data Labels Select
Category Name and Percentage. Alternatively click + box on top of graph and
select Data Labels>More Options>Select Category Name and Percentage.
Total Number of Road Accidents Per Million Population (2016-17)
Punjab
11%
Sindh
6%
Islamabad
35%
KPK
38%
Baluchista
n
10%
The pie chart shows KPK dominates the number of accidents followed by Punjab.
A better picture can be seen by observing the number of accidents per million
population. The pie chart appears below which shows that Islamabad region has
very high percentage of total accidents per capita (to be correct per million person)
in Pakistan.
Exercise1: The following table lists the top 10 countries and amounts of oil
(millions of barrels annually) they exported to the United States in 2010.
10
Excercise2: The Excel file energy.xls gives the energy consumption pattern of
Australia. The figures measure the heat content in metric tons (1,000 kilograms) of
oil equivalent. Draw a graph that depicts these numbers.
Multiple Bar Chart: is used to present data on two or more categorical variables
Sol: In Excel after selecting the three columns including heads, go to Insert
>Chart>Column Chart. Note here the year category name is numeric. So it is better
to use some character e.g. a quotation or a coma before the year so that excel treats
it a character. Otherwise Excel gives the title of series1 series2 etc.
11
Number of Fatalities Related to Terrorism in Pakistan
4000
3500
3000
2500
2000
1500
1000
500
0
Security Target Killings Militants Terrorism Others
Operation Attacks
Excercise1: The following table lists the percentage of males and females in five
age groups that did not have health insurance in the United States in September
2008. Use a graphical technique to present these figures. Use the data file
health.xls
Exercise2: Has the educational level of adults changed over the past two decades?
To help answer this question, the Bureau of Labor Statistics compiled the
following table, which lists the number (1,000) of adults 25 years of age and older
who are employed. Use a graphical technique to present these figures. Briefly
describe what the chart tells you. Use Excel file school.xls
12
Describing Quantitative Data through Tables and Graphs:
Frequency Distribution:
A table showing the ranges/intervals of data values together with number of values
falling in each interval.
Example: Consider the data on life expectancy of 205 countries and territories of
the world for year 1980. The data are obtained from World Development
Indicators and are available in the file lifeexpectancy.xls.
Sol: There are some steps before making a sensible frequency distribution in
Excel.
Step1: Calculate the descriptive statistics to find the range of data including
minimum and maximum values. Keeping in view these we can find the upper
limits (Excel call it Bins) as 25, 30 etc.
In Excel’s Data tab select Data Analysis and Histogram. Input data range and
Bin range and select chart output. In the resulting histogram, click at any bar and
right click, select Format Data Series and reduce Gap Width to 2. The Histogram
looks like this:
30
20
10
0
25 30 35 40 45 50 55 60 65 70 75 80 More
life expectancy
Note that the in histogram upper limits are shown in the center. But we know that
these are upper limits. The distribution of life expectancy appears to be highly
13
negatively skewed. Meaning that many small countries had very low life
expectancy in 1980.
Exercise1: Do similar exercise with the latest data of 2017 and see if the pattern
has changed. Data are sheet 2.
14
Median: is defined as the central most value of the data. This is obtained by
arranging the values in ascending order and taking the central most value of the
number of values are odd or taking the average of two middle most values if the
(n+1)
number of values is even. In general median can be defined as the value of
2
th ordered value of the data set.
Mode: is defined as the value which occurs most frequently in a set of data. As the
above example illustrates the value zero occurs twice and the mode for this data is
0 but as it can be seen the mode is very poor measure of center in this case.
Which measure to use? With three measures from which to choose, which one
should we use? There are several factors to consider when making our choice of
measure of central location. The mean is generally our first selection. However,
there are several circumstances when the median is better. The mode is seldom the
best measure of central location. One advantage the median holds is that it is not as
sensitive to extreme values as is the mean. Thus for highly skewed distribution
median is preferred over mean. An example of this is income, or size distributions
e.g. size or sales of companies.
Measures of Variation:
Two or more data sets or distributions may have the same average but very
different distribution of values within the range. Thus to fully describe the data we
also use measures of variation in addition to measures of central tendency
(averages). Simplest of these measures is the Range, defined as the difference
between the highest and lowest values of the data. However, range is often not
satisfactory.
15
Most popular measure of variation is known as ‘Standard Deviation’ defined as:
n
s= √∑
i=1
¿¿ ¿ ¿
Ex: For the data on life expectany.xls, compare the trend of life expectancy of
countries in 1980 and 2017 via descriptive statistics.
Inferential Statistics
A pair of related hypothesis is considered. The first one is called the null
hypothesis and the second one is called alternative hypothesis. Hypothesis testing
is a statistical procedure which enables us to judge whether the sample data
provide any conclusive evidence to refute the null hypothesis.
For a criminal trial case, the null and alternative are set as follows:
Null hypothesis H0: The defendant is innocent
Alternative hypothesis H1: The defendant is guilty
16
The null hypothesis reflects the famous rule in system of justice that unless proven
guilty the defendant is assumed innocent.
Only when sufficient evidence against his innocence is available he is declared
guilty.
Exactly similar approach is followed in any statistical hypothesis testing. The null
hypothesis (often stated as a status quo condition) is presented and evidence from
sample data is sought (through a test statistic). If sufficient evidence is available in
the data against the null hypothesis, the null is rejected in favor of alternative
hypothesis.
Example1: The average per capita annual health care cost for a sample of U.S.
cities (50,000–200,000 population) is $5,886, with a standard deviation of $1,116.
Bases on a sample of n = 378 a city of similar population contends that its new
outpatient treatment program reduced the annual health care costs to $5,525 per
person. Is this a significant improvement?
Step 1: Formulation of Null and alternative Hypotheses.
H0: μ=$ 5886 (this city’s average health cost is no different from the rest of
country)
H1: μ< $ 5886 (this city’s average health cost is indeed lower than rest of the
country)
Step 2: Computing the test statistic:
x́−μ0
z= 5525−5886
σ = = -6.28
1116/ √378
√n
Step 3: Assuming the probability of Type I error (level of significance) of α =0.05,
the critical value is -1.645 (from Z table or use Excel’s =NORMINV(alpha, mean,
sd), in this case =NORMINV(0.05,0,1) yields -1.645. Thus values less than or
equal to -1.645(more negative than -1.645 form the rejection region and rest of
region is acceptance region.
17
Step 4: As the calculated value of test statistics falls in the rejection region, reject
the null hypothesis and conclude that the city’s claim of having the average health
care cost is supported by the data.
Note here we assume that the distribution of health cost is normal with standard
deviation in the city being same as the SD of the rest of country i.e. $1116. In case
the population standard deviation is not available, we use the sample standard
deviation s (in this case the city’s standard deviation of health cost) can be used.
In that case the test statistic to be used is the t distribution (with n-1 degrees of
x́−μ0
t=
freedom) given by: s
√n
Exercise1: A business student claims that, on average, an MBA student is required
to prepare more than five cases per week. To examine the claim, a statistics
professor asks a random sample of 10 MBA students to report the number of cases
they prepare weekly. The results are exhibited here. 2, 7, 4, 8, 9, 5, 11, 3, 7,
and 4. Can the professor conclude at the 5% significance level that the claim is
true, assuming that the number of cases is normally distributed with a standard
deviation of 1.5?
Note Excel’s NORMINV function used left cumulative probability so for a right
tail test like the above use 0.95 in the probability in Excel command.
Sol: Let μ1 and μ2 represent the mean annual salaries of female and male managers.
Step1: H 0 : μ1 = μ2 (No difference in average salaries i.e. no discrimination)
H 1 : μ1 < μ2 (Females being paid less on average so there is discrimination)
Step 2: Computation of test statistic:
x́1 − x́2 −( μ1−μ 2)
t=
s21 s 22
√ +
n1 n2
From the given data Female: n1 =38 , x́ 1=76189 , s 1=22771
Male: n2 =62 , x́ 2=97832 , s 2=29375
Variable 1 Variable 2
76188.9473 97832.0967
Mean 7 7
518509458. 862916810.
Variance 3 3
Observations 38 62
Hypothesized Mean
Difference 0
df 93
19
t Stat -4.12246734
4.06901E-
P(T<=t) one-tail 05
1.66140367
t Critical one-tail 4
8.13802E-
P(T<=t) two-tail 05
1.98580181
t Critical two-tail 4
Note: There are two forms of the t test. The first assumes population variances to be equal, the
other does not makes such assumptions so the later test is general enough to use, and hence we
have used this test above.
Exercise: The manager of a company wanted to investigate the job offers recent
MBAs were obtaining. In particular, she wanted to know whether finance majors
were being offered higher salaries than marketing majors. In a preliminary study,
she randomly sampled 50 recently graduated MBAs, half of whom majored in
finance and half in marketing. From each she obtained the highest salary offer
(including benefits). These data are listed here. Can we infer that finance majors
obtain higher salary offers than do marketing majors among MBAs? (Use the excel
file mba.xls)
20
Given a sample of n pairs (x,y), the model parameters can be estimated by method
of least square which finds as estimates of parameters those values of β 0 and β 1 for
which the sum of squared errors ∑ ¿¿ is minimum, where ^y i=b 0 +b1 x is the
predicted value of observation y i .
Using Excel’s regression function (a part of Data Analysis Pack available from Data tab), the
estimated regression output is as follows:
Regression Statistics
0.778819
Multiple R 3
0.606559
R Square 6
0.567215
Adjusted R Square 5
1.346795
Standard Error 8
Observations 12
ANOVA
Significan
df SS MS F ce F
27.9639118 27.9639 15.4168
Regression 1 4 1 1 0.002837
18.1385881 1.81385
Residual 10 6 9
Total 11 46.1025
Coefficien Standard t Stat P-value Lower Upper
21
ts Error 95% 95%
9.100373 0.85152802 10.6871 8.62E-
Intercept 9 2 1 07 7.203051 10.9977
0.058230 0.01483050 3.92642 0.00283 0.09127
Advertising 9 5 4 7 0.025186 5
Sales
18
16
14
12
10
8
6
4
2
0
10 20 30 40 50 60 70 80 90 100 110
Coefficients Interpretation:
Intercept: When in a month advertisement is zero, sales is expected to be 9.1
million dollars.
22
Example2: Consider the district level data of total number of fatalities (deaths) for
the year 2015 in Pakistan. Let’s try to build a regression model which can explain
why the violent crimes (more specifically number of fatalities per lac population)
are high leading to fatalities. Since the data are district wise we can consider
district level variables which could explain the variation in number of fatalities. To
begin with we can expect crimes to be lower in district which enjoy good living
standard e.g. education, health and high income levels. Fortunately district level
Human Development Index constructed by UNDP is available for Pakistan which
we can use as independent variables. An abstract of the data is shown below.
The Stata can be used for such regressions very conveniently. The Stata output and
the scatter plot with fitted regression appears below (use Stata file fatal.dta and
excel file fatalities excludes Karachi.xls):
23
The results indicate that as the HDI score of districts increase by one unit (from
zero to 1 say), the number of fatalities per lac population decreases by approx15.
Practically more interesting result is that as HDI of district increase by 0.1 (e.g.
from 0.6 to 0.7) the number of fatalities per lac decreases by 1.5 approximately.
This decrease is statistically significant as seen by t statistic and p-values.
twoway (scatter fatpop hdi, mlabel(district) mlabangle(45) mlabsize(tiny)) (lfit
fatpop hdi)
We will examine other variables after going through multiple regression topic.
Exercise: Critics of television often refer to the detrimental effects that all the
violence shown on television has on children. However, there may be another
problem. It may be that watching television also reduces the amount of physical
exercise, causing weight gains. A sample of 15 10-year-old children was taken.
The number of pounds each child was overweight was recorded (a negative
number indicates the child is underweight). In addition, the number of hours of
television viewing per week was also recorded. These data are listed here. Use the
excel file tv.xls
The regression is estimated as usual by the Least Square method which is built in
procedure in Excel and all statistical software.
Example1: The fire chief of an area has divided the area into nine fire districts of
approximately equal size. They want to know how many fires a month they can
expect to fight so that they can allocate firefighters accordingly. They knows that
older houses are more likely to burn than are newer house. In addition the percent
of owner occupied house uses are expected to have less incidences of fire than
rented due to better management by owners. Use the data file fire.xls. The data are
as follows:
25
The data input for this regression is illustrated by above screen shot. The results are as follows:
XRegression Statistics
Multiple R 0.99212305
R Square 0.98430816
Adjusted R Square 0.97907751
Standard Error 3.5510343
Observations 9
ANOVA
Significanc
df SS MS F eF
Regression 2 4745.8946 2372.94829 188.1822 3.864E-06
Residual 6 75.6590971 12.60984952
Total 8 4821.5556
Standard Upper
Coefficients Error t Stat P-value Lower 95% 95%
Intercept 38.1260379 2.5620847 14.87956568 5.798E-06 31.85294 44.3978
Age (X1) 1.5671063 0.0867279 18.06404664 1.852E-06 1.35483 1.779829
Percent Owned(X2) -0.4922775 0.0364602 -13.49217536 1.028E-05 -0.58538 -0.402997
R-Square: 0.984 indicates that 98.4% variation in umber of fires per month is
explained by these two variables through the multiple regression model.
Hypothesis Tests: The t stats and p-values of the two predictors indicates that both
are useful predictors of the number of fires per month.
The results indicate that only HDI remains important variable associated with the
violent crime fatalities. Although the population density appears with correct
positive sign, the variable is not significant at % level. Similarly, the number of
fatal crimes appear to be negatively associated with the number of police stations,
the variable is found insignificant statistically. As the model’s diagnostic indicate
great extent of non-normality we can try a log linear model.
gen lfatpop=log(1+fatpop)
reg lfatpop hdi popdensity policestations
28
predict e1, resid
The results indicate that the variable appears with sensible sign and are also
significant.
Contingency table: We all know that data on one variable are grouped into a
frequency distribution. Data from two variables are called bivariate data, and a
frequency distribution for bivariate data is called a contingency table or two-way
table or cross-tabulation table or cross tabs.
As example of contingency table appears here where a bivariate distribution is
tabulated on level of job satisfaction (low, medium, high) and its relationship with
income level (low, medium, high).
Income
JobSatisfaction Low Medium High Total
Low 100 30 10 140
Medium 60 80 15 155
High 40 40 50 130
Total 200 150 75 425
2 (O i−Ei ) 2
χ =∑
Ei
This test follows a Chi-Square distribution with number of degrees of
freedom as: (r − 1)(c − 1), where r and c are the number of rows and
number of columns in the contingency table for the two variables under
consideration. Provided that each expected frequency is at least 5.
29
cell Observed frequency Expected frequency (O-E)^2/E
1 100 65.88235294 17.66806723
2 30 49.41176471 7.62605042
3 10 24.70588235 8.753501401
4 60 72.94117647 2.29601518
5 80 54.70588235 11.69512966
6 15 27.35294118 5.578747628
7 40 61.17647059 7.330316742
8 40 45.88235294 0.754147813
9 50 22.94117647 31.91553544
Sum 425 425 93.61751152
Thus the null hypothesis of independence is rejected and we conclude that the two
variables are associated or dependent. Thus job satisfaction level depends on
income.
30