Вы находитесь на странице: 1из 30

Quantitative Tools for Decision Making

National Institute of Management, Karachi


February 13 and 14, 2020

By
Dr. Javed Iqbal
Associate Professor, IBA Karachi

Email: jiqbal@iba.edu.pk, javed_uniku@yahoo.com


Mobile: 0334 3208 707
Outline:
6.03 QUANTITATIVE TOOLS FOR DECISION MAKING W
SCOPE 06
 Introduction to Statistics and its Application in Decision Making (03
 Describing and Interpreting Data: Statistical & Graphical Methods sessions)
 Mean, Median, Mode, Ratio & Standard Deviation etc. and their Total =
Application 6 hours
 Sampling Techniques for Feed Back & Monitoring of Public Service
 Testing of Hypothesis, Application of data and computation of
results
 Three sessional of 2 hours each with hands on practice using Excel
and STATA
 Data Analytics
 Developing understanding of published data, institutional data sets /
data platforms and macro data analysis
 Digging (Secondary) Data
 Understanding Published Data
 Data Visualization 101
 Data and Evidence
 Data Interpretation
(The objective of this day-long workshop is to provide an intermediate-level
understanding of published data, with particular attention to official
statistics on socio-economic and macroeconomic indicators, and their usage
in policy formulation, decision-making and policy evaluation using data
visualization tools.)

Suggested Text:
Gerald Keller (2014). Statistics for Management and Economics. Cengage Learning

Word ‘Statistics’ is used in two different senses:


1
1. Numerical data, facts and figures: e.g. statistics of crimes, statistics of
road accidents, monetary and financial statistics, statistics of education etc.

2. Subject of study: Defined as a science that deals with collection,


presentation, summarization and analysis of data and to draw valid
conclusions on the basis of such analysis. Alternatively, statistics is the
science of converting data into information.

Now a days, new fields related to data are being developed. They use interaction of
Statistics and Computer Sciences to arrive at fascinating fields of ‘Data Science’,
‘Big Data Analytics’, ‘Machine Learning’, ‘Data Mining’ etc. Computer
science with the aid of statistical knowledge is rapidly expanding and highly
demanding area.

As Christopher Scott (2005) [Proceedings of the 2005 CBMS Network Meeting], an expert of
public policy and water management, remarks that wherever possible, public policy decisions
should be informed by careful analysis using sound and transparent data. The evidence-based
policymaking means that systematic and rigorous use of statistics in decision making. The data
driven procedure means a decision-making process which involves collecting data, extracting
patterns and facts from that data, and utilizing those facts to make inferences that influence
decision-making. This means to make organizational decisions based on actual data rather than
intuition or observation alone. Using the evidence based decision making avoids bias in decision
making. Criteria other than those associated with evidence-based policymaking are often used to
make public choices. These alternative criteria include: (i) Power and influence of sectional
interest (ii) Corruption (iii) Political ideology (iv) Arbitrariness (v) Anecdotes (hearsay rather
than hard facts).

Data are collected over different units e.g.

Individuals: Names, date of birth, age, marital status, education, organization of


this workshop participants.

Families: Size, income of household head, household location, number of


smokers.

Cities/Districts: Literacy rate, population, area, revenue collection, crime rate.

Countries: GDP per capita, human development index score, literacy rare, life
expectancy, infant mortality rate.

Several Classifications of Data:


2
Data Types (Time Dimension)
1. Cross Sectional: Data obtained over different units at particular time point
e.g. after tax profit of each of the firm listed at PSX for year 2019, GDP per
capita of a set of countries for year 2018, etc

2. Time Series: Data obtained over several time points for an individual unit
e.g. after tax profit of ICI company for last 10 years, number of violent
crimes in Karachi recorded over last several years etc.

3. Panel: The data type having both the cross section and time dimensions e.g.
after tax profit of a selected set of firms recorded for each of the year from
2010 to 2018. GDP of a selected set of countries recorded for the last 50
years etc.

Data Types (Quantitative/Qualitative)


1. Qualitative /Categorical:
a. Nominal: names of individual, gender, marital status, crime type etc.
In nominal case, categories have no order or rank.

b. Ordinal/Likert scale: Response to a survey e.g. strongly disagree,


disagree, neutral, agree, strongly disagree; performance of an
employee poor, average, good, excellent ; quality of bond issued by a
company AAA, AA, A, BBB, BB etc. Note here categories have some
natural order.

2. Quantitative: Age, height, weight, exchange rate, income etc.

Data Sources
1. Primary Data: Data collected specifically for the purpose under study by
the researcher e.g. by an experiment, by a survey (personal interview
method, telephone survey, self-administer questionnaire, internet survey
etc.)

2. Secondary Data: Data that were collected by someone else and available as
published document, web database, unpublished record etc. The researcher
uses these data for his/her own purpose under study.

Some Important Local and International Sources of Secondary Data


3
Local Data Sources:
1. Pakistan Economic Survey
Published every year shortly before the budget by Ministry of Finance, Govt. of Pakistan. It
contains updates of different aspects of Macro economy. It also covers statistics on different
aspects for last several years e.g. monetary and financial statistics, external trade, social
statistics, demographic statistics and many other aspects.
http://www.finance.gov.pk/survey_1819.html

2. State Bank of Pakistan’s Database


It includes many important documents e.g. Annual Report of the State Bank of Pakistan,
Banking Statistics, Balance of Payment, Exports receipts by commodity and groups. Statistical
Bulletin (monthly), Financial Statement Analysis of Joint Stock Companies (a very good source
of firm level data on several balance sheets and income statement items for all the firms listed at
the Pakistan Stock Exchange), Inflation Monitor (monthly) etc.
http://www.sbp.org.pk/publications/index2.asp

3. Pakistan Bureau of Statistics


This is the main organization responsible for collecting and disseminating data on many aspects
of economy in Pakistan. Its main documents are Pakistan Statistical Year Book, National
Accounts of Pakistan (GDP, GNP, and their various components), Population Census,
Agricultural Statistics, and External Trade Statistics, Labor Force Statistics and Price Statistics
and inflation measurement among them.
http://www.pbs.gov.pk/

4. Publications of Provincial Bureau of Statistics


Examples are the Sindh and Punjab Bureau of Statistics which contain province level data on
several aspects.
http://www.bos.gop.pk/publicationreports and http://sindhbos.gov.pk/
5. Reports and Databases by NGOs
For example the NGO AlifAilaan (https://www.alifailaan.pk/) publishes an online databases
related to district level education conditions and educational rankings of all districts in Pakistan.
The update is that project is ended and the data are no longer available.
6. Reports of Centre of Research and Security Studies
Contains annual and quarterly reports on terrorism related activities, number of fatalities,
classification of fatalities by political parties and districts/regions. Their sources are local English
newspaper e.g. Dawn, The News, Express etc.
https://crss.pk/reports/research-reports/
7. Database of Pakistan Stock Exchange
Contains several information of listed companies. The data portal
(https://dps.psx.com.pk/historical) contains historical prices and trading volume of requested
company for several past days.

8. Financial Statements/ Annual Reports of Listed Companies

4
All listed companies in PSX are required by law to disclose their financial statements to public
via their website.

In addition to these are there several important but sensitive databases not accessible publically
e.g. the NADRA database.

International Data Sources:


1. World Development Indicators
This database contains several aspects of development for all the UN member countries. The
data are related to Agriculture and Food Security, Climate Change, Economic Growth,
Education, Energy and Extractives, Environment and Natural Resources, Financial Sector
Development, Poverty, Literacy etc.
https://datacatalog.worldbank.org/dataset/world-development-indicators
2. OECD Economic Indicators
OECD is the (Organization for Economic Cooperation and Development) is an organization of
36 highly developed countries. This database contains many important economic indicators
including quarterly national accounts, business surveys, retail sales, industrial production,
construction, consumer prices, total employment, unemployment rates, interest rates, money and
domestic finance, foreign finance, foreign trade, and balance of payments for OECD countries
and non-member economies.
https://www.oecd-ilibrary.org/economics/data/main-economic-indicators_mei-data-en

3. IMF Databases
The International Monetary Fund publishes many important databases related to money and
finance for the UN member countries. An important document is the IFS (International Financial
Statistics) which is published at monthly, quarterly and annual frequencies. This document
contains time series data on many financial, macroeconomic and trade related variables for all
countries.
https://www.imf.org/en/Data
4. International Survey and Indices
There are now many different indices published by various organizations e.g. the World
Freedom Index (https://www.heritage.org/index/ranking) publishes ranking of countries with
respect to economic freedom. The PEW Research Centre (https://www.pewresearch.org/fact-
tank/2017/04/05/christians-remain-worlds-largest-religious-group-but-they-are-declining-in-
europe/) publishes various statistics e.g. about world religions. The Human Development Index
(http://hdr.undp.org/en/content/human-development-index-hdi) published by UNDP provides
rankings of countries with respect to human development and its various component. This index
was first developed by Pakistani Economist Dr. Mehboob-ul-Haq. Global Peace Index
(https://en.wikipedia.org/wiki/Global_Peace_Index) provides a list of countries with respect to
peace indicators. Corruption Perception Index by Transparency International
(https://www.transparency.org/cpi2018).

Exercise 1: For each item, identify the data type


5
Exercise 2: For each item, identify the data type

Statistics is science that deals with collection, presentation, summarization and


analysis of data and to draw valid conclusions on the basis of such analysis.
6
Looking at the definition of statistics, two branches of statistics emerge:

1. Descriptive Statistics: Deals with describing the main features of data


through tables, graphs and numerical summary measures. Main tools used
are:
a. Tables
b. Charts and Graphs
c. Measures of Central Tendency (Averages)
d. Measures of Variation
e. Measures of Symmetry/Skewness

2. Inferential Statistics: concerned with drawing conclusions about a larger


set of data (called population) by analyzing only a small part of it (called
sample). Main tools used are:
a. Tests of Hypothesis
b. Confidence Intervals

Most empirical studies in management, economics and social sciences usually


involve both of these types of methods. For example, a study explaining the crime
rate may involve description of data on crime at different level through tables and
charts and then proceeding to more serious techniques like regression analysis (an
inferential technique).

Describing Qualitative/Categorical Data

Bar Chart: Simple bar chart presents a rectangle corresponding to each category:

Example: The planet may be threatened by global warming, possibly caused by


burning fossil fuels (petroleum, natural gas, and coal) that produces carbon dioxide
(CO2). The following table lists the top 15 producers of CO2 and the annual
amounts (millions of metric tons) from fossil fuels. Graphically depict these
figures. Use the Excel file co2.xls.

7
Source: Statistical Abstract of the United States, 2012, Table 1389

Sol: Use the Excel Data Files CO2.xls and Select the two columns including
headings, then go to Insert>Select Column Chart as indicated here.

8
Top Carbon Dioxide CO2 Producing Countries
9000.0
8000.0
7000.0
6000.0
5000.0
4000.0
3000.0
2000.0
1000.0
0.0
y n y n s
li a da na an di
a
Ira Ita
l
pa ou
th ico si a bi
a ca
fri gdo
m te
s tra ana Chi rm In Ja S ex Rus Ara A S ta
Au C
Ge a, M i
ut
h Ki
n d
re ud te
Ko Sa So ited Uni
Un

Excercise1: The file crime.xls give the data on number of crimes in different
regions and provinces in Pakistan. Produce some bar charts and comment.

Exercise2: When will the world run out of oil? One way to judge is to determine
the oil reserves of the countries around the world. The next table displays the
known oil reserves of the top 15 countries. Graphically describe the figures. Use
Excel File oil.xls

Pie Chart: A circular diagram divided into sectors representing a category. The
angle of each sector is proportional to the value of the category.

Example: The following information (Excel File accidents.xls) gives the number
of total road accidents and also the number of accidents per million populations in
Pakistan for the four provinces and Islamabad. Draw a pie chart.

Total Number of Traffic Total Number of Traffic Accidents Per


Region Accidents (2016-17) Region Million Population (2016-17)
Punjab 3819 Punjab 34.714
Sindh 880 Sindh 18.377
KPK 4256 KPK 119.803
Baluchista
n 401 Baluchistan 32.484
Islamabad 226 Islamabad 112.630

9
Total Number of Road Accidents (2016-17)
Baluchistan Islamabad
4% 2%

Punjab
40%
KPK
44%

Sindh
9%

Sol: In Excel, select data range (including heads) then use Inset>Chart (select Pie
Chart). Then right click on any sector and from Format Data Labels Select
Category Name and Percentage. Alternatively click + box on top of graph and
select Data Labels>More Options>Select Category Name and Percentage.
Total Number of Road Accidents Per Million Population (2016-17)
Punjab
11%
Sindh
6%
Islamabad
35%

KPK
38%

Baluchista
n
10%

The pie chart shows KPK dominates the number of accidents followed by Punjab.
A better picture can be seen by observing the number of accidents per million
population. The pie chart appears below which shows that Islamabad region has
very high percentage of total accidents per capita (to be correct per million person)
in Pakistan.
Exercise1: The following table lists the top 10 countries and amounts of oil
(millions of barrels annually) they exported to the United States in 2010.

10
Excercise2: The Excel file energy.xls gives the energy consumption pattern of
Australia. The figures measure the heat content in metric tons (1,000 kilograms) of
oil equivalent. Draw a graph that depicts these numbers.

Multiple Bar Chart: is used to present data on two or more categorical variables

Example: The following data are related to number of fatalities in terrorism


related incidents over the three years period in Pakistan. Use the data file
‘terrorism.xls’. Present the data in a suitable graph.

Source 2013 2014 2015


Security Operation 811 3391 2635
Target Killings 2371 2128 803
Militants Attacks 1171 976 472
Terrorism 810 516 247
Other* 522 600 497
*Other includes robberies, militants and criminal fighting’s, political rivalries, cross border attacks, accidental
explosions and self-detonations, children playing with toy bombs and other brutalities
Source: CRSS Annual Security Report 2015
https://crss.pk/wp-content/uploads/2010/07/CRSS-Annual-Security-Report-2015.pdf

Sol: In Excel after selecting the three columns including heads, go to Insert
>Chart>Column Chart. Note here the year category name is numeric. So it is better
to use some character e.g. a quotation or a coma before the year so that excel treats
it a character. Otherwise Excel gives the title of series1 series2 etc.

11
Number of Fatalities Related to Terrorism in Pakistan
4000
3500
3000
2500
2000
1500
1000
500
0
Security Target Killings Militants Terrorism Others
Operation Attacks

2013 2014 2015

Excercise1: The following table lists the percentage of males and females in five
age groups that did not have health insurance in the United States in September
2008. Use a graphical technique to present these figures. Use the data file
health.xls

Exercise2: Has the educational level of adults changed over the past two decades?
To help answer this question, the Bureau of Labor Statistics compiled the
following table, which lists the number (1,000) of adults 25 years of age and older
who are employed. Use a graphical technique to present these figures. Briefly
describe what the chart tells you. Use Excel file school.xls

12
Describing Quantitative Data through Tables and Graphs:

Frequency Distribution:
A table showing the ranges/intervals of data values together with number of values
falling in each interval.

Histogram: Graphical Representation of a frequency distribution.

Example: Consider the data on life expectancy of 205 countries and territories of
the world for year 1980. The data are obtained from World Development
Indicators and are available in the file lifeexpectancy.xls.

Sol: There are some steps before making a sensible frequency distribution in
Excel.
Step1: Calculate the descriptive statistics to find the range of data including
minimum and maximum values. Keeping in view these we can find the upper
limits (Excel call it Bins) as 25, 30 etc.
In Excel’s Data tab select Data Analysis and Histogram. Input data range and
Bin range and select chart output. In the resulting histogram, click at any bar and
right click, select Format Data Series and reduce Gap Width to 2. The Histogram
looks like this:

Histogram of Life Expectancy 1980


50
40
Frequency

30
20
10
0
25 30 35 40 45 50 55 60 65 70 75 80 More
life expectancy

Note that the in histogram upper limits are shown in the center. But we know that
these are upper limits. The distribution of life expectancy appears to be highly

13
negatively skewed. Meaning that many small countries had very low life
expectancy in 1980.

Exercise1: Do similar exercise with the latest data of 2017 and see if the pattern
has changed. Data are sheet 2.

Numerical Summary Measures: Measures of Central Tendency:


Arithmetic Mean (or simply Mean or average). The Arithmetic mean of a set of n
values is simply defined as the sum of values divided by n i.e.
x́=
∑x
n
Example: A sample of 10 adults was asked to report the number of hours they
spent on the Internet the previous month. The response received was:
0 7 12 5 33 14 8 0 9 22
0+7+ …+22 110
Sol: x́= 10
= 10 =11.0 hours
In Excel, use =Averege(input range)

14
Median: is defined as the central most value of the data. This is obtained by
arranging the values in ascending order and taking the central most value of the
number of values are odd or taking the average of two middle most values if the
(n+1)
number of values is even. In general median can be defined as the value of
2
th ordered value of the data set.

Example: For the data of internet hours:


Ordered data set: 0 0 5 7 8 9 12 14 22 33
th th th
Median = 5.5 data value = average of 5 and 6 value = (8+9)/2= 8.5 hours

In Excel use =Median(data range)

Mode: is defined as the value which occurs most frequently in a set of data. As the
above example illustrates the value zero occurs twice and the mode for this data is
0 but as it can be seen the mode is very poor measure of center in this case.

Which measure to use? With three measures from which to choose, which one
should we use? There are several factors to consider when making our choice of
measure of central location. The mean is generally our first selection. However,
there are several circumstances when the median is better. The mode is seldom the
best measure of central location. One advantage the median holds is that it is not as
sensitive to extreme values as is the mean. Thus for highly skewed distribution
median is preferred over mean. An example of this is income, or size distributions
e.g. size or sales of companies.

Measures of relative standing: Quartiles, Deciles, Percentiles:


Dara can be divided into various equal parts. Q1, Q2 and Q3 are three quartiles of
the data set. Q1 is defined as the value having 25% of the values below Q1 and
75% of the values above Q1. Q3 is the measures such that 75% of the values are
below Q3 and 25% of the values are above Q3. Similarly D1, D2,…,D9 are 9
values diving the data set into 10 equal parts and P1, P2,…,P99 are the 99
percentiles which divide the data into 100 equal parts.
In Excel, =Percentile(data range, k) commands computes (100)kth percentile of the
data specified in the range, where k is a number between 0 and 1. For example k =
0.75 corresponds to 75h percentile of the data.

Measures of Variation:
Two or more data sets or distributions may have the same average but very
different distribution of values within the range. Thus to fully describe the data we
also use measures of variation in addition to measures of central tendency
(averages). Simplest of these measures is the Range, defined as the difference
between the highest and lowest values of the data. However, range is often not
satisfactory.

15
Most popular measure of variation is known as ‘Standard Deviation’ defined as:
n
s= √∑
i=1
¿¿ ¿ ¿

Roughly speaking standard deviation yields the average deviation (distance) of


observations from their mean.
Example: For data set 8, 4, 9, 11, 3 mean is 7 and s = √ 46 /(5−1)=3.391 . The
calculation is illustrated below. In Excel use = stdev.s().

Interpretation of Standard Deviation:


For a symmetric distribution approximately
68% of all observations fall within one standard deviation of the mean.
95% of all observations fall within two standard deviations of the mean.
99.7% of all observations fall within three standard deviations of the mean.

Ex: For the data on life expectany.xls, compare the trend of life expectancy of
countries in 1980 and 2017 via descriptive statistics.

Inferential Statistics

Test of Hypothesis: A parametric hypothesis is a statement regarding the value of


a population parameter.
Most often the parameters under study are mean of a population, or difference
between means of two populations, population proportion or difference between
two population proportions, variance or ratio of variance of populations.

A pair of related hypothesis is considered. The first one is called the null
hypothesis and the second one is called alternative hypothesis. Hypothesis testing
is a statistical procedure which enables us to judge whether the sample data
provide any conclusive evidence to refute the null hypothesis.

Following are examples of how to set null and alternative hypothesis.

For a criminal trial case, the null and alternative are set as follows:
Null hypothesis H0: The defendant is innocent
Alternative hypothesis H1: The defendant is guilty

16
The null hypothesis reflects the famous rule in system of justice that unless proven
guilty the defendant is assumed innocent.
Only when sufficient evidence against his innocence is available he is declared
guilty.
Exactly similar approach is followed in any statistical hypothesis testing. The null
hypothesis (often stated as a status quo condition) is presented and evidence from
sample data is sought (through a test statistic). If sufficient evidence is available in
the data against the null hypothesis, the null is rejected in favor of alternative
hypothesis.

In this connection it is obvious that sometimes the decision may be incorrect


leading to one of the two types of errors known as Type I error and Type II error.
The error of wrongly rejecting a null hypothesis when in fact is true, is Type I
error. On the other hand, accepting a null hypothesis when it is false is the Type II
error. Both of these errors can be reduced by bringing more evidence i.e. by
increasing sample size i.e. more data.

Example1: The average per capita annual health care cost for a sample of U.S.
cities (50,000–200,000 population) is $5,886, with a standard deviation of $1,116.
Bases on a sample of n = 378 a city of similar population contends that its new
outpatient treatment program reduced the annual health care costs to $5,525 per
person. Is this a significant improvement?
Step 1: Formulation of Null and alternative Hypotheses.

H0: μ=$ 5886 (this city’s average health cost is no different from the rest of
country)
H1: μ< $ 5886 (this city’s average health cost is indeed lower than rest of the
country)
Step 2: Computing the test statistic:
x́−μ0
z= 5525−5886
σ = = -6.28
1116/ √378
√n
Step 3: Assuming the probability of Type I error (level of significance) of α =0.05,
the critical value is -1.645 (from Z table or use Excel’s =NORMINV(alpha, mean,
sd), in this case =NORMINV(0.05,0,1) yields -1.645. Thus values less than or
equal to -1.645(more negative than -1.645 form the rejection region and rest of
region is acceptance region.

17
Step 4: As the calculated value of test statistics falls in the rejection region, reject
the null hypothesis and conclude that the city’s claim of having the average health
care cost is supported by the data.

Note here we assume that the distribution of health cost is normal with standard
deviation in the city being same as the SD of the rest of country i.e. $1116. In case
the population standard deviation is not available, we use the sample standard
deviation s (in this case the city’s standard deviation of health cost) can be used.
In that case the test statistic to be used is the t distribution (with n-1 degrees of
x́−μ0
t=
freedom) given by: s
√n
Exercise1: A business student claims that, on average, an MBA student is required
to prepare more than five cases per week. To examine the claim, a statistics
professor asks a random sample of 10 MBA students to report the number of cases
they prepare weekly. The results are exhibited here. 2, 7, 4, 8, 9, 5, 11, 3, 7,
and 4. Can the professor conclude at the 5% significance level that the claim is
true, assuming that the number of cases is normally distributed with a standard
deviation of 1.5?
Note Excel’s NORMINV function used left cumulative probability so for a right
tail test like the above use 0.95 in the probability in Excel command.

Exercise 2: Consider the following hypothesis test.


H0: μ=15
H1: μ ≠15
A sample of 50 provided a sample mean of 14.15. The population standard
deviation is 3. Test the hypothesis at the 5% level of significance.
Note: this is a two tailed test, the rejection region is formulated on both left and
right side as follows:

Example 2: The following example illustrates the use of hypothesis test in


assessing whether or not gender discrimination is taking place in labor force. A
large firm employing tens of thousands of workers has been accused of
discriminating against its female managers. The accusation is based on a random
sample of 100 managers. The mean annual salary of the 38 female managers is
$76,189, whereas the mean annual salary of the 62 male managers is $97,832. Use
the data file discrimination.xls. The data appear below:
18
Femal
e                    
10175 10140 10025 10683
77290 0 69640 99090 0 30200 0 88350 0 47400 70160
10093
57960 56590 70030 83160 75490 50690 57100 80090 63330 0 89050
12979
0 66380 46690 94470 87320 98050 80520 91560 99710 44760 31240
50620 83430 67670 89150 57040            
Male                    
14276 12536 15896 14511
0 0 77030 0 85470 81640 0 61260 93830 90990 97480
13156 11066 10968 13201
68690 0 0 80320 0 97790 36860 68620 93390 86010 0
11497 11693 12675
91450 74810 0 62260 0 93090 76820 91850 0 90070 97440
12058 17576 13926 16540 11550 10575 10757
0 20860 82430 92840 81760 0 0 0 0 0 0
11392 12294 11947 13454
92630 79290 0 82100 99680 87920 79100 0 0 47880 0
10936
84340 64540 0 98940 92720 68670 67920        
Test the relevant hypothesis at 5% level of significance.

Sol: Let μ1 and μ2 represent the mean annual salaries of female and male managers.
Step1: H 0 : μ1 = μ2 (No difference in average salaries i.e. no discrimination)
H 1 : μ1 < μ2 (Females being paid less on average so there is discrimination)
Step 2: Computation of test statistic:
x́1 − x́2 −( μ1−μ 2)
t=
s21 s 22
√ +
n1 n2
From the given data Female: n1 =38 , x́ 1=76189 , s 1=22771
Male: n2 =62 , x́ 2=97832 , s 2=29375

x́1 − x́2 −( μ1−μ 2) 76189−97832−(0)


t=
2
s s 2
= 227712 293752 = -4.122
√ 1
+
n1 n2
2
√ 38
+
62
Step3: Either from the t table or using the excel command =T.INV(0.05,93) the
critical values are -1.661.
Step 4: The null hypothesis is rejected, we conclude that female managers are
indeed being paid less than the males on average.
The Excel output is as follows:

t-Test: Two-Sample Assuming Unequal Variances

  Variable 1 Variable 2
76188.9473 97832.0967
Mean 7 7
518509458. 862916810.
Variance 3 3
Observations 38 62
Hypothesized Mean
Difference 0
df 93
19
t Stat -4.12246734
4.06901E-
P(T<=t) one-tail 05
1.66140367
t Critical one-tail 4
8.13802E-
P(T<=t) two-tail 05
1.98580181
t Critical two-tail 4  
Note: There are two forms of the t test. The first assumes population variances to be equal, the
other does not makes such assumptions so the later test is general enough to use, and hence we
have used this test above.

Exercise: The manager of a company wanted to investigate the job offers recent
MBAs were obtaining. In particular, she wanted to know whether finance majors
were being offered higher salaries than marketing majors. In a preliminary study,
she randomly sampled 50 recently graduated MBAs, half of whom majored in
finance and half in marketing. From each she obtained the highest salary offer
(including benefits). These data are listed here. Can we infer that finance majors
obtain higher salary offers than do marketing majors among MBAs? (Use the excel
file mba.xls)

Study of relationships between a dependent and one or more


independent variables: The Regression Analysis
Regression Analysis is used to predict the value of one variable on the basis of
other variables. This technique may be the most commonly used statistical
procedure because, as you can easily appreciate, almost all companies and
government institutions forecast variables such as product demand, interest rates,
inflation rates, prices of raw materials, and labor costs. For example
A real estate agent wants to predict the selling price of houses more accurately. He
believes that the following variables affect the price of a house: Size of the house
(number of square feet) Number of bedrooms, Frontage of the lot, Condition
Location. This application illustrates that the primary motive for using regression
analysis is forecasting. Nonetheless, analyzing the relationship among variables
can also be quite useful in managerial decision making.

20
Given a sample of n pairs (x,y), the model parameters can be estimated by method
of least square which finds as estimates of parameters those values of β 0 and β 1 for
which the sum of squared errors ∑ ¿¿ is minimum, where ^y i=b 0 +b1 x is the
predicted value of observation y i .

Example1: Attempting to analyze the relationship between advertising and sales,


the owner of a furniture store recorded the monthly advertising budget ($
thousands) and the sales ($ millions) for a sample of 12 months. The data are
listed here (Use the Excel file advertisingsales.xls).
Advertising 23 46 60 54 28 33 25 31 36 88
90 99
Sales 9.6 11.3 12.8 9.8 8.9 12.5 12.0 11.4 12.6 13.7 14.4
15.9

Using Excel’s regression function (a part of Data Analysis Pack available from Data tab), the
estimated regression output is as follows:
Regression Statistics
0.778819
Multiple R 3
0.606559
R Square 6
0.567215
Adjusted R Square 5
1.346795
Standard Error 8
Observations 12
ANOVA
Significan
  df SS MS F ce F
27.9639118 27.9639 15.4168
Regression 1 4 1 1 0.002837
18.1385881 1.81385
Residual 10 6 9
Total 11 46.1025      
  Coefficien Standard t Stat P-value Lower Upper
21
ts Error 95% 95%
9.100373 0.85152802 10.6871 8.62E-
Intercept 9 2 1 07 7.203051 10.9977
0.058230 0.01483050 3.92642 0.00283 0.09127
Advertising 9 5 4 7 0.025186 5

From this output the estimated regression equation is:


^y i=9.1+ 0.0582 x i

Sales
18
16
14
12
10
8
6
4
2
0
10 20 30 40 50 60 70 80 90 100 110

Note: In any actual applications samples are large.

Coefficients Interpretation:
Intercept: When in a month advertisement is zero, sales is expected to be 9.1
million dollars.

Slope: When advertising budget increases by 1 thousand dollars, sales increases on


average by 0.0582 millions of dollars (i.e. $58200).

Prediction from the model: Suppose in a month the company is planning an


advertising budget of $50,000, the estimates sales this month is
9.1+0.0582(50)=12.01 millions of dollar.
How much variation in actual sales, the model is able to explain: R-square =
0.606 indicates that approx. 61% variation in sales is explained by adverting
budget through this model. Remaining 39% unexplained variation is due to error
i.e. other factors in addition to advertising e.g. sales force, whether (if the product
demand is sensitive to temperature variation), etc.

Test of hypothesis: The t-stats and p-values corresponding to the advertising


coefficient indicates that the advertising is a significant predictor of sales. The null
hypothesis being tested i.e. that advertising is a useless predictor is easily rejected.

22
Example2: Consider the district level data of total number of fatalities (deaths) for
the year 2015 in Pakistan. Let’s try to build a regression model which can explain
why the violent crimes (more specifically number of fatalities per lac population)
are high leading to fatalities. Since the data are district wise we can consider
district level variables which could explain the variation in number of fatalities. To
begin with we can expect crimes to be lower in district which enjoy good living
standard e.g. education, health and high income levels. Fortunately district level
Human Development Index constructed by UNDP is available for Pakistan which
we can use as independent variables. An abstract of the data is shown below.

The Stata can be used for such regressions very conveniently. The Stata output and
the scatter plot with fitted regression appears below (use Stata file fatal.dta and
excel file fatalities excludes Karachi.xls):

. twoway (scatter fatpop hdi) (lfit fatpop hdi)

23
The results indicate that as the HDI score of districts increase by one unit (from
zero to 1 say), the number of fatalities per lac population decreases by approx15.
Practically more interesting result is that as HDI of district increase by 0.1 (e.g.
from 0.6 to 0.7) the number of fatalities per lac decreases by 1.5 approximately.
This decrease is statistically significant as seen by t statistic and p-values.
twoway (scatter fatpop hdi, mlabel(district) mlabangle(45)   mlabsize(tiny)) (lfit
fatpop hdi)

We will examine other variables after going through multiple regression topic.

Exercise: Critics of television often refer to the detrimental effects that all the
violence shown on television has on children. However, there may be another
problem. It may be that watching television also reduces the amount of physical
exercise, causing weight gains. A sample of 15 10-year-old children was taken.
The number of pounds each child was overweight was recorded (a negative
number indicates the child is underweight). In addition, the number of hours of
television viewing per week was also recorded. These data are listed here. Use the
excel file tv.xls

Estimate the regression to predict y = overweight as predicted by hours of TV


view. Draw the scatter plot and the regression line. Explain the percent variation in
overweight explained by the model and test whether the TV viewing a useful
24
predictor of overweight. Predict the overweight for a child who views 35 hours TV
per week.

Multiple Regression: In any actual study of explaining a dependent variable, there


are usually more than one predictors. This calls for a multiple regression.

The regression is estimated as usual by the Least Square method which is built in
procedure in Excel and all statistical software.

Example1: The fire chief of an area has divided the area into nine fire districts of
approximately equal size. They want to know how many fires a month they can
expect to fight so that they can allocate firefighters accordingly. They knows that
older houses are more likely to burn than are newer house. In addition the percent
of owner occupied house uses are expected to have less incidences of fire than
rented due to better management by owners. Use the data file fire.xls. The data are
as follows:

25
The data input for this regression is illustrated by above screen shot. The results are as follows:
XRegression Statistics
Multiple R 0.99212305
R Square 0.98430816
Adjusted R Square 0.97907751
Standard Error 3.5510343
Observations 9
ANOVA
Significanc
  df SS MS F eF
Regression 2 4745.8946 2372.94829 188.1822 3.864E-06
Residual 6 75.6590971 12.60984952
Total 8 4821.5556      
Standard Upper
  Coefficients Error t Stat P-value Lower 95% 95%
Intercept 38.1260379 2.5620847 14.87956568 5.798E-06 31.85294 44.3978
Age (X1) 1.5671063 0.0867279 18.06404664 1.852E-06 1.35483 1.779829
Percent Owned(X2) -0.4922775 0.0364602 -13.49217536 1.028E-05 -0.58538 -0.402997

The estimated regression is:


^y i=38.13+1.567 X 1−0.492 X 2
Coefficient Interpretation: The regression coefficients indicates as average age of
houses increase by one year, number of fires per month increases by 1.5 keeping
percent of owned houses fixed. Similarly, if percent of owner occupied hours in a
district increase by one, the number of fires are reduced by about a half per month.
26
In other words, one less fire per month when owner percent rises by 2 percent
keeping average age of houses in a district fixed.

R-Square: 0.984 indicates that 98.4% variation in umber of fires per month is
explained by these two variables through the multiple regression model.

Hypothesis Tests: The t stats and p-values of the two predictors indicates that both
are useful predictors of the number of fires per month.

Prediction: Suppose in a district in a month, 60 percent houses are owner


occupied, and the average age of houses in the district is 25 years. For this district
the predicted number of fires per month are 38.126+1.567(25)-0.492(60) = 47.5
approx.

Example 2: A regression application in gender discrimination investigation: In


Example 2 of hypothesis testing comparing average salaries of male and female
managers we concluded that female managers are being paid less than males on
average. But this may due to male managers being more qualified or experienced.
The real assessment of gender discrimination will be possible if these factors are
controlled. Let’s do this investigation through the multiple regression model. An
abstract of the data on both female and male managers along with their education
(years) and experience (years) are shown below. Note the last variable is gender
which is a binary variable representing 0 for female and 1 for a male manager.
(Use the data file discrimination.xls)

The model is estimated as follows:


Regression Statistics
Multiple R 0.8325596
R Square 0.6931554
Adjusted R Square 0.6835666
Standard Error 16273.955
Observations 100
  Coefficient Standard t Stat P-value Lower 95% Upper
27
s Error 95%
Intercept -5835.104 16082.798 -0.363 0.718 -37759.207 26088.9
Education 2118.898 1018.486 2.080 0.040 97.220 4140.57
Experience 4099.338 317.194 12.924 0.000 3469.714 4728.96
Gender 1850.985 3703.070 0.500 0.618 -5499.551 9201.52

The estimated regression is:


^y i=−5835.1+2118.89 X 1+ 4099.3 X 2+ 1850.98 X 3
The last coefficient indicates that after taking into account the education and
experience, male managers are paid on average approx. $1851 more than female
managers. However, this difference is not statistically significant as indicated by
the high p-value of the test of corresponding coefficient. Thus our previous
conclusion by t test on comparing two means was not correct since this test was
unable to control the education and experience of managers.

Example 3: Number of fatalities in Pakistan explained through multiple


regression
In the simple regression we use only one variable (HDI) to explain the number of
per capita fatalities in Pakistan for the year 2015. (Stata data files fatal.dta). There
can be other variables associated with high crime in a district e.g. population
density or a policy related variable e.g. the number of police stations in the district.
Let’s examine the use of these variables. The following output shows the
regression of number of per lac population fatalities in districts explained some
explanatory variables of population density and number of police stations in a
district.

The results indicate that only HDI remains important variable associated with the
violent crime fatalities. Although the population density appears with correct
positive sign, the variable is not significant at % level. Similarly, the number of
fatal crimes appear to be negatively associated with the number of police stations,
the variable is found insignificant statistically. As the model’s diagnostic indicate
great extent of non-normality we can try a log linear model.

gen lfatpop=log(1+fatpop)
reg lfatpop hdi popdensity policestations

28
predict e1, resid

The results indicate that the variable appears with sensible sign and are also
significant.

The Chi Square test of Independence: Relationship between qualitative


variables

Contingency table: We all know that data on one variable are grouped into a
frequency distribution. Data from two variables are called bivariate data, and a
frequency distribution for bivariate data is called a contingency table or two-way
table or cross-tabulation table or cross tabs.
As example of contingency table appears here where a bivariate distribution is
tabulated on level of job satisfaction (low, medium, high) and its relationship with
income level (low, medium, high).

    Income  
JobSatisfaction Low Medium High Total
Low 100 30 10 140
Medium 60 80 15 155
High 40 40 50 130
Total 200 150 75 425

An important analysis of the contingency table is to establish whether or not the


two variables are associated or dependent. The resulting test is known as the Chi-
Square test of independence.
This test is based on finding expected frequency under the assumption of statistical
independence for each pair of categories and comparing this with observed cell
frequencies. If the two frequencies are sufficiently closer, the null hypothesis of
independence is not rejected.
The expected frequencies are computed as follows
Row Total × ColumnTotal
Expected frequency=
Grand Total

For example, expected frequency corresponding to first cell is


140× 200
Expected frequency= = 65.88
425
The expected frequencies are computed for each cell and the following test is used.

2 (O i−Ei ) 2
χ =∑
Ei
This test follows a Chi-Square distribution with number of degrees of
freedom as: (r − 1)(c − 1), where r and c are the number of rows and
number of columns in the contingency table for the two variables under
consideration. Provided that each expected frequency is at least 5.
29
cell Observed frequency Expected frequency (O-E)^2/E
1 100 65.88235294 17.66806723
2 30 49.41176471 7.62605042
3 10 24.70588235 8.753501401
4 60 72.94117647 2.29601518
5 80 54.70588235 11.69512966
6 15 27.35294118 5.578747628
7 40 61.17647059 7.330316742
8 40 45.88235294 0.754147813
9 50 22.94117647 31.91553544
Sum 425 425 93.61751152

H0: Income and Job Satisfaction levels are independent


H1: Income and Job Satisfaction levels are dependent
The calculated test statistic is 93.62, with (r-1)(c-1) = (3-1)(3-1) = 4 df, the 5% Chi
Square value is 9.487(In Excel =CHI.INV(prob, df) = CHI.INV(0.95, 4) =9.487

Thus the null hypothesis of independence is rejected and we conclude that the two
variables are associated or dependent. Thus job satisfaction level depends on
income.

Generating bivariate frequency tables in Excel. In Excel for a bivariate


contingency table there are at least three columns required. The first column
contains serial number of observations of raw data. The value tables (1, 2,..) are
given for row and column variable categories. Then in Excel>Insert>PivotTable.
Drag row and column variable in the respective boxes and serial number in the
sum (∑) box. This will create a bivariate frequency distribution.

30

Вам также может понравиться