Вы находитесь на странице: 1из 86

Chapter 1

An Introduction to Econometrics
and Statistical Inference

Copyright 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Learning Objectives
Understand the steps involved in
conducting an empirical research
Understand the meaning of the term
econometrics
Understand relationship between
populations, samples, and statistical
inference
Understand the important role that
sampling distributions play in statistical
inference
1-2

What is an Empirical Research


Project?
An empirical research project is a
project that applies empirical analysis
to observed data to provide insight
into questions of theoretical interest.

1-3

The 5 Steps in
Conducting an Empirical
Research Project?

(1)Determining the question of interest


(2)Developing the appropriate theory to address
the question
(3)Collecting data that is appropriate for
empirically investigating the answer
(4)Implementing appropriate empirical techniques,
correctly interpreting results, and drawing
appropriate conclusions based on the estimated
results
(5)Effectively writing up a summary of the first
four steps
1-4

What is Econometrics?
Econometrics is the application of
statistical techniques to economic
data.

1-5

Populations, Samples,
and Statistical Inference
A population is the entire group of entities that we
are interested in learning about.
A sample is a subset or part of the population and
it is what is used to perform statistical inference.
Statistical inference is the process of drawing
conclusions from data that are subject to random
variation.

1-6

Populations, Samples,
and Statistical Inference
Continued

1-7

Some Important Definitions


A parameter is a function that exists
within the population.
A statistic is a function that is
computed from the sample data.
A point estimate is a single valued
statistic that is the best guess of a
population parameter.
1-8

Sampling Distributions
A
distribution is the distribution of a
sampling

sample statistic such as the sample mean.


A sampling distribution is constructed by
(1)collecting all possible samples of size that could
be drawn from the unobserved population of size
(2)calculating the value of a given statistic (say, the
sample mean) for each of those samples
(3)placing those values in order on the number-line
to create a distribution known as a sampling
distribution
1-9

A Visual Example

1-10

Chapter 2

Collection and
Management of Data

Copyright 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Learning Objectives
Consider potential sources of data
Work through an example of the first
three steps in conducting an
empirical research project
Develop data management skills
Understand some useful Excel
commands

1-12

Goals of the Chapter

1-13

Types of Data
Cross-sectional data is data collected for
many different individuals, countries, firms,
etc. in a given time-period.
Time-series data is data collected for a
given individual, country, firm, etc. over
many different time periods.
Panel data are data collected for a number
of individuals, countries, firms, etc. over
many different time periods.
1-14

Primary Data Sources


private-use data
government surveys or internal firm-level data
obtained through formal request and/or having the
appropriate connections.

publicly-available data
obtained through the internet or through formal
Freedom of Information Act (FOIA) request

personal survey data


obtained by personally conducting a survey asking
people for information and recording their responses
1-15

An Example of the First


Three Steps
Suppose you are trying to convince your
significant other to go camping but he or she
is afraid of bears.
How can you use your empirical research skills
to convince him or her that bear attacks are
not a realistic concern?
Step 1: Identify a question of interest
What factors affect the number of fatal bear
attacks in the US?
1-16

An Example of the First Three


Steps
Step 2: Develop appropriate theory
The number of fatal bear attacks in the
US should depend on:
The number of bears
The number of campers
Square feet of national parkland

1-17

An Example of the First


Three Steps

Step 3: Collect appropriate data

Start with an internet search for the data


you seek

1-18

An Example of the First


Three Steps

Download data to Excel and then repeat the process


for the independent variables you seek.
1-19

Data Management Skills


Two important points:
(1) When working with data, it is
common to make mistakes which
alter the initial data
(2) When working on a larger
project, it is common to take
time off before returning to the
project
1-20

Data Management Skills


Our goals with data management
are to be able to:
(1) Recreate our initial data as easily
as possible
(2) Recall what we had previously
done as easily as possible

1-21

Data Management Skills


When working with data, we recommend:
(1)Creating a Master file with the initial
data and performing calculations in a
different working file
(2)Exhaustively documenting all initial data
sources
(3)Making file and variable names as
intuitive as possible
(4)Documenting all commands used when
performing estimation
1-22

Chapter 3

Summary Statistics

Copyright 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Learning Objectives

Construct relative frequency histograms


Calculate measures of central tendency
Calculate measures of dispersion
Use measures of central tendency and
dispersion
Detect whether outliers are present
Construct scatter diagrams for the
relationship between two variables
Calculate the covariance and the correlation
coefficient between two variables
1-24

1-25

Construct a Relative
Frequency Histogram
A bar chart that shows how often
observations lie within a specified classes
Allows a visual inspection of the data
Based on a Relative Frequency Table
The example dataset for constructing a
histogram use states.xls, a survey of
econometrics students that asked how
many states they have been visited.
1-26

Number of States Visited


0.45
0.4
0.35
0.3
0.25
Relative Frequency

0.2
0.15
0.1
0.05
0
0-5.99

6-11.99

12-17.99

18-24

Numver of States Visited

1-27

To create a frequency
distribution we must
1. Select the number of classes
2. Choose the class interval or width of
the classes
3. Select the class boundaries or the
values that form the interval for each
class
4. Count the number of values in the
dataset that fall in each class
1-28

Step 1: Select the


number of classes
The rule for determining the approximate
number of classes is:
Approximate number of classes =
[(2)(Number of observations)].3333
The actual number of classes is the integer
value that just exceeds the number value.
If the formula gives us 4.66 we use 5
1-29

Step 1: Example
We have 43 data points so the rule is:
Approximate number of classes = [(2)(20)].3333
= 3.503
Round this up to the next integer value which is 4.
The number of classes is 4.
**Always round up!!

1-30

Step 2: Choose the width


of the interval
The rule for determining interval width is:
Approximate interval width =
Largest data value Smallest data value
Number of classes

The actual interval width is the integer


value that just exceeds the number value.
If the formula gives us 6.17 we use 7
**Always round up!!
1-31

Step 2: Example
Approximate interval width = (24-1)/4
= 5.75
Round up to 6.
Therefore the class width is 6.

1-32

Step 3: Select the class


boundaries
Class boundaries must be chosen such that
each data item belongs to one and only one
class.
Start just below the lowest value in the dataset
to get the lower boundary. The lower
boundary for the second class is then found by
adding the class width. The upper boundary
for the first class is found by subtracting .01
from the lower boundary of the second class.
Keep adding the class width and subtracting .
01 to get the boundaries.

1-33

Step 3: Example
Lowest data point is 1. We will start
our classes at 0.
Class
Class
Class
Class

1
2
3
4

=
=
=
=

0
6 (=0+6)
12 (=6+6)
18 (=12+6)

1-34

Step 3: Example
Continued
Class boundaries are then:
Class
Class
Class
Class

1:
2:
3:
4:

0- 5.99
6-11.99
12-17.99
18-24

1-35

Step 4: Count the number


of values in the dataset
that fall into each class

Doing this by hand is tedious and,


therefore, we want to rely on Excel to
do this for us.
Enter the class boundaries into Excel
next to the data set.
Enter the Upper Boundaries of each
of the classes
Use the Frequency command

1-36

How to use the


Frequency command in
Excel

1. Select the cells next to the class intervals


where the frequencies should go (say E2:E6).
2. Type but do not enter the formula
=Frequency(A2:A44,D2:D6)
A2:A44 contains the data D2:D6 contain the
ending class boundaries
3. Press CTRL+SHIFT+ENTER and the array
formula will be entered into each of the cells
E2:E6.
1-37

Our Excel Results


Class Boundaries Upper Limit
Frequency
0-5.99
5.99
18
6-11.99
11.99
18
12-17.99
17.99
4
18-24
24.00
3

1-38

Creating relative
frequency and percent
frequency distributions
Recall that the relative frequency is
the proportion of the observations
belonging to a class. With n
observations
Relative frequency of a class =
Frequency of the class
n
The percent frequency is the relative
frequency multiplied by 100.
1-39

Relative Frequency Table

1-40

Using Excels Chart


Wizard to Construct a
Histogram

1. Use the frequency distribution we


just constructed and highlight the
frequencies
2. Click the Chart Wizard and choose
column in the chart type
3. Click on the Category (X) axis labels
box and enter the class boundaries
4. To get the bars to touch right click
on any rectangle in the column
chart and choose Format Data
Series. Select the Options tab and

1-41

Number of States Visited


0.45

0.4

0.35

0.3

0.25
Relative Frequency
0.2

0.15

0.1

0.05

0
0-5.99

6-11.99

12-17.99

18-24

Numver of States Visited


1-42

Soda Consumption Data


Your mission is to pair up with a
classmate and draw what you think
the histogram for soda consumption
looks like.

1-43

Calculate Measures of
Central Tendency
Central tendency is the middle value of a
dataset.
The measure of central tendency is
typically
thought of as the number that best
describes
the data.
Measures of central tendency are:
(1)Mean
(2)Median
1-44

Measure of Central
Tendency - Mean
The mean is the arithmetic average of the data. To
calculate the mean sum all the observations and divide by
the number of observations.

Represented by the symbol,


1 n
1
x xi ( x1 x2 ... xn )
n i 1
n
Mean
For the following small data set:
95 85 99 92 80
Mean =(95+85+99+92+80)/5 = 451/5 = 90.2
In Excel =average(highlight data)

1-45

Measure of Central
Tendency - Median
Median the middle observation when the data are
arranged from smallest to largest sometimes called
the 50% percentile. Half the observations lie below
the median and half the observations lie above the
median.
The median is the middle observation for an odd
number of ordered observations and the average of
the middle two ordered observations for an even
number of observations.
The median is an order statistic so in order to calculate
it the data must be ordered from smallest to largest.
1-46

Measure of Central
Tendency - Median
Median Central observation for an odd number of
observations and an average of the two middle data points
for an even number of observations
For the following small data set :
95 85 99 92 80
(ordered data 80 85 92 95 99)
Median = 92 (the 3rd data point)
If we had 75 80 85 92 95 99
median =(.5*85)+(.5*92) = (85+92)/2 = 42.5+46 = 88.5
In Excel =median(highlight data)
1-47

Calculate Measures of
Dispersion
Dispersion is a measure of how the
data vary.
Measures of dispersion are:
(1)Variance
(2)Standard Deviation
(3)Percentiles
(4)Five Number Summary
1-48

Measure of Dispersion
Variance and Standard
Deviation
Standard Deviation the average deviation away from the
mean. It is the square root of the variance.
The variance is calculated by subtracting the mean from
each observation, squaring that value, adding up all n values,
and then dividing that by the number of observations less
n
one.
2

s2
Sample variance formula is 2

s s

( xi x )
i 1

n 1

Standard deviation is
In Excel = var(highlight data)
= stdev(highlight data)
1-49

Measure of Dispersion
Variance and Standard
Deviation
n

2
(
x

x
)
i

s 2 i 1
Sample variance:

n 1

For the following small data set :


95 85 99 92 80
s2= [(95-90.2)2+ (85-90.2)2+ (99-90.2)2+
(92-90.2)2+ (80-90.2)2]/4=234.8/4=58.7
Sample standard deviation 2
s s
s= 58.7
=7.6616
1-50

Measure of Dispersion
Percentile
A percentile is a number such that p% of the ordered
observations lie below the percentile and (1-p)% of the
observations lie above the percentile.
The median is the 50th percentile and an example of a
percentile where 50% of the ordered data lies below
that level and 50% of the ordered data lies above that
level.
A percentile is an order statistic.
There are many different ways to calculate percentiles.
On the next slide one of the easiest ways to calculate
percentiles.
1-51

Steps to Calculate a Percentile,


p
(1) Sort the data from low to high
(2) Count the number of observations, n
(3) Select the p(n+1) observation
(4)If the value p(n+1) is not a whole number then select the
closest whole number
(5)If p(n+1) is less than 1 then select the smallest number
(6)If p(n+1) is greater than 1 then select the largest number.
In Excel =percentile(highlight data, p)
Note that the steps to calculate a percentile by hand and
calculating percentiles in Excel will likely not result in the
same value.

1-52

Measure of Dispersion Percentile


Calculate the 10th and the 70th percentile for the
following small data set :
95 85 99 92 80
(ordered data 80 85 92 95 99)
10th percentile select the .1(n+1) = .1(6) = .6
number in the data set.
The closest whole number is 1 so the 10 th
percentile is the first observation or 80.
70th percentile select the .7(n+1) = .1(6) = 4.2
number in the data set.
The closest whole number is 4 so the 70 th
percentile is the fourth observation or 95.

1-53

Measure of Dispersion
Five Number Summary
The Five Number Summary is
(1) Minimum
(2) Q1 or 25th Percentile
(3) Q2 or Median (50th Percentile)
(4) Q3 or 75th Percentile
(5) Maximum

1-54

How to Calculate the


Five Number Summary in
Excel
Minimum =Min (data)
Q1 or 25th Percentile
=percentile(data,.25) or
=quartile(data,1)
Q3 or 75th Percentile
=percentile(data,.75) or
=quartile(data,3)
Maximum =Max (data)
1-55

Shapes of Histograms
Symmetric
Skewed to the right or Positively
skewed
Skewed to the left or Negatively
Skewed
Bimodal

1-56

Symmetric
Histogram

1-57

Positively Skewed
Distribution

1-58

Negatively Skewed
Distribution

1-59

Bimodal
Distribution

1-60

Positively Skewed
Distribution
Median = 2.77
Mean = 4.16

1-61

Why is the shape of the


histogram important?
The shape of the empirical
distribution dictates which summary
statistics should be used
Symmetric Use mean and standard
deviation
Skewed Use median and five number
summary

1-62

How to determine if
your data is skewed or
symmetric
Pearsons coefficient of skewness:
sk = 3*(mean-median)/(standard dev.)
Rule of Thumb:
If sk<-.5 or sk>.5 then the distribution
is skewed.
Otherwise
the distribution is
Negatively skewed Symmetric Positively Skewed
symmetric.-.5
.5
1-63

Symmetric Histogram
Mean = .5013
Standard Deviation =.019

1-64

Positively Skewed
Distribution
Median = 2.779

Five Number Summary


Minimum
Q1
Median
Q3
Maximum

0.008
1.1578
2.779
5.643
29.001
1-65

How to Detect Outliers


with Symmetric data
Use the Empirical Rule
68% of data should be within one standard deviation of the
mean
x s
95% of the data should be within two standard deviations of
the mean

x 2s

100% of the data should be within three standard


deviations of the mean

x 3s

Therefore, an observation is an outlier if it lies beyond three


standard deviations from the mean or beyond the interval
( x - 3s,x + 3s)

1-66

How to detect an outlier


with skewed data
Calculate the interquartile range or
IQR = Q3 Q1.
If a value is greater than Q3 plus
1.5*IQR or less than Q1 minus
1.5*IQR the its a moderate outlier
If a value is greater than Q3 plus
3*IQR or less than Q1 minus 3*IQR
then its an extreme outlier
1-67

Construct Scatter Diagrams


for the Relationship between
two Random Variables
A scatter diagram (or scatter plot)
is used to show the relationship
between two variables
It contains one variable on the x-axis
and the other variable on the y-axis
A scatter diagram shows how the two
variables are related to each other,
both the strength and direction of the
relationship
1-68

Scatter Diagram
Examples
y

Positive Linear
relationship

x
Negative Linear
relationship

Curvilinear
relationships

x
y

x
1-69

Scatter Diagram
Examples
y

Strong
relationships

x
y

Weak
relationships

x
y

x
1-70

Scatter Diagrams
Examples
y

No
relationship

x
y

x
1-71

Salary vs. Years of Education

1-72

How to Create a Scatter


Diagram in Excel
Highlight the data making sure that
the variable you want on the y-axis is
on the right
Select Insert and then Scatter
and click on the first option
Make sure to change the chart title,
add axis titles.
Possibly delete the legend and
change the start values for the axis.
1-73

Salary vs. Experience


160,000

140,000

120,000

100,000

Salary (dollars)

80,000

60,000

40,000

20,000

0
10

12

14

16

18

20

22

Experience (years)

1-74

What does the Scatter


Diagram on the previous
slide tell us?
The relationship between education
and salary is positive (in general as
education increases salary increases)
The relationship is fairly strong
because the data point are closely
gathered to each other
This scatter diagram indicates that
while the variable education is
helpful for predicting salaries, it will
not yield perfect predictions.

1-75

Covariance and the


Correlation Coefficient for
the Linear Relationship
between two variables
Covariance and Correlation
Coefficient supplies a numeric value to
the strength and direction of the linear
relationship between two variables
Only concerned with strength of the
relationship
No causal effect is implied
1-76

Covariance
Covariance is a measure of the linear
relationship between two random variables
A positive covariance indicates a positive
linear relationship between x and y (if x is
below its mean then y tends to be below its
mean and if x is above its mean then y
tends to be above its mean)
A negative covariance indicates a negative
linear relationship between x and y (if x is
below its mean then y tends to be above its
mean and if x is above its mean then y
tends to be below its mean)
1-77

Covariance
A covariance near 0 indicates no linear
relationship between x and y
A problem with covariance is that it
depends on the units of measurement for x
and y if we change from measuring in feet
to inches the covariance will go up even
though the overall relationship hasnt
changed.

1-78

Covariance a Measure of
Linear Association
Between Two Variables
Remember the formula for variance is
n

s2

(x i x)

i 1

n 1

( x i x )( x i x )

i 1

n 1

or how x varies with itself.


The formula for Covariance is
n

Cov( x , y) s xy

( x i x )( yi y)

i 1

n 1
and it measures how varies with y in a linear
fashion.

1-79

Applying the Covariance Formula

Cox(x,y) = Sum/(n-1) = 743000/9 =


82,555.5556
1-80

Calculating Covariance in
Excel
In some versions of Excel, the covariance is
not calculated correctly.
The Excel command is
=Covar(highlight x values, highlight
y values)
You should perform this command in Excel for
the data set above and see if it matches the
value 82,555.5556.
If you obtain 74,300 using the covar
command (which is likely), you must multiply
the value you obtain in Excel by n/(n-1) to
obtain the correct value for covariance.

1-81

Correlation Coefficient
The sample correlation coefficient,
rxy, is an estimate of population
correlation coefficient and is used to
measure the strength and direction of
the linear between two random
variables.
The correlation is a unit free measure
(unlike the covariance) and falls
between -1 and 1.
1-82

What Does the Correlation


Coefficient Mean?

If all the points in a data set fall on a positively


sloped line, rxy =1.
If all the points in a data set fall on a negatively
sloped line, rxy =-1.
If there is no linear relationship between x and y
then rxy =0.
The closer to -1, the stronger the negative
linear relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker the linear
relationship
1-83

Examples of Approximate
rxy Values
y

r = -1

r = -.6

r=0

r = +.3

r = +1

x
1-84

Calculating the Correlation


Coefficient
Sample correlation
coefficient:
Cov( x, y )
sxy
rxy

st .dev.( x ) st .dev.( y ) sxsy

From above, the standard deviation of x


is 2.708 and the standard deviation of y
is 38,189.037.
82,555.5556
rxy
0.7983
(2.708)(38,189.0037)
A correlation of 0.7983 means that
education and salary are positively related
and the relationship is strong (because this
values lies near 1)
1-85

What Does Correlation


Mean?

Correlation provides a measure of linear


association between two variables. A
correlation coefficient is near 0 only means
that there is a weak linear association between
the two variables, not that there isnt any
relationship between the two variables.
A high correlation between two variables does
not mean that changes in one variable will
cause changes in the other variable.
We might find that the quality rating and the
typical mean price of restaurants are positively
correlated. However, simply increasing the
mean price at a restaurant will not cause the

1-86