00 голосов за00 голосов против

18 просмотров86 стр.Learning objectives

May 06, 2015

© © All Rights Reserved

PPTX, PDF, TXT или читайте онлайн в Scribd

Learning objectives

© All Rights Reserved

18 просмотров

00 голосов за00 голосов против

Learning objectives

© All Rights Reserved

Вы находитесь на странице: 1из 86

An Introduction to Econometrics

and Statistical Inference

Copyright 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Learning Objectives

Understand the steps involved in

conducting an empirical research

Understand the meaning of the term

econometrics

Understand relationship between

populations, samples, and statistical

inference

Understand the important role that

sampling distributions play in statistical

inference

1-2

Project?

An empirical research project is a

project that applies empirical analysis

to observed data to provide insight

into questions of theoretical interest.

1-3

The 5 Steps in

Conducting an Empirical

Research Project?

(2)Developing the appropriate theory to address

the question

(3)Collecting data that is appropriate for

empirically investigating the answer

(4)Implementing appropriate empirical techniques,

correctly interpreting results, and drawing

appropriate conclusions based on the estimated

results

(5)Effectively writing up a summary of the first

four steps

1-4

What is Econometrics?

Econometrics is the application of

statistical techniques to economic

data.

1-5

Populations, Samples,

and Statistical Inference

A population is the entire group of entities that we

are interested in learning about.

A sample is a subset or part of the population and

it is what is used to perform statistical inference.

Statistical inference is the process of drawing

conclusions from data that are subject to random

variation.

1-6

Populations, Samples,

and Statistical Inference

Continued

1-7

A parameter is a function that exists

within the population.

A statistic is a function that is

computed from the sample data.

A point estimate is a single valued

statistic that is the best guess of a

population parameter.

1-8

Sampling Distributions

A

distribution is the distribution of a

sampling

A sampling distribution is constructed by

(1)collecting all possible samples of size that could

be drawn from the unobserved population of size

(2)calculating the value of a given statistic (say, the

sample mean) for each of those samples

(3)placing those values in order on the number-line

to create a distribution known as a sampling

distribution

1-9

A Visual Example

1-10

Chapter 2

Collection and

Management of Data

Copyright 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Learning Objectives

Consider potential sources of data

Work through an example of the first

three steps in conducting an

empirical research project

Develop data management skills

Understand some useful Excel

commands

1-12

1-13

Types of Data

Cross-sectional data is data collected for

many different individuals, countries, firms,

etc. in a given time-period.

Time-series data is data collected for a

given individual, country, firm, etc. over

many different time periods.

Panel data are data collected for a number

of individuals, countries, firms, etc. over

many different time periods.

1-14

private-use data

government surveys or internal firm-level data

obtained through formal request and/or having the

appropriate connections.

publicly-available data

obtained through the internet or through formal

Freedom of Information Act (FOIA) request

obtained by personally conducting a survey asking

people for information and recording their responses

1-15

Three Steps

Suppose you are trying to convince your

significant other to go camping but he or she

is afraid of bears.

How can you use your empirical research skills

to convince him or her that bear attacks are

not a realistic concern?

Step 1: Identify a question of interest

What factors affect the number of fatal bear

attacks in the US?

1-16

Steps

Step 2: Develop appropriate theory

The number of fatal bear attacks in the

US should depend on:

The number of bears

The number of campers

Square feet of national parkland

1-17

Three Steps

you seek

1-18

Three Steps

for the independent variables you seek.

1-19

Two important points:

(1) When working with data, it is

common to make mistakes which

alter the initial data

(2) When working on a larger

project, it is common to take

time off before returning to the

project

1-20

Our goals with data management

are to be able to:

(1) Recreate our initial data as easily

as possible

(2) Recall what we had previously

done as easily as possible

1-21

When working with data, we recommend:

(1)Creating a Master file with the initial

data and performing calculations in a

different working file

(2)Exhaustively documenting all initial data

sources

(3)Making file and variable names as

intuitive as possible

(4)Documenting all commands used when

performing estimation

1-22

Chapter 3

Summary Statistics

Copyright 2014 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Learning Objectives

Calculate measures of central tendency

Calculate measures of dispersion

Use measures of central tendency and

dispersion

Detect whether outliers are present

Construct scatter diagrams for the

relationship between two variables

Calculate the covariance and the correlation

coefficient between two variables

1-24

1-25

Construct a Relative

Frequency Histogram

A bar chart that shows how often

observations lie within a specified classes

Allows a visual inspection of the data

Based on a Relative Frequency Table

The example dataset for constructing a

histogram use states.xls, a survey of

econometrics students that asked how

many states they have been visited.

1-26

0.45

0.4

0.35

0.3

0.25

Relative Frequency

0.2

0.15

0.1

0.05

0

0-5.99

6-11.99

12-17.99

18-24

1-27

To create a frequency

distribution we must

1. Select the number of classes

2. Choose the class interval or width of

the classes

3. Select the class boundaries or the

values that form the interval for each

class

4. Count the number of values in the

dataset that fall in each class

1-28

number of classes

The rule for determining the approximate

number of classes is:

Approximate number of classes =

[(2)(Number of observations)].3333

The actual number of classes is the integer

value that just exceeds the number value.

If the formula gives us 4.66 we use 5

1-29

Step 1: Example

We have 43 data points so the rule is:

Approximate number of classes = [(2)(20)].3333

= 3.503

Round this up to the next integer value which is 4.

The number of classes is 4.

**Always round up!!

1-30

of the interval

The rule for determining interval width is:

Approximate interval width =

Largest data value Smallest data value

Number of classes

value that just exceeds the number value.

If the formula gives us 6.17 we use 7

**Always round up!!

1-31

Step 2: Example

Approximate interval width = (24-1)/4

= 5.75

Round up to 6.

Therefore the class width is 6.

1-32

boundaries

Class boundaries must be chosen such that

each data item belongs to one and only one

class.

Start just below the lowest value in the dataset

to get the lower boundary. The lower

boundary for the second class is then found by

adding the class width. The upper boundary

for the first class is found by subtracting .01

from the lower boundary of the second class.

Keep adding the class width and subtracting .

01 to get the boundaries.

1-33

Step 3: Example

Lowest data point is 1. We will start

our classes at 0.

Class

Class

Class

Class

1

2

3

4

=

=

=

=

0

6 (=0+6)

12 (=6+6)

18 (=12+6)

1-34

Step 3: Example

Continued

Class boundaries are then:

Class

Class

Class

Class

1:

2:

3:

4:

0- 5.99

6-11.99

12-17.99

18-24

1-35

of values in the dataset

that fall into each class

therefore, we want to rely on Excel to

do this for us.

Enter the class boundaries into Excel

next to the data set.

Enter the Upper Boundaries of each

of the classes

Use the Frequency command

1-36

Frequency command in

Excel

where the frequencies should go (say E2:E6).

2. Type but do not enter the formula

=Frequency(A2:A44,D2:D6)

A2:A44 contains the data D2:D6 contain the

ending class boundaries

3. Press CTRL+SHIFT+ENTER and the array

formula will be entered into each of the cells

E2:E6.

1-37

Class Boundaries Upper Limit

Frequency

0-5.99

5.99

18

6-11.99

11.99

18

12-17.99

17.99

4

18-24

24.00

3

1-38

Creating relative

frequency and percent

frequency distributions

Recall that the relative frequency is

the proportion of the observations

belonging to a class. With n

observations

Relative frequency of a class =

Frequency of the class

n

The percent frequency is the relative

frequency multiplied by 100.

1-39

1-40

Wizard to Construct a

Histogram

just constructed and highlight the

frequencies

2. Click the Chart Wizard and choose

column in the chart type

3. Click on the Category (X) axis labels

box and enter the class boundaries

4. To get the bars to touch right click

on any rectangle in the column

chart and choose Format Data

Series. Select the Options tab and

1-41

0.45

0.4

0.35

0.3

0.25

Relative Frequency

0.2

0.15

0.1

0.05

0

0-5.99

6-11.99

12-17.99

18-24

1-42

Your mission is to pair up with a

classmate and draw what you think

the histogram for soda consumption

looks like.

1-43

Calculate Measures of

Central Tendency

Central tendency is the middle value of a

dataset.

The measure of central tendency is

typically

thought of as the number that best

describes

the data.

Measures of central tendency are:

(1)Mean

(2)Median

1-44

Measure of Central

Tendency - Mean

The mean is the arithmetic average of the data. To

calculate the mean sum all the observations and divide by

the number of observations.

1 n

1

x xi ( x1 x2 ... xn )

n i 1

n

Mean

For the following small data set:

95 85 99 92 80

Mean =(95+85+99+92+80)/5 = 451/5 = 90.2

In Excel =average(highlight data)

1-45

Measure of Central

Tendency - Median

Median the middle observation when the data are

arranged from smallest to largest sometimes called

the 50% percentile. Half the observations lie below

the median and half the observations lie above the

median.

The median is the middle observation for an odd

number of ordered observations and the average of

the middle two ordered observations for an even

number of observations.

The median is an order statistic so in order to calculate

it the data must be ordered from smallest to largest.

1-46

Measure of Central

Tendency - Median

Median Central observation for an odd number of

observations and an average of the two middle data points

for an even number of observations

For the following small data set :

95 85 99 92 80

(ordered data 80 85 92 95 99)

Median = 92 (the 3rd data point)

If we had 75 80 85 92 95 99

median =(.5*85)+(.5*92) = (85+92)/2 = 42.5+46 = 88.5

In Excel =median(highlight data)

1-47

Calculate Measures of

Dispersion

Dispersion is a measure of how the

data vary.

Measures of dispersion are:

(1)Variance

(2)Standard Deviation

(3)Percentiles

(4)Five Number Summary

1-48

Measure of Dispersion

Variance and Standard

Deviation

Standard Deviation the average deviation away from the

mean. It is the square root of the variance.

The variance is calculated by subtracting the mean from

each observation, squaring that value, adding up all n values,

and then dividing that by the number of observations less

n

one.

2

s2

Sample variance formula is 2

s s

( xi x )

i 1

n 1

Standard deviation is

In Excel = var(highlight data)

= stdev(highlight data)

1-49

Measure of Dispersion

Variance and Standard

Deviation

n

2

(

x

x

)

i

s 2 i 1

Sample variance:

n 1

95 85 99 92 80

s2= [(95-90.2)2+ (85-90.2)2+ (99-90.2)2+

(92-90.2)2+ (80-90.2)2]/4=234.8/4=58.7

Sample standard deviation 2

s s

s= 58.7

=7.6616

1-50

Measure of Dispersion

Percentile

A percentile is a number such that p% of the ordered

observations lie below the percentile and (1-p)% of the

observations lie above the percentile.

The median is the 50th percentile and an example of a

percentile where 50% of the ordered data lies below

that level and 50% of the ordered data lies above that

level.

A percentile is an order statistic.

There are many different ways to calculate percentiles.

On the next slide one of the easiest ways to calculate

percentiles.

1-51

p

(1) Sort the data from low to high

(2) Count the number of observations, n

(3) Select the p(n+1) observation

(4)If the value p(n+1) is not a whole number then select the

closest whole number

(5)If p(n+1) is less than 1 then select the smallest number

(6)If p(n+1) is greater than 1 then select the largest number.

In Excel =percentile(highlight data, p)

Note that the steps to calculate a percentile by hand and

calculating percentiles in Excel will likely not result in the

same value.

1-52

Calculate the 10th and the 70th percentile for the

following small data set :

95 85 99 92 80

(ordered data 80 85 92 95 99)

10th percentile select the .1(n+1) = .1(6) = .6

number in the data set.

The closest whole number is 1 so the 10 th

percentile is the first observation or 80.

70th percentile select the .7(n+1) = .1(6) = 4.2

number in the data set.

The closest whole number is 4 so the 70 th

percentile is the fourth observation or 95.

1-53

Measure of Dispersion

Five Number Summary

The Five Number Summary is

(1) Minimum

(2) Q1 or 25th Percentile

(3) Q2 or Median (50th Percentile)

(4) Q3 or 75th Percentile

(5) Maximum

1-54

Five Number Summary in

Excel

Minimum =Min (data)

Q1 or 25th Percentile

=percentile(data,.25) or

=quartile(data,1)

Q3 or 75th Percentile

=percentile(data,.75) or

=quartile(data,3)

Maximum =Max (data)

1-55

Shapes of Histograms

Symmetric

Skewed to the right or Positively

skewed

Skewed to the left or Negatively

Skewed

Bimodal

1-56

Symmetric

Histogram

1-57

Positively Skewed

Distribution

1-58

Negatively Skewed

Distribution

1-59

Bimodal

Distribution

1-60

Positively Skewed

Distribution

Median = 2.77

Mean = 4.16

1-61

histogram important?

The shape of the empirical

distribution dictates which summary

statistics should be used

Symmetric Use mean and standard

deviation

Skewed Use median and five number

summary

1-62

How to determine if

your data is skewed or

symmetric

Pearsons coefficient of skewness:

sk = 3*(mean-median)/(standard dev.)

Rule of Thumb:

If sk<-.5 or sk>.5 then the distribution

is skewed.

Otherwise

the distribution is

Negatively skewed Symmetric Positively Skewed

symmetric.-.5

.5

1-63

Symmetric Histogram

Mean = .5013

Standard Deviation =.019

1-64

Positively Skewed

Distribution

Median = 2.779

Minimum

Q1

Median

Q3

Maximum

0.008

1.1578

2.779

5.643

29.001

1-65

with Symmetric data

Use the Empirical Rule

68% of data should be within one standard deviation of the

mean

x s

95% of the data should be within two standard deviations of

the mean

x 2s

deviations of the mean

x 3s

standard deviations from the mean or beyond the interval

( x - 3s,x + 3s)

1-66

with skewed data

Calculate the interquartile range or

IQR = Q3 Q1.

If a value is greater than Q3 plus

1.5*IQR or less than Q1 minus

1.5*IQR the its a moderate outlier

If a value is greater than Q3 plus

3*IQR or less than Q1 minus 3*IQR

then its an extreme outlier

1-67

for the Relationship between

two Random Variables

A scatter diagram (or scatter plot)

is used to show the relationship

between two variables

It contains one variable on the x-axis

and the other variable on the y-axis

A scatter diagram shows how the two

variables are related to each other,

both the strength and direction of the

relationship

1-68

Scatter Diagram

Examples

y

Positive Linear

relationship

x

Negative Linear

relationship

Curvilinear

relationships

x

y

x

1-69

Scatter Diagram

Examples

y

Strong

relationships

x

y

Weak

relationships

x

y

x

1-70

Scatter Diagrams

Examples

y

No

relationship

x

y

x

1-71

1-72

Diagram in Excel

Highlight the data making sure that

the variable you want on the y-axis is

on the right

Select Insert and then Scatter

and click on the first option

Make sure to change the chart title,

add axis titles.

Possibly delete the legend and

change the start values for the axis.

1-73

160,000

140,000

120,000

100,000

Salary (dollars)

80,000

60,000

40,000

20,000

0

10

12

14

16

18

20

22

Experience (years)

1-74

Diagram on the previous

slide tell us?

The relationship between education

and salary is positive (in general as

education increases salary increases)

The relationship is fairly strong

because the data point are closely

gathered to each other

This scatter diagram indicates that

while the variable education is

helpful for predicting salaries, it will

not yield perfect predictions.

1-75

Correlation Coefficient for

the Linear Relationship

between two variables

Covariance and Correlation

Coefficient supplies a numeric value to

the strength and direction of the linear

relationship between two variables

Only concerned with strength of the

relationship

No causal effect is implied

1-76

Covariance

Covariance is a measure of the linear

relationship between two random variables

A positive covariance indicates a positive

linear relationship between x and y (if x is

below its mean then y tends to be below its

mean and if x is above its mean then y

tends to be above its mean)

A negative covariance indicates a negative

linear relationship between x and y (if x is

below its mean then y tends to be above its

mean and if x is above its mean then y

tends to be below its mean)

1-77

Covariance

A covariance near 0 indicates no linear

relationship between x and y

A problem with covariance is that it

depends on the units of measurement for x

and y if we change from measuring in feet

to inches the covariance will go up even

though the overall relationship hasnt

changed.

1-78

Covariance a Measure of

Linear Association

Between Two Variables

Remember the formula for variance is

n

s2

(x i x)

i 1

n 1

( x i x )( x i x )

i 1

n 1

The formula for Covariance is

n

Cov( x , y) s xy

( x i x )( yi y)

i 1

n 1

and it measures how varies with y in a linear

fashion.

1-79

82,555.5556

1-80

Calculating Covariance in

Excel

In some versions of Excel, the covariance is

not calculated correctly.

The Excel command is

=Covar(highlight x values, highlight

y values)

You should perform this command in Excel for

the data set above and see if it matches the

value 82,555.5556.

If you obtain 74,300 using the covar

command (which is likely), you must multiply

the value you obtain in Excel by n/(n-1) to

obtain the correct value for covariance.

1-81

Correlation Coefficient

The sample correlation coefficient,

rxy, is an estimate of population

correlation coefficient and is used to

measure the strength and direction of

the linear between two random

variables.

The correlation is a unit free measure

(unlike the covariance) and falls

between -1 and 1.

1-82

Coefficient Mean?

sloped line, rxy =1.

If all the points in a data set fall on a negatively

sloped line, rxy =-1.

If there is no linear relationship between x and y

then rxy =0.

The closer to -1, the stronger the negative

linear relationship

The closer to 1, the stronger the positive linear

relationship

The closer to 0, the weaker the linear

relationship

1-83

Examples of Approximate

rxy Values

y

r = -1

r = -.6

r=0

r = +.3

r = +1

x

1-84

Coefficient

Sample correlation

coefficient:

Cov( x, y )

sxy

rxy

is 2.708 and the standard deviation of y

is 38,189.037.

82,555.5556

rxy

0.7983

(2.708)(38,189.0037)

A correlation of 0.7983 means that

education and salary are positively related

and the relationship is strong (because this

values lies near 1)

1-85

Mean?

association between two variables. A

correlation coefficient is near 0 only means

that there is a weak linear association between

the two variables, not that there isnt any

relationship between the two variables.

A high correlation between two variables does

not mean that changes in one variable will

cause changes in the other variable.

We might find that the quality rating and the

typical mean price of restaurants are positively

correlated. However, simply increasing the

mean price at a restaurant will not cause the

1-86

## Гораздо больше, чем просто документы.