Elite - Data Analysis With Stata

1
Introduction to Data Analisys with Stata

Sara Godoy.
Grupo Avanzado. Noviembre 2011
Nonparametric Analysis
Non-Parametric tests: Summary

NATURE OF DEPENDENT VBL. ONE-SAMPLE TWO-SAMPLE K-SAMPLE
RELATED/
MATCHED
INDEPENDENT
INDEPENDENT
CATEGORICAL/
NOMINAL
Binomial test
McNemar test
Fisher s exact test WilconxonMann Whitney test
Chi-square test
ORDINAL/
INTERVAL
KolmogorovSmirnov onesample test
Wilcoxon signed ranks test
Kruskal Wallis test
Non-parametric correlation
A Spearman correlation is used when one or both of the variables are not assumed to be normally distributed and interval (but are assumed to be ordinal). The values of the variables are converted in ranks and then correlated. ! Syntax: spearman [varlist] [if] ,[options]
!
spearman read write Number of obs = 200 Spearman's rho = 0.6167 Test of Ho: read and write are independent Prob > |t| = 0.0000 The results suggest that the relationship between read and write (rho = 0.6167, p = 0.000) is statistically significant.
P-values meaning
A p-value is a measure of how much evidence we have against the null hypothesis (H0) ! The p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.
! !
One often "rejects the null hypothesis" when the p-value is less than the significance level:
! ! !
p <0.1 (10%) p<0.05 (5%) P <0.01 (1%)
When the null hypothesis is rejected, the result is said to be statistically significant.
Binomial probability test

Test whether the proportion of successes on a two-level categorical dependent variable significantly differs from a hypothesized value (small samples) ! Syntax: bitest varname == #p
!
bitest female=.5
The results indicate that there is no statistically significant difference (p = .2292).! In other words, the proportion of females does not significantly differ from the hypothesized value of 50%.
+ One- and two-sample tests of proportions

performs tests on the equality of proportions using large-sample statistics. ! Syntax
! prtest
One-sample test of proportion: tests that varname has a proportion of #p. prtest varname == #p
!
Two-sample test of proportion: tests that varname1 and varname2 have the same proportion prtest varname1 == varname2
!

!
Example 1: One-sample test of proportion
Assume that we have a sample of 74 automobiles. We wish to test whether the proportion of automobiles that are foreign is different from 40%. prtest foreign == .4
The test indicates that we cannot reject the hypothesis that the proportion of foreign automobiles is 0.40 at the 5% significance level.

!
Example 2: Two-sample test of proportion
We have two headache remedies that we give to patients. Each remedy s effect is recorded as 0 for failing to relieve the headache and 1 for relieving the headache. We wish to test the equality of the proportion of people relieved by the two treatments. prtest cure1 == cure2o-sample test of proportion
We find that the proportions are statistically different from each other at any level greater than 3.9%.
+ Kolmogorov-Smirnov one and two-samples

test
!
ksmirnov performs one-sample Kolmogorov Smirnov tests of the equality of distributions. In the first syntax, varname is the variable whose distribution is being tested, and exp must evaluate to the corresponding (theoretical) cumulative. Syntax: ksmirnov varname = exp
Example : One-sample test Let s now test whether x in the example above is distributed normally. KolmogorovSmirnov is not a particularly powerful test in testing for normality, and we do not endorse such use of it; In any case, we will test against a normal distribution with the same mean and standard deviation ksmirnov x = normal((x-r(mean)/r(sd))
!

test
!
1.
Example : One-sample test

summarize x
2.
ksmirnov x = normal((x-4.571429)/3.457222)
The results indicate that the data cannot be distinguished from normally distributed data.

test
!
Example : two-sample test
The first line tests the hypothesis that x for group 1 contains smaller values than for group 2. The largest difference between the distribution functions is 0.5. The approximate p-value for this is 0.424, which is not significant. The second line tests the hypothesis that x for group 1 contains larger values than for group 2. The largest difference between the distribution functions in this direction is 0.1667. The approximate p-value for this small difference is 0.909. Finally, the approximate p-value for the combined test is 0.785, corrected to 0.735.
McNemar test
You would perform McNemar's test if you were interested in the marginal frequencies of two binary outcomes. ! These binary outcomes may be the same outcome variable on matched pairs (like a case-control study) or two outcome variables from a single group. ! Example:
!
!
! ! !
Consider two questions, Q1 and Q2, from a test taken by 200 students. Suppose 172 students answered both questions correctly, 15 students answered both questions incorrectly, 7 answered Q1 correctly and Q2 incorrectly, and 6 answered Q2 correctly and Q1 incorrectly. These counts can be considered in a two-way contingency table. The null hypothesis is that the two questions are answered correctly or incorrectly at the same rate (or that the contingency table is symmetric). We can enter these counts into Stata using mcci, a command from Stata's epidemiology tables. The outcome is labeled according to casecontrol study conventions.
McNemar test
McNemar's chi-square statistic suggests that there is not a statistically significant difference in the proportions of correct/incorrect answers to these two questions.

!
The Wilcoxon signed rank sum test is the non-parametric version of a paired samples ttest. You use the Wilcoxon signed rank sum test when you do not wish to assume that the difference between the two variables is interval and normally distributed (but you do assume the difference is ordinal). We will use the same example as above, but we will not assume that the difference between read and write is interval and normally distributed.
The results suggest that there is not a statistically significant difference between read and write.

!
If you believe the differences between read and write were not ordinal but could merely be classified as positive and negative, then you may want to consider a sign test in lieu of sign rank test. Again, we will use the same variables in this example and assume that this difference is not ordinal.
This output gives both of the onesided tests as well as the twosided test. Assuming that we were looking for any difference, we would use the two-sided test and conclude that no statistically significant difference was found (p=.5565).
Fisher exact test

!
The Fisher's exact test is used when you want to conduct a chi-square test, but one or more of your cells has an expected frequency of five or less. Remember that the chisquare test assumes that each cell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and can be used regardless of how small the expected frequency is. In the example below, we have cells with observed frequencies of two and one, which may indicate expected frequencies that could be below five, so we will use Fisher's exact test with the exact option on the tabulate command.
These results suggest that there is not a statistically significant relationship between race and type of school (p = 0.597). Note that the Fisher's exact test does not have a "test statistic", but computes the p-value directly.
Two independent samples t-test

!
An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups. For example, we wish to test whether the mean for write is the same for males and females.
The results indicate that there is a statistically significant difference between the mean writing score for males and females (t = -3.7341, p = .0002). In other words, females have a statistically significantly higher mean score on writing (54.99) than males (50.12).
Wilcoxon-Mann Whitney test

!
The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and can be used when you do not assume that the dependent variable is a normally distributed interval variable (you only assume that the variable is at least ordinal). You will notice that the Stata syntax for the Wilcoxon-Mann-Whitney test is almost identical to that of the independent samples t-test the same variables in this example as we did in the independent t-test example above and will not assume that write, our dependent variable, is normally distributed. The results suggest that there is a statistically significant difference between the underlying distributions of the write scores of males and the write scores of females (z = -3.329, p = 0.0009). You can determine which group has the higher rank by looking at the how the actual rank sums compare to the expected rank sums under the null hypothesis. The sum of the female ranks was higher while the sum of the male ranks was lower. Thus the female group had higher rank.
Chi-square test
!
A chi-square test is used when you want to see if there is a relationship between two categorical variables. In Stata, the chi2 option is used with the tabulate command to obtain the test statistic and its associated p-value Example: let's see if there is a relationship between the type of school attended (schtyp) and students' gender (female). Remember that the chi-square test assumes the expected value of each cell is five or higher. These results indicate that there is no statistically significant relationship between the type of school attended and gender (chi-square with one degree of freedom = 0.0470, p = 0.828).
Chi-square test
!
Let's look at another example, this time looking at the relationship between gender (female) and socio-economic status (ses). The point of this example is that one (or both) variables may have more than two levels, and that the variables do not have to have the same number of levels. In this example, female has two levels (male and female) and ses has three levels (low, medium and high).
Again we find that there is no statistically significant relationship between the variables (chi-square with two degrees of freedom = 4.5765, p = 0.101).
Kruskal-Wallis
!
The Kruskal Wallis test is used when you have one independent variable with two or more levels and an ordinal dependent variable.
If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly different value of chi-squared. With or without ties, the results indicate that there is a statistically significant difference among the three type of programs.
Linear Regression
Regression: A practical approach

!
We use regression to estimate the unknown effect of changing one variable over another. Technically, linear regression estimates how much Y changes when X changes one unit. Before running a regression it is recommended to have a clear idea of what you are trying to estimate (i.e. which are your outcome and predictor variables). Previous Steps
1. 2. 3. 4.
Examine descriptive statistics Look at relationship graphically and test correlation(s) Run and interpret regression Test regression assumptions
+ An example: SAT
Expenditures
!
scores and Education
Are SAT scores higher in states that spend more money on education controlling by other factors?* ! Outcome (Y) variable: SAT scores, variable csat in dataset ! Predictor (X) variables
! ! ! ! ! !
Per pupil expenditures primary & secondary (expense) % HS graduates taking SAT (percent) Median household income (income) % adults with H Sdiploma (high) % adults with college degree (college) Region(region)
* Source: Data and examples come from the book Statistics with Stata (updated for version 9) by Lawrence C. Hamilton (chapter 6). Click here to download the data or type: . Use the file states.dta (educational data for the U.S.)
Regression: Check the variables

describe csat expense percent income high college region
summarize csat expense percent income high college region
+ Regression: View relationship graphically

twoway scatter !expense !scat
1100
Relationship Between Education Expenditure and SAT Scores
800 2000
Mean composite SAT score 900 1000
4000
6000 8000 Per pupil expenditures prim&sec
10000

twoway (scatter ! !scat ! !expense) !(lfit scat ! expense)

twoway lfitci expense !csat
+ Regression: Correlation test

pwcorr csat expense, !star(.05)
Regression: what to look for

!
Lest run the regression: SAT scores and education expenditures

Outcome Variable (Y)
regress csat expense, robust

Predictor Variable (X) Robust standard errors (to control for heteroscedasticity
How state s mean SAT changes if its expenditure increases one unit. For each onepoint increase in expense, SAT scores decreases by 0.022
Constant (Intercept): state s mean SAT score if its expenditure is 0$

Significance of individual predictors: Is there a statistically significant relationship between SAT scores and per pupil expenditures? regress csat expense, robust
!
The t-values test the hypothesis that the coefficient is different from 0. To reject this, you need a t-value greater than 1.96 (for 95% confidence). You can get the t-values by dividing the coefficient by its standard error. The t- values also show the importance of a variable in the model.
Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (you could choose also an alpha of 0.10). In this case, expense is statistically significant in explaining SAT

!
Significance of overall equation

This is the p-value of the model. It tests whether R2 is different from 0. Usually we need a p-value lower than 0.05 to show a statistically significant relationship between X and Y.
regress csat expense, robust
R-square shows the amount of variance of Y explained by X. In this case expense explains 22% of the variance in SAT scores.

!
Adding the rest of predictor variables
regress csat expense percent income high college , robust
Regression: adding dummies (I)

! !
Region is entered here as dummy variable. First, generate dummies: tab region, g(reg)
. tab region, g(reg) Geographica l region West N. East South Midwest Total
Freq. 13 9 16 12 50
Percent 26.00 18.00 32.00 24.00 100.00
Cum. 26.00 44.00 76.00 100.00
Regression: adding dummies (I)

regress csat expense percent income high college reg2 reg3 reg4 , robust
Linear regression Number of obs = F( 8, 41) = Prob > F = R-squared = Root MSE = 50 69.82 0.0000 0.9111 21.492
csat expense percent income high college reg2 reg3 reg4 _cons
Coef. -.002021 -3.007647 -.1674421 1.814731 4.670564 69.45333 25.39701 34.57704 808.0206
Robust Std. Err. .0035883 .2358047 1.196409 1.02694 1.599798 17.99933 12.52558 9.44989 67.86418
t -0.56 -12.75 -0.14 1.77 2.92 3.86 2.03 3.66 11.91
P>|t| 0.576 0.000 0.889 0.085 0.006 0.000 0.049 0.001 0.000
[95% Conf. Interval] -.0092676 -3.483864 -2.583638 -.2592168 1.439705 33.10295 .101086 15.4926 670.9661 .0052256 -2.53143 2.248754 3.888679 7.901422 105.8037 50.69293 53.66149 945.0751
Regression: adding dummies (II)

!
Let Stata do your dirty work with xi command

. xi: regress csat expense percent income high college i.region, robust i.region _Iregion_1-4 (naturally coded; _Iregion_1 omitted) Linear regression Number of obs = F( 8, 41) = Prob > F = R-squared = Root MSE =
xi: regress csat expense percent income high college i.region, robust
50 69.82 0.0000 0.9111 21.492
csat expense percent income high college _Iregion_2 _Iregion_3 _Iregion_4 _cons
Coef. -.002021 -3.007647 -.1674421 1.814731 4.670564 69.45333 25.39701 34.57704 808.0206
Robust Std. Err. .0035883 .2358047 1.196409 1.02694 1.599798 17.99933 12.52558 9.44989 67.86418
t -0.56 -12.75 -0.14 1.77 2.92 3.86 2.03 3.66 11.91
P>|t| 0.576 0.000 0.889 0.085 0.006 0.000 0.049 0.001 0.000
[95% Conf. Interval] -.0092676 -3.483864 -2.583638 -.2592168 1.439705 33.10295 .101086 15.4926 670.9661 .0052256 -2.53143 2.248754 3.888679 7.901422 105.8037 50.69293 53.66149 945.0751
NOTE: By default xi excludes the first value, to select a different value, before running the regression type: . char region[omit] 4 xi: regress csat expense percent income high college i.region, robust This will select Midwest (4) as the reference category for the dummy variables.
Regression: correlation matrix

!
Below is a correlation matrix for all variables in the model. Numbers are Pearson correlation coefficients, go from -1 to 1. Closer to 1 means strong correlation. A negative value indicates an inverse relationship (roughly, when one goes up the other goes down).
pwcorr csat expense percent income high college, star(0.05) sig
csat csat 1.0000 expense percent income high college
expense
-0.4663* 0.0006 -0.8758* 0.0000 -0.4713* 0.0005 0.0858 0.5495 -0.3729* 0.0070
1.0000
percent
0.6509* 0.0000 0.6784* 0.0000 0.3133* 0.0252 0.6400* 0.0000
1.0000
income
0.6733* 0.0000 0.1413 0.3226 0.6091* 0.0000
1.0000
high
0.5099* 0.0001 0.7234* 0.0000
1.0000
college
0.5319* 0.0001
1.0000
Regression: graph matrix

!
Command graph matrix produces a graphical representation of the correlation matrix by presenting a series of scatterplots for all variables
graph matrix csat expense percent income high college, half maxis (ylabel(none) xlabel(none))
Regression: Managing all this outputs

! Usually
!
when we re running regression, we ll be testing multiple models at a time

Can be difficult to compare results
! Stata
offers several user- friendly options for storing and viewing regression output from multiple models:
! !
Store Output: eststo / esttab Outputting into Excel: outreg2
Regression: eststo/esttab
!We
can store this info in Stata, just type:
regress csat expense, robust eststo model1 regress csat expense college, robust eststo model2 percent income high
xi: regress csat expense college i.region, robust eststo model3
percent
income
high
!
esttab model1 model2 model3 Now Stata will hold your output in . memory until you ask to recall it: (1) (2) csat csat expense -0.0223*** (-6.07) 0.00335 (0.70) -2.618*** (-11.44) 0.106 (0.09) 1.631 (1.73) 2.031 (0.96) (3) csat -0.00202 (-0.56) -3.008*** (-12.75) -0.167 (-0.14) 1.815 (1.77) 4.671** (2.92) 69.45*** (3.86) 25.40* (2.03) 34.58*** (3.66) 1060.7*** (43.55) 51 851.6*** (14.86) 51 808.0*** (11.91) 50
esttab model1 !model2 model3
percent
income
high
college
_Iregion_2
_Iregion_3
_Iregion_4
_cons
t statistics in parentheses * p<0.05, ** p<0.01, *** p<0.001 .
Some options (type help eststo and help esttab for mor options) esttab model1 model2 model3, r2 ar2 se label
!
(1) Mean compo~e Per pupil expendit~c -0.0223*** (0.00367) % HS graduates tak~T Median household~000 (2) Mean compo~e 0.00335 (0.00478) -2.618*** (0.229) 0.106 (1.207) 1.631 (0.943) 2.031 (2.114) (3) Mean compo~e -0.00202 (0.00359) -3.008*** (0.236) -0.167 (1.196) 1.815 (1.027) 4.671** (1.600) 69.45*** (18.00) 25.40* (12.53) 34.58*** (9.450) 1060.7*** (24.35) 51 0.217 0.201 851.6*** (57.29) 51 0.824 0.805 808.0*** (67.86) 50 0.911 0.894
% adults HS diploma
% adults college d~e
region==2
region==3
region==4
Constant
Observations R-squared Adjusted R-squared
Standard errors in parentheses * p<0.05, ** p<0.01, *** p<0.001
Regression: outreg2
!
Avoid human error when transferring coefficients into tables
regress !csat expense, robust outreg2 !using !outreg2 using prediction.doc regress csat expense percent income high college, robust outreg2 using prediction.doc, append xi: regress csat expense percent income high college i.region, robust outreg2 using prediction.doc, append
Regression: outreg2
Getting predicted values

!
How good the model is will depend on how well it predicts Y, the linearity of the model and the behavior of the residuals. Using predict immediately after running the regression:
xi: regress csat expense college i.region, robust predict csat_predict label variable csat_predict "csat predicted
percent
percent2
income
high
Getting predicted values

scatter csat csat_predict
1100 800 850 Mean composite SAT score 900 1000
900
950 csat predicted
1000
1050
We should expect a 45 degree pattern in the data. Y-axis is the observed data and x-axis the predicted data (Yhat). In this case the model seems to be doing a good job in predicting csat
Linear Regression Assumptions

!
Assumption 1: Normal Distribution

! !
The dependent variable is normally distributed The errors of regression equation are normally distributed The variance around the regression line is the values of the predictor variable (X) same for all
Assumption 2: Homoscedasticity
!
Assumption 3: Errors are independent

!
The size of one error is not a function of the size previous error
of any
Assumption 4: Relationships are linear

! !
AKA the relationship can be summarized with a straight line Keep in mind that you can use alternative forms of regression to test nonlinear relationships
Testing for Normality

predict !resid, !residual label !var resid "Residuals !of !pp !expend !and !SAT" histogram !resid, !normal
Testing for Normality

!
Shapiro- Wilk test of normality tests null hypothesis that data is normally distributed
+ Regression:
testing for homoscedasticity
Note: rvfplot command needs to be entered after regression equation is run Stata uses estimates from the regression to create this plot
+ Regression:
!
testing for homoscedasticity
A non-graphical way to detect heteroskedasticiy is the BreuschPagan test. The null hypothesis is that residuals are homoskedastic. In the example below we fail to reject the null at 95% and concluded that residuals are homogeneous. However at 90% we reject the null and conclude that residuals are not homogeneous.
Logit/Probit Regression
Logit model
Use logit models when every our dependent variable is binary (also called dummy) which takes values 0 or 1. ! Logit regression is a nonlinear regression model that forces the output (predicted values) to be either 0 or 1. ! Logit models estimate the probability of your dependent variable to be 1 (Y=1). This is the probability that some event happens. ! Logit and probit models are basically the same, the difference is in the distribution:
!
! ! !
Logit Cumulative standard logistic distribution (F) Probit Cumulative standard normal distribution ( ) Both models provide similar results.
Logit model
Logit: predicted probabilities
Logit: Odds ratio
Logit: adjust
Ordinal logit
!
When a dependent variable has more than two categories and the values of each category have a meaningful sequential order where a value is indeed higher than the previous one, then you can use ordinal logit. Here is an example of the type of variable:
Ordinal logit: the setup
Ordinal logit: predicted probabilities
Ordinal logit: predicted probabilities
Predicted probabilities: using prvalue
Predicted probabilities: using prvalue
Panel Data (fixed and random effects)
Panel Data Analysis
Panel Data Analysis

!
Panel data allows you to control for variables you cannot observe or measure like cultural factors or difference in business practices across companies; or variables that change over time but not across entities (i.e. national policies, federal regulations, international agreements, etc.). This is, it accounts for individual heterogeneity. With panel data you can include variables at different levels of analysis (i.e. students, schools, districts, states) suitable for multilevel or hierarchical modeling. Some drawbacks are data collection issues (i.e. sampling design, coverage), non-response in the case of micro panels or cross-country dependency in the case of macro panels (i.e. correlation between countries)
Panel Data Analysis
!In
this document we focus on two techniques use to analyze panel data:

! Fixed
effects ! Random effects
Setting panel data: xtset
Exploring panel data
Exploring panel data
FIXED-EFFECTS MODEL (Covariance Model, Within Estimator, Individual Dummy Variable Model, Least Squares Dummy Variable Model)
Fixed effects
!
Use fixed-effects (FE) whenever you are only interested in analyzing the impact of variables that vary over time. FE explore the relationship between predictor and outcome variables within an entity (country, person, company, etc.). Each entity has its own individual characteristics that may or may not influence the predictor variables (for example being a male or female could influence the opinion toward certain issue or the political system of a particular country could have some effect on trade or GDP or the business practices of a company may influence its stock price).
Fixed effects
!
When using FE we assume that something within the individual may impact or bias the predictor or outcome variables and we need to control for this. This is the rationale behind the assumption of the correlation between entity s error term and predictor variables. FE remove the effect of those time-invariant characteristics from the predictor variables so we can assess the predictors net effect. Another important assumption of the FE model is that those time-invariant characteristics are unique to the individual and should not be correlated with other individual characteristics. Each entity is different therefore the entity s error term and the constant (which captures individual characteristics) should not be correlated with the others. If the error terms are correlated then FE is no suitable since inferences may not be correct and you need to model that relationship (probably using random-effects), this is the main rationale for the Hausman test (presented later on in this document).
Fixed effects
Fixed effects
Fixed effects
+ Fixed effects: Heterogeneity across countries

(or entities)
+ Fixed effects: Heterogeneity across years
OLS regression
Fixed Effects using least squares dummy variable model (LSDV)
Fixed effects
+ Fixed effects: n entity-specific intercepts using

xtreg
+ Fixed effects: n entity-specific intercepts (using

xtreg)
+ Another way to estimate fixed effects: n entityspecific intercepts (using areg)
+ Another way to estimate fixed effects: common intercept and n-1

binary regressors (using dummies and regress)
+ Fixed
effects: comparing xtreg (with fe), regress (OLS with dummies) and areg
A note on fixed-effects
!
The fixed-effects model controls for all time-invariant differences between the individuals, so the estimated coefficients of the fixed-effects models cannot be biased because of omitted time-invariant characteristics [like culture, religion, gender, race, etc] One side effect of the features of fixed-effects models is that they cannot be used to investigate time-invariant causes of the dependent variables. Technically, time-invariant characteristics of the individuals are perfectly collinear with the person [or entity] dummies. Substantively, fixed-effects models are designed to study the causes of changes within a person [or entity]. A timeinvariant characteristic cannot cause such a change, because it is constant for each person. (Underline is mine) Kohler, Ulrich, Frauke Kreuter, Data Analysis Using Stata, 2nd ed., p.245
RANDOM-EFFECTS MODEL (Random Intercept, Partial Pooling Model)
Random effects
Random effects
!
Random effects assume that the entity s error term is not correlated with the predictors which allows for time-invariant variables to play a role as explanatory variables. In random-effects you need to specify those individual characteristics that may or may not influence the predictor variables. The problem with this is that some variables may not be available therefore leading to omitted variable bias in the model. RE allows to generalize the inferences beyond the sample used in the model.
Random effects
FIXED OR RANDOM?
Fixed or Random: Hausman test
OTHER TESTS/ DIAGNOSTICS
Testing for time-fixed effects
+ Testing for random effects: Breusch-Pagan

Lagrange multiplier (LM)
+ Testing
for cross-sectional dependence/contemporaneous correlation: using Breusch-Pagan LM test of independence
+ Testing
for cross-sectional dependence/ contemporaneous correlation: Using Pasaran CD test
Source: Hoechle, Daniel, Robust Standard Errors for Panel Regressions with Cross-Sectional Dependence http:// fmwww.bc.edu/repec/bocode/x/xtscc_paper.pdf
Testing for heterocedasticity
NOTE: Use the option robust to control for heteroscedasticiy (in both fixed and Presence of heteroscedasticity random effects).
Testing for serial correlation
Testing for unit roots/stationarity
Robust standard errors
Summary of basic models (FE/RE)

Elite - Data Analysis With Stata

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Elite - Data Analysis With Stata

Загружено:

Авторское право:

Доступные форматы

1

Introduction to Data Analisys with Stata

Non-Parametric tests: Summary

Fisher s exact test WilconxonMann Whitney test

KolmogorovSmirnov onesample test

Wilcoxon signed ranks test

Kruskal Wallis test

p <0.1 (10%) p<0.05 (5%) P <0.01 (1%)

Binomial probability test

+ One- and two-sample tests of proportions

+ One- and two-sample tests of proportions

Example 1: One-sample test of proportion

+ One- and two-sample tests of proportions

Example 2: Two-sample test of proportion

+ Kolmogorov-Smirnov one and two-samples

+ Kolmogorov-Smirnov one and two-samples

Example : One-sample test

+ Kolmogorov-Smirnov one and two-samples

Example : two-sample test

Wilcoxon signed ranks test

Wilcoxon signed ranks test

Fisher exact test

Two independent samples t-test

Wilcoxon-Mann Whitney test

Regression: A practical approach

scores and Education

Regression: Check the variables

summarize csat expense percent income high college region

+ Regression: View relationship graphically

Relationship Between Education Expenditure and SAT Scores

Mean composite SAT score 900 1000

6000 8000 Per pupil expenditures prim&sec

+ Regression: View relationship graphically

+ Regression: View relationship graphically

+ Regression: Correlation test

Regression: what to look for

Lest run the regression: SAT scores and education expenditures

regress csat expense, robust

Constant (Intercept): state s mean SAT score if its expenditure is 0$

Regression: what to look for

Regression: what to look for

Significance of overall equation

regress csat expense, robust

Regression: what to look for

Adding the rest of predictor variables

regress csat expense percent income high college , robust

Regression: adding dummies (I)

Percent 26.00 18.00 32.00 24.00 100.00

Cum. 26.00 44.00 76.00 100.00

Regression: adding dummies (I)

t -0.56 -12.75 -0.14 1.77 2.92 3.86 2.03 3.66 11.91

Regression: adding dummies (II)

Let Stata do your dirty work with xi command

50 69.82 0.0000 0.9111 21.492

t -0.56 -12.75 -0.14 1.77 2.92 3.86 2.03 3.66 11.91

Regression: correlation matrix

0.6509* 0.0000 0.6784* 0.0000 0.3133* 0.0252 0.6400* 0.0000

0.6733* 0.0000 0.1413 0.3226 0.6091* 0.0000

0.5099* 0.0001 0.7234* 0.0000

Regression: graph matrix

Regression: Managing all this outputs

when we re running regression, we ll be testing multiple models at a time

t statistics in parentheses * p<0.05, p<0.01, * p<0.001 .

Standard errors in parentheses * p<0.05, p<0.01, * p<0.001