Академический Документы
Профессиональный Документы
Культура Документы
Nonparametric Analysis
RELATED/
MATCHED
INDEPENDENT
INDEPENDENT
CATEGORICAL/
NOMINAL
Binomial test
McNemar test
Chi-square test
ORDINAL/
INTERVAL
Non-parametric correlation
A Spearman correlation is used when one or both of the variables are not assumed to be normally distributed and interval (but are assumed to be ordinal). The values of the variables are converted in ranks and then correlated. ! Syntax: spearman [varlist] [if] ,[options]
!
spearman read write Number of obs = 200 Spearman's rho = 0.6167 Test of Ho: read and write are independent Prob > |t| = 0.0000 The results suggest that the relationship between read and write (rho = 0.6167, p = 0.000) is statistically significant.
P-values meaning
A p-value is a measure of how much evidence we have against the null hypothesis (H0) ! The p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true.
! !
One often "rejects the null hypothesis" when the p-value is less than the significance level:
! ! !
When the null hypothesis is rejected, the result is said to be statistically significant.
bitest female=.5
The results indicate that there is no statistically significant difference (p = .2292).! In other words, the proportion of females does not significantly differ from the hypothesized value of 50%.
One-sample test of proportion: tests that varname has a proportion of #p. prtest varname == #p
!
Two-sample test of proportion: tests that varname1 and varname2 have the same proportion prtest varname1 == varname2
!
Assume that we have a sample of 74 automobiles. We wish to test whether the proportion of automobiles that are foreign is different from 40%. prtest foreign == .4
The test indicates that we cannot reject the hypothesis that the proportion of foreign automobiles is 0.40 at the 5% significance level.
We have two headache remedies that we give to patients. Each remedy s effect is recorded as 0 for failing to relieve the headache and 1 for relieving the headache. We wish to test the equality of the proportion of people relieved by the two treatments. prtest cure1 == cure2o-sample test of proportion
We find that the proportions are statistically different from each other at any level greater than 3.9%.
ksmirnov performs one-sample Kolmogorov Smirnov tests of the equality of distributions. In the first syntax, varname is the variable whose distribution is being tested, and exp must evaluate to the corresponding (theoretical) cumulative. Syntax: ksmirnov varname = exp
Example : One-sample test Let s now test whether x in the example above is distributed normally. KolmogorovSmirnov is not a particularly powerful test in testing for normality, and we do not endorse such use of it; In any case, we will test against a normal distribution with the same mean and standard deviation ksmirnov x = normal((x-r(mean)/r(sd))
!
2.
ksmirnov x = normal((x-4.571429)/3.457222)
The results indicate that the data cannot be distinguished from normally distributed data.
The first line tests the hypothesis that x for group 1 contains smaller values than for group 2. The largest difference between the distribution functions is 0.5. The approximate p-value for this is 0.424, which is not significant. The second line tests the hypothesis that x for group 1 contains larger values than for group 2. The largest difference between the distribution functions in this direction is 0.1667. The approximate p-value for this small difference is 0.909. Finally, the approximate p-value for the combined test is 0.785, corrected to 0.735.
McNemar test
You would perform McNemar's test if you were interested in the marginal frequencies of two binary outcomes. ! These binary outcomes may be the same outcome variable on matched pairs (like a case-control study) or two outcome variables from a single group. ! Example:
!
!
! ! !
Consider two questions, Q1 and Q2, from a test taken by 200 students. Suppose 172 students answered both questions correctly, 15 students answered both questions incorrectly, 7 answered Q1 correctly and Q2 incorrectly, and 6 answered Q2 correctly and Q1 incorrectly. These counts can be considered in a two-way contingency table. The null hypothesis is that the two questions are answered correctly or incorrectly at the same rate (or that the contingency table is symmetric). We can enter these counts into Stata using mcci, a command from Stata's epidemiology tables. The outcome is labeled according to casecontrol study conventions.
McNemar test
McNemar's chi-square statistic suggests that there is not a statistically significant difference in the proportions of correct/incorrect answers to these two questions.
The Wilcoxon signed rank sum test is the non-parametric version of a paired samples ttest. You use the Wilcoxon signed rank sum test when you do not wish to assume that the difference between the two variables is interval and normally distributed (but you do assume the difference is ordinal). We will use the same example as above, but we will not assume that the difference between read and write is interval and normally distributed.
The results suggest that there is not a statistically significant difference between read and write.
If you believe the differences between read and write were not ordinal but could merely be classified as positive and negative, then you may want to consider a sign test in lieu of sign rank test. Again, we will use the same variables in this example and assume that this difference is not ordinal.
This output gives both of the onesided tests as well as the twosided test. Assuming that we were looking for any difference, we would use the two-sided test and conclude that no statistically significant difference was found (p=.5565).
The Fisher's exact test is used when you want to conduct a chi-square test, but one or more of your cells has an expected frequency of five or less. Remember that the chisquare test assumes that each cell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and can be used regardless of how small the expected frequency is. In the example below, we have cells with observed frequencies of two and one, which may indicate expected frequencies that could be below five, so we will use Fisher's exact test with the exact option on the tabulate command.
These results suggest that there is not a statistically significant relationship between race and type of school (p = 0.597). Note that the Fisher's exact test does not have a "test statistic", but computes the p-value directly.
An independent samples t-test is used when you want to compare the means of a normally distributed interval dependent variable for two independent groups. For example, we wish to test whether the mean for write is the same for males and females.
The results indicate that there is a statistically significant difference between the mean writing score for males and females (t = -3.7341, p = .0002). In other words, females have a statistically significantly higher mean score on writing (54.99) than males (50.12).
The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and can be used when you do not assume that the dependent variable is a normally distributed interval variable (you only assume that the variable is at least ordinal). You will notice that the Stata syntax for the Wilcoxon-Mann-Whitney test is almost identical to that of the independent samples t-test the same variables in this example as we did in the independent t-test example above and will not assume that write, our dependent variable, is normally distributed. The results suggest that there is a statistically significant difference between the underlying distributions of the write scores of males and the write scores of females (z = -3.329, p = 0.0009). You can determine which group has the higher rank by looking at the how the actual rank sums compare to the expected rank sums under the null hypothesis. The sum of the female ranks was higher while the sum of the male ranks was lower. Thus the female group had higher rank.
Chi-square test
!
A chi-square test is used when you want to see if there is a relationship between two categorical variables. In Stata, the chi2 option is used with the tabulate command to obtain the test statistic and its associated p-value Example: let's see if there is a relationship between the type of school attended (schtyp) and students' gender (female). Remember that the chi-square test assumes the expected value of each cell is five or higher. These results indicate that there is no statistically significant relationship between the type of school attended and gender (chi-square with one degree of freedom = 0.0470, p = 0.828).
Chi-square test
!
Let's look at another example, this time looking at the relationship between gender (female) and socio-economic status (ses). The point of this example is that one (or both) variables may have more than two levels, and that the variables do not have to have the same number of levels. In this example, female has two levels (male and female) and ses has three levels (low, medium and high).
Again we find that there is no statistically significant relationship between the variables (chi-square with two degrees of freedom = 4.5765, p = 0.101).
Kruskal-Wallis
!
The Kruskal Wallis test is used when you have one independent variable with two or more levels and an ordinal dependent variable.
If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly different value of chi-squared. With or without ties, the results indicate that there is a statistically significant difference among the three type of programs.
Linear Regression
We use regression to estimate the unknown effect of changing one variable over another. Technically, linear regression estimates how much Y changes when X changes one unit. Before running a regression it is recommended to have a clear idea of what you are trying to estimate (i.e. which are your outcome and predictor variables). Previous Steps
1. 2. 3. 4.
Examine descriptive statistics Look at relationship graphically and test correlation(s) Run and interpret regression Test regression assumptions
+ An example: SAT
Expenditures
!
Are SAT scores higher in states that spend more money on education controlling by other factors?* ! Outcome (Y) variable: SAT scores, variable csat in dataset ! Predictor (X) variables
! ! ! ! ! !
Per pupil expenditures primary & secondary (expense) % HS graduates taking SAT (percent) Median household income (income) % adults with H Sdiploma (high) % adults with college degree (college) Region(region)
* Source: Data and examples come from the book Statistics with Stata (updated for version 9) by Lawrence C. Hamilton (chapter 6). Click here to download the data or type: . Use the file states.dta (educational data for the U.S.)
800 2000
4000
10000
How state s mean SAT changes if its expenditure increases one unit. For each onepoint increase in expense, SAT scores decreases by 0.022
The t-values test the hypothesis that the coefficient is different from 0. To reject this, you need a t-value greater than 1.96 (for 95% confidence). You can get the t-values by dividing the coefficient by its standard error. The t- values also show the importance of a variable in the model.
Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05 (you could choose also an alpha of 0.10). In this case, expense is statistically significant in explaining SAT
R-square shows the amount of variance of Y explained by X. In this case expense explains 22% of the variance in SAT scores.
Region is entered here as dummy variable. First, generate dummies: tab region, g(reg)
. tab region, g(reg) Geographica l region West N. East South Midwest Total
Freq. 13 9 16 12 50
csat expense percent income high college reg2 reg3 reg4 _cons
Coef. -.002021 -3.007647 -.1674421 1.814731 4.670564 69.45333 25.39701 34.57704 808.0206
Robust Std. Err. .0035883 .2358047 1.196409 1.02694 1.599798 17.99933 12.52558 9.44989 67.86418
P>|t| 0.576 0.000 0.889 0.085 0.006 0.000 0.049 0.001 0.000
[95% Conf. Interval] -.0092676 -3.483864 -2.583638 -.2592168 1.439705 33.10295 .101086 15.4926 670.9661 .0052256 -2.53143 2.248754 3.888679 7.901422 105.8037 50.69293 53.66149 945.0751
xi: regress csat expense percent income high college i.region, robust
csat expense percent income high college _Iregion_2 _Iregion_3 _Iregion_4 _cons
Coef. -.002021 -3.007647 -.1674421 1.814731 4.670564 69.45333 25.39701 34.57704 808.0206
Robust Std. Err. .0035883 .2358047 1.196409 1.02694 1.599798 17.99933 12.52558 9.44989 67.86418
P>|t| 0.576 0.000 0.889 0.085 0.006 0.000 0.049 0.001 0.000
[95% Conf. Interval] -.0092676 -3.483864 -2.583638 -.2592168 1.439705 33.10295 .101086 15.4926 670.9661 .0052256 -2.53143 2.248754 3.888679 7.901422 105.8037 50.69293 53.66149 945.0751
NOTE: By default xi excludes the first value, to select a different value, before running the regression type: . char region[omit] 4 xi: regress csat expense percent income high college i.region, robust This will select Midwest (4) as the reference category for the dummy variables.
Below is a correlation matrix for all variables in the model. Numbers are Pearson correlation coefficients, go from -1 to 1. Closer to 1 means strong correlation. A negative value indicates an inverse relationship (roughly, when one goes up the other goes down).
pwcorr csat expense percent income high college, star(0.05) sig
csat csat 1.0000 expense percent income high college
expense
-0.4663* 0.0006 -0.8758* 0.0000 -0.4713* 0.0005 0.0858 0.5495 -0.3729* 0.0070
1.0000
percent
1.0000
income
1.0000
high
1.0000
college
0.5319* 0.0001
1.0000
Command graph matrix produces a graphical representation of the correlation matrix by presenting a series of scatterplots for all variables
graph matrix csat expense percent income high college, half maxis (ylabel(none) xlabel(none))
! Stata
offers several user- friendly options for storing and viewing regression output from multiple models:
! !
Regression: eststo/esttab
!We
regress csat expense, robust eststo model1 regress csat expense college, robust eststo model2 percent income high
percent
income
high
Regression: eststo/esttab
!
esttab model1 model2 model3 Now Stata will hold your output in . memory until you ask to recall it: (1) (2) csat csat expense -0.0223*** (-6.07) 0.00335 (0.70) -2.618*** (-11.44) 0.106 (0.09) 1.631 (1.73) 2.031 (0.96) (3) csat -0.00202 (-0.56) -3.008*** (-12.75) -0.167 (-0.14) 1.815 (1.77) 4.671** (2.92) 69.45*** (3.86) 25.40* (2.03) 34.58*** (3.66) 1060.7*** (43.55) 51 851.6*** (14.86) 51 808.0*** (11.91) 50
percent
income
high
college
_Iregion_2
_Iregion_3
_Iregion_4
_cons
Regression: eststo/esttab
Some options (type help eststo and help esttab for mor options) esttab model1 model2 model3, r2 ar2 se label
!
(1) Mean compo~e Per pupil expendit~c -0.0223*** (0.00367) % HS graduates tak~T Median household~000 (2) Mean compo~e 0.00335 (0.00478) -2.618*** (0.229) 0.106 (1.207) 1.631 (0.943) 2.031 (2.114) (3) Mean compo~e -0.00202 (0.00359) -3.008*** (0.236) -0.167 (1.196) 1.815 (1.027) 4.671** (1.600) 69.45*** (18.00) 25.40* (12.53) 34.58*** (9.450) 1060.7*** (24.35) 51 0.217 0.201 851.6*** (57.29) 51 0.824 0.805 808.0*** (67.86) 50 0.911 0.894
% adults HS diploma
region==2
region==3
region==4
Constant
Regression: outreg2
!
regress !csat expense, robust outreg2 !using !outreg2 using prediction.doc regress csat expense percent income high college, robust outreg2 using prediction.doc, append xi: regress csat expense percent income high college i.region, robust outreg2 using prediction.doc, append
Regression: outreg2
How good the model is will depend on how well it predicts Y, the linearity of the model and the behavior of the residuals. Using predict immediately after running the regression:
xi: regress csat expense college i.region, robust predict csat_predict label variable csat_predict "csat predicted
percent
percent2
income
high
900
1000
1050
We should expect a 45 degree pattern in the data. Y-axis is the observed data and x-axis the predicted data (Yhat). In this case the model seems to be doing a good job in predicting csat
The dependent variable is normally distributed The errors of regression equation are normally distributed The variance around the regression line is the values of the predictor variable (X) same for all
Assumption 2: Homoscedasticity
!
The size of one error is not a function of the size previous error
of any
AKA the relationship can be summarized with a straight line Keep in mind that you can use alternative forms of regression to test non- linear relationships
Shapiro- Wilk test of normality tests null hypothesis that data is normally distributed
+ Regression:
Note: rvfplot command needs to be entered after regression equation is run Stata uses estimates from the regression to create this plot
+ Regression:
!
A non-graphical way to detect heteroskedasticiy is the BreuschPagan test. The null hypothesis is that residuals are homoskedastic. In the example below we fail to reject the null at 95% and concluded that residuals are homogeneous. However at 90% we reject the null and conclude that residuals are not homogeneous.
Logit/Probit Regression
Logit model
Use logit models when every our dependent variable is binary (also called dummy) which takes values 0 or 1. ! Logit regression is a nonlinear regression model that forces the output (predicted values) to be either 0 or 1. ! Logit models estimate the probability of your dependent variable to be 1 (Y=1). This is the probability that some event happens. ! Logit and probit models are basically the same, the difference is in the distribution:
!
! ! !
Logit Cumulative standard logistic distribution (F) Probit Cumulative standard normal distribution ( ) Both models provide similar results.
Logit model
Logit: adjust
Ordinal logit
!
When a dependent variable has more than two categories and the values of each category have a meaningful sequential order where a value is indeed higher than the previous one, then you can use ordinal logit. Here is an example of the type of variable:
Panel data allows you to control for variables you cannot observe or measure like cultural factors or difference in business practices across companies; or variables that change over time but not across entities (i.e. national policies, federal regulations, international agreements, etc.). This is, it accounts for individual heterogeneity. With panel data you can include variables at different levels of analysis (i.e. students, schools, districts, states) suitable for multilevel or hierarchical modeling. Some drawbacks are data collection issues (i.e. sampling design, coverage), non-response in the case of micro panels or cross-country dependency in the case of macro panels (i.e. correlation between countries)
!In
FIXED-EFFECTS MODEL (Covariance Model, Within Estimator, Individual Dummy Variable Model, Least Squares Dummy Variable Model)
Fixed effects
!
Use fixed-effects (FE) whenever you are only interested in analyzing the impact of variables that vary over time. FE explore the relationship between predictor and outcome variables within an entity (country, person, company, etc.). Each entity has its own individual characteristics that may or may not influence the predictor variables (for example being a male or female could influence the opinion toward certain issue or the political system of a particular country could have some effect on trade or GDP or the business practices of a company may influence its stock price).
Fixed effects
!
When using FE we assume that something within the individual may impact or bias the predictor or outcome variables and we need to control for this. This is the rationale behind the assumption of the correlation between entity s error term and predictor variables. FE remove the effect of those time-invariant characteristics from the predictor variables so we can assess the predictors net effect. Another important assumption of the FE model is that those time-invariant characteristics are unique to the individual and should not be correlated with other individual characteristics. Each entity is different therefore the entity s error term and the constant (which captures individual characteristics) should not be correlated with the others. If the error terms are correlated then FE is no suitable since inferences may not be correct and you need to model that relationship (probably using random-effects), this is the main rationale for the Hausman test (presented later on in this document).
Fixed effects
Fixed effects
Fixed effects
OLS regression
Fixed effects
+ Fixed
effects: comparing xtreg (with fe), regress (OLS with dummies) and areg
A note on fixed-effects
!
The fixed-effects model controls for all time-invariant differences between the individuals, so the estimated coefficients of the fixed-effects models cannot be biased because of omitted time-invariant characteristics [like culture, religion, gender, race, etc] One side effect of the features of fixed-effects models is that they cannot be used to investigate time-invariant causes of the dependent variables. Technically, time-invariant characteristics of the individuals are perfectly collinear with the person [or entity] dummies. Substantively, fixed-effects models are designed to study the causes of changes within a person [or entity]. A timeinvariant characteristic cannot cause such a change, because it is constant for each person. (Underline is mine) Kohler, Ulrich, Frauke Kreuter, Data Analysis Using Stata, 2nd ed., p.245
Random effects
Random effects
!
Random effects assume that the entity s error term is not correlated with the predictors which allows for time-invariant variables to play a role as explanatory variables. In random-effects you need to specify those individual characteristics that may or may not influence the predictor variables. The problem with this is that some variables may not be available therefore leading to omitted variable bias in the model. RE allows to generalize the inferences beyond the sample used in the model.
Random effects
FIXED OR RANDOM?
+ Testing
+ Testing
Source: Hoechle, Daniel, Robust Standard Errors for Panel Regressions with Cross-Sectional Dependence http:// fmwww.bc.edu/repec/bocode/x/xtscc_paper.pdf
NOTE: Use the option robust to control for heteroscedasticiy (in both fixed and Presence of heteroscedasticity random effects).