Вы находитесь на странице: 1из 14

Screening

Data
Yue Jiao

Screening data
Deal with
(1) Accuracy
(2) Missing data
(3) Fit between data set and the assumptions
(4) Transformations of variables
(5) Outliers
(6) perfect or near- perfect correlations

Accuracy
Proofreading: For large number of data, screening
for accuracy involves examination of descriptive
statistics and graphic representations of the
variables
Honest correlation: It is important that the
correlations, whether between two continuous
variables or between a dichotomous and
continuous variable, be as accurate as possible

Accuracy
Inflated Correlation: When composite variables are
constructed from several individual items by
pooling responses to individual items, correlations
are inflated if some items are reused
Deflated Correlation: A falsely small correlation
between two continuous variables is obtained if the
range of responses to one or both of the variables
is restricted in the sample

Missing Data
Missing data are characterized as:
MCAR (missing completely at random),
MAR (missing at random, called ignorable nonresponse),
MNAR (missing not at random or nonignorable).

The distribution of missing data is unpredictable in


MCAR.
The pattern of missing data is predictable from
other variables in the data set when data are MAR.
In NMAR, the missingness is related to the variable
itself and, therefore, cannot be ignored.

Missing Data:
(1) Delete missing data, if only a few cases have missing
data and they seem to be a random subsample of the
whole sample.
(2) Estimate missing data, using prior knowledge;
inserting mean values; using regression; expectationmaximization; and multiple imputation.
(3) Another option with randomly missing data involves
analysis of a missing data correlation matrix.
(4) Treat missing data as data.
(5) Repeating Analyses With and Without Missing Data

Outlier
An outlier is a case with such an extreme value on one variable (a
univariate outlier) or such a strange combination of scores on two
or more variables (multivariate outlier) that it distorts statistics.
Reason:
(1) incorrect data entry.
(2) failure to specify missing-value codes in computer syntax so that
missing-value indicators are read as real data.
(3) the outlier is not a member of the population from which you
intended to sample.
(4) intended population but the distribution for the variable in the
population has more extreme values than a normal distribution

Eliminate of the variable, which is responsible for


most of the outliers
Variable transformation - undertaken to change the
shape of the distribution to more nearly normal
Score alteration- change the score on the variable
for the outlying case so that they are deviant, but
not as deviant as they were

Normality
Two components of normality:
(1) Skewness has to do with the symmetry of the
distribution; a skewed variable is a variable whose
mean is not in the center of the distribution.
(2) Kurtosis has to do with the peakedness of a
distribution; a distribution is either too peaked (with
short, thick tails) or too flat (with long, thin tails).
When a distribution is normal, the values of
skewness and kurtosis are zero

Linearity
The assumption of linearity is that there is a
straight-line relationship between two variables
(where one or both of the variables can be
combinations of several variables).
Linearity is important in a practical sense because
Pearsons r only captures the linear relationships
among variables; if there are substantial nonlinear
relationships among variables, they are ignored.

Homoscedasticity
Assumption of homoscedasticity is that the
variability in scores for one continuous variable is
roughly the same at all values of another
continuous variable.
Homoscedasticity is related to the assumption of
normality because when the assumption of
multivariate normality is met, the relationships
between variables are homoscedastic.

Common Data Transformations


Data transformations are recommended as a
remedy for outliers and for failures of normality,
linearity, and homoscedasticity.
Transformed variables are sometimes harder to
interpret.

Multicollinearity and Singularity


Multicollinearity and singularity are problems with
a correlation matrix that occur when variables are
too highly correlated.
Multicollinearity, the variables are very highly
correlated (say, .90 and above);
Singularity, the variables are redundant; one of the
variables is a combination of two or more of the
other variables.

Checklist for Screening Data


1. Inspect univariate descriptive statistics for accuracy of input
a. Out-of-range values
b. Plausible means and standard deviations
c. Univariate outliers

2. Evaluate amount and distribution of missing data; deal with problem


3. Check pairwise plots for nonlinearity and heteroscedasticity
4. Identify and deal with nonnormal variables and univariate outliers
a. Check skewness and kurtosis, probability plots
b. Transform variables (if desirable)
c. Check results of transformation

5. Identify and deal with multivariate outliers


a. Variables causing multivariate outliers
b. Description of multivariate outliers

6. Evaluate variables for multicollinearity and singularity

Вам также может понравиться