Вы находитесь на странице: 1из 3

Pitfalls of Data Analysis

(or How to Avoid Lies and Damned Lies)

A Review
Myth behind statistics is that you can prove anything using stats but it is only true when you use it
improperly. Bending the basic rule of stats is something that is observed. Mark Twain summed it up
perfectly when he said “There are three types of lies: lies, damned lies and statistics”

Statistics require ability to consider things from a probabilistic perspective. Jargons like “confidence”,
“reliability”, “significance” etc come into picture. On the other hand, non-mathematicians use logic. For
them quantitative data is very important.

Looking at the pitfalls of statistics (that’s the core of the paper, right?) There are mainly 3 types of statistical
pitfalls i.e. Source of Bias, Error in Methodology and concern interpretation of results.

The fundamental value of statistical methodology is its capability to support one in making interpretations
about a large group based on observations of a smaller subset of that group. Things to be taken of for that
are:

1) The sample must be similar to the target population in all relevant aspects
2) Certain aspects of the measured variables must conform to assumptions which underlie the
statistical procedures to be applied.

Representative sampling. The experimental sample must be representative of the target population for
interpretations to be valid. Of course, the problem comes in applying this principle to real situations. The
ideal scenario would be where the sample is chosen by selecting members of the population at random,
with each member having an equal probability of being selected for the sample. This process will be feasible
for manufacturing processes, but it is more complicated and problematic for studying people. Following
example as explained by author helps us in understanding the above concept.

It is reasonable to assume that people applying for jobs during a slump might be different as a group from
those applying during a period of economic growth. In this case, you'd want to take caution before using
any kind of statistical approach. There are ways to do that, or "control", differences between groups
statistically, as with the inclusion of covariates in a linear model. Levin rightly pointed out that there are
problems with this approach too. One can never be sure one has accounted for all the important variables,

Krishna Zanwar (BD&I)


and inclusion of such controls depends on certain assumptions which may or may not be satisfied in a given
situation (see below for more on assumptions).

Statistical assumptions.

While assuming anything for stats, we have to follow rules and justification depends on assumptions.

The robustness of any statistical techniques only goes so far. If the distributions are non-normal, try to
figure out why? A possible method for dealing with unusual distributions is to apply a transformation.
However, this has dangers as well; an ill-considered transformation can do more harm than good in terms
of interpretability of results.

Errors in Methodology

Statistical Power: The power of any test of statistical significance is defined as the probability that it will
reject a false null hypothesis. Statistical power is inversely related to beta or the probability of making
a Type II error. In short, power = 1 – β. Statistical power is affected chiefly by the size of the effect and the
size of the sample used to detect it. Bigger effects are easier to detect than smaller effects, while large
samples offer greater test sensitivity than small samples.

Multiple Comparisons: In statistics, the multiple comparisons occur when one considers a set of statistical
inferences simultaneously or infers a subset of parameters selected based on the observed values. In certain
fields it is known as the look-elsewhere effect. Multiple comparisons arise when a statistical analysis
involves multiple statistical tests, each of which has a potential to produce a "discovery." The more
inferences are made, the more likely erroneous inferences are to occur.

Measurement Error: Measurement Error is the difference between a measured quantity and its true value.
It includes random error (naturally occurring errors that are to be expected with any experiment) and
systematic error (caused by a mis-calibrated instrument that affects all measurements. Two characteristics
of measurement which are particularly important in psychological measurement are reliability and validity.
Reliability refers to the ability of a measurement instrument to measure the same thing each time it is used
and Validity is the extent to which the indicator measures the thing it was designed to measure.
Measurement errors can quickly grow in size when used in formulas. To account for this, we should use a
formula for error propagation whenever we use uncertain measures in an experiment to calculate something
else.

Krishna Zanwar (BD&I)


Problems with interpretation

The difference between "significance" in the statistical sense and "significance" in the practical sense
continues to elude many consumers of statistical results.

Precision and Accuracy are two concepts which everyone seem to be confused with frequently. It has a
subtle but important distinction: precision refers to how finely an estimate is specified, whereas accuracy
refers to how close an estimate is to the true value.

Assessing causality is the reason of most statistical analysis, yet its subtleties escape many statistical
consumers. For one to determine a causal inference, he/she must have random assignment.

Graphical representation- Tufte introduces to indicate the relationship between the data and the graphic
is a number he calls "the Lie Factor". This is simply the ratio of the difference in the proportion of the
graphic elements versus the difference in the quantities they represent

Some of the important point to keep in mind.

1.Sample is representative of the population in which you're interested.


2. Too much or too little statistical power is harmful.
3. Use the best measurement tools available.
4. Look at magnitudes rather than p-values.
5. Don't confuse precision with accuracy.
7. Make sure your Graphs are accurate and reflect the data variation clearly.

Krishna Zanwar (BD&I)

Вам также может понравиться