Вы находитесь на странице: 1из 31

Replicability & Questionable Research Practices

Readings on Sakai: Pashler & Harris, 2012; Simmons et al., 2011

Replication

Replicability: the degree to which similar results are obtained if a study is repeated

Exact replication:

Repeat study using the same methods as exactly as possible. Rare; difficult to publish (publication bias for novel research) Use slightly different methods (e.g., measures, manipulations, sample) to test the same hypotheses Very common (virtually required in some top-tier journals) Add something to extend the results (e.g., another condition). Can help show that the results 1) replicate and 2) generalize

Conceptual Replication:

Replication-plus-extension:

Example of a Direct Replication:


Bargh, Chen, & Burrons (1996)

Priming a feature of a stereotyped group can yield behaviour that is consistent with the stereotype They used words in a scrambled sentence task to prime an elderly stereotype (e.g., old, wise, sentimental, bingo, retired, wrinkle) or neutral words (e.g., thirsty, clean, private) After the study ended, they timed participants as they walked down the hallway.

Example of a Direct Replication:


Bargh, Chen, & Burrons (1996)

Example of a Conceptual Replication:


Elliot, Maier, Moller, Friedman, & Meinhardt (2007)

Examined the effects of presenting the color red on performance across 6 experiments All studies used the same variables at the abstract (conceptual) level, but differed at the operational level

Example of a Conceptual Replication:


Elliot, Maier, Moller, Friedman, & Meinhardt (2007)

Experiment 1:

IV: ID number was written on page in red, green, or black ink DV: Number of anagrams solved correctly

Example of a Conceptual Replication:


Elliot, Maier, Moller, Friedman, & Meinhardt (2007)

Experiment 2:

IV: Cover page was page in red, green, or white DV: Number of correct analogy items on IQ test

The replicability crisis in psychology

Crisis of confidence in psychology this decade:


High-profile fraud cases (e.g., Diederik Stapel, Dirk Smeesters) Report that psychologists are reluctant to share data for reanalysis (Wicherts, Bakker, & Molenaar, 2011) Focus on questionable research practices (Simmons, Nelson, & Simonsohn, 2011) Widely ridiculed publication showing extrasensory perception effects (Bem, 2011) that failed to replicate (Ritchie, Wiseman, & French, 2012)

How many results in psychology would replicate?

Extransensory Perception Studies

Bem (2011), reported 9 experiments supporting precognition of events before they occurred

Examined well-known psychological effects time-reversed (measure outcome before manipulation)


Presented list of words serially Type all words they could remember After typing the words, they practiced a randomly selected half of the words Result: Participants recalled significantly more words that they practiced (vs. control words), t(49) = 2.96, p = .002, d = 0.42.

Example (experiment 9):


Stephen Colbert's summary

Extransensory Perception Studies

Bem (2011) was published in the Journal of Personality and Social Psychology, a top-tier journal The same journal subsequently rejected a manuscript that failed to replicate the finding (JPSP does not publish replications) Ritchie, Wiseman, & French (2012) failed to replicate Bems Experiment 9 across 3 pre-registered direct replications

What does a failure to replicate mean?


Failures to replicate are ambiguous! They could represent:

Differences in methods (measures, setting, sample, etc.) Random variation across samples Mistakes made during data collection

The failure to replicate could be a Type II error (or the original study could be a Type I error)

Extransensory Perception Studies


Should Bem have been published? Editorial (Judd & Gawronski, 2011):

We openly admit that the reported findings conflict with our own beliefs about causality and that we find them extremely puzzling. Yet, as editors we were guided by the conviction that this paperas strange as the findings may beshould be evaluated just as any other manuscript on the basis of rigorous peer review. Our obligation as journal editors is not to endorse particular hypotheses but to advance and stimulate science through a rigorous review process. (abstract)

Is the Replicability Crisis Overblown?


(Pashler & Harris, 2012)

Choosing a=.05 does not mean the risk of a Type 1 error is 5% How many of the effects that we examine actually exist? How much power do we have to detect those effects?

Direct replications are rare and conceptual replications are problematic Science is not always self-correcting

The replicability crisis

How many results in psychology would replicate?

We dont know! (File drawer effect) The Reproducibility Project (Open Science Collaboration, 2012)

Large-scale (>150 scientists) attempt to replicate studies Currently replicating studies from 3 prominent psychology journals from 2008 What is the overall rate of replicability in psychology? What predicts replicability of studies?

Is this crisis unique to psychology?


Begley and Ellis (2012) attempted to replicate 53 papers in top journals on cancer research They focused on new (unreplicated) results They did not replicate 47 (89%) of those studies

Questionable Research Practices

Homework Assignment

Pretend that you MUST obtain a statistically significant result in your group project at any cost. Try to change your analyses to find a statistically significant result. For instance, you could:

Exclude participants for any reason Add control variables Change scores that look unusual

The significant result does not need to be relevant to your hypotheses Could you write a paper that makes sense of this significant result?

Questionable Research Practices

Practices in the collection, analysis, and reporting of results that inflate the risk of making a Type I error False positive (Type I error): Incorrect rejection of a null hypothesis More common (and perhaps less problematic) than outright fraud

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis (Simmons, Nelson, & Simonsohn, 2011)

False-positives (Type I errors) are problematic!


Persistent because failures to replicate are not conclusive and are usually not published Inspire future research that may waste resources

Using a conservative a (e.g., a = .05) does not solve those problems Researcher degrees of freedom

Decisions made during data collection, analysis, and reporting Can yield significant results, but inflate Type 1 error rates

Researcher degrees of freedom inflate type 1 error rates (Simmons et al., 2011)

Simulations using randomly generated data Proportion of significant results (Type 1 errors)

Checking data and adding subjects if p > .05 inflates type 1 error rates (Simmons et al., 2011)

Checking data and adding subjects if p > .05 inflates type 1 error rates (Simmons et al., 2011)

How common are questionable research practices? (John, Loewenstein, & Prelec, 2012)
Sent anonymous survey to 5,964 academic psychologists at U.S. universities; 2,155 (36%) responded Have you done this? Is it defensible? (0=no, 1=possible, 2=yes) Item Admission rate Mean defensibility

(%)
Falsifying data Wrongly claiming results are unaffecting by demographic variables Reporting unexpected finding as having been predicted from the start Deciding whether to exclude data after looking at the impact on results

0.6 3.0

0.16 1.32

27.0 38.2

1.50 1.61

Stopping data collection earlier than planned because a result is significant


Failing to report all conditions in a

15.6

1.76

27.7

1.77

How common are questionable research practices? (John, Loewenstein, & Prelec, 2012)

This study had several methodological limitations:


Initial response rate of 36% 33% of participants dropped out of the survey before finishing Some participants argued that the questions were worded in a biased manner (e.g., Norbert Schwartz, 2012, listserv posting)

Some QRPs may be justifiable in some contexts

Another approach to identifying falsepositives: p-values

Masciampo & Lalande (2012) A peculiar prevalence of p values just below .05

Examined p-values from three prominent journals: JEPG, JPSP, PS Collected 3,627 p values between .01 and .10 from 36 issues

With real effects, you expect relatively low p-values (an exponential curve of p-values In reality, there are more p-values just below .05 than would be expected by chance

Another approach to identifying falsepositives: p-values (Masciampo & Lalande, 2012)

Frequency

P-curve: A key to the file drawer


(Simonsohn, Nelson, & Simmons, 2013)

P-hacking: Engaging in questionable research procedures in order to reduce the p-value to under .05 The shape of the distribution of p-values (the pcurve) can help to identify p-hacking With a large effect size, p-hacking should not matter much With a small effect or no effect, p-hacking will lead to more p-values just under .05

P-curves for different effect sizes with and without phacking (Simonsohn, Nelson, & Simmons, 2013)

How can we minimize false positives?


(Simmons et al., 2011)

Guidelines for authors: 1. Decide rule for terminating data collection before data collection begins and report that rule 2. Collect at least 20 observations per sell or justification 3. List all variables collected in a study 4. Report all experimental conditions 5. If observations are eliminated, report results with and without those observations 6. If covariates are included, report results with and without the covariate

How can we minimize false positives?


(Simmons et al., 2011)

Guidelines for journal reviewers: 1. Ask authors to follow previous requirements 2. Be tolerant of imperfections in results 3. Ask authors to show that results are robust (vs. hinging on a very specific type of analysis) 4. In some cases, require an exact replication

Additional guidelines for increasing replicability (Asendorpf et al., 2013)


Increase sample size Increase reliability of measures Choose study designs that minimize error variance:

Clear and standardized instructions Use controlled conditions Design strong manipulations Test and address assumptions Control for covariates, when justified

Use appropriate statistical methods


Avoid multiple underpowered studies Publish all relevant information (materials, sample size justifications, etc.)

Any Questions ?

Вам также может понравиться