Вы находитесь на странице: 1из 6

A self-referential discussion of scientic reproducibility

Mark Galassi 1 and Laura Fortunato 2


2

Los Alamos National Laboratory Santa Fe Institute and University of Oxford July 14, 2013

Abstract We discuss some aspects of reproducibility in scientic research. We propose an accessible approach to rigorous software pipelines and use such an approach to generate this very paper. We use two toy examples which illustrate the two most common types of computationally-heavy research activity simulation and data analysis. We provide complete end-to-end software pipelines for these two proglems. We have chosen simple toy models so that we can clearly demonstrate the fundamental issue of provenance : how can we be sure what software steps were used to go from which data and which assumptions to the nal paper. There is nothing new in the data analysis and model: this papers goal is to give you a skeleton of reproducible software pipelines which you can adapt to your own software projects.

Motivation and plan

We now dive into the motivation, but we rst address the issue of demotivation. Not all scientists are top hackers and it can be demotivating to feel you have to implement a rigorous software pipeline. We hope that our demonstration will convince you that a) even just a few of the steps, such as making data and source code available, already get you a long way there; and b) introducing several aspects of a rigorous pipeline approach might be quite enjoyable and not technically arduous. But rst let us scare you into feeling motivated:

1.1

Negative motivation sloppy research and fraud

You have probably been in the situation of asking someone how they generated a plot in a paper or in a presentation, and received an answer on the lines of Well, I imported the data into Excel1 played with it until it looked good. If you experienced discomfort at that answer then you might consider reading this paper. Or you might have been in a project meeeting situation in which a collaborator showed a neat result, but when pressed could not say exactly what settings of apparatus or software had generated those results. If you were tempted to bang your st on the table, ranting about sloppy work in the meeting and protesting I dont want to see that happen
mark@galassi.org fortunato@santafe.edu 1 Excel

ever again! then you should denitely read this paper and consider joining us in future revisions of it. These two are simple examples of how in the day-to-day rush of doing research we might get sloppy and lose track of the precise lineage of our results. Some might excuse this sloppiness as being part of the early exploration phase of a project, and that nal scientic work goes through a sort of polishing and a process of peer review which makes it robust. This dismissal of concern, a sort of what happens in Vegas stays in Vegas attitude, might seem reasonable, but strong arguments have been made that sloppy early work often stays sloppy in nal results, and that the peer review mechanism does not ush it out. A less frequent problem is that of scientic fraud: a community of scientists accustomed to not tracking data from its origin and through its processing steps will accept cases of deliberate falsication of data and calculation. An infamous example is the case of Jan Hendrick Sch on, the condensed matter physicist who built a career on a series papers based on falsied results [6]. This is discussed in more detail below. The existence of intriguing anecdotal examples of sloppy and fraudulent publication does not mean that the phenomenon is widespread. The only study we know that tries to quantify the impact of this lack of rigor is in the medical eld is by John Ioannidis. In 2005 Ioannidis published two forceful papers with the provocative titles Why Most Published Research Findings Are False [14] and Contradicted and Initially Stronger Ef-

is a proprietary spreadsheet program widely used at the time of writing

fects in Highly Cited Clinical Research [13]. In the former he studies the larger picture of false results; in the latter he analyzes 49 highly cited papers from clinical journals and nds that a third of them have replication problems. These suspect papers are at the basis of several widely used medical diagnoses and treatment protocols. Although we do not know of analogous systematic studies in other scientic areas, other elds have analogous incentives for scientists to publish recklessly and a similarly lethargic peer review process. In 2005 Jeremy Stribling made a big impression by submitting an obviously incoherent paper to a conference and having it accepted; the paper had been generated by his program SciGen [15, 26, 2]. Apart from the amusement it has provided, SciGen is allowing researchers to study the extent and eects of bogus article submission and careless peer review. For example Labb e and Labb e [16] have studied whether indexing services are robust to SciGen submissions, noting that many articles generated by SciGen, or slightly modied from SciGen-generated text, are accepted by indexing engines, with open-access and proprietary engines accepting bogus articles at about the same rate. Another initiative that has tried to bring light to the problem of sloppy science and fraud is the Retraction Watch blog [20] started by Ivan Oransky and Adam Marcus. Retraction Watch reports on various retractions and other cases of unvalidated scientic publications, trying to illuminate the cause for retraction. This eort naturally leads to analysis of invidual cases, but Oransky and Marcus also report on articles that try to characterize the problem quantitatively, for example by reporting on Mobleys nding that half of all investigators have had trouble validating at least one of their previously reported results [19, 18]. It is interesting to note that Oransky and Marcus, like Ioannidis, come from the medical research community. Social scientists also contribute to the debate: Hackett [11] and others have written about the role and incentives of journals and peer review, sparking a debate about whether referees can stop the ow of sloppy publications. Given our angle in this paper (provenance and reproducibility), we propose a possible point of view in this debate in section 8.1. We conclude this subsection by mentioning two more famous examples, one of scientic fraud and one of sloppy work, both of which had great negative impact in their areas and far beyond. The rst is that of Jan Hendrick Sch on, a researcher at Bell Labs whose results in solid state physics were much too good to be true. After he published a series of claims to strong inventions in prestigious journals, some researchers noticed that he was reusing the same data for dierent situations, while others tried to reproduce his results and could not. Eventually Bell Labs conducted an inquiry and his work was discredited and many of his articles were retracted [22]. Another example comes from economics. Harvard economists Reinhart and Rogo [23] wrote a paper titled Growth in a Time of Debt. Even a glance at the pa2

per shows that the authors have not provided a software pipeline which can reproduce their results. University of Massachusetts-Amherst graduate student Thomas Herndon and two collaborators analyzed the article and noticed that the authors had used a proprietary spreadsheet program for their data analysis and had forgotten to select some parts of the data [12]. This falsely inated the magnitude of the eect reported by Reinhart and Rogo [23]. This is unfortunate, since Reinhart and Rogo [23]s results have significant impact on government decision-making in the US and Europe.

1.2

Positive examples

Many researchers have already quietly been following good practices for scientic reproducibility. We report just a couple of examples which have come to our attention: Andy Fraser (Los Alamos National Laboratory) is the author of a book on Hidden Markov Models [7]. He commendably makes the full source code for the book available [8] in the fullest possible manner: not only does he provide the text and gures, but he also provides the source code to all the programs he used to generate the gures. You can download his distribution and use a script to build the full book. If you modify any text or source code, the build system will re-generate the book to exactly match the current state of the software. Andy Frasers approach to automated turnkey building of something as big as a full book is what inspired us to come up with our self-referential step-by-step approach. Titus Brown (Michigan State University) is a biologist and computer scientist who maintains an active blog [5] which often discusses issues of open science. He uses strong words to press any scientist to change habits and move toward best practices [1]: use version control, publish source code and data, have test suites for scientic software, put your paper on an open-access web site such as the arXiv, and many more. In one blog entry on automated testing he uses the colorful expression [...] if you arent using version control for your code, stop pretending to be a scientist [...] [4].

1.3

Great respect for provenance

We might vaguely remember hearing the word provenance used to discuss how art experts track the successive ownership of a work of art. We might read of van Meegeren [29], whose forgery of Vermeers Christ with the Adulteress was sold to Herman G oring during World War II for what would be $7 million today. Today the use of the word provenance (or lineage) is growing in scientic discussions. Newly designed large scientic programs enable the tracing of data through all its processing steps, thus allowing reproducibility. An example of this is SciDB, a replacement for the traditional SQL databases that have their origin in nancial transaction processing.

For example, one of the most important requirements 2.1 Its OK to not be perfect but put in the design of SciDB [24, 25] (the modern replacement for out all you have traditional SQL databases) is data provenance. Stonebraker Not all projects have the resources to organize a polished and his collaborators note: pipeline and to write programs that use state-of-the-art data formats. But with a very small eort you can put out all Most scientists are adamant about not discardthe information needed to reproduce the work. For examing any data. If a data item is shown to be ple, in our analysis we use poor persons metadata (cite wrong, they want to add the replacement value Mark Galassi, unpublished conversations) to show how we and the time of the replacement, retaining the obtained the data: old value for provenance (lineage) purposes. As [FIXME: added the git version and run date to the metasuch, they require a no-overwrite storage mandata] ager, in contrast to most commercial systems [FIXME: this verbatim should use TeXs input to include today, which overwrite the old value with a new the metadata generated by the program; we are violating our one. [24] principles by pasting it in here. Need to x this soon.] We encourage you to consider the provenance of your in## COMMENT : GRB light curves retrieved strument specs, data, information, parameters (and all other from NASA assumptions) to be almost sacred. ## C RE AT ED _B Y_ PR OG RA M : ../ analysis / grb retrieve . py ## G AM MA _R AY _B UR ST _I D : 080503 1.4 What to do? (i.e. Is this paper useful?) ## S W I F T _ B A T _ T R I G G E R _ I D : 310785 [...The problem is usually that people dont have a simple ## URL_FOR_FITS_LC : http :// gcn . gsfc . nasa . blueprint to start their project, a sort of wizard to use gov / notices_s / sw00310785000msb . fits GUI terminology; that is what we provide here.] ## R E T R I E V E _ S T A R T _ D A T E T I M E : 2013 -07 -09 T10 :52:27.335601 ## COLUMN : time [ seconds since 2000 -- ?? 1.5 The plan is that really it ?] ## COLUMN : raw_counts We will start with a few assorted sections discussing how ## COLUMN : total_counts you might think about certain aspects of your works repro## O R I G I N AL_FITS_FILE_SHA1 : 4 ducibility. a 7 8 9 e d62b7c20e7d1a25bdf799cb096aa7880ad We then describe how you can reproducibly build this entire paper (with its two toy models and their data analysis and plots) using our publically available source code repository. Then introduce our two toy examples so that you see what the result of this pipeline can be. Our example are: 1) simulation of predator-prey interactions based on the (rather simplistic) Lotka-Volterra equations and 2) analysis of data from measurements (a Gamma Ray Burst from NASAs Swift satellite). We then return to how we went from our public data and source to this paper with an eye toward documentation we describe the documentation we have written (beyond this paper). We hope that we have documented all the processes we used to conceive the project, build it up from scratch, and then have end-users build it. We welcome feedback on how eective this documentation is. Finally we ramble about the kind of awareness you should have in your design and the kinds of debates that all scientists should have about whether you can believe results. ## FITS_FILE_SPEC : http :// gcn . gsfc . nasa . gov / swift_gnd_ana . html As you see, anyone can use our software and this information to make the same run and obtain the same data.

From raw data to this paper part I

2
[21]

Putting it all out

We have created a version control repository which has the source code for this project: everything you need to completely reproduce the results we present here. You can check out the repository from the project hosting site gitorious (cite gitorious URL and gitorious-book by Marius Mrnes Mathiesen marius@gitorious.com, https: //git.gitorious.org/gitorious/gitorious-book.git). To check out (or clone, as one says in version control language) the repository you will need the version control program git, which is easily installed on all computers. The instructions to do so are at https://gitorious.org/ end-to-end-sci-sw-tutorial You should examine this repository. You will nd a meta-document called project-admin.org in which we annotated every step as we built the project. We encourage 3

you to use this as a template for creating your own soft- 5 Second project: analyzing GRBs ware project. If you use git to examine the project history (for example with git log --reverse) you will see that we [3] [9] [text to be written; for now we just have the plot] annotated all our steps as we carried them out, not later. You will also nd a README le which briey describes how you can build this paper; it will direct you to go to the directory paper and type make: cd paper make The make command reads a Makele to see which les need to be built and what they depend on. It will: 1. follow a deterministic path to download the Gamma Ray Burst data used for our analysis example. 2. run the analysis program and create the plots needed for this paper 3. run the separate simulation program and create plots of solutions to the Lotka-Volterra equations 4. run pdflatex on the paper to typeset it into a PDF le You can view the resulting le toy-reproducibility.pdf Figure 2: Top panel: Gamma Ray Burst light curve count rate using your favorite PDF viewer. You will notice that there was no user intervention in- versus time. Bottom panel: FFT of count rate versus frequency. [17] volved: the paper was generated automatically, nobody pasted columns in a spreadsheet GUI, etc. The Makele will know (and let you know) when the pdf le is up to 6 From raw data to this paper date. grb_analysis.pdf

part II 4 First project: Lotka-Volterra model for predator-prey


[what types of documentation do we keep as we work on our project] The ChangeLog records all edits by contributors to the project, with information on les the edits apply to and a brief description. [ADDTOME: how to do this in your text editor, e.g. emacs?] We use a bib le to keep track of any resources (papers, websites, blog entries, etc.) related to the project, including those that we have not yet read. README les in the sub-directory for each sub-project, with details on how to run the programs therein etc.

[28] [just the plot at this time]


baboons and cheetas versus time 140 120 100 80 60 40 20 00 1 2 3 4 5 6 7 8 9

An awareness to go along with this prescription

time

Figure 1: Numerical solutions to the Lotka-Volterra equations, approximated with the really trivial Euler method.

We have given a prescription for how to automate the pipeline for your project: sections 3 and 6 give you a procedure you could apply when you start a project. The procedure we propose might last for the whole lifetime of your project, even if it becomes big, or it might have to grow and evolve with your project (in which case we hope you will share your insights with us). 4

But you also need awareness of related issues, otherwise ADDTOME Tools [make and friends, ...] Document everything. Visualize and document the ow. Look at other peoples projects critically. Look at your own project critically, imagining how you would look at someone elses.

References
[1] [2] D. A. Aruliah et al. Best Practices for Scientic Computing. In: CoRR abs/1210.0530 (2012). Philip Ball. Computer conference welcomes gobbledegook paper. In: Nature 434.7036 (2005), pp. 946946. Joshua S Bloom. What are gamma-ray bursts? Princeton University Press, 2011. C. Titus Brown. Automated testing and research software. 2012. url: http://ivory.idyll.org/blog/ automated - testing - and - research - software . html. C. Titus Brown. Living in an Ivory Basement blog. Stochastic thoughts on science, testing and programming. 2012. url: http://ivory.idyll.org/blog/. Leon Cassuto. Big Trouble in the World of Big Physics. 2002. Andrew M Fraser. Hidden Markov models and dynamical systems. SIAM, 2008.

[3] [4]

The debate that should ensue

[this should pick up on the discussions in the le projectadmin.org] [discussion of apparatus, blueprints, PVO, Cold Fusion]

[5]

8.1

Peer review proposal


[6] [7] [8]

The process of peer review should have saved us from all this: peer review is cited as the role of the reviewer should be to ensure that the authors have given correct provenance to their results and their writing; nothing else We recognize that this is a single-focus proposal, articially stated as an extreme, un-nuanced approach. Still, this extreme formulation oers a few advantages: Papers would only be accepted if other researchers have a chance at showing that a result is incorrect. Referees are not expected to judge relevance, thus not throttling the introduction of ideas.

Andrew M Fraser. Hidden Markov models and dynamical systems book source. 2008. url: http://www. fraserphysics.com/~andy/hmmdsbook/. [9] Neil Gehrels et al. The Swift gamma-ray burst mission. In: The Astrophysical Journal 611.2 (2004), p. 1005. [10] Paul Ginsparg. First steps toward electronic research communication. MIT Press, Cambridge, MA, 1997. url: http : / / www . fas . org / sgp / othergov / doe / lanl/pubs/00285556.pdf.

This might continue along a path toward full automation of the review process (imagine a meeting between [11] Edward J Hackett. Essential Tensions Identity, ConSciGen and RefereeGen). This process began when trol, and Risk in Research. In: Social Studies of SciPaul Ginsparg created the Los Alamos preprint server ence 35.5 (2005), pp. 787826. [10] which was a precursor to the arXiv. In his Background section Ginsparg points out that the physics [12] Thomas Herndon, Michael Ash, and Robert Pollin. Does High Public Debt Consistently Stie Economic community had recognized the irrelevance of refereed Growth? A Critique of Reinhart and Rogo. Tech. rep. journals to ongoing research for a long time. Political Economy Research Institute, 2013. Another prophetic feature of the arXiv was that it [13] John PA Ioannidis. Contradicted and initially would build a paper from the submitted TEX les. We stronger eects in highly cited clinical research. In: propose that reviewers simply take that to the next JAMA: the journal of the American Medical Associastep of building the entire paper from a basic reprotion 294.2 (2005), pp. 218228. ducible set of software steps. [14] John PA Ioannidis. Why most published research ndings are false. In: PLoS medicine 2.8 (2005), e124. The peer in peer review would eventually disappear, and the notion of peer review might be replaced [15] Cyril Labb e. Ike Antkare, one of the great stars in the with one of consensus on reproducibility, in some scientic rmament. In: International Society for Sciways analogous to what is accepted in the world of entometrics and Informetrics Newsletter 6.2 (2010), pure mathematics.2 pp. 4852.
2 Thurston

puts it very nicely in his article On proof and progress in mathematics [27]:

. . . But in any eld, there is a strong social standard of validity and truth. . . . In a way our goal here is to give a friendly nudge to the social standard for the natural and social sciences. The elbow in our nudge is the blueprint we have given you to make your own research reproducible from its inception.

[16]

Cyril Labb e and Dominique Labb e. Duplicate and fake publications in the scientic literature: how many SCIgen papers in computer science? In: Scientomet- [22] rics 94.1 (2013), pp. 379396. J. Mao et al. GRB 080503: Swift detection of a short burst. In: GCN 7665 (2008), pp. 1+. [23] Adam Marcus. Half of researchers have reported trouble reproducing published ndings: MD Anderson survey. May 2013. url: http : / / retractionwatch . [24] wordpress . com / 2013 / 05 / 16 / half of - researchers - have - reported - trouble reproducing- published- findings- md- anderson- [25] survey/.

0000308. url: http : / / dx . plos . org / 10 . 1371 / journal.pone.0000308. Eugenie Samuel Reich. Plastic fantastic: How the biggest fraud in physics shook the scientic world. Macmillan, 2009. Carmen M Reinhart and Kenneth S Rogo. Growth in a Time of Debt. Tech. rep. National Bureau of Economic Research, 2010. Michael Stonebraker et al. Requirements for Science Data Bases and SciDB. In: CIDR. Vol. 7. 2009, pp. 173184. Michael Stonebraker et al. The Architecture of SciDB. In: Scientic and Statistical Database Management. Springer. 2011, pp. 116.

[17] [18]

[19]

Aaron Mobley et al. A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Lim- [26] Jeremy Stribling, Max Krohn, and Dan Aguayo. ited Ability to Translate Findings from the Laboratory Scigen-an automatic cs paper generator. In: proceedto the Clinic. In: PloS one 8.5 (2013), e63221. doi: ings (2005). doi:10.1371/journal.pone.0063221. [27] William P Thurston. On proof and progress in mathematics. In: arXiv preprint math/9404236 (1994). [20] Ivan Oransky and Adam Marcus. RetractionWatch. 2010. url: http : / / retractionwatch . wordpress . [28] Wikipedia. Lotka-Volterra Equation. url: http : / / com/. en.wikipedia.org/wiki/Lotka%C3%A2%C2%80%C2% 93Volterra_equation/. [21] Heather A. Piwowar, Roger S. Day, and Douglas B. Fridsma. Sharing Detailed Research Data Is Associ- [29] ated with Increased Citation Rate. In: PLoS ONE 2.3 (Mar. 2007), e308. doi: 10.1371/journal.pone. Frank Wynne. I Was Vermeer: The Rise and Fall of the Twentieth Centurys Greatest Forger. Bloomsbury Publishing USA, 2011.

Вам также может понравиться