Difference between revisions of "Reproducibility"

From Madagascar
Jump to navigation Jump to search
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
[[Image:Reproducibility.jpg|right|Reproducibility is paramount in numerically-intensive sciences]]
 
[[Image:Reproducibility.jpg|right|Reproducibility is paramount in numerically-intensive sciences]]
Making a numerical experiment reproducible means filing and documenting all data, source code and scripts used to produce the experimental results (figures, tables, graphs, etc). The filing and documentation is done such that the same results can be produced automatically by an independent reviewer. In numerically-intensive sciences, it is possible to not only ''describe'' the experimental setup, but to actually reproduce the entire experiment, completely. However, paradoxically, the increase in computational complexity of the last decades meant that "today, few published results are reproducible in any practical sense. To verify them requires almost as much effort as it took to create them originally." [3]
+
 
 +
'''(written by I. Vlad)'''
 +
 
 +
''"At the end of the day, reproducibility is what separates science from superstition" (S. Fomel)''
 +
 
 +
Making a numerical experiment reproducible means filing and documenting all data, source code and scripts used to produce the experimental results (figures, tables, graphs, etc). The filing and documentation is done such that the same results can be reproduced by an independent reviewer. In numerically-intensive sciences, it is possible to not only ''describe'' the experimental setup, but to actually reproduce the entire experiment. Paradoxically, the increase in computational complexity of the last decades meant that "today, few published results are reproducible in any practical sense. To verify them requires almost as much effort as it took to create them originally." [3]
  
 
=Why "do" reproducibility?=
 
=Why "do" reproducibility?=
Line 22: Line 27:
  
 
==The last resort when returns start to diminish==
 
==The last resort when returns start to diminish==
Scientific progress is [http://en.wikipedia.org/wiki/The_Structure_of_Scientific_Revolutions not uniform]. In the first stages after a new idea is discovered to be valuable, scientists proceed to exploit the "lowest-hanging fruit"/"stake their territory" as soon as possible. In this stage experimental methodology tends to not be followed in a pedantic fashion, as second-order inaccuracies do not have a large influence on the results. When returns start to diminish however, improvements can drown in these second-order errors, and experiments need to be carefully set up (and made truly reproducible) in order to be able to perform accurate comparisons. More commentary on this topic, with application to the seismic processing industry, is provided in [5].
+
Scientific progress is [http://en.wikipedia.org/wiki/The_Structure_of_Scientific_Revolutions not uniform]. In the first stages after a new idea is discovered to be valuable, scientists proceed to exploit the "lowest-hanging fruit"/"stake their territory" as soon as possible. In this stage experimental methodology tends to not be followed in a pedantic fashion, as second-order inaccuracies do not have a large influence on the results. However, when returns start to diminish, improvements can drown in these second-order errors, and experiments need to be carefully set up (and made truly reproducible) in order to be able to perform accurate comparisons. More commentary on this topic, with application to the seismic processing industry, is provided in [5].
  
 
=Pitfalls of reproducibility=
 
=Pitfalls of reproducibility=
Line 32: Line 37:
 
Let us presume a reproducible experiment was set up and verified, then archived to disk... is that all?
 
Let us presume a reproducible experiment was set up and verified, then archived to disk... is that all?
  
No. Software dependencies and platforms change in time. Each change is incremental, but overall they add up. An archived experiment will not run a few years later, when all libraries are different, versions of compilers/interpreters are different, etc. The only way to archive an experiment statically, "maintenance-free" is to conserve the actual physical machine it was run on. Nothing less will do: even archiving an image of the whole disk does not help, as old OSs will lack drivers for new hardware. The solution is to re-run the experiment every time something changed in the system, from a bug fix in the source code of the experiment's tools to a new version of a library called by a dependency of a dependency. The experiments function as regression tests.  
+
No. Software dependencies and platforms change in time. Each change is incremental, but overall they add up. An archived experiment will not run a few years later, when libraries are different, versions of compilers/interpreters are different, etc. The only way to archive an experiment statically, "maintenance-free" is to conserve the actual physical machine it was run on. Nothing less will do: even archiving an image of the whole disk does not help, as old OSs will lack drivers for new hardware. The solution is to re-run the experiment every time something changed in the system, from a bug fix in the source code of the experiment's tools to a new version of a library called by a dependency of a dependency. The experiments function as regression tests.
  
 
==The "legacy maintenance" pitfall==
 
==The "legacy maintenance" pitfall==
  
Perpetual regression testing works well, but somebody thinking two steps ahead may wonder how it scales with growth in the number of experiments. By necessity, the entire community will have to participate in the testing of the experiments. Since every participant in the community scales his own experiments to maximize use of own resources, nobody will have enough resources to run all experiments for everybody, or even significantly more than his own share!  
+
Perpetual regression testing works well, but somebody thinking two steps ahead may wonder how it scales with growth in the number of experiments. By necessity, the entire community will have to participate in the testing of the experiments. Since every participant in the community scales her own experiments to maximize use of own resources, nobody will have enough resources to run all experiments for everybody, or even significantly more than her own share!  
  
 
True problems arise when people stop participating in the community, by graduating, changing employers, shirking their testing duties, not being able to perform them, etc, but the experiments they created remain in the system and need to be tested. To see the magnitude of the problem, try to imagine the Stanford Exploration Project continously testing and debugging all numerical experiments done since the inception of the group in 1973. (Moore's Law would help, but unfortunately it is already slowing down dramatically)  
 
True problems arise when people stop participating in the community, by graduating, changing employers, shirking their testing duties, not being able to perform them, etc, but the experiments they created remain in the system and need to be tested. To see the magnitude of the problem, try to imagine the Stanford Exploration Project continously testing and debugging all numerical experiments done since the inception of the group in 1973. (Moore's Law would help, but unfortunately it is already slowing down dramatically)  
Line 64: Line 69:
 
* [2] [http://sepwww.stanford.edu/research/redoc/IRIS.html Making Research Reproducible], introduction to the first SEP reproducibility system distributed through the WWW. 1996
 
* [2] [http://sepwww.stanford.edu/research/redoc/IRIS.html Making Research Reproducible], introduction to the first SEP reproducibility system distributed through the WWW. 1996
 
* [http://www.ad-astra.ro/journal/2/vlad_reproducibility.pdf Reproducibility in computer-intensive sciences], Ad Astra short note (2000)
 
* [http://www.ad-astra.ro/journal/2/vlad_reproducibility.pdf Reproducibility in computer-intensive sciences], Ad Astra short note (2000)
* [1] "Making Scientific computations reproducible (2000)" Schwab, M.; Karrenbach, N.; Claerbout, J., , Computing in Science & Engineering, Vol. 2, Issue6, Nov.-Dec. 2000, p.61-67  
+
* [1] [http://sep.stanford.edu/lib/exe/fetch.php?id=sep%3Aresearch%3Areproducible&cache=cache&media=sep:research:reproducible:cip.pdf "Making Scientific computations reproducible (2000)" Schwab, M.; Karrenbach, N.; Claerbout, J., , Computing in Science & Engineering, Vol. 2, Issue6, Nov.-Dec. 2000, p.61-67]
 
* [http://www.agu.org/eos_elec/000381e.html Complete PostScript: An Archival and Exchange Format for the Sciences?] (2003) EOS 84(36), P. Wessel, U. of Hawaii at Manoa, Honolulu
 
* [http://www.agu.org/eos_elec/000381e.html Complete PostScript: An Archival and Exchange Format for the Sciences?] (2003) EOS 84(36), P. Wessel, U. of Hawaii at Manoa, Honolulu
 
* [http://www.dlib.org/dlib/september04/vandesompel/09vandesompel.html Rethinking Scholarly Communication], Sept. 2004 article in the D-Lib Magazine
 
* [http://www.dlib.org/dlib/september04/vandesompel/09vandesompel.html Rethinking Scholarly Communication], Sept. 2004 article in the D-Lib Magazine
Line 86: Line 91:
 
* [http://lcavwww.epfl.ch/reproducible_research/ Reproducible Neurophysiological Data Analysis]: a setup from U. Paris 5 for reproducible research in neurophysiology with Sweave. Sweave itself is advertised as a [http://www.stat.umn.edu/~charlie/Sweave/ package for reproducible research]
 
* [http://lcavwww.epfl.ch/reproducible_research/ Reproducible Neurophysiological Data Analysis]: a setup from U. Paris 5 for reproducible research in neurophysiology with Sweave. Sweave itself is advertised as a [http://www.stat.umn.edu/~charlie/Sweave/ package for reproducible research]
 
* [http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html Open Notebook Science]
 
* [http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html Open Notebook Science]
 +
* [http://www.reproducibleresearch.net/ reproducibleresearch.net]
  
 
The links above point to explicit attempts to bundle together whole scientific experiments, as described in scientific papers. It is worth noting that:
 
The links above point to explicit attempts to bundle together whole scientific experiments, as described in scientific papers. It is worth noting that:

Latest revision as of 12:28, 15 June 2011

Reproducibility is paramount in numerically-intensive sciences

(written by I. Vlad)

"At the end of the day, reproducibility is what separates science from superstition" (S. Fomel)

Making a numerical experiment reproducible means filing and documenting all data, source code and scripts used to produce the experimental results (figures, tables, graphs, etc). The filing and documentation is done such that the same results can be reproduced by an independent reviewer. In numerically-intensive sciences, it is possible to not only describe the experimental setup, but to actually reproduce the entire experiment. Paradoxically, the increase in computational complexity of the last decades meant that "today, few published results are reproducible in any practical sense. To verify them requires almost as much effort as it took to create them originally." [3]

Why "do" reproducibility?

Knowledge management

  • The complexity of modern numerical experiments makes it necessary for the author himself to keep track meticulously of what he has done: "In the mid 1980's, we noticed that a few months after completing a project, the researchers at our laboratory were usually unable to reproduce their own computational work without considerable agony." [1]

"It takes some effort to organize your research to be reproducible. We found that although the effort seems to be directed to helping other people stand thisup on your shoulders, the principal beneficiary is generally the author herself. This is because time turns each one of us into another person, and by making effort to communicate with strangers, we help ourselves to communicate with our future selves." [2]

  • Collaboration with other researchers, bringing a new employee/student up to speed or continuing to build on the work of a former team member: "Research cooperation can happen effortlessly if you use a uniform system for filing your research." [2]
  • Speeding up the progress of science, and helping readers: "In a traditional article the author merely outlines the relevant computations: the limitations of a paper medium prohibit a complete documentation including experimental data, parameter values, and the author's programs. Consequently, the reader has painfully to re-implement the author's work before verifying and utilizing it. Even if the reader receives the author's source files (a feasible assumption considering the recent progress in electronic publishing), the results can be recomputed only if the various programs are invoked exactly as in the original publication. The reader must spend valuable time merely rediscovering minutiae, which the author was unable to communicate conveniently." [1]
  • A reproducible experiment is the ultimate level of documentation. Code that works shows what actually happened in the experiment, regardless of the quality of theoretical explanations elsewhere
  • A reproducible experiment is the first step towards an industrial implementation

Error catching

  • Makes total peer review possible. "Many eyes make all bugs shallow" (Raymond, The Cathedral and the Bazaar)
  • Discourages false claims, is an essential requirement for scientific integrity [4]
  • Helps avoid errors caused by subtle differences in the implementation of current experiment and reference experiment [4]
  • A reproducible experiment also serves as a regression test for the underlying software, helping avoind unintentional side effects of future modifications.

Large benefit for small marginal cost

"Our experience shows that it is only slightly more difficult to give birth to a "living" document than a "dead" one. The major hurdles in preparing a doctoral dissertation, research monograph or textbook are these: (1) mastering the subject matter itself, (2) writing the ancillary technical programs, (3) using a text editor and word processing system, and (4) writing the command scripts that run the programs to make the illustrations. The difference between preparing a "live" document and a "dead" one, lies in the command scripts. Will they be run once and then forgotten, or will they be attached to the figure-caption pushbutton?" [3]

The last resort when returns start to diminish

Scientific progress is not uniform. In the first stages after a new idea is discovered to be valuable, scientists proceed to exploit the "lowest-hanging fruit"/"stake their territory" as soon as possible. In this stage experimental methodology tends to not be followed in a pedantic fashion, as second-order inaccuracies do not have a large influence on the results. However, when returns start to diminish, improvements can drown in these second-order errors, and experiments need to be carefully set up (and made truly reproducible) in order to be able to perform accurate comparisons. More commentary on this topic, with application to the seismic processing industry, is provided in [5].

Pitfalls of reproducibility

A few problems arise whenever reproducibility is put in practice. Fortunately, Madagascar has the potential to avoid all of them:

The "out-of-date" pitfall

Let us presume a reproducible experiment was set up and verified, then archived to disk... is that all?

No. Software dependencies and platforms change in time. Each change is incremental, but overall they add up. An archived experiment will not run a few years later, when libraries are different, versions of compilers/interpreters are different, etc. The only way to archive an experiment statically, "maintenance-free" is to conserve the actual physical machine it was run on. Nothing less will do: even archiving an image of the whole disk does not help, as old OSs will lack drivers for new hardware. The solution is to re-run the experiment every time something changed in the system, from a bug fix in the source code of the experiment's tools to a new version of a library called by a dependency of a dependency. The experiments function as regression tests.

The "legacy maintenance" pitfall

Perpetual regression testing works well, but somebody thinking two steps ahead may wonder how it scales with growth in the number of experiments. By necessity, the entire community will have to participate in the testing of the experiments. Since every participant in the community scales her own experiments to maximize use of own resources, nobody will have enough resources to run all experiments for everybody, or even significantly more than her own share!

True problems arise when people stop participating in the community, by graduating, changing employers, shirking their testing duties, not being able to perform them, etc, but the experiments they created remain in the system and need to be tested. To see the magnitude of the problem, try to imagine the Stanford Exploration Project continously testing and debugging all numerical experiments done since the inception of the group in 1973. (Moore's Law would help, but unfortunately it is already slowing down dramatically)

The first solution is finding "sterile" community members, who are willing to run tests and perform maintenance, but do not generate their own experiments. Possible candidates are corporate sponsors, foundations set up by professional associations, and libraries (yes, 21st century archiving involves keeping documents "alive"!). Then there is the sad, last-resort solution of dropping support for the oldest, least-referenced experiments. This should be avoided as much as possible, because bringing an experiment up to date after it was left to become obsolete can involve practically rewriting it, which is much more expensive than continous maintenance. Those who have any doubts should download one of the SEP CDs from the early 90s and try to reproduce the papers.

Economic obstacles and how to overcome them

Reproducibility has many obvious advantages, so it begs the question: why are so few practitioners of it today? The answer is that its benefits are enormous for the long-term, but most material incentives favor the short term. Getting a junk-food meal in front of the TV/computer is cheaper, takes less time, and is more immediately rewarding than cooking at home a lean steak with salad and exercising to keep in shape. This is commonly framed as an issue of willpower, but sadly, it is often the case that the time or resources (money, education) to take the long-term-health option are not available. This is also the case with the lack of software testing in many instances of regular software development. Since a reproducible paper is essentially a regression test, the end-product of a numerically-intensive piece of research often is a software application, and the issues related to software testing have been analyzed by many others, I will discuss the two together.

Many purchasers of software do not understand that a large software application is more complex in its behavior than, say, a nuclear power plant. Unlike purchases of common objects/machinery, it should come with automated testing covering all paths through the source code and extensive documentation. Past experience has conditioned purchasers to expect buggy software and poor documentation. Many decision-makers are rewarded for a purchase that visibly minimizes costs now, even if many bugs and a poor interface result in hard-to-quantify productivity losses over the long term. The software market is a Lemon Market, in which the buyer, not having the ability to distinguish between high and low quality, pays an average price, so higher-quality products which cost more to make are pushed out of the market. Producers of high-quality goods have traditionally fought lemon markets by educating the consumer and seeking independent certifications, such as ISO 9001 for software development.

Sponsors of research have more ability to distinguish between good and bad research, but the researcher is rewarded proportionally to the number of publications, encouraging hasty, non-reproducible work, even if putting the parameters and data into a reproducibility framework would have taken just a few more hours. There is no extra reward now for the author of a paper that is likely to still be reproducible ten years down the road. Hence most people do not bother with it. Of course, they forfeit all the long-term advantages, including the distinction of calling their work science (If it's not reproducible, then it's not science, sorry). But the bottom line is that rewards are proportional to the number of papers. The ease of technology transfer when a reproducible paper already exists is a good argument for convincing industrial sponsors of research to request reproducibility as a deliverable. In the case of academic and government sponsors, the only way to go is to continuously advocate the long-term benefits of reproducibility, until they change their internal policy of evaluation, giving appropriate weight to reproducible research.

Not only are the extrinsic motivations perversely arraigned, but many scientists themselves, and many programmers even, are not properly educated when it comes to software engineering. Computer programming, like playing the stock market, is prone to a "It is easy to enter the game, hence I have a chance of winning" mentality. Any child can write a simple "Hello, world" programming script, and anybody with a credit card can open a brokerage account. However, from that point to writing a world-class piece of software, or to becoming and staying rich by investing, is a really long way. Most of the pilgrims on this road are self-educated, mostly following the example of those around them, and many have gaping holes in their knowledge that they are not even aware of. A regression test suite is part of the software engineer's toolbox, but how many researchers in numerically-intensive sciences are currently doing regression testing or reproducibility? How many are even using version control with easy visual comparison tools for their code? How many have at least read even one famous software engineering book, and actively work to improve their skills in this field by following online discussions on this topic? The way to promote reproducible research among practitioners is, again, permanent advocacy of its benefits, as well as making a foundation of software engineering knowledge prerequisite for graduate-level training in numerically-intensive sciences.

A fundamental reason for the frontier mentality that still pervades the software world is that for decades software had to keep pace with hardware developments described by Moore's law, and this resulted in a high pace of change of what computer programs did, as well as of APIs of dependencies. By the time a full-fledged test suite would be written, the design and function of the software might have had to change in order to expand to what new hardware would allow it to do. However, the tide has turned. The explosive growth in computer clock speeds has already stopped, growth in hard drive capacity has started to slow down, and the outlet for growth now is the increase in number of CPUs. Should this slowdown persist, there will be more time available for testing and documenting software. This will have consequences in numerically-intensive sciences as well: increases in raw CPU power meant an easy avenue of progress through applying algorithms that up to that time were too expensive to try in practice. Should this go away, practitioners would have to research second-order effects and to focus more on algorithm speeds. Working with smaller improvements means a lower signal/noise ratio. This in turn begs for scientific rigor of experiments through reproducibility, as well as commoditization of codes that does not offer competitive advantage in order to facilitate comparison between experiments.

A final factor stopping reproducibility from thriving are commercial restrictions. Commercial entities sponsor research because they want to derive a competitive advantage from it. They should be encouraged to share that part of work that does not constitute a competitive advantage, in effect commoditizing the platform so they can focus their efforts on that part of the software/research stack that adds the most value. The personnel problems that the oil and gas industry will soon start facing with the retirement of the baby boomer generation may be instrumental in convincing large consumers of software (oil companies) that manpower in the industry as a whole is too scarce to dedicate it to maintaining a large number of competing platforms. The example of other industries (software, banking) can be given in order to show that cooperation on a small number of common platforms so that everybody can focus on the value-added parts is a desirable Nash equilibrium. Already several oil companies open-sourced their platforms (examples). Even in such cases, companies will keep "the good bits" to themselves, and this is understandable. However, should by any event reproducibility become mainstream, they would be compelled to share more in order to be a leader of change rather than shield from it and be left behind by the general advance.

About reproducibility

In chronological order:

Reproducible science in practice

The links above point to explicit attempts to bundle together whole scientific experiments, as described in scientific papers. It is worth noting that:

  • It was not uncommon for scientific articles and books written during the late 1970's and early 1980's to contain printouts of the computer programs used. This alone helped clarify the algorithm and remove any doubts about the implementation. If input parameters were mentioned in the text and data was created synthetically, then the article can be considered reproducible for all practical purposes. The brevity of the computer programs of those times helped make them publishable in print and made them easy to understand, so in practice those results may be even more reproducible than a live working paper based on thousands of lines of poorly written code. Volumes of Geophysics with sections dedicated to computer programs are 45/1980(nr 3,7,11), 44/1979 (nr 12) and others [Note to self: complete list. The earliest example probably gains the title of earliest reproducibile experiment]
  • Outside the scientific world there are many instances of using software to automatically re-create a document with figures whenever the underlying data changes. Internet searches on "automatic reporting" or "automatic report generation" yield useful information on such utilities used in finance, meteorology, computer system administration, medicine, government administration, etc.
  • Some proprietary software packages have offered reproducibility capabilities:
    • Matlab notebooks
    • The whole Mathcad
    • Mathematica may have "live document" capabilities