Tuesday, August 31. 2010
Reproducible Research, a manifest-like paper by a number of authors from different scientific disciplines, is published by Computing in Science and Engineering.
Progress in computational science is often hampered by researchers' inability to independently reproduce or verify published results. Attendees at a roundtable at Yale Law School formulated a set of steps that scientists, funding agencies, and journals might take to improve the situation. We describe those steps here, along with a proposal for best practices using currently available options and some long-term goals for the development of new tools and standards.
Thursday, August 26. 2010
Tuesday, August 17. 2010
Seismic Unix (SU) is a famous open-source seismic processing package maintained by John Stockwell at the Center for Wave Phenomena, Colorado School of Mines. SU has been around for 25 years and has attracted many devoted users. If you are one of them, please consider the following:
- Using Seismic Unix is not an excuse for non-reproducible computational experiments. To facilitate reproducibility, you can use Python and SCons with the rsf.suproj module supplied by Madagascar. The book/rsf/su directory contains many examples of seismic data processing flows using SU and their loose translation to Madagascar analogs. Here is an example SConstruct script from rsf/su/sulab1

from rsf.suproj import *
Flow('plane',None,'suplane')
Result('plane','suxwigb label1="Time (s)" label2=Trace')
Flow('specfx','plane','suspecfx')
Result('specfx','suxwigb label1="Frequency (Hz)" label2=Trace')
End()
Its loose Madagascar translation is in rsf/su/rsflab1
from rsf.proj import *
Flow('plane',None,
'''
spike n1=64 n2=32 d2=1 o2=1 label2=Trace unit2=
nsp=3 k1=8,20,32 k2=4 l2=28 p2=2,1,0
''')
Flow('specfx','plane','spectra | scale axis=2')
for plot in ('plane','specfx'):
Plot(plot,
'''
wiggle clip=1 transp=y yreverse=y poly=y
wanttitle=n wheretitle=b wherexlabel=t
''')
Result('specfx','plane specfx','SideBySideAniso')
End()
If you want only rsf.suproj but not the rest of Madagascar, download madagascar-framework package from SourceForge.
- It is also possible to convert between SU and RSF file formats with sfsuread and sfsuwrite and to combine SU and Madagascar programs in common processing flows.
- If you decide to switch to Madagascar but are missing certain functionality from Seismic Unix, it is possible and legal to borrow code from SU and to add it to Madagascar. The opposite is not true, because the Madagascar license (GPL) is more restrictive than the SU license (BSD-like). The su directory contains some of the codes Madagascar has borrowed from SU by requests of the users. Naturally, we try to limit such borrowing to avoid unnecessary forks.
In a recent message to the Seismic Unix mailing list, John Stockwell described a proposal for S3, the third generation SU. The main requirements for S3 are
- a new project managed on SourceForge
- GPL license
- flexible trace headers
- integration with scientific libraries
- integration with the current SU as well as other GPL- or BSD-licensed packages
One cannot help thinking that the project that John describes is Madagascar!
Wednesday, August 4. 2010
Eureka Daily, a science blog at The Times newspaper, published a two-part article by Hannah Devlin about freedom of information in science ( "FOI: should scientists be exempt?" and "Freedom of information and climate science" - both require a subscription). Part two discusses issues of openness in the context of a recent investigation of research practices of CRU - a climate research group at The University of East Anglia.
Here is an interesting excerpt from the second part:
As Myles Allen, a climate scientist at the University of Oxford points out, in most cases that which is in the public interest will be good for science too. Validation and replication are central to the scientific method. However, points of contention remain about the optimum degree of information sharing. Allen, for instance, suggests that while open access to data is generally desirable, making the computer code used to analyse data available online could have unintended negative consequences. If everyone's using the same code, who's going to challenge whether it's working correctly?
This view is countered by programmer John Graham-Cumming, who found coding errors after trying to reproduce the CRU/Met Office's CRUTEM and HadCRUT global warming datasets. Working from the raw data released by the Met Office and the description of their process for generating the datasets in a scientific paper he decided to validate their work - a considerable effort that required writing code to implement the algorithm described in the paper. In doing so, he found a problem with the way the error ranges were calculated (amongst other errors), stemming from a bug in their code.
He says: "You could say that by not releasing their buggy code they forced me to find the bug in it by writing my own validation. But actually, if they'd released their code I would have been able to quickly compare the code and the paper and find the bug without the massive effort to write new code. And no one else had actually done this validation (including the Muir Russell review) and as a result the Met Office has been releasing incorrect data for a long time. Perhaps that's because the validation was so hard in the first place, whereas having code to check would have been easy."
John Graham-Cumming demonstrated why reproducibility is crucial for computational sciences: it exposes scientific algorithms and workflows to a greater audience, thus preventing critical bugs from going unnoticed.
Reproducibility is an approach to openness in computational sciences. It assumes that not only data but source code (and eveything else needed to reproduce published results) should be released. At the end of the day, it might save one's scientific credibility from a rather unpleasant public exposé.
Tuesday, August 3. 2010
The July 2010 School and Workshop in Houston went well and attracted more than 50 people, about half of them being graduate students. About 10 companies and 12 universities were represented. The event was sponsored by the Petroleum Technology Transfer Council. All presentation materials from the workshop are now available on the website.
Friday, July 30. 2010
A story about Madagascar is featured today on the SourceForge blog.
Among more than 230,000 software projects hosted by SourceForge, Madagascar is currently the 210th in activity (or top 0.1%) thanks to the recent release of 1.0 and to our active Subversion repository.
Thursday, July 22. 2010
A big event in the Madagascar history: the first non-beta stable version (madagascar-1.0) is released! The event is celebrated at the School and Workshop in Houston.
This release features new reproducible papers, major structural improvements (thanks to Nick Vlad) and an automatic testing system (thanks to booklist, figlist, and testlist from Jim Jennings and vplotdiff from Joe Dellinger).
The cumulative number of all previous stable release downloads has exceeded 10,000.
Tuesday, July 20. 2010
SWAG ( Seismic Wave Analysis Group) at KAUST is glad to introduce the Madagascar reference manual that includes description and usage information for all functions in rsf.h. These include Data types, preparing for input, operations with RSF files, error handling, linear operators, data analysis, filtering, solvers, interpolation, Smoothing, Ray tracing, General tools, Geometry, and System. The purpose behind this manual is to ease the task of searching for subroutines and understand their function. You can access the manual at book/rsf/manual.
Monday, July 19. 2010
A hardcore Madagascar user does not need anything more than a friendly editor (to edit SConstruct files) and the good old command line (to run scons commands). However, sometimes it is necessary to provide simplified GUIs (Graphical User Interfaces) for inexperienced users.
Creating GUIs in Python is quite simple. An example is provided in rsf/rsf/gui. In this example, we obtain a compressed approximation of a piecewise-regular signal with by a wavelet transform. The figure using default parameters is shown below:
There are two main parameters in this experiment: the type of the wavelet transform ( type= parameter in sfdwt) and the thresholding percentile ( pclip= parameter in sfthreshold). The first step is to expose these parameters to CLI (Command Line Interface) by using ARGUMENTS.get construct in SConstruct:
# Wavelet transform type
type = ARGUMENTS.get('type','b')
# Thresholding percentile
pclip = int(ARGUMENTS.get('pclip',50))
Now one can select parameters on the command line by launching something like scons type=b pclip=50 view
Next, we build the GUI by using one of the Python interfaces. The gui.py script provides an interface using Tkinter, the most standard Python GUI package. It allows the user to select the parameter values graphically. Clicking the Run button would then launch scons with the selected parameters in the background.
A alternative, both simple and powerful GUI package is Traits from Enthought Inc. An example Traits-based interface interface is provided by gui-traits.py.
For full-featured GUIs exposing all program parameters, one can use the Madagascar interface to TKSU or OpendTect.
Thursday, June 10. 2010
Most of the time, when we talk about reproducibility in computational sciences, we assume that of numerical results. We expect computational experiments to produce identical results in different execution environments for same input data.
But it is not the case all the time - quite often, the goal of a research endeavor is to design a faster algorithm. Then, the result of the experiment is performance information and demonstration of a speedup over existing algorithms for finding a solution for the same problem. Speedup, just like numerical results, should be reproducible in different execution environments.
Sid-Ahmed-Ali Touati, Julien Worms, and Sebastien Briais of INRIA published an excellent work on methodology of reproducible speedup tests.
A part of the introduction from their paper is worth quoting on this blog:
Known hints for making a research result non reproducible
Hard natural sciences such as physics, chemistry and biology impose strict experimental methodologies and rigorous statistical measures in order to guarantee the reproducibility of the results with a measured confidence (probability of error/success). The reproducibility of the experimental results in our community of program optimisation is a weak point. Given a research article, it is in practice impossible or too difficult to reproduce the published performance. If the results are not reproducible, the benefit of publishing becomes limited. We note below some hints that make a research article non-reproducible:
- Non using of precise scientific languages such as mathematics. Ideally, mathematics must always be preferred to describe ideas, if possible, with an accessible difficulty.
- Non available software, non released software, non communicated precise data.
- Not providing formal algorithms or protocols make impossible to reproduce exactly the ideas.
- Hide many experimental details.
- Usage of deprecated machines, deprecated OS, exotic environment, etc.
- Doing wrong statistics with the collected data.
Part of the non-reproducibility (and not all) of the published experiments is explained by the fact that the observed speedups are sometimes rare events. It means that they are far from what we could observe if we redo the experiments multiple times. Even if we take an ideal situation where we use exactly the original experimental machines and software, it is sometimes difficult to reproduce exactly the same performance numbers again and again, experience after experience. Since some published performances numbers represent exceptional events, we believe that if a computer scientist succeeds in reproducing the performance numbers of his colleagues (with a reasonable error ratio), it would be equivalent to what rigorous probabilists and statisticians call a surprise. We argue that it is better to have a lower speedup that can be reproduced in practice, than a rare speedup that can be remarked by accident.
Read the full document for a thorough explanation of how to avoid creating non-reproducible and erroneous speedup tests by using proper scientific techniques.
Wednesday, June 9. 2010
Please reserve the date for the Madagascar "event of the year": the School and Workshop in Houston on July 23-24, 2010. The program details and registration information will follow soon.
Sunday, June 6. 2010
After returning back from the NSF Workshop Archiving Experiments to Raise Scientific Standards, here are some thoughts on reproducible research. Thanks to Dan Gezelter, Dennis Shasha, and others for inspiring discussions.
First of all, it is important to point out that reproducibility is not the goal in itself. There are many situations in which strict computational reproducibility is not achievable. The goal is an exposure of the scientific communication to a skeptic inquiry. A mathematical proof is an example of a scientific communication, which is constructed as a dialogue with a skeptic: someone who might say "What if your conclusions are not true?" Step by step, a mathematical proof is designed to convince the skeptic that the conclusion (a theorem) has to be true. As for computational results, even the simplest skeptic inquiry "What if there is a bug in your code?" cannot be answered unless the software code and every computational step that led to the published result are open for inspection.
If you attend a mathematical conference, you can notice that mathematicians do not usually go through every step in the proof to present a theorem, it is enough to sketch the main idea of the proof. However, the audience understands that the detailed proof should be available in the published work, otherwise the theorem cannot be accepted. Similarly, in a presentation of a computational work, one can simply show results of a computational experiment. However, such results cannot be accepted as scientific unless the full computation is disclosed for a skeptic inquiry. As stated by Dave Donoho (paraphrased Jon Claerbout), An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. If you don't want to disclose the details of your computation, then the work that you do is not science.
As for reproducibility, there seems to be different degrees of it: - Replicability: the ability to reproduce the computation as published
- Reusability: the ability to apply a similar algorithm to a different problem or different input data
- Reconfigurability: the ability to obtain a similar result when the parameters of the experiment are perturbed deliberately
Some algorithms are perfectly replicable but of limited use, because they are too sensitive to the choice of parameters to be reusable or reconfigurable. Nevertheless, such algorithms deserve a place in the scientific body of knowledge, because they may lead to a discovery or invention of more robust algorithms.
Those who read Italian may enjoy the philosophical article on open software and reproducible research by Alessandro Frigeri and Gisella Speranza: "Eppur si muove" Software libero e ricerca riproducibile Eppur si muove (and yet it moves) are the words attributed to Galileo Galilei, the father of modern science.
Tuesday, May 18. 2010
Vladimir Bashkardin contributes a Vplot plugin for PLplot, an open-source scientific plotting library. Here is an example of generating Vplot files with PLplot using sfplsurf.
|