On the Virtues of Open Data

1 Comment


The scientific world has been rocked by some notable instances of allegedly falsified data recently; Andrew Wakefield, Marc Hauser, Phil Jones and others. These prominent researchers have suffered, some would say justifiably, public humiliation and engendered scepticism for their fields of research.

However, to point out their failings without addressing the broader issue of how scientific publication proceeds from data collection to publication would be myopic. Scientists collect data of one form or another, then put it through the machinations of statistical techniques and find... well that's where it gets tricky!

There's often a lot riding on findings: research grants, prestige, bets even. Therefore there is a certain bias that is inevitable. How can this be minimised? I would argue that we need to overturn the idea that we keep data close to our chests and only publish digested results. In this endeavour, we can learn much from the free software community. It is certainly possible to pursue a career and profit from software without making it closed and proprietary, likewise having other scientists inspecting the 'private parts' of our data and statistics can also be useful move, and not just for preventing fraud.

More eyes equal more hypotheses and data can be added to create a corpus for new analyses, which brings in the important issue of which statistical methods we use (more on Bayesian and mega-analysis methods in an upcoming post) and which statistics packages we choose (more on open-source statistics coming too, most notably the R project).

There are a number of neuroscience sites leading the charge as far as providing open data, here are a few (please post any others you know of in comments and I'll add them to the list).

NameActiveMaintainerSizeaccess toolsraw dataprocessed datalicencenotes
OpenfMRIYesRuss Poldrack
University of Texas
157http .tgzyesnoopendata
INDI - International Neuroimaging Data-sharing InitiativeYesMaarten Mennes
41web .tar.gz
yesnoAttribution Non-Commercialresting mode data with phenotypes, also R-fMRIpackage
BrainmapActiveResearch Imaging Institute
University of Texas
2155 paperscustom, closed source softwareNoYesCopyrighted with limited use licencecoregistered data from preprocessed, categorised studies

Many of these sites provide co-registered data, for example EEG/MRI in the same participants. As computing power improves, I believe that many more discoveries will come from re-analysing these massive data sets than necessarily running new experiments. Or at least using the data sets to develop appropriate hypotheses for testing.

To this aim, I will be publishing raw data for my current neuroscience experiments in parallel with journal articles and encouraging you all to reuse the data under copyleft. Hopefully you'll all feel empowered to do the same. More to follow on how to do the publishing and where....


You may want to have a look at the Potsdam Mind Research Repository.
( http://read.psych.uni-potsdam.de/pmr2/ ). Perhaps start with the "About" page.

30-10-2011 12:40 pm