27 January 2010

Credibility of Science and the Freedom of Data

Adhering to the free and open source model (from the software world) would help, in the long run, to bring credibility back into the sciences in the eyes of the public.  For scientists, this suggestion should sound like a non-suggestion, like suggesting we continue breathing to continue living (ie, "but we're already doing that").  What seems surprising is that the credibility, and the free and open nature, of science is even in doubt.

I've already talked briefly about the tarnishing of sciences' credibility, but what of the free and open source nature of science?  There's actually two "parts" of science (ie, data and methodology) that we have to address in terms of freedom and open-source-ness.  In this post, I'll briefly address the issue of raw data, since a lot of the recent news has been about the data the climate scientists used in their models.  I'll address the issue of methodology in another post. 

Why is Open Source Difficult?

It goes without saying that the data generated in any measurements or experiments should be open to scrutiny; this is the open source nature of data.  If scientists want to avoid having the quality of their data called into question, the data has to be open to scrutiny.  There is, however, a lot of costs to sharing data and very little to gain as discussed in Why won't psychologists share their data? and Dark-matter paper raises questions over data sharing.

In brief, the most a scientist could gain by sharing data is to not have their conclusions called into question.  On the other hand, the most one could lose by sharing is to have the conclusions refuted, and have one's reputation diminished.  So a good question to ask is, how can the costs be mitigated or gains increased so that open sourcing the data is a good default rather than a costly or a grudging "I guess I should" moral responsibility as a scientist.  It's a good question, but more of an economic type one, so let's file it away for now.

What Data to Open Source?
A different question we might ask is, what constitutes "raw data" that should be open sourced? If we did an experiment and measured temperatures using a classic mercury-in-glass thermometer, it seems the temperature measurements are the raw data.  But even this is just an interpretation of the measurement, which is how high the mercury liquid is in the thermometer relative to the "zero", which itself has to be calibrated somehow.  Modern instruments wrap up a lot of interpretations of the raw sensor feed before presenting anything to the user as "raw" data.  So how raw is raw?

The answer is tough in that it has to be specific to the scientific field and the instrument in question.  Perhaps a good first approximation may simply be, specify everything required so a fellow scientist could purchase all the required materials to make the same "raw" data appear.  So specify which brand and model of instrument, how calibrated, under what usage conditions, etc.  Of course, we're edging towards the issue of methodology now, so let's save this issue for next time and, instead, look briefly at the issue of freedom of data.

Data Freedom
The key concept here is about the freedom to modify and redistribute so that "you can give the whole community a chance to benefit from your changes" (The Free Software Definition). Once the idea that the data should be open to scrutiny is established, it's clear that freedom to modify and redistribute the data is a prerequisite to exercising the ability to scrutinize the data.

After all, to scrutinize the data, one must analyse it, possibly modify it, and then disseminate it to others so others can put your analysis under scrutiny too.  To have the data open without the freedom to modify and redistribute it is like having a diamond open to public viewing without allowing others the opportunity to examine it for validity (ie, "the diamond looks real, but is it really?").  To allow open access without these freedoms may be acceptable for actual diamonds for obvious security reasons, but for scientific data, it's not effective for making scientific conclusions credible.

Scientific Data Wants to be Free
Is such openness and freedom ever practised?  Fortunately, it is!  A model citizen for providing such openness and freedom of scientific data may be the recent climate analysis done by NASA: 2009 tied for 2nd-warmest year, 00s hottest decade too.  It's just too bad that groups like the Intergovernmental Panel on Climate Change and the University of East Anglia's Climatic Research Unit has given the rest of the scientific community, in the eyes of the general newspaper reading public, a bad name.

No comments: