Data Freedom: The Expansion of Data Sharing in Research Publications

Posted by Guest Blogger on Aug 5, 2014 2:51:13 PM


This post was contributed by Jim Woodgett.


Data-sharing-Addgene-plasmid-samplesPublic Library of Science (PLOS) created a stir earlier this year when it announced its data access and sharing policy. Since early March, the open access publisher has required authors to include a note as to where readers may locate data supporting the research reported in PLOS publications. The policy was not an overnight revelation, rather it was the result of consultations between researchers and publishers. Nonetheless, the initial release caused a storm as the organization left open the question of how much data was necessary and reasonable. PLOS has since clarified their data sharing policy and recently announced that of the 16,000 manuscripts that had been processed since the declaration, only a small fraction (<1%) of authors have asked for advice about the scope of the policy. End of story? Not quite.

The PLOS policy on data access is a formal statement that is already practiced or required by many other journals and several funding agencies (e.g. Wellcome Trust, NIH Data Sharing Policies, etc.) and is founded on the principle that scientific publications must contain enough information (or links to information) for assessment of veracity of the conclusions drawn. The main difference with the PLOS policy is that access of this data is to be made at the time of manuscript submission. The rationale includes increasing the ability of reviewers (and, once published, anyone) to detect possible errors in assumptions, use of inappropriate statistical methods, omission of some data, etc. 

The firestorm that followed implementation of the policy comprised of numerous objections and concerns. Perhaps least compelling of these were those from researchers who wanted more time to mine their own data, with some smaller labs feeling that larger, better resourced, groups would harvest and reanalyse their data with limited acknowledgement of the initial contributors. This is probable to some extent - but that is the price of publication. Who is to say that the initial data-generating lab has efficiently trawled their own data? A distinct (perhaps similarly impoverished lab) might apply different or novel tools and assumptions to reveal findings that the originator was oblivious to, no matter how much time they had. It could be argued that data generators may, in light of the policy, hold onto data longer before publishing. But there are strong incentives to publish such as precedence, recognition and “productivity”, so this is unlikely. Usually, the originating scientist and colleagues are privileged in viewing their data months or years before others. In some cases, typically consortia involving genomic screens, the datastream is so prolific that raw information is made available to everyone at once, including the generators. This practice doesn’t appear to harm the originating researchers, even if they are occasionally beaten to press, as there is so much analysis possible. Most scientists are, however, privileged in having exclusive access to data and enjoy reasonable time to make sense of it, with their decision to publish setting the clock for others to dive in.

The real cost of making data accessible

A more reasonable concern with the data policy was raised by researchers worried about the effort to make the underlying data easily available, organized in a form that can be understood by others, balanced against how often such data is likely to be accessed. Adding to this burden, the policy requires that data should be available for anonymous request.  This means that supporting datasets must be logically annotated and arranged, as in an actual publication and not require interaction with the data authors. In some types of research, raw data requires huge amounts of storage. For example, mouse behavioural geneticists may track the motion of multiple animals on different occasions via video, generating huge amounts of data. When published, this data is usually condensed into a statistically evaluated set of values.  However, to evaluate the interpretation of the experiment would require access to each individual video. To PLOS’s credit, the organization subsequently clarified what the policy intent was and provided an FAQ list to further explain the policy. The fact that so few authors now query what to include suggests pragmatic adoption (although it might be argued various authors have simply chosen to submit elsewhere). The majority of the scientific community is thus not averse to making data available, but there remains concern about the extra work this may entail. 

So if access to data is important, what data should be included?

We should recognize that storing data in a sharable form is indeed a significant effort. Our standards of data recording are highly variable and usually left to individual researchers. Yet accurate and comprehensible organization of data is essential for efficient science. Improvements in our discipline for annotation and archiving of data are likely to improve research. What is published is typically a tiny fraction of what was actually carried out. While no one is suggesting all data be stored in a retrievable format, we too rarely preserve “raw” data of any sort in a form that can be parsed by someone other than the actual author. Instead, selected images are said to be representative or the statistical variation is calculated and placed in the legend. Widely reported difficulties in reproducing experimental data are likely, at least in part, due to our inadequate cataloguing of how results were obtained and this alone is a reasonable motivation for increasing both data accessibility and methodological detail. Publishers have also contributed to this issue by requiring ever more condensed methods descriptions, legends and limiting numbers of figures such that many have multiple, complex panels.

PLOS, itself, has not finalized its definition of necessary data requirements and likely will not do so (calling it a “sisyphean task” given the endless types of data). Rather, PLOS appears to leave the scope of data to be shared to individual authors. Minimally, raw but sufficiently annotated data used to create charts and tables in the paper are in scope. For images, the number of examples supporting the published material is a factor of practicality (given image sizes) but images should be unprocessed and in capture format. This raises another issue of proprietary data formats. Many instruments store raw data in formats that require specific (expensive) software that evolves over time such that prior formats are no longer supported/readable. Is the author or their host institution expected to provide data in formats exported to common file types? And how long should data be archived?

Simplifying data access

Some of the abovementioned issues could be mitigated by on-line services such as SlideShare, Figshare and Dataverse. Just as Addgene helps with DNA reagents, these resources can offload the burden of data storage and distribution. I would argue that the journals themselves should be primarily responsible for storage.  If datasets are too large to transfer to a journal repository, they are likely too large to be retrieved by others. Other new data initiatives are also taking root. Nature Publishing Group recently launched a publication called Scientific Data, which aims to act as a peer-reviewed repository for “scientifically valuable” datasets. It likely won’t be long before other data repositories emerge that allow efficient searching and extraction of information.

Looking a little further ahead, journals may move towards encouraging authors to integrate datasets with their resultant figures. Instead of providing a “deadend” figure or table, clicking on the published chart could reveal the underlying data and allow other analyses “on the fly”. This becomes intuitive with touch interfaces where the viewer is more likely to want to manipulate how they view the information. We are used to publications presenting a polished, standard and usually limited view of data. In addition to enhancing veracity of collections and interpretation of data, publishing complete data, rather than just a single projection of it, could allow idea generation and extension, based on existing, published data.

The importance of data sharing

The reason this topic is important to my lab is three-fold. Firstly, we rely on many other labs to teach us how to do experiments and understand their data. Likewise, we want to contribute to helping others understand ours. Improving access to data promotes dissemination, especially for young researchers who may be intimidated by having to contact more senior colleagues. Secondly, science is highly technical and this has a natural, albeit unintentional, tendency to work against transparency. This opens science to charges of elitism and for scientists to be hiding behind our jargon and esoterica. Public trust in and funding of science is not served when it is perceived as being done in the dark. Lastly, data is our raw product. If we don’t take interest in how it is released, we will be in danger of having to deliver on well-meaning but impractical requirements imposed by publishers and funders.

Clearly, it is in our interest to ensure that our data are reproducible and useful to others and efforts to enhance these attributes should be welcomed. However, expectations must also be reasonable, sensitive to the data type and made as facile for both the author and the reader as possible. Rather than adding to demands, making effective data sharing the norm should open up new opportunities for research as well as increasing transparency.


Thank you to our guest blogger!

Jim-Woodgett-Data-Sharing

Jim Woodgett is Director of Research at the Lunenfeld-Tanenbaum Research Institute at Mount Sinai Hospital, Toronto, Canada where he studies protein-serine/threonine kinases implicated in human diseases. His spare time is spent trying to raise the alpha isoform of GSK-3 to the same (deserved) stature as the beta isoform. Follow him on Twitter @jwoodgett.

 

 

More Discussions About Data Sharing:

Read more Addgene blog posts about scientific sharing



Like this Post? Click Here to Subscribe to Addgene's Blog

Topics: Scientific Sharing

Addgene blog logo

Subscribe to Our Blog