6 minute read.

Why Data Citation matters to publishers and data repositories

A couple of weeks ago we shared with you that data citation is here, and that you can start doing data citation today. But why would you want to? There are always so many priorities, why should this be at the top of the list?

I’m sure you heard this before—data sharing and data citation are important for scientific progress. The three key reasons for this are:

1) Transparency and reproducibility

Most scientific results that are shared today are just a summary of what researchers did and found. The underlying data are not available, making it difficult to verify and replicate results. If data would always be made available with publications, transparency of research would be greatly improved.

2) Reuse

The availability of raw data allows other researchers to reuse the data. Not just for replication purposes, but to answer new research questions.

3) Credit

When researchers cite the data they used, this forms the basis for a data credit system. Right now researchers are not really incentivized to share their data, because nobody is looking at data metrics and measuring their impact. Data citation is a first step towards changing that.

data article nexus

The benefits described above are all quite long-term, so why, as a publisher or data repository, should you put your resources towards implementing data citation workflows now? During our pre-conference workshop at FORCE2018 we asked repositories and publishers this question. Below you’ll find some of the answers.

Data repositories

For data repositories, data citation leads to increased visibility of both the repository and the datasets. The workshop revealed that many repositories do a lot of work to establish links between articles and datasets, thereby significantly contributing to transparency in research. Some of the repositories explained that they hire curators that text mine articles to find associations and manually curate datasets to ensure information about links is part of the metadata. This is reflected in Event Data, where 99% of links between articles and datasets comes from data repository metadata. This downstream enrichment of metadata is useful, but it would be more effective if all stakeholders strive to establish these links at a much earlier stage in the research communication process.

ICPSR, the Inter-university Consortium for Political and Social Research, shared:

ICPSR views data citation as vital. As a large social science data archive, ICPSR curates, preserves, and distributes data for the research community to re-use over time. Data citation makes data visible to the research community. Without it, data cannot be accessed for re-use or reproduced for transparency. Its use cannot be tracked and counted to reveal its impact and potential for new uses by investigators in new fields or in combination with new types of data. Data creators cannot receive adequate credit for their intellectual output. And the original investment by funders and scientists to create those data stops producing dividends. Therefore, data citation plays an essential role in the data sharing lifecycle.

Proper data citation, with a unique identifier, makes it much easier to measure impact. When data use is not cited or cited obliquely, it is rendered virtually invisible. Hence, much data use is still not easily detected. The ICPSR Bibliography of Data-related Literature represents ICPSR’s efforts to identify publications that analyze data distributed at ICPSR and link them directly to the data in the ICPSR catalog. As of 2018, ICPSR has a searchable database that contains nearly 80,000 citations of published and unpublished works resulting from analyses of data held in the archive. ICPSR also makes the case for data citation in its brief new video, “ICPSR 101: Why Should I Cite Data?”

GBIF, the Global Biodiversity Information Facility, explained:

The work required to collect, clean, compile and publish biodiversity datasets is significant and deserves recognition. Researchers publish studies based on data made available through at a rate of about 2 papers every single day. It is crucial for GBIF to link these scientific uses to the underlying data as one measure of demonstrating the value and impact of sharing free and open biodiversity data. At the moment, however, only about 10 percent of authors cite or acknowledge the datasets used in research papers properly. As a result, data publishers efforts often risk going unnoticed, and the true impact of sharing data remains invisible. GBIF will continue to work with publishers and researchers to provide guidance and input for how to best cite the use of GBIF-mediated data in scientific journals to ensure proper attribution and reproducible research and to demonstrate the true value of free and open access to biodiversity data.


By ensuring data is cited in a consistent way, publishers help provide transparency and context for the content they publish. Depositing that information as part of the Crossref metadata helps that work go further by uncovering how data is being used across multiple publications and publishers This means patterns can be explored and researchers can gain more comprehensive recognition and credit for the work they have done.

Melissa Harrison, Head of Production Operations at eLife says:

eLife is committed to ensuring researchers get credit for all their outputs, and data is a major component of this. We’re working with Crossref and JATS4R to enable publishers to tag their JATS data content consistently and thus create an easy crosswalk to their Crossref deposits. The JATS4R guidance on Data Availability Statements, linked to and incorporating data citations, will be updated soon, please watch that space!

It will be really interesting to see how much re-use of previously published data is happening, look for patterns in re-use, and see links and hopefully building up of data by different research groups. Ultimately, this will incentivize researchers and publishers to ensure it is correctly accredited at source and in publications, improving the cycle further.’

Anita de Waard, VP of Research Collaborations at Elsevier, says:

One of the key recommendations of the Force11 Manifesto was to “3.3 Add data, software, and workflows into the publication as first-class research objects”, which will allow greater reproducibility and rigor to experimental research, and allow the reuse of all digital artefacts in the scholarly lifecycle. By following the data citation principles, we achieve two things: the author presents a richer representation of their work, and the data producer receives credit for the hard work of curating and publishing citable datasets.

Mendeley Data and Elsevier are active contributors to the Scholix framework that as a collaborative and open standard, allows the open mining of relationships between articles and datasets. We are also active participants in the new Enabling FAIR Data Project, and next to supporting the TOP Guidelines in all domains, require all authors in the earth and space sciences to deposit their data before publication.

Next week at Crossref LIVE18, Patricia Cruse from DataCite will talk about Data Citations and why they matter. If you’re in Toronto next week, do not hesitate to ask her or anyone from Crossref anything you want to know about data citation!

Related pages and blog posts

Page owner: Helena Cousijn   |   Last updated 2018-November-08