In the scholarly communications environment, the evolution of a journal article can be traced by the relationships it has with its preprints. Those preprint–journal article relationships are an important component of the research nexus. Some of those relationships are provided by Crossref members (including publishers, universities, research groups, funders, etc.) when they deposit metadata with Crossref, but we know that a significant number of them are missing. To fill this gap, we developed a new automated strategy for discovering relationships between preprints and journal articles and applied it to all the preprints in the Crossref database. We made the resulting dataset, containing both publisher-asserted and automatically discovered relationships, publicly available for anyone to analyse.
The second half of 2023 brought with itself a couple of big life changes for me: not only did I move to the Netherlands from India, I also started a new and exciting job at Crossref as the newest Community Engagement Manager. In this role, I am a part of the Community Engagement and Communications team, and my key responsibility is to engage with the global community of scholarly editors, publishers, and editorial organisations to develop sustained programs that help editors to leverage rich metadata.
STM, DataCite, and Crossref are pleased to announce an updated joint statement on research data.
In 2012, DataCite and STM drafted an initial joint statement on the linkability and citability of research data. With nearly 10 million data citations tracked, thousands of repositories adopting data citation best practices, thousands of journals adopting data policies, data availability statements and establishing persistent links between articles and datasets, and the introduction of data policies by an increasing number of funders, there has been significant progress since.
Have you attended any of our annual meeting sessions this year? Ah, yes – there were many in this conference-style event. I, as many of my colleagues, attended them all because it is so great to connect with our global community, and hear your thoughts on the developments at Crossref, and the stories you share.
Let me offer some highlights from the event and a reflection on some emergent themes of the day.
What’s in the metadata matters because it is So.Heavily.Used.
You might be tired of hearing me say it but that doesn’t make it any less true. Our open APIs now see over 1 billion queries per month. The metadata is ingested, displayed and redistributed by a vast, global array of systems and services that in whole or in part are often designed to point users to relevant content. It’s also heavily used by researchers, who author the content that is described in the metadata they analyze. It’s an interconnected supply chain of users large and small, occasional and entirely reliant on regular querying.
A majority of Crossref metadata users rely on our free, open APIs and many are anonymous. A small but growing group of users pay for a guaranteed service level option and while their individual needs and feedback have long been integrated into Crossref’s work, as a group they provide a window into the workflows and use cases for the metadata of the scholarly record. As this use grows in strategic importance, to both Crossref and the wider community, it was clear that we might be overdue for a deeper dive into user workflows.
In 2021, we surveyed these subscribers for their feedback and brought together a few volunteers over a series of 5 calls to dig into a number of topics specific to regular users of metadata. This group, the first primarily non-member working group at Crossref, wrapped up in December 2022, and we are grateful for their time:
Achraf Azhar, Centre pour la Communication Scientifique Directe (CCSD)
Satam Choudhury, HighWire Press
Nees Jan van Eck, CWTS-Leiden University
Bethany Harris, Jisc
Ajay Kumar, Nova Techset
David Levy, Pubmill
Bruno Ohana, biologit
Michael Parkin, European Bioinformatics Institute (EMBL-EBI)
Axton Pitt, Litmaps
Dave Schott, Copyright Clearance Center (CCC)
Stephan Stahlschmidt, German Centre for Higher Education Research and Science Studies (DZHW)
This post is intended to summarize the work we did, to highlight the role of metadata users in research communications, to provide a few ideas for future efforts and, crucially, to get your feedback on the findings and recommendations. Though this particular group set out to meet for a limited time, we hope this report helps facilitate ongoing conversations with the user community.
If you’re looking for an easy overview of users and use cases, here’s a great starting point.
If you interpret this graphic to mean that there is a lot of variety centered on a few high level use cases, the survey and our experiences with users certainly supports that. A few key takeaways from the 2021 survey may be useful context:
Frequency of use: At least 60% of respondents query metadata on a daily basis
Finding and enhancing metadata as well as using it for general discovery are all common use cases
For most users, matching DOIs and citations is a common need but for a significant group, it is their primary use case
Analyzing the corpus for research was a consistent use case for 13% of respondents
Metadata of particular interest
Abstracts are the most desirable non-bibliographic metadata, followed by affiliation information, including RORs
Some other elements (beyond citation information) that respondents find useful are:
NB: The survey did not ask about references but we are frequently asked why they’re not included more often.
It’s also worth noting that about a third of respondents said that correct metadata is more important to them than any particular element.
There is more to this survey that isn’t covered here but it was kept fairly short to help with the response rate. Knowing we would have some focused time to discuss issues too numerous or nuanced to reasonably address in a survey, we compiled a long list of questions and topics for the Working Group then followed up with a second, more detailed survey to kick off the meeting series.
What we set out to address
We had three primary goals for this Working Group:
Highlight the efforts of metadata users in enabling discovery and discoverability
Of course, everyone involved had some questions and topics of interest to cover, including (but not limited to):
Understanding publisher workflows
How best to introduce changes, e.g. for a high volume of updated records
Understanding the Crossref schema
Query efficiencies, i.e. ‘tips and tricks’ (here for the REST API)
Which scripts, tools and/or programs are used in workflows
What other metadata sources are used
What kind of normalization or processing is done on ingest
How metadata errors are handled
What did we learn?
Workflows I started with the admittedly ambitious goal of collecting a library of workflows. After a few years of working with users, I learned never to assume what a user was doing with the metadata, why or how. For example, some subscribers use Plus snapshots (a monthly set of all records), regularly or occasionally and some don’t use them at all. Understanding why users make the choices they do is always helpful.
In my experience, workflows are frequently characterized as “set it and forget it.” It’s hard to know how often and how easily they might be adapted when, for example, a new record type like peer review reports becomes available. So, it’s worth exploring when and how to highlight to users changes that might be of interest.
As it turned out, half the group had their workflows mostly or fully documented. The rest are partially documented, not documented at all or the availability of documentation was unknown. Helping users document their workflows, to the extent possible, should be a mutually beneficial effort to explore going forward. We’re doing similar work with the aim of making ours more transparent and replicable.
Feedback on subscriber services User feedback might be the most obvious and directly consequential work of this group, at least for Crossref - understanding how well the services used meet their needs and what might be improved.
One frequent suggestion for improvement is faster response time on queries. This is an area we’ve focused on for some time, because refining queries to be more efficient is often the most straightforward way to improve response times and one reason for the emphasis on workflows.
We also discussed the possibility of whether or how to notify users of changes of interest. Just defining “change” is complex since they are so frequent and may often be considered very minor. We’ve been experimenting a bit over the past few years with notifying these users in cases where we’re aware of upcoming large volumes of changes, which is sometimes the case when landing page URLs are updated due to a platform change, for example. It was incredibly useful to discuss with the group what volume of records would be a useful threshold to trigger a notification (100K if you’re curious).
But perhaps the most common feedback we get from all users is on the metadata itself and the myriad quality issues involved. The group spent a fair amount of time discussing how this affects their work and shared a few examples of notable concerns:
Author name issues, e.g. ‘Anonymous’ is an option for authors but that or things like ‘n/a’ are sometimes used in surname fields
Invalid DOIs are sometimes found in reference lists
Garbled characters from text not rendering properly
Affiliation information is often not included or incomplete (e.g. doesn’t include RORs)
Inconsistencies in commonly included information, e.g. ISSNs
It’s worth noting that a common misunderstanding - not just among users - is what is required in the metadata. Users nearly always expect more metadata and more consistency than is actually available. The introduction of Participation Reports a few years ago was a very useful start to what is an ongoing discussion about the variable nature of metadata quality and completeness.
The role of metadata users in discoverability of content is key in my view and one that often doesn’t get enough attention, especially given that the systems and services that use this information often use it to point their own users to relevant resources. And because they work so closely with the metadata, users frequently report errors and so serve as a sort of de facto quality control. So, unfortunately, the effects of incomplete or incorrect metadata on these users might be the most powerful way to highlight the need for more and better metadata.
What are the recommendations?
In discussions with the Working Group, a few themes emerged, largely around best practices, which, by their nature, tend to be aspirational.
If you’re not already familiar with the personas and Best Practices and Principles of Metadata 2020, that is a useful starting point (I am admittedly biased here!) and many are echoed in the following recommendations:
Document and periodically review workflows
Report errors to members or to Crossref support and reflect corrections when they’re made (metadata and content)
Define a set of metadata changes, e.g. to affiliations, to further the discussion around thresholds for notifying users of ‘high volumes’ of changes
Provide an output schema.
Continue refining the input schema to include information like preprint server name, journal article sub types (research article, review article, letter, editorial, etc.), corresponding author flags, raw funding statement texts, provenance information, etc.
Collaborate on improving processes for reporting metadata errors and making corrections and enhancements
For metadata providers (publishers, funders and their service providers):
Consistency is important, e.g. using the same, correct relationship for preprint to VoR links for all records
Workarounds such as putting information into a field that is ‘close’ but not meant for it can be considered a kind of error
Understand the roles and needs of users in amplifying your outputs
Respond promptly to reports of metadata errors
Whenever possible, provide PIDs (ORCID IDs, ROR IDs, etc.) in addition to (not as a substitute for) textual metadata
What is still unclear or unfinished?
Honestly, a lot. We knew from the outset that the group would conclude with much more work to be done, in part because there is so much variety under the umbrella of metadata users and many answers lead to more questions and in part because the metadata and the user community will continue to evolve. Even without a standing group that meets regularly, it’s very much an ongoing conversation and we invite you to join it.
Now it’s your turn–can you help fill in the blanks?
Does any or all of this resonate with you? Do you take exception to any of it? Do you have suggestions for continuing the conversation?
Specifically, can you help fill in any of the literal blanks? We’ve prepared a short survey that we hope can serve as a template for collecting (anonymous) workflows. Please take just a few minutes to answer a few short questions such as how often you query for metadata.
If you are willing to share examples of your queries or have questions or further comments, please get in touch.