5 minute read.

Crossref Research and Development: Releasing our Tools from the Ground Up

This is the first post in a series designed to showcase what we do in the Crossref R&D group, also known as Crossref Labs, which over the last few years has been strengthened, first with Dominika Tkaczyk and Esha Datta, last year with part of Paul Davis’s time, and more recently, yours truly. Research and development are, obviously, crucial for any organization that doesn’t want to stand still. The R&D group builds prototypes, experimental solutions, and data-mining applications that can help us to understand our member base, in the service of future evolution of the organization. One of the strategic pillars of Crossref is that we want to contribute to an environment in which the scholarly research community identifies shared problems and co-creates solutions for broad benefit. We do this in all teams through research and engagement with our expanding community.

The Crossref Labs Creature The Crossref Labs Creature The Crossref Labs Creature

For example, if the metadata team wants to implement a new field in our schema, it helps to have a prototype to show to members. The Labs team would implement such a prototype. If we want to know the answer to a question about the 150m or so metadata records we have – e.g. how many DOIs are duplicates? – it’s the Labs team that will work on this.

When building such prototypes, which can often seem esoteric and one-off, though, it can be easy to believe that there is no way anybody else would re-use our components. At the same time, we find ourselves consistently working with the same infrastructures, re-using different code blocks across many applications. One of the tasks I have been working on is to extract these duplicated functions and to get them into external code libraries.

Why is this important? As many readers doubtless know, Crossref is committed to The Principles of Open Scholarly Infrastructure. For reasons of insurance, everything we do and newly develop is open source and we want our members to be able to re-use the software that we create. It’s also important because, if we centralize these low-level building blocks, we make it much easier to fix bugs when they occur, which would otherwise be distributed across all of our projects.

As a result, Crossref Labs has a series of small code libraries that we have released for various service interactions. We often find ourselves needing to interact with AWS services. Indeed, Crossref’s live systems are in the process of transitioning to running in the cloud, rather than our own data centre. It makes sense, therefore, for prototype Labs systems to run on this infrastructure, too. However, the boto3 library is not terribly Pythonic. As a result, many of our low-level tools interact with AWS. These include:

  • CLAWS: the Crossref Labs Amazon Web Services toolkit. The CLAWS library gives speedy and Pythonic access to functions that we use again and again. This includes downloading files from and pushing data to S3 buckets (often in parallel/asynchronously), fetching secrets from AWS Secrets Manager, generating pre-signed URLs, and more.
  • Longsight: A range of common logging functions for the observability of Python AWS cloud applications. Less mature than CLAWS, this is the starting point for observability across Labs applications. It supports running in AWS Lambda function contexts or pushing your logs to AWS Cloudwatch from anywhere else. It also supports logging metrics in structured forms. Crucially, the logs are all converted into machine-readable JSON format. This allows us to export the metrics into Grafana dashboards to visualize failure and performance.
  • Distrunner: decentralized data processing on AWS services. Easily the least mature and experimental of these libraries, distrunner is one of the ways that we distribute the workloads of our recurrent data processing. A number of the Labs projects require us to run recurrent data-processing tasks. For instance, my colleague Dominika Tkaczyk has developed the sampling framework that is regenerated once per week. We use Apache Airflow (and, specifically, Amazon Managed Workflows for Apache Airflow) to host these periodic tasks. This is useful because it gives us quick, visual oversight if tasks fail. However, the Airflow worker instances on AWS are quite severely underpowered and unsuitable for large in-memory activities. Hence, the sampling framework fires up a Spark instance for its processing. Often, though, we do not need the parallelization of Spark and just want to be able to run a generic Python script in a more powerful environment. That’s what distrunner is designed to do. The current version uses Coiled but this may change in the future.

While these tools will be useful to nobody except programmers – and this has been quite a technical post – there is a broader philosophical point to be made about this approach, in which everything is available for re-use, “from the ground up”. The point is: we also try, in Labs and in the process of “R&Ding”, to work without privileged access. That is: I don’t get “inside” access to a database that isn’t accessible to external users. I have to work with the same APIs and systems as would an end-user of our services. This means that, when we develop internal libraries, it’s worth releasing them. Because they use systems that are accessible to any of our users.

I should also say that our openness is more than unidirectional. While we are putting a lot of effort into ensuring that everything new we put out is openly accessible, we are also open to contributions coming in. If we’ve built something and you make changes or improve it, please do get in touch or submit a pull request. Openness has to work both ways if projects are truly to be used by the community.

Future posts – coming soon! – will introduce some of the technologies and projects that we have been building atop this infrastructure. This includes a Labs API system; new functionality to retrieve unpaginated datasets of whole API routes; a study of the preservation status of DOI-assigned content; and a mechanism for modeling new metadata fields.

Related pages and blog posts

Page owner: Martin Eve   |   Last updated 2023-June-21