What do we know about DOIs

8 minute read.

What do we know about DOIs

Martin Eve – 2024 February 29

Crossref holds metadata for approximately 150 million scholarly artifacts. These range from peer reviewed journal articles through to scholarly books through to scientific blog posts. In fact, amid such heterogeneity, the only singular factor that unites such items is that they have been assigned a document object identifier (DOI); a unique identification string that can be used to resolve to a resource pertaining to said metadata (often, but not always, a copy of the work identified by the metadata).

What, though, do we actually know about the state of persistence of these links? How many DOIs resolve correctly? How many landing pages, at the other end of the DOI resolution, contain the information that is supposed to be there, including the title and the DOI itself? How can we find out?

The first and seemingly most obvious way that we can obtain some of these data is by working through the most recent sample of DOIs and attempting to fetch metadata from each of them using a standard python script. This involves using the httpx library to attempt to resolve each of the DOIs to a resource, visiting that resource and seeing what the landing page yields.

Even this is not straightforward. Landing pages can be HTML resources or they can be PDF files, among other things. In the case of PDF files, to detect a run of text is not simple as a single line break can be enough to foil our search. Nonetheless, when using this strategy we find the following statistics:

Total DOI count in sample: 5000
Number of HTTP 200 response: 3301*
Percentage of HTTP 200 responses: 66.02%
Number of titles found on landing page: 1580
Percentage of titles found on landing page: 31.60%
Number of DOIs in recommended format found on landing page: 1410
Percentage of DOIs in recommended format found on landing page: 28.20%
Number of titles and DOIs found on landing page: 929
Percentage of titles and DOIs found on landing page: 18.58%
Number of PDFs found on landing page: 1469
Percentage of PDFs found on landing page: 29.38%
Percent of PDFs found on landing pages that loaded: 44.50%

* an HTTP 200 response means that the web page loaded correctly

While these numbers look quite low, the problem here is that a large number of scholarly publishers use Digital Rights Management techniques on their sites that block a crawl of this type. We can use systems like Playwright to remote control browsers to do the crawling, so that the request looks as much like a genuine user as possible and to evade such detection systems. However, lots of these sites detect headless browsers (where the browser is invisible and running on a server) and block them with a 403 Permission Denied error.

There’s a great Github javascript suite that aims to help evade headless detection. The tests it uses are:

User Agent: in a browser running with puppeteer in headless mode, user agent includes Headless.
App Version: same as User Agent above.
Plugins: headless browsers don’t have any plugins. So we can say that if it has plugin it’s headful, but not otherwise since some browsers, like Firefox, don’t have default plugins.
Plugins Prototype: check if the Plugin and PluginsArray prototype are correct.
Mime Type: similar to Plugins test, where headless browsers don’t have any mime type
Mime Type Prototype: check if the MimeType and MimeTypeArrayprototype are correct.
Languages: all headful browser has at least one language. So we can say that if it has no language it’s headless.
Webdriver: this property is true when running in a headless browser.
Time elapse: it pops an alert() on page and if it’s closed too fast, means that it’s headless.
Chrome element: it’s specific for chrome browser that has an element window.chrome.
Permission: in headless mode Notification.permission and navigator.permissions.query report contradictory values.
Devtool: puppeteer works on devtools protocol, this test checks if devtool is present or not.
Broken Image: all browser has a default nonzero broken image size, and this may not happen on a headless browser.
Outer Dimension: the attributes outerHeight and outerWidth have value 0 on headless browser.
Connection Rtt: The attribute navigator.connection.rtt,if present, has value 0 on headless browser.
Mouse Move: The attributes movementX and movementY on every MouseEvent have value 0 on headless browser.

Using the stealth plugin for Playwright also allows us to evade most of these checks. This just leaves Mouse Move and Broken Image detection, which I thought would not outweigh all the other factors. We can also jitter the connection with arbitrary delays so that it should appear to be coming at random intervals, rather than a robotic crawl.

Yet the basic fact is that we are still blocked from crawling many sites. This does not happen when we put the browser into headful mode, so current detection techniques have clearly evolved in the past half decade (since Detect Headless) was designed.

If, however, we run the browser in a headful mode, the results are somewhat stunningly different:

Total DOI count in sample: 5000
Number of HTTP 200 response: 4852
Percent of HTTP 200 responses: 97.04%
Number of titles found on landing page: 2547
Percentage of titles found on landing page: 50.94%
Number of DOIs in recommended format found on landing page: 2424
Percentage of DOIs in recommended format found on landing page: 48.48%
Number of titles and DOIs found on landing page: 1574
Percentage of titles and DOIs found on landing page: 31.48%
Number of PDFs found on landing page: 2085
Percentage of PDFs found on landing page: 41.70%
Percentage of PDFs found on landing pages that loaded: 42.97%

Let’s talk about the resolution statistics. Other studies, looking at general links on the web, have found a link-rot rate of about 60%-70% over a ten-year period (Lessig, Zittrain, and Albert 2014; Stox 2022). The DOI resolution rate that we have, with 97% of links resolving (or a 3% link-rot rate), is far better and more robust than a web link in general.

Is 3% a good or a bad number? It’s more robust than the web in general, but it still means that for every 100 DOIs, just under 3 will fail to resolve. We also cannot tell whether these DOIs are resolving to the correct target, except by using the metadata detection metrics (are the title and DOI on the landing page, which we could only detect at a far lower rate). It is entirely possible for a website to resolve with an HTTP 200 (OK) response, but for the page in question to be something very different to what the user expected, a phenomenon dubbed content drift. A good example is domain hijacking, where a domain name expires and spam companies buy them up. These still resolve to a web page, but instead of an article on RNA, for a hypothetical example, the user gets adverts for rubber welding hose. That said, other studies are also prone to this and there is no guarantee that content drift doesn’t affect a huge proportion of supposedly good links in the other studies, too.

Of course, one of the most frustrating elements of this exercise is having to work around publisher blocks on content when visiting using a server-only robot script. It’s important for us periodically to monitor the uptime rate of the DOI system. We also recognise, though, that publishers want to block malicious traffic. However, we can’t perform our monitoring in an easy, automatic way if headless scripts are blocked from resolving DOIs and visiting their respective landing pages. This is not even a call for open access; it’s just saying that current anti-bot techniques, sometimes implemented for legitimate reasons, stifle our ability to know the landscape. Even if the bot resolved a DOI to just a paywall, it would be easier for us to monitor this than it is now. Similarly, CAPTCHA systems such as Cloudflare that would seem to offer an easy way to distinguish between humans (good) and robots (bad) can make life very difficult at the monitoring end. We would certainly be grateful for any proposed solution that could help us to work around these mechanisms.

Conclusion

The context in which I wanted to know this information was so that we can take a snapshot of a page and then, at a later stage, determine whether it is down or has changed substantially. To do this, we are developing Shelob, an experimental content drift spider system; that’s what we’ve used so far to conduct this analysis. Over time, Shelob will evolve, we hope, to give us a way to detect when content has drifted or gone offline. If, however, we can’t detect whether an endpoint is good in the first place, then we likewise cannot detect when things have gone wrong. On the other hand, if, when we first visit, we find the DOI and title on the landing page, but at some future point this degrades, we might be able to say with some confidence that the original has died. I, personally, would encourage publishers not to block automated crawlers, because it’s good when we can determine these types of figures.

Works Cited

Lessig, Lawrence, Jonathan Zittrain, and Kendra Albert. 2014. ‘Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations’. Harvard Law Review 127 (4). https://harvardlawreview.org/forum/vol-127/perma-scoping-and-addressing-the-problem-of-link-and-reference-rot-in-legal-citations/.(https://www.zotero.org/google-docs/?970bfS

Stox, Patrick. 2022. ‘Ahrefs Study on Link Rot’. SEO Blog by Ahrefs. 29 April 2022. https://ahrefs.com/blog/link-rot-study/.

Get involved

Find a service

Documentation

About us

2024 April 26

This year's call for expressions of interest to join our board

2024 April 24

Common views and questions about metadata across Africa

2024 April 03

Testing times

2024 March 18

Mending Chesterton's Fence: Open Source Decision-making

Blog