Text and data mining for researchers

Our API allows researchers to easily harvest full-text documents from all participating members, regardless of whether the content is open access or subscription. The member is responsible for delivering the full-text content requested, so open access content can simply be delivered, while subscription content is available through access control systems.

To mine our metadata, you should have a list of DOIs for the content you want to download, and a safelist of licenses that you accept. You can get a list of DOIs from citations, our metadata search, our metadata API, or another source.

For each DOI, you should:

Use content negotiation to get the metadata for the DOI
Check to see if the DOI has license and full-text details in its metadata
Check the license against your safelist of acceptable licenses
If you agree to the license, follow the link and download the full-text of the content item.

The absence of a license does not mean that the full-text can be used without one. Members should deposit both the license and the full-text link at the same time.

Watch a basic introduction or a more detailed presentation on how to perform TDM using our API.

Example using the cURL utility

You should be able to integrate with the API very easily with your TDM software.

Step 1

Fetch the metadata: at its simplest, you can issue a HTTP GET request using a Crossref DOI and use DOI content negotiation. For example, the following cURL command will retrieve the metadata for the DOI 10.5555/515151:

curl -L -iH "Accept: application/vnd.crossref.unixsd+xml" http://0-dx-doi-org.pugwash.lib.warwick.ac.uk/10.5555/515151

This will return the metadata for the specified DOI, as well as a link header which points to several representations of the full-text on the member’s site:

HTTP/1.1 200 OK Date: Wed, 31 Jul 2013 11:24:14 GMT Server: Apache/2.2.3 (CentOS) Link: <http://0-annalsofpsychoceramics-labs-crossref-org.pugwash.lib.warwick.ac.uk/fulltext/10.5555/515151.pdf>; rel="http://0-id-crossref-org.pugwash.lib.warwick.ac.uk/schema/fulltext"; type="application/pdf", <http://0-annalsofpsychoceramics-labs-crossref-org.pugwash.lib.warwick.ac.uk/fulltext/10.5555/515151.xml>; rel="http://0-id-crossref-org.pugwash.lib.warwick.ac.uk/schema/fulltext"; type="application/xml" Vary: Accept Content-Length: 2189 Status: 200 OK Connection: close Content-Type: application/vnd.crossref.unixsd+xml;charset=utf-8

Access this full-text link information using Ruby:

require 'open-uri' r = open("http://0-dx-doi-org.pugwash.lib.warwick.ac.uk/10.5555/515151", "Accept" => "application/vnd.crossref.unixsd+xml") puts r.meta['link']

Access this full-text link information using Python:

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('Accept', 'application/vnd.crossref.unixsd+xml')]
r = opener.open('http://0-dx-doi-org.pugwash.lib.warwick.ac.uk/10.5555/515151')
print (r.info()['Link'])

Access this full-text link information using R:

library(httr) r = content(GET('http://0-dx-doi-org.pugwash.lib.warwick.ac.uk/10.5555/515151', add_headers(Accept = 'application/vnd.crossref.unixsd+xml'))) r

If present, the full-text URL will also be returned in the metadata for the DOI. For instance, in our unixref schema, you would also see this in the returned metadata:

http://0-annalsofpsychoceramics-labs-crossref-org.pugwash.lib.warwick.ac.uk/fulltext/10.5555/515151.pdf
http://0-annalsofpsychoceramics-labs-crossref-org.pugwash.lib.warwick.ac.uk/fulltext/10.5555/515151.xml

Step 2

Deciding what to do. Members who enable mining through us need to register a stable license URL using the <license_ref> element. For example, this unixref extract shows that the DOI is licensed under the Creative Commons CC-BY license:

<license_ref>http://creativecommons.org/licenses/by/3.0/deed.en_US

But this shows that the DOI is licensed under a member’s proprietary license:

<license_ref>http://www.annalsofpschoceramics.org/art_license.html

The license that the URL points to does not have to be machine-readable. Check the license against your safelist. If you agree to it, you can proceed. If you don’t agree to it, put it in a list of licenses to review later and add to your safelist (or blacklist).

If a content item is under embargo, a slight complication arises: the member can use a start_date attribute on the <license_ref> element. In this example, the content item is under a proprietary license for a year after its publication date, after which it is licensed under a CC-BY license:

<license_ref start_date="2013-02-03">https://0-www-crossref-org.pugwash.lib.warwick.ac.uk/license <license_ref start_date="2014-02-03">http://creativecommons.org/licenses/by/3.0/deed.en_US

TDM tools can easily use a combination of the <license_ref> element(s) and the start_date attribute to determine if the content item is currently under embargo.

If you are not interested in receiving the metadata for the DOI, you can simply issue an HTTPS HEAD request and you will get the link header without the rest of the DOI record.

Step 3

Fetching the full-text: you can now perform a standard GET request on the URL to download the full-text from the member’s site. Because the bulk downloading of large amounts of data may put a strain on the member’s servers, we have defined a set of rate-limiting HTTPS headers. You are not obliged to test for and act on these headers, and not all members will use them, but doing so will avoid surprises.

An example session using rate limiting

curl -k "https://0-annalsofpsychoceramics-labs-crossref-org.pugwash.lib.warwick.ac.uk/fulltext/515151" -D - -L -O HTTP/1.1 200 OK Date: Fri, 02 Aug 2013 07:10:53 GMT Server: Apache/2.2.22 (Ubuntu) X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.13 CR-TDM-Client-Token: hZqJDbcbKSSRgRG_PJxSBA CR-TDM-Rate-Limit: 5 CR-TDM-Rate-Limit-Remaining: 4 CR-TDM-Rate-Limit-Reset: 1375427514 X-Content-Type-Options: nosniff Last-Modified: Tue, 23 Apr 2013 15:52:01 GMT Status: 200 Content-Length: 9426 Content-Type: application/pdf

Problems accessing full-text URLs using our API

If you are having trouble accessing the full-text text URLs returned to you in the link header, this may be because:

You have hit a rate limit (learn more about rate-limiting headers)
You are trying to access content from a publisher that requires you to accept a TDM license; consider modifying your tools to work with such publishers’ licenses.

Get involved

Find a service

Documentation

About us

2024 April 26

This year's call for expressions of interest to join our board

2024 April 24

Common views and questions about metadata across Africa

2024 April 03

Testing times

2024 March 18

Mending Chesterton's Fence: Open Source Decision-making

Documentation