CrossRef’s Text and Data Mining (TDM) API
Posted on June 11, 2014 in Uncategorized by Eric Lease Morgan
A few weeks ago I learned that CrossRef’s Text And Data Mining (TDM) API had gone version 1.0, and this blog posting describes my tertiary experience with it.
A number of months ago I learned about Prospect, a fledgling API being developed by CrossRef. Its purpose was to facilitate direct access to full text journal content without going through the hassle of screen scraping journal article splash pages. Since then the API has been upgraded to version 1.0 and renamed the Text And Data Mining API. This is how the API is expected to be used:
- Given a (CrossRef) DOI, resolve the DOI using HTTP content negotiation. Specifically, request text/turtle output.
- From the response, capture the HTTP header called “links”.
- Parse the links header to extract URIs denoting full text, licenses, and people.
- Make choices based on the values of the URIs.
What sorts of choices is one expected to make? Good question. First and foremost, a person is suppose to evaluate the license URI. If the URI points to a palatable license, then you may want to download the full text which seems to come in PDF and/or XML flavors. With version 1.0 of the API, I have discovered ORCID identifiers are included in the header. I believe these denote authors/contributors of the articles.
Again, all of this is based on the content of the HTTP links header. Here is an example header, with carriage returns added for readability:
<http://downloads.hindawi.com/journals/isrn.neurology/2013/908317.pdf>; rel="http://id.crossref.org/schema/fulltext"; type="application/pdf"; version="vor", <http://downloads.hindawi.com/journals/isrn.neurology/2013/908317.xml>; rel="http://id.crossref.org/schema/fulltext"; type="application/xml"; version="vor", <http://creativecommons.org/licenses/by/3.0/>; rel="http://id.crossref.org/schema/license"; version="vor", <http://orcid.org/0000-0002-8443-5196>; rel="http://id.crossref.org/schema/person", <http://orcid.org/0000-0002-0987-9651>; rel="http://id.crossref.org/schema/person", <http://orcid.org/0000-0003-4669-8769>; rel="http://id.crossref.org/schema/person"
I wrote a tiny Perl library — extractor.pl — used to do steps #1 through #3, above. It returns a reference to a hash containing the values in the links header. I then wrote three Perl scripts which exploit the library:
- resolver.cgi – a Web-based application taking a DOI as input and returning the URIs in the links header, if they exist. Your milage with the script will vary because most DOIs are not associated with full text URIs.
- search.cgi – given a simple query, use CrossRef’s Metadata API to find no more than five articles associated with full text content, and then resolve the links to the full text.
- search.pl – a command-line version of search.cgi
Here are a few comments. Myself, as a person who increasingly wants direct access to full text articles, the Text And Data Mining API is a step in the right direction. Now all that needs to happen is for publishers to get on board and feed CrossRef the URIs of full text content along the associated licensing terms. I found the links header to be a bit convoluted, but this is what programming libraries are for. I could not find a comprehensive description of what name/value combinations can exist in the links header. For example, the documentation alludes to beginning and ending dates. CrossRef seems to have a growing number of interesting applications and APIs which are probably going unnoticed, and there is an opportunity of some sort lurking in there. Specifically, somebody out to do something the text/turtle (RDF) output of the DOI resolutions.
‘More fun with HTTP and bibliographics.