Exploiting the content of the HathiTrust, continued

This blog posting describes how I created a set of MARC records representing public domain content that is in both the University of Notre Dame’s collection as well as in the HathiTrust.


In a previous posting I described how I learned about the amount of overlap between my library’s collection and the ‘Trust. There is about a 33% overlap. In other words, about one out of every three books owned by the Hesburgh Libraries has also been digitized and in the ‘Trust. I wondered how our collections and services could be improved if hypertext links between our catalog and the ‘Trust could be created.

In order to create links between our catalog and the ‘Trust, I need to identify overlapping titles and remote ‘Trust URLs. Because they originally wrote the report which started the whole thing, OCLC had to have the necessary information. Consequently I got in touch with the author of the original OCLC report (Constance Malpas) who in turn sent me a list of Notre Dame holdings complete with the most rudimentary of bibliographic data. We then had a conference call between ourselves and two others — Roy Tennant from OCLC and Lisa Stienbarger from the Notre Dame. As a group we discussed the challenges of creating an authoritative overlap list. While we all agreed the creation of links would be beneficial to my local readers, we also agreed to limit what gets linked, specifically public domain items associated with single digitized items. Links to copyrighted materials were deemed more useless than useful. One can’t download the content, and searching the content is limited. Similarly, any OCLC number — the key I planned to use to identify overlapping materials — can be associated with more than one digitized item. “To which digitized item should I link?” Trying to programmatically disambiguate between one digitized item and another was seen as too difficult to handle at the present time.

The hacking

I then read the HathiTrust Bib API, and I learned it was simple. Construct a URL denoting the type of control number one wants to use to search as well as denote full or brief output. (Full output is just like brief output except full output includes a stream of MARCXML.) Send the URL off to the ‘Trust and get back a JSON stream of text. The programmer is then expected to read, parse, and analyze the result.

Energized with a self-imposed goal, I ran off to my text editor to hack a program. Given the list of OCLC numbers provided by OCLC, I wrote a Perl program that queries the ‘Trust for a single record. I then made sure the resulting record was: 1) denoted as in the public domain, 2) published prior to 1924, and 3) was associated with a single digitized item. When records matched this criteria I wrote the OCLC number, the title, and the ‘Trust URL pointing to the digitized item to a tab-delimited file. After looping through all the records I identified about 25,000 fitting my criteria. I then wrote another program which looped through the 25,000 items and created a local MARC file describing each item complete with remote HathiTrust URL. (Both of my scripts — filter-pd.pl and get-marcxml.pl — can be used by just about any library. All you need is a list of OCLC numbers.) It is now possible for us here at Notre Dame to pour these MARC records into our catalog or “discovery system”. Doing so is not always straight-forward, and if the so desire, I’ll let that work to others.

What I learned

This process has been interesting. I learned that a lot of our library’s content exists in digital form, and copyright is getting in the way of making it as useful as it could be. I learned the feasibility of improving our library collections and services by linking between our catalog and remote repositories. The feasibility is high, but the process of implementation is not straight-forward. I learned how to programmatically query the HathiTrust. It is simple and easy-to-use. And I learned how the process of mass digitization has been boon as well as a bit of a bust — the result is sometimes ambiguous.

It is now our job as librarians to figure out how to exploit this environment and fulfill our mission at the same time. Hopefully, this posting will help somebody else take the next step.

3 Responses to “Exploiting the content of the HathiTrust, continued”

  1. Chris says:

    Why pre-1924? If you were trying to omit govdocs, it would be easier to cull those by 008 value.

  2. @Chris, exactly — to eliminate government documents. The 008 field is not in the metadata I am getting from the ‘Trust’s API. Good question.