Yet more about HathiTrust items
Posted on September 14, 2012 in Uncategorized by Eric Lease Morgan
This directory includes the files necessary to determine what downloadable public domain items in the HathiTrust are also in the Notre Dame collection.
In previous postings I described some investigations regarding HathiTrust and Notre Dame collections. [1, 2, 3] Just yesterday I got back from a HathiTrust meeting and learned that even the Google digitized items in the public domain are not really downloadable without signing some sort of contract.
Consequently, I downloaded a very large list of 100% downloadable public domain items from the HathiTrust (pd.xml). I then extracted the identifiers from the list using a stylesheet (pd.xsl). The result is pd.txt. Starting with my local MARC records created from the blog postings (nd.marc), I wrote a Perl script (nd.pl) to extract all the identifiers (nd.txt). Lastly, I computed the intersection of the two lists using a second Perl script (compare.pl) resulting in a third text file (both.txt). The result is a list of public domain items in the HathiTrust as well as in the collection here at Notre Dame as well as require no disambiguation because the item has not been digitized more than once. (“Confused yet?”)
It is now possible to download the entire digitized book through the HathiTrust Data API via a Web form. [4] Or you can use something like the following URL:
http://babel.hathitrust.org/cgi/htd/aggregate/<ID>
where <ID> is a HathiTrust identifier. For example:
http://babel.hathitrust.org/cgi/htd/aggregate/mdp.39015003700393
Of the about 20,000 items previously “freely” available, it seems that there are now just more than 2,000. In other words, about 18,000 of the items I previously thought were freely available for our catalog are not really “free” but instead permissions still need to be garnered in order to get these free items.
I swear we are presently creating a Digital Dark Age!