HathiTrust Research Center Perl Library

Posted on September 12, 2013 in Uncategorized by Eric Lease Morgan

IrisesThis is the README file for a tiny library of Perl subroutines to be used against the HathiTrust Research Center (HTRC) application programmer interfaces (API). The Github distribution ought to contain a number of files, each briefly described below:

  • README.md – this file
  • LICENSE – a copy of the GNU Public License
  • htrc-lib.pl – our raison d’être; more below
  • search.pl – given a Solr query, return a list of no more than 100 HTRC identifiers
  • authorize.pl – given a client identifier and secret, return an authorization token
  • retrieve.pl – given a list of HTRC identifiers, return a zip stream of no more than 100 text and METS files
  • search-retrieve.pl – given a Solr query, return a zip stream of no more than 100 texts and METS files

The file doing the heavy lifting is htrc-lib.pl. It contains only three subroutines:

  1. search – given a Solr query, returns a list of no more than 100 HTRC identifiers
  2. obtainOAuth2Token – given a client ID and secret (supplied by the HTRC), return an authorization token, and this token is expected to be included in the HTTP header of any HTRC Data API request.
  3. retrieve – given a client ID, secret, and list of HTRC identifiers, return a zip stream of no more than 100 HTRC text and METS files

The library is configured at the beginning of the file with three constants:

  1. SOLR – a stub URL pointing to the location of the HTRC Solr index, and in this configuration you can change the number of search results that will be returned
  2. AUTHORIZE – the URL pointing to the authorization engine
  3. DATAAPI – the URL pointing to the HTRC Data API, specifically the API to get volumes

The other .pl files in this distribution are the simplest of scripts demonstrating how to use the library.

Be forewarned. The library does very little error checking, nor is there any more documentation beyond what you are reading here.

Before you will be able to use the obtainOAuth2Token and retrieve subroutines, you will need to acquire a client identifier and secret from the HTRC. These are required in order for the Center to track who is using their services.

The home page for the HTRC is http://www.hathitrust.org/htrc. From there you ought to be able to read more information about the Center and their supported APIs.

This software is distributed under the GNU Public License.

Finally, here is a suggestion of how to use this library:

  1. Use your Web browser to search the HTRC for content — https://htrc2.pti.indiana.edu/HTRC-UI-Portal2/ or https://sandbox.htrc.illinois.edu:8443/blacklight — ultimately generating a list of HTRC identifiers.
  2. Programmatically feed the list of identifiers to the retrieve subroutine.
  3. “Inflate” the zip stream into its constituent text and METS files.
  4. Do analysis against the result.

I’m tired. That is enough for now. Enjoy.

Drive By Shared Data: A Travelogue

Posted on June 8, 2013 in Uncategorized by Eric Lease Morgan

Last Friday (May 31, 2013) I attended an interesting symposium at Northwestern University called Driven By Shared Data. This blog posting describes my experiences.

Driven By Shared Data was an OCLC-sponsored event with the purpose of bringing together librarians to discuss “opportunities and operational challenges of turning data into powerful analysis and purposeful action”. At first I thought the symposium was going to be about the curation of “research data”, but I was pleasantly surprised otherwise. The symposium was organized into a number of sections / presentations, each enumerated below:

  • Larry Birnbaum (Northwestern University) – Birnbaum’s opening remarks bordered on the topic of artificial intelligence. For a long time he has been interested in the problem of “find more like this one”. To address this problem, he often took initial queries sent to things like Google, syntactically altered the queries, and resubmitted them. Other times he looked at search results, did entity-extraction against them, looked for entities occurring less frequently, supplemented queries with these newly found entities, and repeated the search process. The result was usually a set of “interesting” search results — results that were not identical to original but rather slightly askew. He also described and demonstrated a recommender service listing books of possible interest based on Twitter tweets. More recently he has been spending his time creating computer-generated narrative texts from sets of numeric data. For example, given the essential statistics from a baseball game, he and his colleagues have been able to generate newspaper stories describing the action of the game. “The game was tied until the bottom in the seventh inning when Bass Ball came to bat. Ball hit a double. Jim Hitter was up next, and blew one out of the park. The final score was three to one. Go home team!” What is the problem he is trying to solve? Rows and columns of data often do not make sense to the general reader. Illustrating the data graphically goes a long way to describing trends, but not everybody knows how to read graphs. Narrative texts supplement both the original data and graphical illustrations. His technique has been applied to all sorts of domains from business to medicine. Interesting because many times people don’t want words but images instead. (“A picture is worth a thousand words.”) Birnbaum is generating a thousand words from both a picture as well as data sets. In the words of Birnbaum, “Stories make data meaningful.” Some of his work has been commercialized at a site called Narrative Science.
  • Deborah Blecic (University of Illinois at Chicago) – Blecie described how some of her collection development processes have changed with the availability of COUNTER statistics. She began by enumerating some of her older data sets: circulation counts, reshelving counts, etc. She then gave an overview of some of the data sets available from COUNTER: number of hits, number of reads, etc. Based on this new information she has determined how she is going to alter her subscription to “the Big Deal” when the time comes for changing it. She commented on the integrity of COUNTER statistics because they seem ambiguious. “What is a ‘read’? Is a read when a patron looks at the HTML abstract, or is a read when the patron downloads a PDF version of an article? How do the patrons identify items to ‘read’?” She is looking forward to COUNTER 4.
  • Dave Green (Northeastern Illinois University) – Green shared with the audience some of the challenges he has had when it came to dealing with data generated from an ethnography project. More specifically, Green is the Project Director for ERIAL. Through this project a lot of field work was done, and the data created was not necessarily bibliographic in nature. Examples included transcripts of interviews, cognitive maps, photographs, movies & videos, the results of questionnaires, etc. Being an anthropologic study, the data was more qualitative than quantitative. After analyzing their data, they learned how students of their libraries used the spaces, and instructors learned how to take better advantage of library services.
  • Kim Armstrong (CIC) – On the other hand, Armstrong’s data was wholly bibliographic in nature. She is deeply involved in a project to centrally store older and lesser used books and journals owned by CIC libraries. It is hard enough to coordinate all the libraries in the project, but trying to figure out who owns what is even more challenging because of evolving and local cataloging practices. While everybody used the MARC record as a data structure, there is little consistency between libraries on how data gets put into each of the field/subfields. “The folks at Google have much of our bibliographic data as a part of their Google Books Project, and even they are not able to write a ‘regular expression’ to parse serial holdings… The result is a ‘Frankenrun’ of journals.”
  • small group discussion – We then broke up into groups of five or six people. We were tasked with enumerating sets of data we have or we would like to have. We were then expected to ask ourselves what we would do with the data once we got it, what are some of the challenges we have with the data, and what are some of the solutions to the challenges. I articulated data sets including information about readers (“patrons” or “users”), information about what is used frequently or infrequently, tabulations of words and phrases from the full text of our collections, contact information of local grant awardees, and finally, the names and contact information of local editors of scholarly publications. As we discussed these data sets and others, challenges ranged from technical to political. Every solution seemed to be rooted in a desire for more resources (time and money).
  • Dorothea Salo (Univesity of Wisconsin – Madison) – The event was brought to a close by Salo who began by articulating the Three V’s of Big Data: volume, velocity, and variety. Volume alludes to the amount of data. Velocity refers to the frequency the data changes. Variety is an account of the data’s consistency. Good data, she says is clean, consistent, easy to understand, and computable. She then asked, “Do libraries have ‘big data’?” And her answer was, “Yes and no.” Yes, we have volumes of bibliographic information but it is not clean nor easy to understand. The challenges described by Armstrong are perfect examples. She says that our ‘non-computable’ datasets are costing the profession mind share, and we have only a limited amount of time to rectify the problem before somebody else comes up with a solution and by-passes libraries all together. She also mentioned the power of data aggregation. Examples included OIAster, WorldCat, various union catalogs, and the National Science Foundation Digital Library. It did not sound to me as if she thought these efforts were successes. She alluded to the Digital Public Library Of America, and because of their explicit policy for metadata use and re-use, she thinks it has potential, but only time will tell. She has a lot of faith in the idea of “linked data”, and frankly, that sounds like a great idea to me as well. What is the way forward? She advocated the creation of “library scaffolding” to increase internal library skills, and she did not advocate the hiring of specific people to do specific tasks and expect them to solve all the problems.

After the meeting I visited the Northwestern main library and experienced the round rooms where books are shelved. It was interesting to see the ranges radiating from each rooms’ center. Along the way I autographed my book and visited the university museum which had on display quite a number of architectural drawings.

Even though the symposium was not about “e-science research data”, I’m very glad I attended. Discussion was lively. The venue was intimate. I met a number of people, and my cognitive side was stimulated. Thank you for the opportunity.

Catholic pamphlets workflow

Posted on April 12, 2013 in Uncategorized by Eric Lease Morgan

Gratuitous eye candy by Matisse

Gratuitous eye candy by Matisse

This is an outline of how we here at Notre Dame have been making digitized versions of our Catholic pamphlets available on the Web — a workflow:

  1. Save PDF files to a common file system – This can be as simple as a shared hard disk or removable media.
  2. Ingest PDF files into Fedora to generate URLs – The PDF files are saved in Fedora for the long haul.
  3. Create persistent URLs and return a list of system numbers and… URLs – Each PDF file is given a PURL for the long haul. Output a delimited file containing system numbers in one column and PURLs in another. Create persistent URLs and return a list of system numbers and… URLs – Each PDF file is given a PURL for the long haul. Output a delimited file containing system numbers in one column and PURLs in another. (Steps #2 and #3 are implemented with a number of Ruby scripts: batch_ingester.rb, book.rb, mint_purl.rb, purl_config.rb, purl.rb, repo_object.rb.)
  4. Update Filemaker database with URLs for quality assurance purposes – Use the PURLs from the previous step and update the local database so we can check the digitization process.
  5. Start quality assurance process and cook until done – Look at each PDF file making sure it has been digitized correctly and thoroughly. Return poorly digitized items back to the digitization process.
  6. Use system numbers to extract MARC records from Aleph – The file names of each original PDF document should be an Aleph system number. Use the list of numbers to get the associated bibliographic data from the integrated library system.
  7. Edit MARC records to include copyright information and URLs to PDF file – Update the bibliographic records using scripts called list-copyright.pl and update-marc.pl. The first script outputs a list of copyright information that is used as input for the second script which includes the copyright information as well as simply pointers to the PDF documents.
  8. Duplicate MARC records and edit them to create electronic resource records – Much of this work is done using MARCEdit
  9. Put newly edited records into Aleph test – Ingest the newly created records into a staging area.
  10. Check records for correctness – Given enough eyes, all bugs are shallow.
  11. Put newly edited records into Aleph production – Make the newly created records available to the public.
  12. Extract newly created MARC records with new system numbers – These numbers are needed for the concordance program — a way to link back from the concordance to the full bibliographic record.
  13. Update concordance database and texts – Use something like pdftotext to extract the OCR from the scanned PDF documents. Save the text files in a place where the concordance program can find them. Update the concordance’s database linking keys to bibliographic information as well as locations of the text files. All of this is done with a script called extract.pl.
  14. Create Aleph Sequential File to add concordance links – This script (marc2aleph.pl) will output something that can be used to update the bibliographic records with concordance URLs — an Aleph Sequential File.
  15. Run Sequential File to update MARC records with concordance link – This updates the bibliographic information accordingly.

Done, but I’m sure your milage will vary.

Digital Scholarship Grilled Cheese Lunch

Posted on April 5, 2013 in Uncategorized by Eric Lease Morgan

Grilled Cheese Lunch Attendees

Grilled Cheese Lunch Attendees

In the Fall the Libraries will be opening a thing tentatively called The Hesburgh Center for Digital Scholarship. The purpose of the Center will be to facilitate learning, teaching, and research across campus through the use of digital technology.

For the past few months I have been visiting other centers across campus in order to learn what they do, and how we can work collaboratively with them. These centers included the Center for Social Research, the Center for Creative Computing, the Center for Research Computing, the Kaneb Center, Academic Technologies, as well as a number of computer lab/classroom. Since we all have more things in common than differences, I recently tried to build a bit of community through a grilled cheese lunch. The event was an unqualified success, and pictured are some of the attendees.

Fun with conversation and food.

Editors across campus: A reverse travelogue

Posted on March 8, 2013 in Uncategorized by Eric Lease Morgan

Some attending editors

Some attending editors

On Friday, February 8 an ad hoc library group called The Willing sponsored a lunch for editors of serial titles from across campus, and this is all but the tiniest of “reverse travelogues” documenting this experience surrounding the scholarly communications process.

Professionally, I began to experience changes in the scholarly communications process almost twenty years ago when I learned how the cost of academic journals was increasing by as much as 5%-7% per year every year. With the advent of globally networked computers, the scholarly communications process is now effecting academics more directly.

In an effort to raise the awareness of the issues and provide a forum for discussing them, The Willing first compiled a list of academic journals whose editors were employed by the University. There are/were about sixty journals. Being good librarians, we subdivided these journals into smaller piles based on various characteristics. We then invited subsets of the journal editors to a lunch to discuss common problems and solutions.

The lunch was attended by sixteen people, and they were from all over the campus wearing the widest variety of hats. Humanists, scientists, and social scientists. Undergraduate students, junior faculty, staff, senior faculty. Each of us, including myself, had a lot to say about our individual experiences. We barely got around the room with our introductions in the allotted hour. Despite this fact, a number of common themes — listed below in more or less priority order — became readily apparent:

  • facilitating the peer-review process
  • going digital
  • understanding open access publishing models
  • garnering University support
  • balancing copyrights (often called “ownership” by attendees)
  • being financially sustainable
  • combatting plagiarism
  • facilitating community building around and commenting on journal content
  • soliciting submissions

With such a wide variety of topics it was difficult to have a focused discussion on any one of them in the given time and allow everybody to express their most important concerns. Consequently it was decided by the group to select individual themes and sponsor additional get togethers whose purpose will be to discuss the selected theme and only the selected theme. We will see what we can do.

Appreciation goes to The Willing (Kenneth Kinslow, Parker Ladwig, Collette Mak, Cheryl Smith, Lisa Welty, Marsha Stevenson, and myself) as well as all the attending editors. “Thanks! It could not have happened without you.

Editors Across The Campus

Posted on January 18, 2013 in Uncategorized by Eric Lease Morgan

gratuitous “eye candy” by Matisse


In an effort to make life easier for people who edit serial literature here at Notre Dame, we are organizing an informal lunch called Editors Across The Campus. We hope you can join us:

  • Who: Anybody and everybody who edits a journal here at Notre Dame
  • What: An informal lunch and opportunity for discussion
  • When: 11:45 to no later than 1 o’clock, Friday, February 8
  • Where: Room 248 of the Hesburgh Libraries
  • Why: Because we all have something to learn from each other

Here at the University quite a number of journals, magazines, and various other types of serial literature are edited by local faculty, students, and staff; based on our investigations there are more than one hundred editors who have their hands in more than sixty serial titles.

Bringing editors together from across campus will build community. It will foster the creation of a support network. It will also make it easier for people interested in scholarly communication to hear, learn, and prioritize issues and challenges facing editors. Once these issues are identified and possibly prioritized, then plans can be made to address the issues effectively. Thus, the purpose of the lunch/discussion combination is to begin to share “war stories” in the hopes of at least finding some common ground. Issues and challenges might include but are certainly not limited to:

  • balancing the costs of publication
  • dealing with copyright issues
  • decisioning between electronic and paper-based distribution
  • determining the feasibility of open access publishing
  • finding and identifying qualified authors
  • finding and identifying qualified publishers
  • finding and identifying qualified reviewers
  • implementing a searchable/browsable archive of previous content
  • increasing impact factors
  • increasing readership
  • learning how to use computer technology to manage workflows
  • moving from one publisher to another

We sincerely believe we all have more things in common than differences. If you are an editor or someone who is keenly interested in the scholarly communications process, then drop us a line (Eric Lease Morgan <emorgan@nd.edu>, 631-8604), come to the lunch, and participate in the discussion. We hope to see you there.

A couple of Open Access Week events

Posted on November 17, 2012 in Uncategorized by Eric Lease Morgan

A couple of Open Access Week events were sponsored here at Notre Dame on October 31, and this posting summarizes my experiences.

willing
Many of The Willing plus Nick Shockey and José E. Limón

Morning session

In the morning there was a presentation to library faculty by Nick Shockey (SPARC), specifically on the process of increasing open access publishing, and he outlined five different tactics:

  1. Simple advocacy – Describing what open access publishing is and its philosophical advantages. Unfortunately this approach does not always resonate with the practicalities of everyday promotion and tenure processes.
  2. Education – This goes hand-in-hand with advocacy but may also include how open access has more things in common with traditional publishing than differences. For example, Shockey pointed out the increasing number of mandates from funders to have the results of research funded by them become available via open access. Another success factor in education involves getting a deep level of understanding in faculty. Once this is done resistance is much lower.
  3. Engage scholarly societies – For example, ask the society to open up their back log of published materials as open access materials.
  4. Educate friends and colleagues – We have to understand that not everybody sees the whole problem. There is the perspective of the author, the publisher, and librarian. Each are needed in the scholarly communications process, yet not everybody understands the issues of the other completely. Build relationships between all three of these communities. He also advocated educating students because they can be a catalyst to change.
  5. Make your work open access – This means know your rights, keep your rights, and use your rights. The process is increasingly negotiable.

Finally, Shockey insisted on engaging authors on very real world problems instead of the philosophical issues such as expanding the sphere of knowledge. “Look for and point out tangible benefits of open access including higher citation counts, wider distribution, and the possibility of massive textual analysis.”

Afternoon session

The afternoon session was co-presented by Nick Shockey and José E. Limón. The topic was authors’ rights.

Shockey began by outlining the origination of scholarly journals and how they were originally non-profit enterprises. But as time went on and the publishing increasingly became profit-based, a question needed to be asked, “How well does this new model really serve the people for whom it is needed?” When the prices of some chemistry journals approach $4,200/year, there has got to be a better way.

Knowing author’s rights can help. For example, knowing, understanding, and acting upon the self-archiving rights associated with many journals now-a-days, it is possible to make available versions of published materials in a much wider fashion than ever before, but it does require some extra work — systematic extra work that could be done by libraries.

Shockey also advocated contractual amendments like the one called the Scholar’s Copyright Addendum Engine [1]. Complete the form with our name, title, and journal. Click the button. Print the form. Sign and send away to the publisher while retaining many of one’s rights automatically.

Finally, Shockey advocated university-wide institutional policies for retaining authors’ rights. “These policies create a broader and wider audiences which are not limited and offer greater visibility.”

José E. Limón (American Studies at the University of Notre Dame) began by confessing the idea of authors’ right has been rather foreign to him, and at the same time the ante is going up in terms of tenure and promotion. No longer is is about publishing a single book. Consequently he believes his knowledge regarding authors’ rights needs to be increased.

Limón went on to regale a personal story about authors’ rights. It began when he discovered an unpublished manuscript at Texas A&M University. It was a novel coauthored by Jovita González and Margaret Eimer which he edited and eventually published under the title of Caballero. Written in the 1930s, this historical novel is set during the Mexican American War and is sometimes called Texas’s Gone with the Wind. After the book was published Limón was approached by Steven Spielberg’s company about movie rights, but after a bit of investigation he discovered he had no rights to the book, but rather the rights remained with Texas A&M. To many in the audience, the story was a bit alarming.

In the end, he had one thing to say, “Academics just do not know.”

Kudos

Kudos to Nick Shockey and José E. Limón for sharing some of their experiences. “Thank you!” Thanks also go to the ad hoc group in the Hesburgh Libraries who call themselves “The Willing” (Kenneth Kinslow, Parker Ladwig, Collette Mak, Cheryl Smith, Marsha Stevenson, Lisa Welty, and Eric Lease Morgan). Without their help none of this would have happend.

New Media From the Middle Ages To The Digital Age

Posted on November 7, 2012 in Uncategorized by Eric Lease Morgan

new and old teaching toolsI attended an interesting lecture yesterday from a series called New Media From the Middle Ages to the Digital Age, and here are a few of my take-aways.

Peter Holland (Film, Television, and Theatre) began by giving an overview of his academic career. He noted how his technology of the time was a portable typewriter. He then went on to compare and contrast scholarship then and now. From what I could tell, he did not think there was a significant difference, with the exception of one thing — the role and definition of community. In the past community meant going to conferences and writing letters every once in a while. Now-a-days, conferences are still important, letters have been replaced by email, but things like mailing lists play a much larger role in community. This sort of technology has made it possible to communicate with a much wider audience much faster than in previously times. The SHAKSPER mailing was his best example.

The next presentation was by Elliott Visconsi (English). While the foundation of his presentation surrounded his The Tempest for iPad project, he was really focused on how technology can be used to enhance learning, teaching, and research. He believed portable Web apps represent a convergence of new and old technologies. I believe he called them “magic books”. One of his best examples is how the application can support dynamic and multiple commentaries on particular passages as well as dynamic and different ways speeches can be vocalized. This, combined with social media, give Web applications some distinct advantages over traditional pedagogical approaches.

From my point of view, both approaches have their distinct advantages and disadvantages. Traditional teaching and learning tolls are less fragile — less mutable. But at the same time they rely very much on the work of a single individual. On the the other hand, the use of new technology is expensive to create and keep up-to-date while offering a richer learning experience that is easier to use in groups. “Two heads are better than one.”

So many editors!

Posted on September 22, 2012 in Uncategorized by Eric Lease Morgan

There are so many editors of serial content here at the University of Notre Dame!

In a previous posting I listed the titles of serials content with editors here at Notre Dame. I identified about fifty-nine titles. I then read more about each serial title and created a sub-list of editors which resulted in about 113 names. The original idea was to gather as many of the editors together as possible and facilitate a discussion on scholarly communication, but alas, the number of 113 people is far too many for a chat.

Being a good librarian, I commenced to classify my list of serials hoping to create smaller, more cohesive groups of people. I used facets such as student-run, peer-reviewed, open access, journal (as opposed to blog), and subjects. This being done I was able to create subsets of the titles with much more manageable numbers of editors. For example:

  • 15 science publications (19 editors)
  • 10 student-run publications (24 editors)
  • 12 open access publications (26 editors)
  • 17 humanities publications (41 editors)
  • 31 peer-reviewed publications (43 editors)
  • 26 social science publications (50 editors)
  • 28 published here at Notre Dame (56 editors)

One of our goals here in the Libraries to play a role in the local scholarly communication process. Exactly what that role entails is yet to be determined. Bringing together editors from across campus could build community. It could also make it easier for us to hear, learn, and prioritize issues facing editors. Once we know what those issues are, we might be able to figure out a role for ourselves. Maybe there isn’t a role. On the other hand, maybe there is something significant we can do.

The next step is to figure out whether or not to bring subsets of these editors together, and if so, then how. We’ll see what happens.

Yet more about HathiTrust items

Posted on September 14, 2012 in Uncategorized by Eric Lease Morgan

This directory includes the files necessary to determine what downloadable public domain items in the HathiTrust are also in the Notre Dame collection.

In previous postings I described some investigations regarding HathiTrust and Notre Dame collections. [1, 2, 3] Just yesterday I got back from a HathiTrust meeting and learned that even the Google digitized items in the public domain are not really downloadable without signing some sort of contract.

Consequently, I downloaded a very large list of 100% downloadable public domain items from the HathiTrust (pd.xml). I then extracted the identifiers from the list using a stylesheet (pd.xsl). The result is pd.txt. Starting with my local MARC records created from the blog postings (nd.marc), I wrote a Perl script (nd.pl) to extract all the identifiers (nd.txt). Lastly, I computed the intersection of the two lists using a second Perl script (compare.pl) resulting in a third text file (both.txt). The result is a list of public domain items in the HathiTrust as well as in the collection here at Notre Dame as well as require no disambiguation because the item has not been digitized more than once. (“Confused yet?”)

It is now possible to download the entire digitized book through the HathiTrust Data API via a Web form. [4] Or you can use something like the following URL:

http://babel.hathitrust.org/cgi/htd/aggregate/<ID>

where <ID> is a HathiTrust identifier. For example:

http://babel.hathitrust.org/cgi/htd/aggregate/mdp.39015003700393

Of the about 20,000 items previously “freely” available, it seems that there are now just more than 2,000. In other words, about 18,000 of the items I previously thought were freely available for our catalog are not really “free” but instead permissions still need to be garnered in order to get these free items.

I swear we are presently creating a Digital Dark Age!

Links

  1. http://sites.nd.edu/emorgan/2012/08/hathitrust/
  2. http://sites.nd.edu/emorgan/2012/08/hathitrust-continued/
  3. http://sites.nd.edu/emorgan/2012/08/hathi-epilogue/
  4. https://babel.hathitrust.org/shcgi/htdc