Archive for the ‘Uncategorized’ Category

Simple text analysis with Voyant Tools

Posted on January 18, 2014 in Uncategorized

Voyant Tools is a Web-based application for doing a number of straight-forward text analysis functions, including but not limited to: word counts, tag cloud creation, concordancing, and word trending. Using Voyant Tools a person is able to read a document “from a distance”. It enables the reader to extract characteristics of a corpus quickly and accurately. Voyant Tools can be used to discover underlying themes in texts or verify propositions against them. This one-hour, hands-on workshop familiarizes the student with Voyant Tools and provides a means for understanding the concepts of text mining. (This document is also available as a PDF document suitable for printing.)

Getting started

Voyant Tools is located at http://voyant-tools.org, and the easiest way to get started is by pasting into its input box a URL or a blob of text. For learning purposes, enter one of the URL’s found at the end of this document, select from Thoreau’s Walden, Melville’s Moby Dick, or Twain’s Eve’s Diary, or enter a URL of your own choosing. Voyant Tools can read the more popular file formats, so URL’s pointing to PDF, Word, RTF, HTMl, and XML files will work well. Once given a URL, Voyant Tools will retrieve the associated text and do some analysis. Below is what is displayed when Walden is used as an example.

Voyant Tools
Voyant Tools

In the upper left-hand corner is a word cloud. In the lower-left hand corner are some statistics. The balance of the screen is made up of the text. The word cloud probably does not provide you with very much useful information because stop words have not been removed from the analysis. By clicking on the word cloud customization link, you can choose from a number of stop word sets, and the result will make much more sense. Figure #2 illustrates the appearance of the word cloud once the English stop words are employed.

By selecting words from the word cloud a word trends graph appears illustrating the relative frequency of the selection compared to its location in the text. You can use this tool to determine the consistency of the theme throughout the text. You can compare the frequency of additional words by entering them into the word trends search box. Figure #3 illustrates the frequency of the words pond and ice.

word cloud
Figure 2 – word cloud
word trends
Figure 3 – word trends
concordance
Figure 4 – concordance

Once you select a word from the word cloud, a concordance appears in the lower right hand corner of the screen. You can use this tool to: 1) see what words surround your selected word, and 2) see how the word is used in the context of the entire work. Figure #4 is an illustration of the concordance. The set of horizontal blue lines in the center of the screen denote where the selected word is located in the text. The darker the blue line is the more times the selected word appears in that area of the text.

What good is this?

On the surface of things you might ask yourself, “What good is this?” The answer lies in your ability to ask different types of questions against a text — questions you may or may not have been able to ask previously but are now able to ask because things like Voyant Tools count and tabulate words. Questions like:

  • What ara the most frequently used words in a text?
  • What words do not appear at all or appear infrequently?
  • Do any of these words represent any sort of theme?
  • Where do these words appear in the text, and how they compare to their synonyms or antonyms?
  • Where should a person go looking in the text for the use of particular words or their representative themes?

More features

Voyant Tools includes a number of other features. For example, multiple URLs can be entered into the home page’s input box. This enables the reader to examine many documents all at one time. (Try adding all the URLs at the end of the document.) After doing so many of the features of Voyant Tools work in a similar manner, but others become more interesting. For example, the summary pane in the lower left corner allows you to compare words across documents. (Consider applying stop words feature to the pane in order to make things more meaningful.) Each of Voyant Tools’ panes can be exported to HTML files or linked from other documents. This is facilitated by clicking on the small icons in the upper right-hand corner of each pane. Use this feature to embed Voyant illustrations into Web pages or printed documents. By exploring the content of a site called Hermeneuti.ca (http:/hermeneuti.ca) you can discover other features of Voyant Tools as well as other text mining applications.

The use of Voyant Tools represents an additional way of analyzing text(s). By counting and tabulating words, it provides a quick and easy quantitative method for learning what is in a text and what it might have to offer. The use of Voyant Tools does not offer “truth” per se, only new ways at observation.

Sample links

[1] Walden – http://infomotions.com/etexts/philosophy/1800-1899/thoreau-walden-186.txt
[2] Civil Disobedience – http://infomotions.com/etexts/philosophy/1800-1899/thoreau-life-183.txt
[3] Merrimack River – http://infomotions.com/etexts/gutenberg/dirs/etext03/7cncd10.txt

Semantic Web in Libraries 2013

Posted on December 30, 2013 in Uncategorized

I attended the Semantic Web in Libraries 2013 conference in Hamburg (November 25-27), and this posting documents some of my experiences. In short, I definitely believe the linked data community in libraries is maturing, but I still wonder whether or not barrier for participation is really low enough for the vision of the Semantic Web to become a reality.

venue

Preconference on provenance

On the first day I attended a preconference about linked data and provenance led by Kai Eckert (University of Mannheim) and Magnus Pfeffer (Stuttgarat Media University). One of the fundamental ideas behind the Semantic Web and linked data is the collecting of triples denoting facts. These triples are expected to be amassed and then inferenced across in order to bring new knowledge to light. But in the scholarly world it is important cite and attribute scholarly output. Triples are atomistic pieces of information: subjects, predicates, objects. But there is no room in these simple assertions to denote where the information originated. This issue was the topic of the preconference discussion. Various options were outlined but none of them seemed optimal. I’m not sure of the conclusion, but one “solution” may be the use of PROV, “a model, corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web”.

castle

Day #1

Both Day #1 and Day #2 were peppered with applications which harvested linked data (and other content) to create new and different views of information. AgriVIVO, presented by John Fereira (Cornell University) was a good example:

AgriVIVO is a search portal built to facilitate connections between all actors in the agricultural field, bridging across separately hosted directories and online communities… AgriVIVO is based on the VIVO open source semantic web application initially developed at Cornell University and now adopted by several cross-institutional research discovery projects.

Richard Wallis (OCLC) advocated the creation of library knowledge maps similar to the increasingly visible “knowledge graphs” created by Google and displayed at the top of search results. These “graphs” are aggregations of images, summaries, maps, and other bit of information providing the reader with answers / summaries describing what may be the topic of search. They are the same sort of thing one sees when searches are done in Facebook as well. And in the true spirit of linked data principles, Wallis advocated the additional use of additional peoples’ Semantic Web ontologies such as the ontology used by Schema.org. If you want to participate and help extend the bibliographic entities of Schema.org, then consider participating in a W3C Community called Schema Bib Extend Community Group.

BIBFRAME was described by Julia Hauser (Reinhold Heuvelmann German National Library). Touted as as a linked data replacement for MARC, its data model consists of works, instances, authorities, and annotations (everything else). According to Hauser, “The big unknown is how can RDA or FRBR be expressed using BIBFRAME.” Personally, I noticed how BIBFRAME contains no holdings information, but such an issue may be resolvable through the use of schema.org.

“Language effects hierarchies and culture comes before language” were the concluding remarks in a presentation by the National Library of Finland. Leaders in the linked data world, the presenters described how they were trying to create a Finnish ontology, and they demonstrated how language does not fit into neat and orderly hierarchies and relationships. Things always get lost in translation. For example, one culture may have a single word for a particular concept, but another culture may have multiple words because the concept has more nuances in its experience. Somewhere along the line the presenters alluded to onki-light, “a REST-style API for machine and Linked Data access to the underlying vocabulary data.” I believe the presenters were using this tool to support access to their newly formed ontology.

Yet another ontology was described by Carsten Klee (Berlin State Library) and Jakob Voẞ (GBV Common Library Network). This was a holdings ontology which seemed unnecessarily complex to me, but then I’m no real expert. See the holding-ontology repository on Github.

memorial

Day #2

I found the presentation — “Decentralization, distribution, disintegration: Towards linked data as a first class citizen in Library Land” — by Martin Malmsten (National Library of Sweden) to be the most inspiring. In the presentation he described why he thinks linked data is the way to describe the content of library catalogs. He also made insightful distinctions between file formats and the essencial characteristics of data, information, knowledge, (and maybe wisdom). Like many at the conference, he advocated interfaces to linked data, not MARC:

Working with RDF has enabled me to see beyond simple formats and observe the bigger picture — “Linked data or die”. Linked data is the way to do it now. I advocate the abstraction of MARC to RDF because RDF is more essencial and fundmental… Mixing data is new problem with the advent of linked data. This represents a huge shift in our thinking of Library Land. It is transformative… Keep the formats (monsters and zombies) outside your house. Formats are for exchange. True and real RDF is not a format.

Some of the work demonstrating the expressed ideas of the presentation is available on Github in a package called librisxl.

Another common theme / application demonstrated at the conference were variations of the venerable library catalog. OpenCat, presented by Agnes Simon (Bibliothéque Nationale de France), was an additional example of this trend. Combining authority data (available as RDF) provided by the National Library of France with works of a second library (Fresnes Public Library), the OpenCat prototype provides quite an interesting interface to library holdings.

Peter Király (Europeana Foundation) described how he is collecting content over many protocols and amalgamating it into the data store of Europenana. I appreciated the efforts he has made to normalize and enrich the data — not an easy task. The presentation also made me think about provenance. While provenance is important, maybe trust of provenance can come from the aggregator. I thought, “If these aggregators believe — trust — the remote sources, then may be I can too.” Finally, the presentation got me imagining how one URI can lead to others, and my goal would be to distill it down again into a single URI all of the interesting information I found a long the way, as in the following image I doodled during the presentation.

uri

Enhancing the access and functionality of manuscripts was the topic of the presentation by Kai Eckert (Universität Mannheim). Specifically, manuscripts are digitized and an interface is placed on top allowing scholars to annotate the content beneath. I think the application supporting this functionality is called Pundit. Along the way he takes heterogeneous (linked) data and homogenizes it with a tool called DM2E.

OAI-PMH was frequently alluded to during the conference, and I have some ideas about that. In “Application of LOD to enrich the collection of digitized medieval manuscripts at the University of Valencia” Jose Manuel Barrueco Cruz (University of Valencia) described how the age of his content inhibited his use of the currently available linked data. I got the feeling there was little linked data closely associated with the subject matter of his manuscripts. Still, an an important thing to note, is how he started his investigations with the use of Datahub:

a data management platform from the Open Knowledge Foundation, based on the CKAN data management system… [providing] free access to many of CKAN’s core features, letting you search for data, register published datasets, create and manage groups of datasets, and get updates from datasets and groups you’re interested in. You can use the web interface or, if you are a programmer needing to connect the Datahub with another app, the CKAN API.

Simeon Warner (Cornell University) described how archives or dumps of RDF triple stores are synchronized across the Internet via HTTP GET, gzip, and a REST-ful interface on top of Google sitemaps. I was impressed because the end result did not necessarily invent something new but rather implemented an elegant solution to a real-world problem using existing technology. See the resync repository on Github.

In “From strings to things: A linked data API for library hackers and Web developers” Fabian Steeg and Pascal Christoph (HBZ) described an interface allowing librarians to determine the URIs of people, places, and things for library catalog records. “How can we benefit from linked data without being linked data experts? We want to pub Web developers into focus using JSON for HTTP.” There are few hacks illustrating some of their work on Github in the lobid repository.

Finally, I hung around for a single lightning talk — Carsten Klee’s (Berlin State Library) presentation of easyM2R, a PHP script converting MARC to any number of RDF serializations.

church

Observations, summary, and conclusions

I am currently in the process of writing a short book on the topic of linked data and archives for an organization called LiAM — “a planning grant project whose deliverables will facilitate the application of linked data approaches to archival description.” One of my goals for attending this conference was to determine my level of understanding when it comes to linked data. At the risk of sounding arrogant, I think I’m on target, but at the same time, I learned a lot at this conference.

For example, I learned that the process of publishing linked data is not “rocket surgery” and what I have done to date is more than functional, but I also learned that creating serialized RDF from MARC or EAD is probably not the best way to create RDF. I learned that publishing linked data is only one half of the problem to be solved. The other half is figuring out ways to collect, organize, and make useful the published content. Fortunately this second half of the problem was much of what the conference was about. Many people are using linked data to either create or enhance “next-generation library catalogs”. In this vein they are not really doing anything new and different; they are being evolutionary. Moreover, many of the developers are aggregating content using quite a variety of techniques, OAI-PMH being one of the more frequent.

When it comes to OAI-PMH and linked data, I see very much the same vision. Expose metadata in an agreed upon format and in an agreed upon method. Allow others to systematically harvest the metadata. Provide information services against the result. OAI-PMH was described as protocol with a low barrier to entry. The publishing of linked data is also seen as low barrier technology. The challenges of both first lie the vocabularies used to describe the things of the metadata. OAI-PMH required Dublin Core but advocated additional “ontologies”. Few people implemented them. Linked data is not much different. The problem with the language of the things is just as prevalent, if not more so. Linked data is not just the purview of Library Land and a few computer scientists. Linked data has caught the attention of a much wider group of people, albiet the subject is still a bit esoteric. I know the technology supporting linked data functions. After all, it is the technology of the Web. I just wonder if: 1) there will ever be a critical mass of linked data available in order to fulfill its promise, and 2) will we — the information community — be able to overcome the “Tower of Babel” we are creating with all the the various ontologies we are sporting. A single ontology won’t work. Just look at Dublin Core. Many ontologies won’t work either. There is too much variation and too many idiosyncrasies in real-world human language. I don’t know what the answer is. I just don’t.

Despite some of my misgivings, I think the following quote by Martin Malmsten pretty much sums up much of the conference — Linked data or die!

Fun with bibliographic indexes, bibliographic data management software, and Z39.50

Posted on November 15, 2013 in Uncategorized

It is not suppose to be this hard.

The problem to solve

A student came into the Center For Digital Scholarship here at Notre Dame. They wanted to do some text analysis against a mass of bibliographic citations from the New York Times dating from 1934 to the present. His corpus consists of more than 1.6 million records. The student said, “I believe the use of words like ‘trade’ and ‘tariff’ have changed over time, and these changes reflect shifts in our economic development policies.” Sounds interesting to me, really.

Solution #1

To do this analysis I needed to download the 1.6 million records in question. No, I wasn’t going to download them in one whole batch but rather break them up into years. Still this results in individual data sets totaling thousands and thousands of records. Selecting these records through the Web interface of the bibliographic vendor was tedious. No, it was cruel and unusual punishment. There had to be a better way. Moreover, the vendor said, “Four thousand (4,000) records is the most a person can download at any particular time.”

Solution #2

After a bit of back & forth a commercial Z39.50 client seemed to be the answer. At the very least there won’t be a whole lot of clicking going on. I obtained a username/password combination. I figured out the correct host name of the remote Z39.50 server. I got the correct database name. I configured my client. Searches worked perfectly. But upon closer inspection, no date information was being parsed from the records. No pagination. The bibliographic citation management software could not create… bibliographic citations. “Is date information being sent? What about pagination and URLs?” More back & forth and I learned that the bibliographic vendor’s Z39.50 server outputs MARC, and the required data is encoded in the MARC. I went back to tweaking my client’s configuration. Everything was now working, but downloading the citations was very slow — too slow.

Solution #3

So I got to more thinking. “I have all the information I need to use a low-level Z39.50 client.” Yaz-client might be an option, but in the end I wrote my own Perl script. In about twenty-five lines of code I wrote what I needed, and downloads were a factor of 10 faster than the desktop client. (See the Appendix.) The only drawback was the raw MARC that I was saving. I would need to parse it for my student.

Back to the drawing board

Everything was going well, but then I hit the original limit — the record limit. When the bibliographic database vendor said there was a 4,000 record limit, I thought that meant no more than 4,000 records could be downloaded at one time. No, it means that from any given search I can only download the first 4,000 records. Trying to retrieve record 4,001 or greater results in an error. This is true. When I request record 4001 from my commercial client or Perl-based client I get an error. Bummer!

The only thing I can do now is ask the bibliographic vendor for a data dump.

Take-aways

On one hand I can’t blame the bibliographic vendor too much. For decades the library profession has been trying to teach people to do the most specific, highly accurate, precision/recall searches possible. “Why would anybody want more than a few dozen citations anyway? Four thousand ought to be plenty.” One the other hand, text mining is a legitimate and additional method for dealing with information overload. Four thousand records is just a tip of an iceberg.

I learned a few things:

  • many students have very interesting senior projects
  • the commercial Z39.50 client works quite well and is well-supported
  • many commercial Z30.50 implementations are based on the good work of Indexdata
  • my bibliographic database vendor does IP-based Z39.50 authentication

I also got an idea — provide my clientele with a “smart” database search interface. Here’s how:

  1. authenticate a person
  2. allow the person to select one or more bibliographic databases to search
  3. allow the person to enter a rudimentary, free text query
  4. search the selected databases
  5. harvest the results (of potentially thousand’s of records)
  6. do text mining against the results to create timelines, word clouds, author recommendations, etc.
  7. present the results to the person for analysis

Wish me luck!?

Appendix

#!/usr/bin/perl

# nytimes-search.pl - rudimentary z39.50 client to query the NY Times

# Eric Lease Morgan <emorgan@nd.edu>
# November 13, 2013 - first cut; "Happy Birthday, Steve!"

# usage: ./nytimes-search.pl > nytimes.marc


# configure
use constant DB     => 'hnpnewyorktimes';
use constant HOST   => 'fedsearch.proquest.com';
use constant PORT   => 210;
use constant QUERY  => '@attr 1=1016 "trade or tariff"';
use constant SYNTAX => 'usmarc';

# require
use strict;
use ZOOM;

# do the work
eval {

	# connect; configure; search
	my $conn = new ZOOM::Connection( HOST, PORT, databaseName => DB );
	$conn->option( preferredRecordSyntax => SYNTAX );
	my $rs = $conn->search_pqf( QUERY );

	# requests > 4000 return errors
	# print $rs->record( 4001 )->raw;
			
	# retrieve; will break at record 4,000 because of vendor limitations
	for my $i ( 0 .. $rs->size ) {
	
		print STDERR "\tRetrieving record #$i\r";
		print $rs->record( $i )->raw;
		
	}
		
};

# report errors
if ( $@ ) { print STDERR "Error ", $@->code, ": ", $@->message, "\n" }

# done
exit;

Network Detroit and Great Lakes THATCamp

Posted on October 5, 2013 in Uncategorized

This time last week I was in Detroit (Michigan) where I attended Network Detroit and the Great Lakes THATCamp. This is the briefest of postings describing my experiences.

Network Detroit brought together experienced and fledgling digital humanists from around the region. There were presentations by local libraries, archives, and museums. There were also presentations by scholars and researchers. People were creating websites, doing bits of text mining, and trying to figure out how to improve upon the scholarly communications process. A few useful quotes included:

  • Design is a communication of knowledge. —Rebecca Tegtmeyer
  • Stop calling it DH… Show how DH supports the liberal arts… Build a support model… Integrate DH into the curriculum. —William Pannapacker
  • Analytic brillance is no longer the only game in town. —Lior Shamir
  • Provenance verification, knowledge representation, and staffing are the particular challenges when it come to making archival material accessible. —Arjun Sabharwal
  • Commenting should be a part of any museum’s website. —Adrienne Aluzzo

Day #2 consisted of participation in the Great Lakes THATCamp. I spend the time doing three things. First, I spent time thinking about a program I’m writing called PDF2TXT or maybe “Distant Reader”. The original purpose of the program is/was to simply extract the text from a PDF document. Since then it has succumbed to creeping featuritis to include the reporting of things like: readability scores, rudimentary word clouds of uni- and bi-grams, an extraction of the most frequent verb lemmings and the listing of sentences where they are found, a concordance, and the beginnings of network diagram illustrating what words are used “in the same breath” as other words. The purpose of the program is two-fold: to allow the reader to get more out of their text(s), and 2) to advertise some of the services of the Libraries’s fledgling Center For Digital Scholarship. I presented a “geek short” on the application.

The second and third ways I spent my time were in group sessions. One was on the intersection of digital humanities and the scholarly communications process. The second was on getting digital humanities projects off the ground. In both cases folks discussed ways to promote library services, and it felt as if we were all looking for new ways to be relevant compared to fifty years ago when the great libraries were defined by the sizes of their collections.

I’m glad I attended the meetings. The venue — Lawrence Technical University — is a small but growing institution. Detroit is a city of big road and big cars. The Detroit Art Institute was well-worth the $8 admission fee, even if you do get a $30 parking ticket.

Data Information Literacy @ Purdue

Posted on October 4, 2013 in Uncategorized

By this time last week I had come and gone to the Data Information Literacy (DIL) Symposium at Purdue University. It was a very well-organized event, and I learned a number of things.

First of all, I believe the twelve DIL Competencies were well-founded and articulated:

  • conversion & interoperability
  • cultures of practice
  • curation & re-use
  • databases & formats
  • discovery & acquisition
  • ethics & attribution
  • management & organization
  • metadata & description
  • preservation
  • processing & analytics
  • quality & documentation
  • visualization & representation

For more detail of what these competencies mean and how they were originally articulated, see: Carlson, Jake R.; Fosmire, Michael; Miller, Chris; and Sapp Nelson, Megan, “Determining Data Information Literacy Needs: A Study of Students and Research Faculty” (2011). Libraries Faculty and Staff Scholarship and Research. Paper 23. http://docs.lib.purdue.edu/lib_fsdocs/23

I also learned about Bloom’s Taxonomy, a classification of learning objectives. At the bottom of this hierarchy/classification is remembering. The next level up is understanding. The third level is application. At the top of the hierarchy/classification is analysis, evaluation, and creation. According to the model, a person needs to move from remembering through to analysis, evaluation, and creation in order to really say they have learned something.

Some of my additional take-aways included: spend time teaching graduate students about data information literacy, and it is almost necessary to be imbedded or directly involved in the data collection process in order to have a real effect — get into the lab.

About 100 people attended the event. It was two days long. Time was not wasted. There were plenty of opportunities for discussion & interaction. Hat’s off to Purdue. From my point of view, y’all did a good job. “Thank you.”

3-D printing in the Center For Digital Scholarship

Posted on October 3, 2013 in Uncategorized

"my" library

“my” library

This is the tiniest of blog postings outlining my experiences with 3-D printing.

The Libraries purchased a 3-D printer — a MakerBot Replicator 2X — and it arrived here in the Center For Digital Scholarship late last week. It can print things to sizes just smaller than a bread box — not very big. To make it go one feeds it a special file which moves — drives — a horizontal platform as well as a movable nozzle dispensing melted plastic. The “special file” is something only MakerBot understands, I think. But the process is more generalized than that. Ideally one would:

  1. use a CAD program to model a 3-D object
  2. convert the resulting CAD file to a MakerBot file
  3. print

Alternatively, a person can:

  1. visit Thingiverse
  2. download one of their thousands of files
  3. convert the file to a MakerBot file
  4. print

Another choice is to:

  1. visit TinkerCAD
  2. use their online software to design a model
  3. download the resulting file
  4. convert the file to a MakerBot file
  5. print

Yet another choice is to:

  1. obtain 123D Catch for your iPhone
  2. use it to take many photographs of an object
  3. edit and clean-up the resulting 3-D image with 123D Catch online
  4. download the resulting file
  5. convert the file to a MakerBot file
  6. print

The other day I downloaded a modeling program — 3-D Sculpt — for my iPad. Import a generic model. Use the tools to modify it. Save. Convert. Print.

To date I’ve only printed a bust of Michelangelo’s David and a model of a “library”. I’ve tried to print other sculptures but with little success.

How can this be used in a library, or more specifically, in our Center For Digital Scholarship? Frankly, I don’t know, yet, but I will think of something. For example, maybe I could print 3-D statistics. Or I could create a 3-D model representing the use of words in a book. Hmmm… Do you have any ideas?

HathiTrust Research Center Perl Library

Posted on September 12, 2013 in Uncategorized

IrisesThis is the README file for a tiny library of Perl subroutines to be used against the HathiTrust Research Center (HTRC) application programmer interfaces (API). The Github distribution ought to contain a number of files, each briefly described below:

  • README.md – this file
  • LICENSE – a copy of the GNU Public License
  • htrc-lib.pl – our raison d’être; more below
  • search.pl – given a Solr query, return a list of no more than 100 HTRC identifiers
  • authorize.pl – given a client identifier and secret, return an authorization token
  • retrieve.pl – given a list of HTRC identifiers, return a zip stream of no more than 100 text and METS files
  • search-retrieve.pl – given a Solr query, return a zip stream of no more than 100 texts and METS files

The file doing the heavy lifting is htrc-lib.pl. It contains only three subroutines:

  1. search – given a Solr query, returns a list of no more than 100 HTRC identifiers
  2. obtainOAuth2Token – given a client ID and secret (supplied by the HTRC), return an authorization token, and this token is expected to be included in the HTTP header of any HTRC Data API request.
  3. retrieve – given a client ID, secret, and list of HTRC identifiers, return a zip stream of no more than 100 HTRC text and METS files

The library is configured at the beginning of the file with three constants:

  1. SOLR – a stub URL pointing to the location of the HTRC Solr index, and in this configuration you can change the number of search results that will be returned
  2. AUTHORIZE – the URL pointing to the authorization engine
  3. DATAAPI – the URL pointing to the HTRC Data API, specifically the API to get volumes

The other .pl files in this distribution are the simplest of scripts demonstrating how to use the library.

Be forewarned. The library does very little error checking, nor is there any more documentation beyond what you are reading here.

Before you will be able to use the obtainOAuth2Token and retrieve subroutines, you will need to acquire a client identifier and secret from the HTRC. These are required in order for the Center to track who is using their services.

The home page for the HTRC is http://www.hathitrust.org/htrc. From there you ought to be able to read more information about the Center and their supported APIs.

This software is distributed under the GNU Public License.

Finally, here is a suggestion of how to use this library:

  1. Use your Web browser to search the HTRC for content — https://htrc2.pti.indiana.edu/HTRC-UI-Portal2/ or https://sandbox.htrc.illinois.edu:8443/blacklight — ultimately generating a list of HTRC identifiers.
  2. Programmatically feed the list of identifiers to the retrieve subroutine.
  3. “Inflate” the zip stream into its constituent text and METS files.
  4. Do analysis against the result.

I’m tired. That is enough for now. Enjoy.

Drive By Shared Data: A Travelogue

Posted on June 8, 2013 in Uncategorized

Last Friday (May 31, 2013) I attended an interesting symposium at Northwestern University called Driven By Shared Data. This blog posting describes my experiences.

Driven By Shared Data was an OCLC-sponsored event with the purpose of bringing together librarians to discuss “opportunities and operational challenges of turning data into powerful analysis and purposeful action”. At first I thought the symposium was going to be about the curation of “research data”, but I was pleasantly surprised otherwise. The symposium was organized into a number of sections / presentations, each enumerated below:

  • Larry Birnbaum (Northwestern University) – Birnbaum’s opening remarks bordered on the topic of artificial intelligence. For a long time he has been interested in the problem of “find more like this one”. To address this problem, he often took initial queries sent to things like Google, syntactically altered the queries, and resubmitted them. Other times he looked at search results, did entity-extraction against them, looked for entities occurring less frequently, supplemented queries with these newly found entities, and repeated the search process. The result was usually a set of “interesting” search results — results that were not identical to original but rather slightly askew. He also described and demonstrated a recommender service listing books of possible interest based on Twitter tweets. More recently he has been spending his time creating computer-generated narrative texts from sets of numeric data. For example, given the essential statistics from a baseball game, he and his colleagues have been able to generate newspaper stories describing the action of the game. “The game was tied until the bottom in the seventh inning when Bass Ball came to bat. Ball hit a double. Jim Hitter was up next, and blew one out of the park. The final score was three to one. Go home team!” What is the problem he is trying to solve? Rows and columns of data often do not make sense to the general reader. Illustrating the data graphically goes a long way to describing trends, but not everybody knows how to read graphs. Narrative texts supplement both the original data and graphical illustrations. His technique has been applied to all sorts of domains from business to medicine. Interesting because many times people don’t want words but images instead. (“A picture is worth a thousand words.”) Birnbaum is generating a thousand words from both a picture as well as data sets. In the words of Birnbaum, “Stories make data meaningful.” Some of his work has been commercialized at a site called Narrative Science.
  • Deborah Blecic (University of Illinois at Chicago) – Blecie described how some of her collection development processes have changed with the availability of COUNTER statistics. She began by enumerating some of her older data sets: circulation counts, reshelving counts, etc. She then gave an overview of some of the data sets available from COUNTER: number of hits, number of reads, etc. Based on this new information she has determined how she is going to alter her subscription to “the Big Deal” when the time comes for changing it. She commented on the integrity of COUNTER statistics because they seem ambiguious. “What is a ‘read’? Is a read when a patron looks at the HTML abstract, or is a read when the patron downloads a PDF version of an article? How do the patrons identify items to ‘read’?” She is looking forward to COUNTER 4.
  • Dave Green (Northeastern Illinois University) – Green shared with the audience some of the challenges he has had when it came to dealing with data generated from an ethnography project. More specifically, Green is the Project Director for ERIAL. Through this project a lot of field work was done, and the data created was not necessarily bibliographic in nature. Examples included transcripts of interviews, cognitive maps, photographs, movies & videos, the results of questionnaires, etc. Being an anthropologic study, the data was more qualitative than quantitative. After analyzing their data, they learned how students of their libraries used the spaces, and instructors learned how to take better advantage of library services.
  • Kim Armstrong (CIC) – On the other hand, Armstrong’s data was wholly bibliographic in nature. She is deeply involved in a project to centrally store older and lesser used books and journals owned by CIC libraries. It is hard enough to coordinate all the libraries in the project, but trying to figure out who owns what is even more challenging because of evolving and local cataloging practices. While everybody used the MARC record as a data structure, there is little consistency between libraries on how data gets put into each of the field/subfields. “The folks at Google have much of our bibliographic data as a part of their Google Books Project, and even they are not able to write a ‘regular expression’ to parse serial holdings… The result is a ‘Frankenrun’ of journals.”
  • small group discussion – We then broke up into groups of five or six people. We were tasked with enumerating sets of data we have or we would like to have. We were then expected to ask ourselves what we would do with the data once we got it, what are some of the challenges we have with the data, and what are some of the solutions to the challenges. I articulated data sets including information about readers (“patrons” or “users”), information about what is used frequently or infrequently, tabulations of words and phrases from the full text of our collections, contact information of local grant awardees, and finally, the names and contact information of local editors of scholarly publications. As we discussed these data sets and others, challenges ranged from technical to political. Every solution seemed to be rooted in a desire for more resources (time and money).
  • Dorothea Salo (Univesity of Wisconsin – Madison) – The event was brought to a close by Salo who began by articulating the Three V’s of Big Data: volume, velocity, and variety. Volume alludes to the amount of data. Velocity refers to the frequency the data changes. Variety is an account of the data’s consistency. Good data, she says is clean, consistent, easy to understand, and computable. She then asked, “Do libraries have ‘big data’?” And her answer was, “Yes and no.” Yes, we have volumes of bibliographic information but it is not clean nor easy to understand. The challenges described by Armstrong are perfect examples. She says that our ‘non-computable’ datasets are costing the profession mind share, and we have only a limited amount of time to rectify the problem before somebody else comes up with a solution and by-passes libraries all together. She also mentioned the power of data aggregation. Examples included OIAster, WorldCat, various union catalogs, and the National Science Foundation Digital Library. It did not sound to me as if she thought these efforts were successes. She alluded to the Digital Public Library Of America, and because of their explicit policy for metadata use and re-use, she thinks it has potential, but only time will tell. She has a lot of faith in the idea of “linked data”, and frankly, that sounds like a great idea to me as well. What is the way forward? She advocated the creation of “library scaffolding” to increase internal library skills, and she did not advocate the hiring of specific people to do specific tasks and expect them to solve all the problems.

After the meeting I visited the Northwestern main library and experienced the round rooms where books are shelved. It was interesting to see the ranges radiating from each rooms’ center. Along the way I autographed my book and visited the university museum which had on display quite a number of architectural drawings.

Even though the symposium was not about “e-science research data”, I’m very glad I attended. Discussion was lively. The venue was intimate. I met a number of people, and my cognitive side was stimulated. Thank you for the opportunity.

Catholic pamphlets workflow

Posted on April 12, 2013 in Uncategorized

Gratuitous eye candy by Matisse

Gratuitous eye candy by Matisse

This is an outline of how we here at Notre Dame have been making digitized versions of our Catholic pamphlets available on the Web — a workflow:

  1. Save PDF files to a common file system – This can be as simple as a shared hard disk or removable media.
  2. Ingest PDF files into Fedora to generate URLs – The PDF files are saved in Fedora for the long haul.
  3. Create persistent URLs and return a list of system numbers and… URLs – Each PDF file is given a PURL for the long haul. Output a delimited file containing system numbers in one column and PURLs in another. Create persistent URLs and return a list of system numbers and… URLs – Each PDF file is given a PURL for the long haul. Output a delimited file containing system numbers in one column and PURLs in another. (Steps #2 and #3 are implemented with a number of Ruby scripts: batch_ingester.rb, book.rb, mint_purl.rb, purl_config.rb, purl.rb, repo_object.rb.)
  4. Update Filemaker database with URLs for quality assurance purposes – Use the PURLs from the previous step and update the local database so we can check the digitization process.
  5. Start quality assurance process and cook until done – Look at each PDF file making sure it has been digitized correctly and thoroughly. Return poorly digitized items back to the digitization process.
  6. Use system numbers to extract MARC records from Aleph – The file names of each original PDF document should be an Aleph system number. Use the list of numbers to get the associated bibliographic data from the integrated library system.
  7. Edit MARC records to include copyright information and URLs to PDF file – Update the bibliographic records using scripts called list-copyright.pl and update-marc.pl. The first script outputs a list of copyright information that is used as input for the second script which includes the copyright information as well as simply pointers to the PDF documents.
  8. Duplicate MARC records and edit them to create electronic resource records – Much of this work is done using MARCEdit
  9. Put newly edited records into Aleph test – Ingest the newly created records into a staging area.
  10. Check records for correctness – Given enough eyes, all bugs are shallow.
  11. Put newly edited records into Aleph production – Make the newly created records available to the public.
  12. Extract newly created MARC records with new system numbers – These numbers are needed for the concordance program — a way to link back from the concordance to the full bibliographic record.
  13. Update concordance database and texts – Use something like pdftotext to extract the OCR from the scanned PDF documents. Save the text files in a place where the concordance program can find them. Update the concordance’s database linking keys to bibliographic information as well as locations of the text files. All of this is done with a script called extract.pl.
  14. Create Aleph Sequential File to add concordance links – This script (marc2aleph.pl) will output something that can be used to update the bibliographic records with concordance URLs — an Aleph Sequential File.
  15. Run Sequential File to update MARC records with concordance link – This updates the bibliographic information accordingly.

Done, but I’m sure your milage will vary.

Digital Scholarship Grilled Cheese Lunch

Posted on April 5, 2013 in Uncategorized

Grilled Cheese Lunch Attendees

Grilled Cheese Lunch Attendees

In the Fall the Libraries will be opening a thing tentatively called The Hesburgh Center for Digital Scholarship. The purpose of the Center will be to facilitate learning, teaching, and research across campus through the use of digital technology.

For the past few months I have been visiting other centers across campus in order to learn what they do, and how we can work collaboratively with them. These centers included the Center for Social Research, the Center for Creative Computing, the Center for Research Computing, the Kaneb Center, Academic Technologies, as well as a number of computer lab/classroom. Since we all have more things in common than differences, I recently tried to build a bit of community through a grilled cheese lunch. The event was an unqualified success, and pictured are some of the attendees.

Fun with conversation and food.