Archive for the ‘Uncategorized’ Category

Learning from the Power Plant

Posted on November 7, 2023 in Uncategorized

Today (September 21, 2023) I got a tour of the University’s power plant, and I believe we could learn from them.

Today, I had the opportunity to get a tour of the University’s power plant. The facility was large, loud, clean, efficient, and seemingly operated by dedicated professionals with wide and deep experience. Much of the power is generated through the burning of natural gas, but once water is turned into steam it is used over and over, again and again. The power plant supplements its services with geo-thermal energy, solar power, and water-spun turbines. The plant works in cooperation the other power utilities in the area, but it also has the ability be operate completely independently. In a great many ways, the power plant is self-sufficient, more over, it continues to evolve and improve both its services and operations.

I then asked myself, “What is the ‘business’ of the University?”, and the answer alludes to teaching, learning, research, and Catholicism. Yet, the power plant really has nothing to do with those things. Yet, now-a-days, it seems fashionable to outsource non-core aspects of a workplace. Companies will lease a building, and hire other people to do the maintenance and cleaning. Restaurants will not launder their own napkins nor table cloths. Services are contracted to mow our grass or plow our snow. In our own workplace, we increasingly outsource digital collection, preservation, and metadata creation operations. For example, to what degree are we really & truly curating the totality of theses and dissertations created here at Notre Dame? Similarly, to what degree are we curating the scholarly record?

I asked the leader of the tour, “Running a power plant is not the core business of the University, so why is it not outsourced?” And the answer was, “Once, such was considered, and there was a study; it was deemed more cost effective to run our own plant.” I then asked myself, “To what degree have we — the Libraries and the wider library profession — done similar studies?” Personally, I have not seen nor heard of any such things, and if they do exists, then to what degree have they been rooted in antitotal evidence?

Our University has a reputation for being self-sufficient and independent. Think football. Think the power plant. Think the police, fire, religious, postal, food, grounds, housing, and banking services. Why not the Libraries? How are we any different?

I assert that if we — the Libraries — were to divert some of our contracted services and licensing fees to the development of our people, then we too would become more independent, more knowledgable, and more able to evolve with the ever-changing environment, just like the power plant. After all, we too are large, clean, efficient, and operated by dedicated professionals with wide and deep experience. (We’e not loud.)

Given a short-term and limited period of time, I suggest we more systematically digitize a greater part of our collections, pro-actively collect born-digital materials freely available on the ‘Net, catalog things at scale and support the Semantic Web, etc. Along the way our skills will increase, new ways of thinking will emerge, and we will feel empowered as opposed to powerless. Only after we actually give this a try — do a study — will we be able to measure the cost-effectiveness of outsourcing. Is outsourcing really worth the cost?

Again, the University has a reputation for being independent. Let’s try to put some of that philosophy into practice here in the Libraries.

OJS Toolbox

Posted on October 26, 2019 in Uncategorized

Given a Open Journal System (OJS) root URL and an authorization token, cache all JSON files associated with the given OJS title, and optionally output rudimentary bibliographics in the form of a tab-separated value (TSV) stream. [0]

Wall Paper by Eric

OJS is a journal publishing system. [1] Is supports a REST-ful API allowing the developer to read & write to the System’s underlying database. [2] This hack — the OJS Toolbox — merely caches & reads the metadata associated with the published issues of a given journal title.

The Toolbox is written in Bash. To cache the metadata, you will need to have additional software as part of your file system: curl and jq. [3, 4] Curl is used to interact with the API. Jq is used to read & parse the resulting JSON streams. When & if you want to transform the cached JSON files into rudimentary bibliographics, then you will also need to install GNU Parallel, a tool which makes parallel processing trivial. [5]

Besides the software, you will need three pieces of information. The first is the root URL of the OJS system/title you wish to use. This value will probably look something like this –> https://example.com/index.php/foo Ask the OJS systems administrator regarding the details. The second piece of information is an authorization token. If an “api secret” has been created by the local OJS systems administrator, then each person with an OJS account ought to have been granted a token. Again, ask the OJS systems administrator for details. The third piece of information is the name of a directory where your metadata will be cached. For the sake of an example, assume the necessary values are:

  1. root URL – https://example.com/index.php/foo
  2. token – xyzzy
  3. directory – bar

Once you have gotten this far, you can cache the totality of the issue metadata:

$ ./bin/harvest.sh https://example.com/index.php/foo xyzzy bar

More specifically, `harvest.sh` will create a directory called bar. It will then determine how many issues exist in the title foo. It will then harvest sets of issue data, parse each set into individual issue files, and save the result as JSON files in the bar directory. You now have a “database” containing all the bibliographic information of a given title

For my purposes, I need a TSV file with four columns: 1) author, 2) title, 3) date, and 4) url. Such is the purpose of `issues2tsv.sh` and `issue2tsv.sh`. The first script, `issues2tsv.sh`, takes a directory as input. It then outputs a simple header, finds all the JSON files in the given directory, and passes them along (in parallel) to `issue2tsv.sh` which does the actual work. Thus, to create my TSV file, I submit a command like this:

$ ./bin/issues2tsv.sh bar > ./bar.tsv

The resulting file (bar.tsv) looks something like this:

author title date url
Kilgour The Catalog 1972-09-01 https://example.com/index.php/foo/article/download/5738/5119
McGee Two Designs 1972-09-01 https://example.com/index.php/foo/article/download/5739/5120
Saracevic Book Reviews 1972-09-01 https://example.com/index.php/foo/article/download/5740/5121

Give such a file, I can easily download the content of a given article, extract any of its plain text, perform various natural language processing tasks against it, text mine the whole, full text index the whole, apply various bits of machine learning against the whole, and in general, “read” the totality of the journal. See The Distant Reader for details. [6]

Links

[0] OJS Toolbox – https://github.com/ericleasemorgan/ojs-toolbox
[1] OJS – https://pkp.sfu.ca/ojs/
[2] OJS API – https://docs.pkp.sfu.ca/dev/api/ojs/3.1
[3] curl – https://curl.haxx.se
[4] jq – https://stedolan.github.io/jq/
[5] GNU Parallel – https://www.gnu.org/software/parallel/
[6] Distant Reader – https://distantreader.org

Fantastic Futures: My take-aways

Posted on December 11, 2018 in Uncategorized

shipsThis is the briefest of take-aways from my attendance at Fantastic Futures, a conference on artificial intelligence (AI) in libraries. [1] From the conference announcement introduction:

The Fantastic futures-conferences, which takes place in Oslo december 5th 2018, is a collaboration between the National Library of Norway and Stanford University Libraries, and was initiated by the National Librarian at the National Library of Norway, Aslak Sira Myhre and University Librarian at Stanford University Libraries, Michael Keller.

First of all, I had the opportunity to attend and participate in a pre-conference workshop. Facilitated by Nicole Coleman (Stanford University) and Svein Brygfjeld (National Library of Norway), the workshop’s primary purpose was to ask questions about AI in libraries, and to build community. To those ends the two dozen or so of us were divided into groups where we discussed what a few AI systems might look like. I was in a group discussing the possibilities of reading massive amounts of text and/or refining information retrieval based on reader profiles. In the end our group thought such things were feasible, and we outlined how they might be accomplished. Other groups discussed things such as metadata creation and collection development. Towards the end of the day we brainstormed next steps, and at the very least try to use the ai4lib mailing list to a greater degree. [2]

fortThe next day, the first real day of the conference, was attended by more than a couple hundred of people. Most were from Europe, obviously, but from my perspective about as many were librarians as non-librarians. There was an appearance by Nancy Pearl, who, as you may or may not know, is a Seattle Public Library librarian embodied as an action figure. [3] She was brought to the conference because the National Library of Norway’s AI system is named Nancy. A few notable quotes from some of the speakers, as least from my perspective, included:

  • George Zarkadakis – “Robots ought not to pretend to not be robots.”
  • Meredith Broussard – “AI uses quantitative data but qualitative data is necessary also.”
  • Barbara McGillivray – “Practice the typical research process but annotate it with modeling; humanize the algorithms.”
  • Nicole Coleman – “Put the human in the loop … The way we model data influences the way we make interpretations.”

The presenters generated lively discussion, and I believe the conference was a success by the vast majority of attendees. It is quite likely the conference will be repeated next year and be held at Stanford.

What are some of my take-aways? Hmmm:

  1. Machine learning is simply the latest incarnation of AI, and machine learning algorithms are only as unbiased as the data used to create them. Be forewarned.
  2. We can do this. We have the technology.
  3. There is too much content to process, and AI in libraries can used to do some of the more mechanical tasks. The creation and maintenance of metadata is a good example. But again, be forewarned. We were told this same thing with the advent of word processors, and in the end, we didn’t go home early because we got our work done. Instead we output more letters.
  4. Metadata is not necessary. Well, that was sort of a debate, and (more or less) deemed untrue.

It was an honor and a privilege to attend the pre-conference workshop and conference. I sincerely believe AI can be used in libraries, and the use can be effective. Putting AI into practice will take time, energy, & prioritization. How do this and simultaneously “keep the trains running” will be a challenge. On the other hand, AI in libraries can be seen as an opportunity to demonstrate the inherent worth of cultural heritage institutions. ai4lib++

P.S. Along the way I got to see some pretty cool stuff: Viking ships, a fort, “The Scream”, and a “winterfest”. I also got to experience sunset at 3:30 in the afternoon.

winterfest scream

Links

[1] Fantastic Futures – https://www.nb.no/artikler/fantastic-futures/

[2] ai4lib – https://groups.google.com/forum/#!forum/ai4lib

[3] action figure – https://www.amazon.com/Nancy-Pearl-Librarian-Action-Figure/dp/B0006FU9EG

marc2catalog

Posted on July 2, 2018 in Uncategorized

Given a set of MARC records, output a set of library catalogs

This set of scripts will take a set of MARC data, parse it into a simple (rudimentary and SQLite) database, and then generate a report against the database in the form of plain text files — a set of “library catalogs & indexes”. These catalogs & indexes are intended to be printed, but they can also be used to support rudimentary search via one’s text editor. For extra credit, the programer could read the underlying database, feed the result to an indexer, and create an OPAC (online public access catalog).

The system requires a bit of infrastructure: 1) Bash, 2) Perl, 3) a Perl module named MARC::Batch, 4) the DBI driver for SQLite.

The whole MARC-to-catalog process can be run with a single command:

./bin/make-all.sh <marc> <name>

Where <marc> is the name of the MARC file, and <name> is a one-word moniker for the collection. The distribution comes with sample data, and therefore an example execution includes:

./bin/make-all.sh ./etc/morse.mrc morse

The result ought to be the creation of a .db file in the ./etc directory, a collections directory, and sub-directory of collections, and a set of plain text files in the later. The plain text files are intended to be printed or given away like candy to interested learners or scholars.

The code for marc2catalog ought to be available on GitHub.

Project English: An Index to English/American literature spanning six centuries

Posted on April 24, 2018 in Uncategorized

I have commenced upon a project to build an index and set of accompanying services rooted in English/American literature spanning the 15th to 20th centuries. For the lack of a something better, I call it Project English. This blog posting describes Project English in greater detail.

Goals & scope

The goals of the Project include but are not limited to:

  • provide enhanced collections & services to the University of Notre Dame community
  • push the boundaries of librarianship

To accomplish these goals I have acquired a subset of three distinct and authoritative collections of English/American literature:

  1. EEBO – Early English Books Online which has its roots in venerable Short-Title Catalogue of English Books
  2. ECCO – Eighteenth Century Collection Online, which is an extension of the Catalogue
  3. Sabin – Bibliotheca Americana: A Dictionary of Books Relating to America from Its Discovery to the Present Time originated by Joseph Sabin

More specifically, the retired and emeritus English Studies Librarian, Laura Fuderer purchased hard drives containing the full text of the aforementioned collections. Each item in the collection is manifested as an XML file and a set of JPEG images (digital scans of the original materials). The author identified the hard drives, copied some of the XML files, and began the Project. To date, the collection includes:

  • 56 thousand titles
  • 7.6 million pages
  • 2.3 billion words

At the present time, the whole thing consumes 184 GB of disk space where approximately 1/3 of it is XML files, 1/3 of it is HTML files transformed from the XML, and 1/3 is plain text files transformed from the XML. At the present time, there are no image nor PDF files in the collection.

On average, each item in the collection is approximately 135 pages (or 46,000 words) long. As of right now, each sub-collection is equally represented. The vast majority of the collection is in English, but other languages are included. Most of the content was published in London. The distribution of centuries is beginning to appear balanced, but determining the century of publication is complicated by the fact the metadata’s date values are not expressed as integers. The following charts & graphs illustrate all of these facts.

years sub-collections
languages cities

Access & services

By default, the collection is accessible via freetext/fielded/faceted searching. Given an EBBO, ECCO, or Sabin identifier, the collection is also accessible via known-item browse. (Think “call number”.) Search results can optionally be viewed and sorted using a tabled interface. (Think “spreadsheet”.) The reader has full access to:

  1. the original XML data – hard to read but rich in metadata
  2. rudimentary HTML – transformed from the original XML and a bit easier to read
  3. plain text – intended for computational analysis

Search results and its associated metadata can also be downloaded en masse. This enables the reader to do offline analysis such as text mining, concordancing, parts-of-speech extraction, or topic modeling. Some of these things are currently implemented inline, including:

  • listing the frequency of unigrams, bigrams, and trigrams
  • listing the frequency of noun phrases, the subjects & objects of sentences

For example, the reader can first create a set of one or more items of interest. They can then do some “distant” or “scalable” reading against the result. In its current state, Project English enables the reader to answer questions like:

  • What is this particular item about?
  • To what degree does this item mention the words God, man, truth, or beauty?

As Project English matures, it will enable the reader to answer additional questions, such as:

  • What actions take place in a given corpus?
  • How are things described?
  • If one were to divide the collection into T themes, then what might those themes be?
  • How has a theme changed over time?
  • Who, what places, and what organizations appear in the corpus?
  • What ideas appear concurrently in a corpus?

Librarianship

Remember, one of the goals of the Project is to push the boundaries of librarianship. With the advent of ubiquitous networked computers, the traditional roles of librarianship are not as important as they previously were. (I did not say the roles were unimportant, just not as important as they used to be.) Put another way, there is less of a need for the collection, organization, preservation, and dissemination of data, information, and knowledge. Much of this work is being facilitated through the Internet. This then begs the question, “Given the current environment, what are or can be the roles of (academic) libraries?” In the author’s opinion, the roles are rooted in two activities:

  1. the curation of rare & infrequently held materials
  2. the provision of value-added services against those materials

In the case of Project English, the rare & infrequently held materials are full text items dating from 15th to 20th centuries. When it is all said & done, the collection may come close to 2.5 million titles in size, a modest library by most people’s standards. These collections are being curated with scope, with metadata, with preservation, and with quick & easy access. The value-added services are fledgling, but they will include a sets of text mining & natural langage processing interfaces enabling the learner, teacher, and scholar to do “distant” and “scalable” reading. In other words, instead of providing access to materials and calling the work of librarianship done, Project English will enable & empower the reader to use & understand the materials they have acquired.

Librarianship needs to go beyond the automation of traditional tasks; it behooves librarianship to exploit computers to a greater degree and use them to augment & supplement the profession’s reason and experience. Project English is one librarian’s attempt to manifest this idea into a reality.

LexisNexis hacks

Posted on December 18, 2017 in Uncategorized

This blog posting briefly describes and makes available two Python scripts I call my LexisNexis hacks.

Forever FlowersThe purpose of the scripts is to enable the researcher to reformulate LexisNexis full text downloads into tabular form. To accomplish this goal, the researchers is expected to first search LexisNexis for items of interest. They are then expected to do a full text download of the results as a plain text file. Attached ought to be an example that includes about five records. The first of my scripts — results2files.py — parses the search results into individual records. The second script — files2table.py — reads the output of the first script and parses each file into individual but selected fields. The output of the second script is a tab-delimited file suitable for further analysis in any number of applications.

These two scripts can work for a number of people, but there are a few caveats. First, results2files.py saves its results as a set of files with randomly generated file names. It is possbile, albeit unlikely, files will get overwritten because the same randomly generated file names was… generated. Second, the output of files2table.py only includes fields required for a specific research question. It is left up to the reader to edit files2table.py for additional fields.

In short, your milage may vary.

Stories: Interesting projects I worked on this past year

Posted on August 9, 2017 in Uncategorized

This is short list of “stories” outlining some of the more interesting projects I worked on this past year:

  • Ask Putin – A faculty member from the College of Arts & Letters acquired the 950-page Cyrillic transcript of a television show called “Ask Putin”. The faculty member had marked up the transcription by hand in order to analyze the themes conveyed therein. They then visited the Center for Digital Scholarship, and we implemented a database version of the corpus. By counting & tabulating the roots of each of the words for each of the sixteen years of the show, we were able to quickly & easily confirm many of the observations she had generated by hand. Moreover, the faculty member was able to explore additional themes which they had not previously coded.
  • Who’s related to whom – A visiting scholar from the Kroc Center asked the Center for Digital Scholarship to extract all of the “named entities” (names, places, & things) from a set of Spanish language newspaper articles. Based on strength of the relationships between the entities, the scholar wanted a visualization to be created illustrating who was related to whom in the corpus. When we asked more about the articles and their content, we learned we had been asked to map the Columbian drug cartel. While incomplete, the framework of this effort will possibly be used by a South American government.
  • Counting 250,000,000 words – Working with Northwestern University, and Washington University in St. Louis, the Center for Digital Scholarship is improving access & services against the set of literature called “Early English Books”. This corpus spans 1460 and 1699 and is very representative of English literature of that time. We have been creating more accurate transcriptions of the texts, digitizing original items, and implementing ways to do “scalable reading” against the whole. After all, it is difficult to read 60,000 books. Through this process each & every word from the transcriptions has been saved in a database for future analysis. To date the database includes a quarter of a billion (250,000,000) rows. See: http://cds.crc.nd.edu
  • Convocate – In conjunction with the Center for Civil and Human Rights, the Hesburgh Libraries created an online tool for comparing & contrasting human rights policy written by the Vatican and various non-governmental agencies. As a part of this project, the Center for Digital Scholarship wrote an application that read each & every paragraph from the thousands of pages of text. The application then classified each & every paragraph with one or more keyword terms for the purposes of more accurate & thorough discovery across the corpus. The results of this application enable the researcher to items of similar interest even if they employ sets of dispersed terminology. For more detail, see: https://convocate.nd.edu

tei2json: Summarizing the structure of Early English poetry and prose

Posted on January 17, 2017 in Uncategorized

This posting describes a hack of mine, tei2json.pl – a Perl program to summarize the structure of Early English poetry and prose. [0]

In collaboration with Northwestern University and Washington University, the University of Notre Dame is working on a project whose primary purpose is to correct (“annotate”) the Early English corpus created by the Text Creation Partnership (TCP). My role in the project is to do interesting things with the corpus once it has been corrected. One of those things is the creation of metdata files denoting the structure of each item in the corpus.

Some of my work is really an effort to reverse engineer good work done by the late Sebastian Rahtz. For example, Mr. Rahtz cached a version of the TCP corpus, transformed each item into a number of different formats, and put the whole thing on GitHub. [1] As a part of this project, he created metadata files enumerating what TEI elements were in each file and what attributes were associated with each element. The result was an HTML display allowing the reader to quickly see how many bibliographies an item may have, what languages may be present, how long the document was measured in page breaks, etc. One of my goals is/was to do something very similar.

The workings of the script are really very simple: 1) configure and denote what elements to count & tabulate, 2) loop through each configuration, 3) keep a running total of the result, 4) convert the result to JSON (a specific data format), and 5) save the result to a file. Here are (temporary) links to a few examples:

JSON files are not really very useful in & of themselves; JSON files are designed to be transport mechanisms allowing other applications to read and process them. This is exactly what I did. In fact, I created two different applications: 1) json2table.pl and 2) json2tsv.pl. [2, 3] The former script takes a JSON file and creates a HTML file whose appearance is very similar to Rahtz’s. Using the JSON files (above) the following HTML files have been created through the use of json2table.pl:

The second script (json2tsv.pl) allows the reader to compare & contrast structural elements between items. Json2tsv.pl reads many JSON files and outputs a matrix of values. This matrix is a delimited file suitable for analysis in spreadsheets, database applications, statistical analysis tools (such as R or SPSS), or programming languages libraries (such as Python’s numpy or Perl’s PDL). In its present configuration, the json2tsv.pl outputs a matrix looking like this:

id      bibl  figure  l     lg   note  p    q
A00002  3     4       4118  490  8     18   3
A00011  3     0       2     0    47    68   6
A00089  0     0       0     0    0     65   0
A00214  0     0       0     0    151   131  0
A00289  0     0       0     0    41    286  0
A00293  0     1       189   38   0     2    0
A00395  2     0       0     0    0     160  2
A00749  0     4       120   18   0     0    2
A00926  0     0       124   12   0     31   7
A00959  0     0       2633  9    0     4    0
A00966  0     0       2656  0    0     17   0
A00967  0     0       2450  0    0     3    0

Given such a file, the reader could then ask & answer questions such as:

  • Which item has the greatest number of figures?
  • What is average number of lines per line group?
  • Is there a statistical correlation between paragraphs and quotes?

Additional examples of input & output files are temporarily available online. [4]

My next steps include at least a couple of things. One, I need/want to evaluate whether or not save my counts & tabulations in a database before (or after) creating the JSON files. The data may be prove to be useful there. Two, as a librarian, I want to go beyond qualitative description of narrative texts, and the counting & tabulating of structural elements moves in that direction, but it does not really address the “aboutness”, “meaning”, nor “allusions” found in a corpus. Sure, librarians have applied controlled vocabularies and bits of genre to metadata descriptions, but such things are not quantitive and consequently allude statistical analysis. For example, using sentiment analysis one could measure and calculate the “lovingness”, “war mongering”, “artisticness”, or “philosophic nature” of the texts. One could count & tabulate the number of times family-related terms are used, assign the result a score, and record the score. One could then amass all documents and sort them by how much they discussed family, love, philosophy, etc. Such is on my mind, and more than half-way baked. Wish me luck.

Links

Synonymizer: Using Wordnet to create a synonym file for Solr

Posted on January 16, 2017 in Uncategorized

This posting describes a little hack of mine, Synonymizer — a Python-based CGI script to create a synonym files suitable for use with Solr and other applications. [0]

Human language is ambiguous, and computers are rather stupid. Consequently computers often need to be explicitly told what to do (and how to do it). Solr is a good example. I might tell Solr to find all documents about dogs, and it will dutifully go off and look for things containing d-o-g-s. Solr might think it is smart by looking for d-o-g as well, but such is a heuristic, not necessarily a real understanding of the problem at hand. I might say, “Find all documents about dogs”, but I might really mean, “What is a dog, and can you give me some examples?” In which case, it might be better for Solr to search for documents containing d-o-g, h-o-u-n-d, w-o-l-f, c-a-n-i-n-e, etc.

This is where Solr synonym files come in handy. There are one or two flavors of Solr synonym files, and the one created by my Synonymizer is a simple line-delimited list of concepts, and each line is a comma-separated list of words or phrases. For example, the following is a simple Solr synonym file denoting four concepts (beauty, honor, love, and truth):

  beauty, appearance, attractiveness, beaut
  honor, abide by, accept, celebrate, celebrity
  love, adoration, adore, agape, agape love, amorousness
  truth, accuracy, actuality, exactitude

Creating a Solr synonym file is not really difficult, but it can be tedious, and the human brain is not always very good at multiplying ideas. This is where computers come in. Computers do tedium very well. And with the help of a thesaurus (like WordNet), multiplying ideas is easier.

Here is how Synonymizer works. First it reads a configured database of previously generated synonyms.† In the beginning, this file is empty but must be readable and writable by the HTTP server. Second, Synonymizer reads the database and offers the reader to: 1) create a new set of synonyms, 2) edit an existing synonym, or 3) generate a synonym file. If Option #1 is chosen, then input is garnered, and looked up in WordNet. The script will then enable the reader to disambiguate the input through the selection of apropos definitions. Upon selection, both WordNet hyponyms and hypernyms will be returned. The reader then has the opportunity to select desired words/phrase as well as enter any of their own design. The result is saved to the database. The process is similar if the reader chooses Option #2. If Option #3 is chosen, then the database is read, reformatted, and output to the screen as a stream of text to be used on Solr or something else that may require similar functionality. Because Option #3 is generated with a single URL, it is possible to programmatically incorporate the synonyms into your Solr indexing process pipeline.

The Synonymizer is not perfect.‡ For example, it only creates one of the two different types of Solr synonym files. Second, while Solr can use the generated synonym file, search results implement phrase searches poorly, and this is well-know issue. [1] Third, editing existing synonyms does not really take advantage of previously selected items; data-entry is tedious but not as tedious as writing the synonym file by hand. Forth, the script is not fast, and I blame this on Python and WordNet.

Below are a couple of screenshots from the application. Use and enjoy.

Synonymizer home

Synonymizer output

[0] synonymizer.py – http://dh.crc.nd.edu/sandbox/synonymizer/synonymizer.py

[1] “Why is Multi-term synonym mapping so hard in Solr?” – http://bit.ly/2iyYZw6

† The “database” is really simple delimited text file. No database management system required.

‡ Software is never done. If it were, then it would be called “hardware”.

Tiny road trip: An Americana travelogue

Posted on October 13, 2016 in Uncategorized

This travelogue documents my experiences and what I learned on a tiny road trip including visits to Indiana University, Purdue University, University of Illinois / Urbana-Champagne, and Washington University In St. Louis between Monday, October 26 and Friday, October 30, 2017. In short, I learned four things: 1) of the places I visited, digital scholarship centers support a predictable set of services, 2) the University Of Notre Dame’s digital scholarship center is perfectly situated in the middle of the road when it comes to the services provided, 3) the Early Print Project is teamed with a set of enthusiastic & animated scholars, and 4) Illinois is very flat.

Lafayette Bloomington Greenwood Crawfordsville

Four months ago I returned from a pseudo-sabbatical of two academic semesters, and exactly one year ago I was in Tuscany (Italy) painting cornfields & rolling hills. Upon my return I felt a bit out of touch with some of my colleagues in other libraries. At the same time I had been given an opportunity to participate in a grant-sponsored activity (the Early English Print Project) between Northwestern University, Washington University In St. Louis, and the University Of Notre Dame. Since I was encouraged to visit the good folks at Washington University, I decided to stretch a two-day visit into a week-long road trip taking in stops at digital scholarship centers. Consequently, I spent bits of time in Bloomington (Indiana), West Lafayette (Indiana), Urbana (Illinois), as well as St. Louis (Missouri). The whole process afforded me the opportunity to learn more and get re-acquainted.

Indiana University / Bloomington

My first stop was in Bloomington where I visited Indiana University, and the first thing that struck me was how much Bloomington exemplified the typical college town. Coffee shops. Boutique clothing stores. Ethnic restaurants. And teaming with students ranging from fraternity & sorority types, hippie wanna be’s, nerds, wide-eyed freshman, young lovers, and yes, fledgling scholars. The energy was positively invigorating.

My first professional visit was with Angela Courtney (Head of Arts & Humanities, Head of Reference Services, Librarian for English & American Literature, and Director of the Scholars’ Commons). Ms. Courtney gave me a tour of the library’s newly renovated digital scholarship center. [1] It was about the same size at the Hesburgh Libraries’s Center, and it was equipped with much of the same apparatus. There was a scanning lab, plenty of larger & smaller meeting spaces, a video wall, and lots of open seating. One major difference between Indiana and Notre Dame was the “reference desk”. For all intents & purposes, the Indiana University reference desk is situated in the digital scholarship center. Ms. Courtney & I chatted for a long hour, and I learned how Indiana University & the University Of Notre Dame were similar & different. Numbers of students. Types of library collections & services. Digital initiatives. For the most part, both universities have more things in common than differences, but their digital initiatives were by far more mature than the ones here at Notre Dame.

Later in the afternoon I visited with Yu (Marie) Ma who works for the HathiTrust Research Center. [2] She was relatively new to the HathiTrust, and if I understand her position correctly, then she spends a lot of her time setting up technical workflows and the designing the infrastructure for large-scale text analysis. The hour with Marie was informative on both of our parts. For example, I outlined some of the usability issues with the Center’s interface(s), and she outlined how the “data capsules” work. More specifically, “data capsules” are virtual machines operating in two different modes. In one mode a researcher is enabled to fill up a file system with HathiTrust content. In the other mode, one is enabled to compute against the content and return results. In one or the other of the modes (I’m not sure which), Internet connectivity is turned off to disable the leaking of HathiTrust content. In this way, a HathiTrust data capsule operates much like a traditional special collections room. A person can go into the space, see the book, take notes with a paper & pencil, and then leave sans any of the original materials. “What is old is new again.” Along the way Marie showed me a website — Lapps Grid — which looks as if it functions similar to Voyant Tools and my fledgling EEBO-TCP Workset Browser. [3, 4, 5] Amass a collection. Use the collection as input against many natural language processing tools/applications. Use the output as a means for understanding. I will take a closer look at Lapps Grid.

Purdue University

The next morning I left the rolling hills of southern Indiana for the flatlands of central Indiana and Purdue University. There I facilitated a brown-bag lunch discussion on the topic of scalable reading, but the audience seemed more interested in the concept of digital scholarship centers. I described the Center here at Notre Dame, and did my best to compare & contrast it with others as well as draw into the discussion the definition of digital humanities. Afterwards I went to lunch with Micheal Witt and Amanda Visconti. Mr. Witt spends much of his time on institutional repostory efforts, specifically in regards to scientific data. Ms. Visconti works in the realm of the digital humanities and has recently made available her very interesting interactive dissertation — Infinite Ulysses. [6] After lunch Mr. Witt showed me a new library space scheduled to open before the Fall Semester of 2017. The space will be library-esque during the day, and study-esque during the evening. Through the process of construction, some of their collection needed to be weeded, and I found the weeding process to be very interesting.

University of Illinois / Urbana-Champagne

Up again in the morning and a drive to Urbana-Champagne. During this jaunt I became both a ninny and a slave to my computer’s (telephone’s) navigation and functionality. First it directed me to my location, but no parking places. After identifying a parking place on my map (computer), I was not able to get directions on how to get there. Once I finally found parking, I required my telephone to pay. Connect to remote site while located in concrete building. Create account. Supply credit card number. Etc. We are increasingly reliant (dependent) on these gizmos.

My first meeting was with Karen Hogenboom (Associate Professor of Library Administration, Scholarly Commons Librarian and Head, Scholarly Commons). We too discussed digital scholarship centers, and again, there were more things in common with our centers than differences. Her space was a bit smaller than Notre Dame’s, and their space was less about specific services and more about referrals to other services across the library and across the campus. For example, geographic information systems services and digitization services were offered elsewhere.

I then had a date with an old book, but first some back story. Here at Notre Dame Julia Schneider brought to my attention a work written by Erasmus and commenting on Cato which may be a part of a project called The Digital Schoolbook. She told me how there were only three copies of this particular book, and one of them was located in Urbana. Consequently, a long month ago, I found a reference to the book in the library catalog, and I made an appointment to see it in person. The book’s title is Erasmi Roterodami libellus de co[n]structio[n]e octo partiu[m]oratio[n]is ex Britannia nup[er] huc plat[us] : et ex eo pureri bonis in l[ite]ris optio and it was written/published in 1514. [7, 8] The book represented at least a few things: 1) the continued and on-going commentary on Cato, 2) an example of early book printing, and 3) forms of scholarship. Regarding Cato I was only able to read a single word in the entire volume — the word “Cato” — because the whole thing was written in Latin. As an early printed book, I had to page through the entire volume to find the book I wanted. It was the last one. Third, the book was riddled with annotations, made from a number of hands, and with very fine-pointed pens. Again, I could not read a single word, but a number of the annotations were literally drawings of hands pointing to sections of interest. Whoever said writing in books was a bad thing? In this case, the annotations were a definite part of the scholarship.

Manet lion art colors

Washington University In St. Louis

Yet again, I woke up the next morning and continued on my way. Along the road there were billboards touting “foot-high pies” and attractions to Indian burial grounds. There were corn fields being harvested, and many advertisements pointing to Abraham Lincoln stomping locations.

Late that afternoon I was invited to participate in a discussion with Doug Knox, Steve Pentecost, Steven Miles, and Dr. Miles’s graduate students. (Mr. Knox & Mr. Pentecost work in a university space called Arts & Sciences Computing.) They outlined and reported upon a digital project designed to aid researchers & scholars learn about stelae found along the West River Basin in China. I listened. (“Stelae” are markers, usually made of stone, commemorating the construction or re-construction of religious temples.) To implement the project, TEI/XML files were being written and “en masse” used akin to a database application. Reports were to be written agains the XML to create digital maps as well as browsable lists of names of people, names of temples, dates, etc. I got to thinking how timelines might also be apropos.

The bulk of the following day (Friday) was spent getting to know a balance of colleagues and discussing the Early English Print Project. There were many people in the room: Doug Knox & Steve Pentecost from the previous day, Joesph Loewenstein (Professor, Department of English, Director Of the Humanities Digital Workshop and the Interdisciplinary Project in the Humanities) Kate Needham, Andrew Rouner (Digital Library Director), Anupam Basu (Assistant Professor, Department of English), Shannon Davis (Digital Library Services Manager), Keegan Hughes, and myself.

More specifically, we talked about how sets of EEBO/TCP ([9]) TEI/XML files can be: 1) corrected, enhanced, & annotated through both automation as well as crowd-sourcing, 2) supplemented & combined with newly minted & copy-right free facsimiles from the original printed documents, 3) analyzed & reported upon through text mining & general natural language processing techniques, and 4) packaged up & redistributed back to the scholarly community. While the discussion did not follow logically, it did surround a number of unspoken questions, such as but not limited to:

  • Is METS a desirable re-distribution method? [10] What about some sort of database system instead?
  • To what degree is governance necessary in order for us to make decisions?
  • To what degree is it necessary to pour the entire corpus (more than 60,000 XML files with millions of nodes) into a single application for processing, and is the selected application up to the task?
  • What form or flavor of TEI would be used as the schema for the XML file output?
  • What role will an emerging standard called IIIF play in the process of re-distribution? [11]
  • When is a corrected text “good enough” for re-distribution?

To my mind, none of these questions were answered definitively, but then again, it was an academic discussion. On the other hand, we did walk away with a tangible deliverable — a whiteboard drawing illustrating a possible workflow going something like this:

  1. cache data from University of Michigan
  2. correct/annotate the data
  3. when data is “good enough”, put the data back into the cache
  4. feed the data back to the University of Michigan
  5. when data is “good enough”, text mine the data and put the result to back into the cache
  6. feed the data back to the University of Michigan
  7. create new facsimiles from the printed works
  8. combine the facsimiles with the data, and put the result to back into the cache
  9. feed the data back to the University of Michigan
  10. repeat

model

After driving through the country side, and after two weeks of reflection, I advocate a slightly different workflow:

  1. cache TEI data from GitHub repository, which was originally derived from the University of Michigan [12]
  2. make cache accessible to the scholarly community through a simple HTTP server and sans any intermediary application
  3. correct/annotate the data
  4. as corrected data becomes available, replace files in cache with corrected versions
  5. create copyright-free facsimiles of the originals, combine them with corrected TEI in the form of METS files, and cache the result
  6. use the METS files to generate IIIF manifests, and make the facsimiles viewable via the IIIF protocol
  7. as corrected files become available, use text mining & natural language processing to do analysis, combine the results with the original TEI (and/or facsimiles) in the form of METS files, and cache the result
  8. use the TEI and METS files to create simple & rudimentary catalogs of the collection (author lists, title lists, subject/keyword lists, date lists, etc.), making it easier for scholars to find and download items of interest
  9. repeat

The primary point I’d like to make in regard to this workflow is, “The re-distribution of our efforts ought to take place over simple HTTP and in the form of standardized XML, and I do not advocate the use of any sort of middle-ware application for these purposes.” Yes, of course, middle-ware will be used to correct the TEI, create “digital combos” of TEI and images, and do textual analysis, but the output of these processes ought to be files accessible via plain o’ ordinary HTTP. Applications (database systems, operating systems, content-management systems, etc.) require maintenance, and maintenance is done by a few & specialized number of people. Applications are often times “black boxes” understood and operated by a minority. Such things are very fragile, especially compared to stand-alone files. Standardized (XML) files served over HTTP are easily harvestable by anybody. They are easily duplicated. They can be saved on platform-independent media such as CD’s/DVD’s, magnetic tape, or even (gasp) paper. Once the results of our efforts are output as files, then supplementary distribution mechanisms can be put into place, such as IIIF or middleware database applications. XML files (TEI and/or METS) served over simple HTTP ought be the primary distribution mechanism. Such is transparent, sustainable, and system-independent.

Over lunch we discussed Spenser’s Faerie Queene, the Washington University’s Humanities Digital Workshop, and the salient characteristics of digital humanities work. [13] In the afternoon I visited the St. Louis Art Museum, whose collection was rich. [14] The next day, on my way home through Illinois, I stopped at the tomb of Abraham Lincoln in order to pay my respects.

Lincoln University Matisse Arch

In conclusion

In conclusion, I learned a lot, and I believe my Americana road trip was a success. My conception and defintion of digital scholarship centers was re-enforced. My professional network was strengthened. I worked collaboratively with colleagues striving towards a shared goal. And my personal self was enriched. I advocate such road trips for anybody and everybody.

Links

[1] digital scholarship at Indiana University – https://libraries.indiana.edu/services/digital-scholarship
[2] HathiTrust Research Center – https://analytics.hathitrust.org
[3] Lapps Grid – http://www.lappsgrid.org
[4] Voyant Tools – http://voyant-tools.org
[5] EEBO-TCP Workset Browser – http://blogs.nd.edu/emorgan/2015/06/eebo-browser/
[6] Infinite Ulysses – http://www.infiniteulysses.com
[7] old book from the UIUC catalog – https://vufind.carli.illinois.edu/vf-uiu/Record/uiu_5502849
[8] old book from the Universal Short Title Catalogue – http://ustc.ac.uk/index.php/record/403362
[9] EEBO/TCP – http://www.textcreationpartnership.org/tcp-eebo/
[10] METS – http://www.loc.gov/standards/mets/
[11] IIIF – http://iiif.io
[12] GitHub repository of texts – https://github.com/textcreationpartnership/Texts
[13] Humanities Digital Workshop – https://hdw.artsci.wustl.edu
[14] St. Louis Art Museum – http://www.slam.org