Project Gutenberg and the Distant Reader

Posted on November 6, 2019 in Distant Reader by Eric Lease Morgan

The venerable Project Gutenberg is perfect fodder for the Distant Reader, and this essay outlines how & why. (tl;dnr: Search my mirror of Project Gutenberg, save the result as a list of URLs, and feed them to the Distant Reader.)

Project Gutenberg

wall paper by Eric

Wall Paper by Eric

A long time ago, in a galaxy far far away, there was a man named Micheal Hart. Story has it he went to college at the University of Illinois, Urbana-Champagne. He was there during a summer, and the weather was seasonably warm. On the other hand, the computer lab was cool. After all, computers run hot, and air conditioning is a must. To cool off, Micheal went into the computer lab to be in a cool space.† While he was there he decided to transcribe the United States Declaration of Independence, ultimately, in the hopes of enabling people to use a computers to “read” this and additional transcriptions. That was in 1971. One thing led to another, and Project Gutenberg was born. I learned this story while attending a presentation by the now late Mr. Hart on Saturday, February 27, 2010 in Roanoke (Indiana). As it happened it was also Mr. Hart’s birthday. [1]

To date, Project Gutenberg is a corpus of more than 60,000 freely available transcribed ebooks. The texts are predominantly in English, but many languages are represented. Many academics look down on Project Gutenberg, probably because it is not as scholarly as they desire, or maybe because the provenance of the materials is in dispute. Despite these things, Project Gutenberg is a wonderful resource, especially for high school students, college students, or life-long learners. Moreover, its transcribed nature eliminates any problems of optical character recognition, such as one encounters with the HathiTrust. The content of Project Gutenberg is all but perfectly formatted for distant reading.

Unfortunately, the interface to Project Gutenberg is less than desirable; the index to Project Gutenberg is limited to author, title, and “category” values. The interface does not support free text searching, and there is limited support for fielded searching and Boolean logic. Similarly, the search results are not very interactive nor faceted. Nor is there any application programmer interface to the index. With so much “clean” data, so much more could be implemented. In order to demonstrate the power of distant reading, I endeavored to create a mirror of Project Gutenberg while enhancing the user interface.

To create a mirror of Project Gutenberg, I first downloaded a set of RDF files describing the collection. [2] I then wrote a suite of software which parses the RDF, updates a database of desired content, loops through the database, caches the content locally, indexes it, and provides a search interface to the index. [3, 4] The resulting interface is ill-documented but 100% functional. It supports free text searching, phrase searching, fielded searching (author, title, subject, classification code, language) and Boolean logic (using AND, OR, or NOT). Search results are faceted enabling the reader to refine their query sans a complicated query syntax. Because the cached content includes only English language materials, the index is only 33,000 items in size.

Project Gutenberg & the Distant Reader

The Distant Reader is a tool for reading. It takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of any size, the Distant Reader will analyze the corpus, and it will output a myriad of reports enabling you to use & understand the corpus. The Distant Reader is intended to supplement the traditional reading process. Project Gutenberg and the Distant Reader can be used hand-in-hand.

As described in a previous posting, the Distant Reader can take five different types of input. [5] One of those inputs is a file where each line in the file is a URL. My locally implemented mirror of Project Gutenberg enables the reader to search & browse in a manner similar to the canonical version of Project Gutenberg, but with two exceptions. First & foremost, once a search has been gone against my mirror, one of the resulting links is “only local URLs”. For example, below is an illustration of the query “love AND honor AND truth AND justice AND beauty”, and the “only local URLs” link is highlighted:

search result

Search result

By selecting the “only local URLs”, a list of… URLs is returned, like this:

urls

URLs

This list of URLs can then be saved as file, and any number of things can be done with the file. For example, there are Google Chrome extensions for the purposes of mass downloading. The file of URLs can be fed to command-line utilities (ie. curl or wget) also for the purposes of mass downloading. In fact, assuming the file of URLs is named love.txt, the following command will download the files in parallel and really fast:

cat love.txt | parallel wget

This same file of URLs can be used as input against the Distant Reader, and the result will be a “study carrel” where the whole corpus could be analyzed — read. For example, the Reader will extract all the nouns, verbs, and adjectives from the corpus. Thus you will be able to answer what and how questions. It will pull out named entities and enable you to answer who and where questions. The Reader will extract keywords and themes from the corpus, thus outlining the aboutness of your corpus. From the results of the Reader you will be set up for concordancing and machine learning (such as topic modeling or classification) thus enabling you to search for more narrow topics or “find more like this one”. The search for love, etc returned more than 8000 items. Just less than 500 of them were returned in the search result, and the Reader empowers you to read all 500 of them at one go.

Summary

Project Gutenberg is very useful resource because the content is: 1) free, and 2) transcribed. Mirroring Project Gutenberg is not difficult, and by doing so an interface to it can be enhanced. Project Gutenberg items are perfect items for reading & analysis by the Distant Reader. Search Project Gutenberg, save the results as a file, feed the file to the Reader and… read the results at scale.

Notes and links

† All puns are intended.

[1] Michael Hart in Roanoke (Indiana) – video: https://youtu.be/eeoBbSN9Esg; blog posting: http://infomotions.com/blog/2010/03/michael-hart-in-roanoke-indiana/

[2] The various Project Gutenberg feeds, including the RDF is located at https://www.gutenberg.org/wiki/Gutenberg:Feeds

[3] The suite of software to cache and index Project Gutenberg is available on GitHub at https://github.com/ericleasemorgan/gutenberg-index

[4] My full text index to the English language texts in Project Gutenberg is available at http://dh.crc.nd.edu/sandbox/gutenberg/cgi-bin/search.cgi

[5] The Distant Reader and its five different types of input – http://sites.nd.edu/emorgan/2019/10/dr-inputs/

OJS Toolbox

Posted on October 26, 2019 in Uncategorized by Eric Lease Morgan

Given a Open Journal System (OJS) root URL and an authorization token, cache all JSON files associated with the given OJS title, and optionally output rudimentary bibliographics in the form of a tab-separated value (TSV) stream. [0]

Wall Paper by Eric

OJS is a journal publishing system. [1] Is supports a REST-ful API allowing the developer to read & write to the System’s underlying database. [2] This hack — the OJS Toolbox — merely caches & reads the metadata associated with the published issues of a given journal title.

The Toolbox is written in Bash. To cache the metadata, you will need to have additional software as part of your file system: curl and jq. [3, 4] Curl is used to interact with the API. Jq is used to read & parse the resulting JSON streams. When & if you want to transform the cached JSON files into rudimentary bibliographics, then you will also need to install GNU Parallel, a tool which makes parallel processing trivial. [5]

Besides the software, you will need three pieces of information. The first is the root URL of the OJS system/title you wish to use. This value will probably look something like this –> https://example.com/index.php/foo Ask the OJS systems administrator regarding the details. The second piece of information is an authorization token. If an “api secret” has been created by the local OJS systems administrator, then each person with an OJS account ought to have been granted a token. Again, ask the OJS systems administrator for details. The third piece of information is the name of a directory where your metadata will be cached. For the sake of an example, assume the necessary values are:

  1. root URL – https://example.com/index.php/foo
  2. token – xyzzy
  3. directory – bar

Once you have gotten this far, you can cache the totality of the issue metadata:

$ ./bin/harvest.sh https://example.com/index.php/foo xyzzy bar

More specifically, `harvest.sh` will create a directory called bar. It will then determine how many issues exist in the title foo. It will then harvest sets of issue data, parse each set into individual issue files, and save the result as JSON files in the bar directory. You now have a “database” containing all the bibliographic information of a given title

For my purposes, I need a TSV file with four columns: 1) author, 2) title, 3) date, and 4) url. Such is the purpose of `issues2tsv.sh` and `issue2tsv.sh`. The first script, `issues2tsv.sh`, takes a directory as input. It then outputs a simple header, finds all the JSON files in the given directory, and passes them along (in parallel) to `issue2tsv.sh` which does the actual work. Thus, to create my TSV file, I submit a command like this:

$ ./bin/issues2tsv.sh bar > ./bar.tsv

The resulting file (bar.tsv) looks something like this:

author title date url
Kilgour The Catalog 1972-09-01 https://example.com/index.php/foo/article/download/5738/5119
McGee Two Designs 1972-09-01 https://example.com/index.php/foo/article/download/5739/5120
Saracevic Book Reviews 1972-09-01 https://example.com/index.php/foo/article/download/5740/5121

Give such a file, I can easily download the content of a given article, extract any of its plain text, perform various natural language processing tasks against it, text mine the whole, full text index the whole, apply various bits of machine learning against the whole, and in general, “read” the totality of the journal. See The Distant Reader for details. [6]

Links

[0] OJS Toolbox – https://github.com/ericleasemorgan/ojs-toolbox
[1] OJS – https://pkp.sfu.ca/ojs/
[2] OJS API – https://docs.pkp.sfu.ca/dev/api/ojs/3.1
[3] curl – https://curl.haxx.se
[4] jq – https://stedolan.github.io/jq/
[5] GNU Parallel – https://www.gnu.org/software/parallel/
[6] Distant Reader – https://distantreader.org

The Distant Reader and its five different types of input

Posted on October 19, 2019 in Distant Reader by Eric Lease Morgan

The Distant Reader can take five different types of input, and this blog posting describes what they are.

wall paper by Eric

Wall Paper by Eric

The Distant Reader is a tool for reading. It takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of any size, the Distant Reader will analyze the corpus, and it will output a myriad of reports enabling you to use & understand the corpus. The Distant Reader is intended to designed the traditional reading process.

At the present time, the Reader can accept five different types of input, and they include:

  1. a file
  2. a URL
  3. a list of URLs
  4. a zip file
  5. a zip file with a companion CSV file

Each of these different types of input are elaborated upon below.

A file

The simplest form of input is a single file from your computer. This can be just about file available to you, but to make sense, the file needs to contain textual data. Thus, the file can be a Word document, a PDF file, an Excel spreadsheet, an HTML file, a plain text file, etc. A file in the form of an image will not work because it contains zero text. Also, not all PDF files are created equal. Some PDF files are only facsimiles of their originals. Such PDF files are merely sets of images concatenated together. In order for PDF files to be used as input, the PDF files need to have been “born digitally” or they need to have had optical character recognition previously applied against them. Most PDF files are born digitally nor do they suffer from being facsimiles.

A good set of use-cases for single file input is the whole of a book, a long report, or maybe a journal article. Submitting a single file to the Distant Reader is quick & easy, but the Reader is designed for analyzing larger rather than small corpora. Thus, supplying a single journal article to the Reader doesn’t make much sense; the use of the traditional reading process probably makes more sense for a single journal article.

A URL

The Distant Reader can take a single URL as input. Given a URL, the Reader will turn into a rudimentary Internet spider and build a corpus. More specifically, given a URL, the Reader will:

  • retrieve & cache the content found at the other end of the URL
  • extract any URLs it finds in the content
  • retrieve & cache the content from these additional URLs
  • stop building the corpus but continue with its analysis

In short, given a URL, the Reader will cache the URL’s content, crawl the URL one level deep, cache the result, and stop caching.

Like the single file approach, submitting a URL to the Distant Reader is quick & easy, but there are a number of caveats. First of all, the Reader does not come with very many permissions, and just because you are authorized to read the content at the other end of a URL does not mean the Reader has the same authorization. A lot of content on the Web resides behind paywalls and firewalls. The Reader can only cache 100% freely accessible content.

“Landing pages” and “splash pages” represent additional caveats. Many of the URLs passed around the ‘Net do not point to the content itself, but instead they point to ill-structured pages describing the content — metadata pages. Such pages may include things like authors, titles, and dates, but these things are not presented in a consistent nor computer-readable fashion; they are laid out with aesthetics or graphic design in mind. These pages do contain pointers to the content you want to read, but the content may be two or three more clicks away. Be wary of URLs pointing to landing pages or splash pages.

Another caveat to this approach is the existence of extraneous input due to navigation. Many Web pages include links for navigating around the site. They also include links to things like “contact us” and “about this site”. Again, the Reader is sort of stupid. If found, the Reader will crawl such links and include their content in the resulting corpus.

Despite these drawbacks there are number of excellent use-cases for single URL input. One of the best is Wikipedia articles. Feed the Reader a URL pointing to a Wikipedia article. The Reader will cache the article itself, and then extract all the URLs the article uses as citations. The Reader will then cache the content of the citations, and then stop caching.

Similarly, a URL pointing to an open access journal article will function just like the Wikipedia article, and this will be even more fruitful if the citations are in the form of freely accessible URLs. Better yet, consider pointing the Reader to the root of an open access journal issue. If the site is not overly full of navigation links, and if the URLs to the content itself are not buried, then the whole of the issue will be harvested and analyzed.

Another good use-case is the home page of some sort of institution or organization. Want to know about Apple Computer, the White House, a conference, or a particular department of a university? Feed the root URL of any of these things to the Reader, and you will learn something. At the very least, you will learn how the organization prioritizes its public face. If things are more transparent than not, then you might be able to glean the names and addresses of the people in the organization, the public policies of the organization, or the breadth & depth of the organization.

Yet another excellent use-case includes blogs. Blogs often contain content at their root. Navigations links abound, but more often than not the navigation links point to more content. If the blog is well-designed, then the Reader may be able to create a corpus from the whole thing, and you can “read” it in one go.

A list of URLs

The third type of input is a list of URLs. The list is expected to be manifested as a plain text file, and each line in the file is a URL. Use whatever application you desire to build the list, but save the result as a .txt file, and you will probably have a plain text file.‡

Caveats? Like the single URL approach, the list of URLs must point to freely available content, and pointing to landing pages or splash pages is probably to be avoided. Unlike the single URL approach, the URLs in the list will not be used as starting points for Web crawling. Thus, if the list contains ten items, then ten items will be cached for analysis.

Another caveat is the actual process of creating the list; I have learned that is actually quite difficult to create lists of URLs. Copying & pasting gets old quickly. Navigating a site and right-clicking on URLs is tedious. While search engines & indexes often provide some sort of output in list format, the lists are poorly structured and not readily amenable to URL extraction. On the other hand, there are more than a few URL extraction tools. I use a Google Chrome extension called Link Grabber. [1] Install Link Grabber. Use Chrome to visit a site. Click the Link Grabber button, and all the links in the document will be revealed. Copy the links and paste them into a document. Repeat until you get tired. Sort and peruse the list of links. Remove the ones you don’t want. Save the result as a plain text file.‡ Feed the result to the Reader.

Despite these caveats, the list of URLs approach is enormously scalable; the list of URLs approach is the most scalable input option. Given a list of five or six items, the Reader will do quite well, but the Reader will operate just as well if the list contains dozens, hundreds, or even thousands of URLs. Imagine reading the complete works of your favorite author or the complete run of an electronic journal. Such is more than possible with the Distant Reader.‡

A zip file

The Distant Reader can take a zip file as input. Create a folder/directory on your computer. Copy just about any file into the folder/directory. Compress the file into a .zip file. Submit the result to the Reader.

Like the other approaches, there are a few caveats. First of all, the Reader is not able to accept .zip files whose size is greater than 64 megabytes. While we do it all the time, the World Wide Web was not really designed to push around files of any great size, and 64 megabytes is/was considered plenty. Besides, you will be surprised how many files can fit in a 64 megabyte file.

Second, the computer gods never intended file names to contain things other than simple Romanesque letters and a few rudimentary characters. Now-a-days our file names contain spaces, quote marks, apostrophes, question marks, back slashes, forward slashes, colons, commas, etc. Moreover, file names might be 64 characters long or longer! While every effort as been made to accomodate file names with such characters, your milage may vary. Instead, consider using file names which are shorter, simpler, and have some sort of structure. An example might be first word of author’s last name, first meaningful word of title, year (optional), and extension. Herman Melville’s Moby Dick might thus be named melville-moby.txt. In the end the Reader will be less confused, and you will be more able to find things on your computer.

There are a few advantages to the zip file approach. First, you can circumvent authorization restrictions; you can put licensed content into your zip files and it will be analyzed just like any other content. Second, the zip file approach affords you the opportunity to pre-process your data. For example, suppose you have downloaded a set of PDF files, and each page includes some sort of header or footer. You could transform each of these PDF files into plain text, use some sort of find/replace function to remove the headers & footers. Save the result, zip it up, and submit it to the Reader. The resulting analysis will be more accurate.

There are many use-cases for the zip file approach. Masters and Ph.D students are expected to read large amounts of material. Save all those things into a folder, zip them up, and feed them to the Reader. You have been given a set of slide decks from a conference. Zip them up and feed them to the Reader. A student is expected to read many different things for History 101. Download them all, put them in a folder, zip them up, and submit them to the Distant Reader. You have written many things but they are not on the Web. Copy them to a folder, zip them up, and “read” them with the… Reader.

A zip file with a companion CSV file

The final form of input is a zip file with a companion comma-separated value (CSV) file — a metadata file.

As the size of your corpus increases, so does the need for context. This context can often be manifested as metadata (authors, titles, dates, subject, genre, formats, etc.). For example, you might want to compare & contrast who wrote what. You will probably want to observe themes over space & time. You might want to see how things differ between different types of documents. To do this sort of analysis you will need to know metadata regarding your corpus.

As outlined above, the Distant Reader first creates a cache of content — a corpus. This is the raw data. In order to do any analysis against the corpus, the corpus must be transformed into plain text. A program called Tika is used to do this work. [2] Not only does Tika transform just about any file into plain text, but it also does its best to extract metadata. Depending on many factors, this metadata may include names of authors, titles of documents, dates of creation, number of pages, MIME-type, language, etc. Unfortunately, more often than not, this metadata extraction process fails and the metadata is inaccurate, incomplete, or simply non-existent.

This is where the CSV file comes in; by including a CSV file named “metadata.csv” in the .zip file, the Distant Reader will be able to provide meaningful context. In turn, you will be able to make more informed observations, and thus your analysis will be more thorough. Here’s how:

  • assemble a set of files for analysis
  • use your favorite spreadsheet or database application to create a list of the file names
  • assign a header to the list (column) and call it “file”
  • create one or more columns whose headers are “author” and/or “title” and/or “date”
  • to the best of your ability, update the list with author, title, or date values for each file
  • save the result as a CSV file named “metadata.csv” and put it in the folder/directory to be zipped
  • compress the folder/directory to create the zip file
  • submit the result to the Distant Reader for analysis

The zip file with a companion CSV file has all the strengths & weakness of the plain o’ zip file, but it adds some more. On the weakness side, creating a CSV file can be both tedious and daunting. On the other hand, many search engines & index export lists with author, title, and data metadata. One can use these lists as the starting point for the CSV file.♱ On the strength side, the addition of the CSV metadata file makes the Distant Reader’s output immeasurably more useful, and it leads the way to additional compare & contrast opportunities.

Summary

To date, the Distant Reader takes five different types of input. Each type has its own set of strengths & weaknesses:

  • a file – good for a single large file; quick & easy; not scalable
  • a URL – good for getting an overview of a single Web page and its immediate children; can include a lot of noise; has authorization limitations
  • a list of URLs – can accomodate thousands of items; has authorization limitations; somewhat difficult to create list
  • a zip file – easy to create; file names may get in the way; no authorization necessary; limited to 64 megabytes in size
  • a zip file with CSV file – same as above; difficult to create metadata; results in much more meaningful reports & opportunities

Happy reading!

Notes & links

‡ Distant Reader Bounty #1: To date, I have only tested plain text files using line-feed characters as delimiters, such are the format of plain text files in the Linux and Macintosh worlds. I will pay $10 to the first person who creates a plain text file of URLs delimited by carriage-return/line-feed characters (the format of Windows-based text files) and who demonstrates that such files break the Reader. “On you mark. Get set. Go!”

‡ Distant Reader Bounty #2: I will pay $20 to the first person who creates a list of 2,000 URLs and feeds it to the Reader.

♱ Distant Reader Bounty #3: I will pay $30 to the first person who writes a cross-platform application/script which successfully transforms a Zotero bibliography into a Distant Reader CSV metadata file.

[1] Link Grabber – http://bit.ly/2mgTKsp

[2] Tika – http://tika.apache.org

Invitation to hack the Distant Reader

Posted on June 13, 2019 in Distant Reader by Eric Lease Morgan

We invite you to write a cool hack enabling students & scholars to “read” an arbitrarily large corpus of textual materials.

Introduction

A website called The Distant Reader takes an arbitrary number of files or links to files as input. [1] The Reader then amasses the files locally, transforms them into plain text files, and performs quite a bit of natural language processing against them. [2] The result — the the form of a file system — is a set of operating system independent indexes which point to individual files from the input. [3] Put another way, each input file is indexed in a number of ways, and therefore accessible by any one or combination of the following attributes:

  • any named entity (name of person, place, date, time, money amount, etc)
  • any part of speech (noun, verb, adjective, etc.)
  • email address
  • free text word
  • readability score
  • size of file
  • statistically significant keyword
  • textual summary
  • URL

All of things listed above are saved as plain text files, but they have also been reduced to an SQLite database (./etc/reader.db), which is also distributed with the file system.

The Challenge

Your mission, if you choose to accept it, is to write a cool hack against the Distant Reader’s output. By doing so, you will be enabling people to increase their comprehension of the given files. Here is a list of possible hacks:

  • create a timeline – The database includes a named entities table (ent). Each entity is denoted by a type, and one of those types is “PERSON”. Find all named entities of type PERSON, programmatically look them up in Wikidata, extract the entity’s birth & death dates, and plot the result on a timeline. As an added bonus, update the database with the dates. Alternatively, and possibly more simply, find all entities of type DATE (or TIME), and plot those values on a timeline.
  • create a map – Like the timeline hack, find all entities denoting places (GRE or LOC), look up their geographic coordinates in Wikidata, and plot them on a map. As an added bonus, update the database with the coordinates.
  • order documents based on similarity – “Find more like this one” is a age-old information retrieval use case. Given a reference document – a document denoted as particularly relevant — create a list of documents from the input which are similar to the reference document. For example, create a vector denoting the characteristics of the reference document. [4] Then create vectors for each document in the collection. Finally, use something like the Cosine Similarly algorithm to determine which documents are most similar (or different). [5] The reference document may be from either inside or outside the Reader’s file system, for example, the Bible or Shakespeare’s Hamlet.
  • write a Javascript interface to the database – The Distant Reader’s database (./etc/reader.db) is manifested as a single SQLite file. There exists a Javascript library enabling one to read & write to SQLite databases. [6] Sans a Web server, write sets of HTML pages enabling a person to query the database. Example queries might include: find all documents where Plato is a keyword, find all sentences where Plato is a named entity, find all questions, etc. The output of such queries can be HTML pages, but almost as importantly, they can be CSV files so people can do further analysis. As an added bonus, enable a person to update the database so things like authors, titles, dates, genres, or notes can be assigned to items in the bib table.
  • list what is being bought or sold – Use the entities table (ent) to identify all the money amounts (type equals “MONEY”) and the sentences from which they appear. Extract all of those sentences, analyze the sentence, and output the things being tendered. You will probably have to join the id and sentenced id in the ent table with the id and sentence id in the pos table to implement this hack. As an added bonus, calculate how much things would cost in today’s dollars or any other currency.
  • normalize metadata – The values in the named entities table (ent) are often repeated in various forms. For example, a value may be Plato, plato, or PLATO. Use something like the Levenshtein distance algorithm to normalize each value into something more consistent. [7]
  • prioritize metadata – Just because a word is frequent does not mean it is significant. A given document may mention Plato many times, but if Plato is mentioned in each and every document, then the word is akin to noise. Prioritize given named entities, specifically names, through the use of a something like TFIDF. Calculate a TFIDF score for a given word, and if the word is above a given threshold, then update the database accordingly. [8]
  • extract sentences matching a given grammer – Each & every word, punctuation mark, and part of speech of each & every document is enumerated and stored in the pos table of the database. Consequently it is rather easy to find all questions in the database and extract them. (Find all sentences ids where punctuation equals “?”. Find all words (tokens) with the same id and sentence id. Output all tokens sorted by token id.) Similarly, it is possible to find all sentences where a noun precedes a verb which precedes another noun. Or, find all sentences where a noun precedes a verb which is followed by the word “no” or “not” which precedes another noun. Such queries find sentence in the form of “cat goes home” or “dog is not cat”. Such are assertive sentences. A cool hack would be to identify sentences of any given grammer such as adjective-noun or entity-verb where the verb is some form of the lemma to be (is, was, are, were, etc.), as in “Plato is” or “Plato was”. The adjective-noun patterns is of particular interest, especially given a particular noun. Find all sentences matching the pattern adjective-king to learn how the king was described.
  • create a Mad Lib – This one is off the wall. Identify (random) items of interest from the database. Write a template in the form of a story. Fill in the template with the items of interest. Done. The “better” story would be one that is complete with “significant” words from the database; the better story would be one that relates truths from the underlying content. For example, identify the two most significant nouns. Identify a small handful of the most significant verbs. Output simple sentences in the form of noun-verb-noun.
  • implement one of two search engines – The Distant Reader’s output includes a schema file (./etc/schema.xml) defining the structure of a possible Solr index. The output also includes an indexer (./bin/db2solr.pl) as well as a command-line interface (./bin/search-solr.pl) to search the index. Install Solr. Create an index with the given schema. Index the content. Write a graphical front-end to the index complete with faceted search functionality. Allow search results to be displayed and saved in a tabular form for further analysis. The Reader’s output also includes a semantic index (./etc/reader.vec) a la word2vec, as well as a command-line interface (./bin/search-vec.py) for querying the semantic index. Write a graphical interface for querying the semantic index.

Sample data

In order for you to do your good work, you will need some Distant Reader output. Here are pointers to some such stuff:

Helpful hint

With the exception of only a few files (./etc/reader.db, ./etc/reader.vec, and ./cache/*), all of the files in the Distant Reader’s output are plain text files. More specifically, they are either unstructured data files or delimited files. Despite any file’s extension, the vast majority of the files can be read with your favorite text editor, spreadsheet, or database application. To read the database file (./etc/reader.db), you will need an SQLite application. The files in the adr, bib, ent, pos, urls, or wrd directories are all tab delimited files. A program called OpenRefine is a WONDERFUL tool for reading and analyzing tab delimited files. [9] In fact, a whole lot can be learned through the skillful use of OpenRefine against the tab delimited files.

Notes

[1] The home page of the Distant Reader is https://distantreader.org

[2] All of the code doing this processing is available on GitHub. See https://github.com/ericleasemorgan/reader

[3] This file system is affectionately known as a “study carrel”.

[4] A easy-to-use library for creating such vectors is a part of the Scikit Learn suite of software. See http://bit.ly/2F5EoxA

[5] The algorithm is described at https://en.wikipedia.org/wiki/Cosine_similarity, and a SciKit Learn module is available at http://bit.ly/2IaYcS3

[6] The name of the library is called sql.js and it is available at https://github.com/kripken/sql.js/

[7] The Levenshtein distance is described here — https://en.wikipedia.org/wiki/Levenshtein_distance, and various libraries doing the good work are outlined at http://bit.ly/2F30roM

[8] Yet another SciKit Learn module may be of use here — http://bit.ly/2F5o2oS

[9] OpenRefine eats delimited files for lunch. See http://openrefine.org

Fantastic Futures: My take-aways

Posted on December 11, 2018 in Uncategorized by Eric Lease Morgan

shipsThis is the briefest of take-aways from my attendance at Fantastic Futures, a conference on artificial intelligence (AI) in libraries. [1] From the conference announcement introduction:

The Fantastic futures-conferences, which takes place in Oslo december 5th 2018, is a collaboration between the National Library of Norway and Stanford University Libraries, and was initiated by the National Librarian at the National Library of Norway, Aslak Sira Myhre and University Librarian at Stanford University Libraries, Michael Keller.

First of all, I had the opportunity to attend and participate in a pre-conference workshop. Facilitated by Nicole Coleman (Stanford University) and Svein Brygfjeld (National Library of Norway), the workshop’s primary purpose was to ask questions about AI in libraries, and to build community. To those ends the two dozen or so of us were divided into groups where we discussed what a few AI systems might look like. I was in a group discussing the possibilities of reading massive amounts of text and/or refining information retrieval based on reader profiles. In the end our group thought such things were feasible, and we outlined how they might be accomplished. Other groups discussed things such as metadata creation and collection development. Towards the end of the day we brainstormed next steps, and at the very least try to use the ai4lib mailing list to a greater degree. [2]

fortThe next day, the first real day of the conference, was attended by more than a couple hundred of people. Most were from Europe, obviously, but from my perspective about as many were librarians as non-librarians. There was an appearance by Nancy Pearl, who, as you may or may not know, is a Seattle Public Library librarian embodied as an action figure. [3] She was brought to the conference because the National Library of Norway’s AI system is named Nancy. A few notable quotes from some of the speakers, as least from my perspective, included:

  • George Zarkadakis – “Robots ought not to pretend to not be robots.”
  • Meredith Broussard – “AI uses quantitative data but qualitative data is necessary also.”
  • Barbara McGillivray – “Practice the typical research process but annotate it with modeling; humanize the algorithms.”
  • Nicole Coleman – “Put the human in the loop … The way we model data influences the way we make interpretations.”

The presenters generated lively discussion, and I believe the conference was a success by the vast majority of attendees. It is quite likely the conference will be repeated next year and be held at Stanford.

What are some of my take-aways? Hmmm:

  1. Machine learning is simply the latest incarnation of AI, and machine learning algorithms are only as unbiased as the data used to create them. Be forewarned.
  2. We can do this. We have the technology.
  3. There is too much content to process, and AI in libraries can used to do some of the more mechanical tasks. The creation and maintenance of metadata is a good example. But again, be forewarned. We were told this same thing with the advent of word processors, and in the end, we didn’t go home early because we got our work done. Instead we output more letters.
  4. Metadata is not necessary. Well, that was sort of a debate, and (more or less) deemed untrue.

It was an honor and a privilege to attend the pre-conference workshop and conference. I sincerely believe AI can be used in libraries, and the use can be effective. Putting AI into practice will take time, energy, & prioritization. How do this and simultaneously “keep the trains running” will be a challenge. On the other hand, AI in libraries can be seen as an opportunity to demonstrate the inherent worth of cultural heritage institutions. ai4lib++

P.S. Along the way I got to see some pretty cool stuff: Viking ships, a fort, “The Scream”, and a “winterfest”. I also got to experience sunset at 3:30 in the afternoon.

winterfest scream

Links

[1] Fantastic Futures – https://www.nb.no/artikler/fantastic-futures/

[2] ai4lib – https://groups.google.com/forum/#!forum/ai4lib

[3] action figure – https://www.amazon.com/Nancy-Pearl-Librarian-Action-Figure/dp/B0006FU9EG

marc2catalog

Posted on July 2, 2018 in Uncategorized by Eric Lease Morgan

Given a set of MARC records, output a set of library catalogs

This set of scripts will take a set of MARC data, parse it into a simple (rudimentary and SQLite) database, and then generate a report against the database in the form of plain text files — a set of “library catalogs & indexes”. These catalogs & indexes are intended to be printed, but they can also be used to support rudimentary search via one’s text editor. For extra credit, the programer could read the underlying database, feed the result to an indexer, and create an OPAC (online public access catalog).

The system requires a bit of infrastructure: 1) Bash, 2) Perl, 3) a Perl module named MARC::Batch, 4) the DBI driver for SQLite.

The whole MARC-to-catalog process can be run with a single command:

./bin/make-all.sh <marc> <name>

Where <marc> is the name of the MARC file, and <name> is a one-word moniker for the collection. The distribution comes with sample data, and therefore an example execution includes:

./bin/make-all.sh ./etc/morse.mrc morse

The result ought to be the creation of a .db file in the ./etc directory, a collections directory, and sub-directory of collections, and a set of plain text files in the later. The plain text files are intended to be printed or given away like candy to interested learners or scholars.

The code for marc2catalog ought to be available on GitHub.

Project English: An Index to English/American literature spanning six centuries

Posted on April 24, 2018 in Uncategorized by Eric Lease Morgan

I have commenced upon a project to build an index and set of accompanying services rooted in English/American literature spanning the 15th to 20th centuries. For the lack of a something better, I call it Project English. This blog posting describes Project English in greater detail.

Goals & scope

The goals of the Project include but are not limited to:

  • provide enhanced collections & services to the University of Notre Dame community
  • push the boundaries of librarianship

To accomplish these goals I have acquired a subset of three distinct and authoritative collections of English/American literature:

  1. EEBO – Early English Books Online which has its roots in venerable Short-Title Catalogue of English Books
  2. ECCO – Eighteenth Century Collection Online, which is an extension of the Catalogue
  3. Sabin – Bibliotheca Americana: A Dictionary of Books Relating to America from Its Discovery to the Present Time originated by Joseph Sabin

More specifically, the retired and emeritus English Studies Librarian, Laura Fuderer purchased hard drives containing the full text of the aforementioned collections. Each item in the collection is manifested as an XML file and a set of JPEG images (digital scans of the original materials). The author identified the hard drives, copied some of the XML files, and began the Project. To date, the collection includes:

  • 56 thousand titles
  • 7.6 million pages
  • 2.3 billion words

At the present time, the whole thing consumes 184 GB of disk space where approximately 1/3 of it is XML files, 1/3 of it is HTML files transformed from the XML, and 1/3 is plain text files transformed from the XML. At the present time, there are no image nor PDF files in the collection.

On average, each item in the collection is approximately 135 pages (or 46,000 words) long. As of right now, each sub-collection is equally represented. The vast majority of the collection is in English, but other languages are included. Most of the content was published in London. The distribution of centuries is beginning to appear balanced, but determining the century of publication is complicated by the fact the metadata’s date values are not expressed as integers. The following charts & graphs illustrate all of these facts.

years sub-collections
languages cities

Access & services

By default, the collection is accessible via freetext/fielded/faceted searching. Given an EBBO, ECCO, or Sabin identifier, the collection is also accessible via known-item browse. (Think “call number”.) Search results can optionally be viewed and sorted using a tabled interface. (Think “spreadsheet”.) The reader has full access to:

  1. the original XML data – hard to read but rich in metadata
  2. rudimentary HTML – transformed from the original XML and a bit easier to read
  3. plain text – intended for computational analysis

Search results and its associated metadata can also be downloaded en masse. This enables the reader to do offline analysis such as text mining, concordancing, parts-of-speech extraction, or topic modeling. Some of these things are currently implemented inline, including:

  • listing the frequency of unigrams, bigrams, and trigrams
  • listing the frequency of noun phrases, the subjects & objects of sentences

For example, the reader can first create a set of one or more items of interest. They can then do some “distant” or “scalable” reading against the result. In its current state, Project English enables the reader to answer questions like:

  • What is this particular item about?
  • To what degree does this item mention the words God, man, truth, or beauty?

As Project English matures, it will enable the reader to answer additional questions, such as:

  • What actions take place in a given corpus?
  • How are things described?
  • If one were to divide the collection into T themes, then what might those themes be?
  • How has a theme changed over time?
  • Who, what places, and what organizations appear in the corpus?
  • What ideas appear concurrently in a corpus?

Librarianship

Remember, one of the goals of the Project is to push the boundaries of librarianship. With the advent of ubiquitous networked computers, the traditional roles of librarianship are not as important as they previously were. (I did not say the roles were unimportant, just not as important as they used to be.) Put another way, there is less of a need for the collection, organization, preservation, and dissemination of data, information, and knowledge. Much of this work is being facilitated through the Internet. This then begs the question, “Given the current environment, what are or can be the roles of (academic) libraries?” In the author’s opinion, the roles are rooted in two activities:

  1. the curation of rare & infrequently held materials
  2. the provision of value-added services against those materials

In the case of Project English, the rare & infrequently held materials are full text items dating from 15th to 20th centuries. When it is all said & done, the collection may come close to 2.5 million titles in size, a modest library by most people’s standards. These collections are being curated with scope, with metadata, with preservation, and with quick & easy access. The value-added services are fledgling, but they will include a sets of text mining & natural langage processing interfaces enabling the learner, teacher, and scholar to do “distant” and “scalable” reading. In other words, instead of providing access to materials and calling the work of librarianship done, Project English will enable & empower the reader to use & understand the materials they have acquired.

Librarianship needs to go beyond the automation of traditional tasks; it behooves librarianship to exploit computers to a greater degree and use them to augment & supplement the profession’s reason and experience. Project English is one librarian’s attempt to manifest this idea into a reality.

LexisNexis hacks

Posted on December 18, 2017 in Uncategorized by Eric Lease Morgan

This blog posting briefly describes and makes available two Python scripts I call my LexisNexis hacks.

Forever FlowersThe purpose of the scripts is to enable the researcher to reformulate LexisNexis full text downloads into tabular form. To accomplish this goal, the researchers is expected to first search LexisNexis for items of interest. They are then expected to do a full text download of the results as a plain text file. Attached ought to be an example that includes about five records. The first of my scripts — results2files.py — parses the search results into individual records. The second script — files2table.py — reads the output of the first script and parses each file into individual but selected fields. The output of the second script is a tab-delimited file suitable for further analysis in any number of applications.

These two scripts can work for a number of people, but there are a few caveats. First, results2files.py saves its results as a set of files with randomly generated file names. It is possbile, albeit unlikely, files will get overwritten because the same randomly generated file names was… generated. Second, the output of files2table.py only includes fields required for a specific research question. It is left up to the reader to edit files2table.py for additional fields.

In short, your milage may vary.

Freebo@ND and library catalogues

Posted on August 19, 2017 in Freebo@ND by Eric Lease Morgan

Freebo@ND is a collection of early English book-like materials as well as a set of services provided against them. In order to use & understand items in the collection, some sort of finding tool — such as a catalogue — is all but required. Freebo@ND supports the more modern full text index which has become the current best practice finding tool, but Freebo@ND also offers a set of much more traditional library tools. This blog posting describes how & why the use of these more traditional tools can be beneficial to the reader/researcher. In the end, we will learn that “What is old is new again.”

An abbreviated history

lemons by ericA long time ago, in a galaxy far far away, library catalogues were simply accession lists. As new items were brought into the collection, new entries were appended to the list. Each item would then be given an identifier, and the item would be put into storage. It could then be very easily located. Search/browse the list, identify item(s) of interest, note identifier(s), retrieve item(s), and done.

As collections grew, the simple accession list proved to be not scalable because it was increasingly difficult to browse the growing list. Thus indexes were periodically created. These indexes were essentially lists of authors, titles, or topics/subjects, and each item on the list was associated with a title and/or a location code. The use of the index was then very similar to the use of the accession list. Search/browse the index, identify item(s) of interest, note location code(s), retrieve item(s), and done. While these indexes were extremely functional, they were difficult to maintain. As new items became a part of the collection it was impossible to insert them into the middle of the printed index(es). Consequently, the printed indexes were rarely up-to-date.

To overcome the limitations of the printed index(es), someone decided not to manifest them as books, but rather as cabinets (drawers) of cards — the venerable card catalogue. Using this technology, it was trivial to add new items to the index. Type up cards describing items, and insert cards into the appropriate drawer(s). Readers could then search/browse the drawers, identify item(s) of interest, note location code(s), retrieve item(s), and done.

It should be noted that these cards were formally… formatted. Specifically, they included “cross-references” enabling the reader to literally “hyperlink” around the card catalogue to find & identify additional items of interest. On the downside, these cross-references (and therefore the hyperlinks) where limited by design to three to five in number. If more than three to five cross-references were included, then the massive numbers of cards generated would quickly out pace the space allotted to the cabinets. After all, these cabinets came dominate (and stereotype) libraries and librarianship. They occupied hundreds, if not thousands, of square feet, and whole departments of people (cataloguers) were employed to keep them maintained.

With the advent of computers, the catalogue cards became digitally manifested. Initially the digital manifestations were used to transmit bibliographic data from the Library of Congress to libraries who would print cards from the data. Eventually, the digital manifestations where used to create digital indexes, which eventually became the online catalogues of today. Thus, the discovery process continues. Search/browse the online catalogue. Identify items of interest. Note location code(s). Retrieve item(s). Done. But, for the most part, these catalogues do not meet reader expectations because the content of the indexes is merely bibliographic metadata (authors, titles, subjects, etc.) when advances in full text indexing have proven to be more effective. Alas, libraries simply do not have the full text of the books in their collections, and consequently libraries are not able to provide full text indexing services. †

What is old is new again

flowersThe catalogues representing the content of Freebo@ND are perfect examples of the history of catalogues as outlined above.

For all intents & purposes, Freebo@ND is YAVotSTC (“Yet Another Version of the Short-Title Catalogue”). In 1926 Pollard & Redgrave compiled an index of early English books entitled A Short-title Catalogue of books printed in England, Scotland, & Ireland and of English books printed abroad, 1475-1640. This catalogue became know as the “English short-title catalogue” or ESTC. [1] The catalog’s purpose is succinctly stated on page xi:

The aim of this catalogue is to give abridged entries of all ‘English’ books, printed before the close of the year 1640, copies of which exist at the British Museum, the Bodleian, the Cambridge University Library, and the Henry E. Huntington Library, California, supplemented by additions from nearly one hundres and fifty other collections.

The 600-page book is essentially an author index beginning with likes of George Abbot and ending with Ulrich Zwingli. Shakespeare begins on page 517, goes on for four pages, and includes STC (accession) numbers 22273 through 22366. And the catalogue functions very much like the catalogues of old. Articulate an author of interest. Look up the author in the index. Browse the listings found there. Note the abbreviation of libraries holding an item of interest. Visit library, and ultimately, look at the book.

The STC has a history and relatives, some of which is documented in a book entitled The English Short-Title Catalogue: Past, present, future and dating from 1998. [2] I was interested in two of the newer relatives of the Catalogue:

  1. English short title catalogue on CD-ROM 1473-1800 – This is an IBM DOS-based package supposably enabling the researcher/scholar to search & browse the Catalogue’s bibliographic data, but I was unable to give the package a test drive since I did not have ready access to DOS-based computer. [3] From the bibliographic description’s notes: “This catalogue on CD-ROM contains more than 25,000 of the total 36,000 records of titles in English in the British Library for the period 1473-1640. It also includes 105,000 records for the period 1641-1700, together with the most recent version of the ESTC file, approximately 312,000 records.”
  2. English Short Title Catalogue [as a website] – After collecting & indexing the “digital manifestations” describing items in the Catalogue, a Web-accessible version of the catalogue is available from the British Library. [4] From the about page: “The ‘English Short Title Catalogue’ (ESTC) began as the ‘Eighteenth Century Short Title Catalogue’, with a conference jointly sponsored by the American Society for Eighteenth-Century Studies and the British Library, held in London in June 1976. The aim of the original project was to create a machine-readable union catalogue of books, pamphlets and other ephemeral material printed in English-speaking countries from 1701 to 1800.” [5]

As outlined above, Freebo@ND is a collection of early English book-like materials as well as a set of services provided against them. The source data originates from the Text Creation Partnership, and it is manifested as a set of TEI/XML files with full/rich metadata as well as the mark up of every single word in every single document. To date, there are only 15,000 items in Freebo@ND, but when the project is complete, Freebo@ND ought to contain close to 60,000 items dating from 1460 to 1699. Given this data, Freebo@ND sports an online, full text index of the works collected to date. This online interface is both field searchable, free text searchable, and provides a facet browse interace. [6]

market by ericBut wait! There’s more!! (And this is the point.) Because the complete bibliographic data is available from the original data, it has been possible to create printed catalogs/indexes akin to the catalogs/indexes of old. These catalogs/indexes are available for downloading, and they include:

  • main catalog – a list of everything ordered by “accession” number; use this file in conjunction with your software’s find function to search & browse the collection [7]
  • author index – a list of all the authors in the collection & pointers to their locations in the repository; use this to learn who wrote what & how often [8]
  • title index – a list of all the works in the collection ordered by title; this file is good for “known item searches” [9]
  • date index – like the author index, this file lists all the years of item publication and pointers to where those items can be found; use this to see what was published when [10]
  • subject index – a list of all the Library Of Congress subject headings used in the collection, and their associated items; use this file to learn the “aboutness” of the collection as a whole [11]

These catalogs/indexes are very useful. It is really easy to load these them into your favorite text editor and to peruse them for items of interest. They are even more useful if they are printed! Using these catalogues/indexes it is very quick & easy to see how prolific any author was, how many items were published a given year, and what the published items were about. The library profession’s current tools do not really support such functions. Moreover, and unlike the cool (“kewl”) online interfaces alluded to above, these printed catalogs are easily updated, duplicated, shareable, and if bound can stand the test of time. Let’s see any online catalog last more than a decade and be so inexpensively produced.

“What is old is new again.”

Notes/links

† Actually, even if libraries where to have the full text of their collection readily available, the venerable library catalogues would probably not be able to use the extra content. This is because the digital manifestations of the bibliographic data can not be more than 100,000 characters long, and the existing online systems are not designed for full text indexing. To say the least, the inclusion of full text indexing in library catalogues would be revolutionary in scope, and it would also be the beginning of the end of traditional library cataloguing as we know it.

[1] Short-title catalogue or ESTC – http://www.worldcat.org/oclc/846560579
[2] Past, present, future – http://www.worldcat.org/oclc/988727012
[3] STC on CD-ROM – http://www.worldcat.org/oclc/605215275
[4] ESTC as website – http://estc.bl.uk/
[5] ESTC about page – http://www.bl.uk/reshelp/findhelprestype/catblhold/estchistory/estchistory.html
[6] Freebo@ND search interface – http://cds.crc.nd.edu/cgi-bin/search.cgi
[7] main catalog – http://cds.crc.nd.edu/downloads/catalog-main.txt
[8] author index – http://cds.crc.nd.edu/downloads/catalog-author.txt
[9] title index – http://cds.crc.nd.edu/downloads/catalog-title.txt
[10] date index – http://cds.crc.nd.edu/downloads/catalog-date.txt
[11] subject index – http://cds.crc.nd.edu/downloads/catalog-subject.txt

Stories: Interesting projects I worked on this past year

Posted on August 9, 2017 in Uncategorized by Eric Lease Morgan

This is short list of “stories” outlining some of the more interesting projects I worked on this past year:

  • Ask Putin – A faculty member from the College of Arts & Letters acquired the 950-page Cyrillic transcript of a television show called “Ask Putin”. The faculty member had marked up the transcription by hand in order to analyze the themes conveyed therein. They then visited the Center for Digital Scholarship, and we implemented a database version of the corpus. By counting & tabulating the roots of each of the words for each of the sixteen years of the show, we were able to quickly & easily confirm many of the observations she had generated by hand. Moreover, the faculty member was able to explore additional themes which they had not previously coded.
  • Who’s related to whom – A visiting scholar from the Kroc Center asked the Center for Digital Scholarship to extract all of the “named entities” (names, places, & things) from a set of Spanish language newspaper articles. Based on strength of the relationships between the entities, the scholar wanted a visualization to be created illustrating who was related to whom in the corpus. When we asked more about the articles and their content, we learned we had been asked to map the Columbian drug cartel. While incomplete, the framework of this effort will possibly be used by a South American government.
  • Counting 250,000,000 words – Working with Northwestern University, and Washington University in St. Louis, the Center for Digital Scholarship is improving access & services against the set of literature called “Early English Books”. This corpus spans 1460 and 1699 and is very representative of English literature of that time. We have been creating more accurate transcriptions of the texts, digitizing original items, and implementing ways to do “scalable reading” against the whole. After all, it is difficult to read 60,000 books. Through this process each & every word from the transcriptions has been saved in a database for future analysis. To date the database includes a quarter of a billion (250,000,000) rows. See: http://cds.crc.nd.edu
  • Convocate – In conjunction with the Center for Civil and Human Rights, the Hesburgh Libraries created an online tool for comparing & contrasting human rights policy written by the Vatican and various non-governmental agencies. As a part of this project, the Center for Digital Scholarship wrote an application that read each & every paragraph from the thousands of pages of text. The application then classified each & every paragraph with one or more keyword terms for the purposes of more accurate & thorough discovery across the corpus. The results of this application enable the researcher to items of similar interest even if they employ sets of dispersed terminology. For more detail, see: https://convocate.nd.edu