June // 2019 // Days in the Life of a Librarian // Blog Network // University of Notre Dame

Archive for June, 2019

Invitation to hack the Distant Reader

Posted on June 13, 2019 in Distant Reader

We invite you to write a cool hack enabling students & scholars to “read” an arbitrarily large corpus of textual materials.

Introduction

A website called The Distant Reader takes an arbitrary number of files or links to files as input. [1] The Reader then amasses the files locally, transforms them into plain text files, and performs quite a bit of natural language processing against them. [2] The result — the the form of a file system — is a set of operating system independent indexes which point to individual files from the input. [3] Put another way, each input file is indexed in a number of ways, and therefore accessible by any one or combination of the following attributes:

any named entity (name of person, place, date, time, money amount, etc)
any part of speech (noun, verb, adjective, etc.)
email address
free text word
readability score
size of file
statistically significant keyword
textual summary
URL

All of things listed above are saved as plain text files, but they have also been reduced to an SQLite database (./etc/reader.db), which is also distributed with the file system.

The Challenge

Your mission, if you choose to accept it, is to write a cool hack against the Distant Reader’s output. By doing so, you will be enabling people to increase their comprehension of the given files. Here is a list of possible hacks:

create a timeline – The database includes a named entities table (ent). Each entity is denoted by a type, and one of those types is “PERSON”. Find all named entities of type PERSON, programmatically look them up in Wikidata, extract the entity’s birth & death dates, and plot the result on a timeline. As an added bonus, update the database with the dates. Alternatively, and possibly more simply, find all entities of type DATE (or TIME), and plot those values on a timeline.
create a map – Like the timeline hack, find all entities denoting places (GRE or LOC), look up their geographic coordinates in Wikidata, and plot them on a map. As an added bonus, update the database with the coordinates.
order documents based on similarity – “Find more like this one” is a age-old information retrieval use case. Given a reference document – a document denoted as particularly relevant — create a list of documents from the input which are similar to the reference document. For example, create a vector denoting the characteristics of the reference document. [4] Then create vectors for each document in the collection. Finally, use something like the Cosine Similarly algorithm to determine which documents are most similar (or different). [5] The reference document may be from either inside or outside the Reader’s file system, for example, the Bible or Shakespeare’s Hamlet.
write a Javascript interface to the database – The Distant Reader’s database (./etc/reader.db) is manifested as a single SQLite file. There exists a Javascript library enabling one to read & write to SQLite databases. [6] Sans a Web server, write sets of HTML pages enabling a person to query the database. Example queries might include: find all documents where Plato is a keyword, find all sentences where Plato is a named entity, find all questions, etc. The output of such queries can be HTML pages, but almost as importantly, they can be CSV files so people can do further analysis. As an added bonus, enable a person to update the database so things like authors, titles, dates, genres, or notes can be assigned to items in the bib table.
list what is being bought or sold – Use the entities table (ent) to identify all the money amounts (type equals “MONEY”) and the sentences from which they appear. Extract all of those sentences, analyze the sentence, and output the things being tendered. You will probably have to join the id and sentenced id in the ent table with the id and sentence id in the pos table to implement this hack. As an added bonus, calculate how much things would cost in today’s dollars or any other currency.
normalize metadata – The values in the named entities table (ent) are often repeated in various forms. For example, a value may be Plato, plato, or PLATO. Use something like the Levenshtein distance algorithm to normalize each value into something more consistent. [7]
prioritize metadata – Just because a word is frequent does not mean it is significant. A given document may mention Plato many times, but if Plato is mentioned in each and every document, then the word is akin to noise. Prioritize given named entities, specifically names, through the use of a something like TFIDF. Calculate a TFIDF score for a given word, and if the word is above a given threshold, then update the database accordingly. [8]
extract sentences matching a given grammer – Each & every word, punctuation mark, and part of speech of each & every document is enumerated and stored in the pos table of the database. Consequently it is rather easy to find all questions in the database and extract them. (Find all sentences ids where punctuation equals “?”. Find all words (tokens) with the same id and sentence id. Output all tokens sorted by token id.) Similarly, it is possible to find all sentences where a noun precedes a verb which precedes another noun. Or, find all sentences where a noun precedes a verb which is followed by the word “no” or “not” which precedes another noun. Such queries find sentence in the form of “cat goes home” or “dog is not cat”. Such are assertive sentences. A cool hack would be to identify sentences of any given grammer such as adjective-noun or entity-verb where the verb is some form of the lemma to be (is, was, are, were, etc.), as in “Plato is” or “Plato was”. The adjective-noun patterns is of particular interest, especially given a particular noun. Find all sentences matching the pattern adjective-king to learn how the king was described.
create a Mad Lib – This one is off the wall. Identify (random) items of interest from the database. Write a template in the form of a story. Fill in the template with the items of interest. Done. The “better” story would be one that is complete with “significant” words from the database; the better story would be one that relates truths from the underlying content. For example, identify the two most significant nouns. Identify a small handful of the most significant verbs. Output simple sentences in the form of noun-verb-noun.
implement one of two search engines – The Distant Reader’s output includes a schema file (./etc/schema.xml) defining the structure of a possible Solr index. The output also includes an indexer (./bin/db2solr.pl) as well as a command-line interface (./bin/search-solr.pl) to search the index. Install Solr. Create an index with the given schema. Index the content. Write a graphical front-end to the index complete with faceted search functionality. Allow search results to be displayed and saved in a tabular form for further analysis. The Reader’s output also includes a semantic index (./etc/reader.vec) a la word2vec, as well as a command-line interface (./bin/search-vec.py) for querying the semantic index. Write a graphical interface for querying the semantic index.

Sample data

In order for you to do your good work, you will need some Distant Reader output. Here are pointers to some such stuff:

Cultural Analytics – http://dh.crc.nd.edu/sandbox/reader/hackaton/cultural-analytics.zip
- 318,287 words; 33 documents; 67 uncompressed MB
- all articles from a journal named Cultural Analytics
Plato – http://dh.crc.nd.edu/sandbox/reader/hackaton/plato.zip
- 929,704 words; 24 documents; 112 uncompressed MB
- the complete works of Plato
Code4Lib Journal – http://dh.crc.nd.edu/sandbox/reader/hackaton/code4lib-journal.zip
- 1,234,348 words; 303 documents; 286 uncompressed MB
- all articles from a journal named Code4Lib Journal
aesthetics – http://dh.crc.nd.edu/sandbox/reader/hackaton/aesthetics.zip
- 2,296,890 words; 37 documents; 287 uncompressed MB
- books classified as the philosophy of art
love stories – http://dh.crc.nd.edu/sandbox/reader/hackaton/love-stories.zip
- 238,374,038 words; 460 documents; 5.94 uncompressed GB
- books classified as love stories

Helpful hint

With the exception of only a few files (./etc/reader.db, ./etc/reader.vec, and ./cache/*), all of the files in the Distant Reader’s output are plain text files. More specifically, they are either unstructured data files or delimited files. Despite any file’s extension, the vast majority of the files can be read with your favorite text editor, spreadsheet, or database application. To read the database file (./etc/reader.db), you will need an SQLite application. The files in the adr, bib, ent, pos, urls, or wrd directories are all tab delimited files. A program called OpenRefine is a WONDERFUL tool for reading and analyzing tab delimited files. [9] In fact, a whole lot can be learned through the skillful use of OpenRefine against the tab delimited files.

Notes

[1] The home page of the Distant Reader is https://distantreader.org

[2] All of the code doing this processing is available on GitHub. See https://github.com/ericleasemorgan/reader

[3] This file system is affectionately known as a “study carrel”.

[4] A easy-to-use library for creating such vectors is a part of the Scikit Learn suite of software. See http://bit.ly/2F5EoxA

[5] The algorithm is described at https://en.wikipedia.org/wiki/Cosine_similarity, and a SciKit Learn module is available at http://bit.ly/2IaYcS3

[6] The name of the library is called sql.js and it is available at https://github.com/kripken/sql.js/

[7] The Levenshtein distance is described here — https://en.wikipedia.org/wiki/Levenshtein_distance, and various libraries doing the good work are outlined at http://bit.ly/2F30roM

[8] Yet another SciKit Learn module may be of use here — http://bit.ly/2F5o2oS

[9] OpenRefine eats delimited files for lunch. See http://openrefine.org

Days in the Life of a Librarian

Recent Posts

Subscribe