Archive for the ‘Distant Reader’ Category

What is the Distant Reader and why should I care?

Posted on November 9, 2019 in Distant Reader

The Distant Reader is a tool for reading. [1]

wall paper by eric

Wall Paper by Eric

The Distant Reader takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of any size, the Distant Reader will analyze the corpus, and it will output a myriad of reports enabling you to use & understand the corpus. The Distant Reader is intended to supplement the traditional reading process.

The Distant Reader empowers one to use & understand large amounts of textual information both quickly & easily. For example, the Distant Reader can consume the entire issue of a scholarly journal, the complete works of a given author, or the content found at the other end of an arbitrarily long list of URLs. Thus, the Distant Reader is akin to a book’s table-of-contents or back-of-the-book index but at scale. It simplifies the process of identifying trends & anomalies in a corpus, and then it enables a person to further investigate those trends & anomalies.

The Distant Reader is designed to “read” everything from a single item to a corpus of thousand’s of items. It is intended for the undergraduate student who wants to read the whole of their course work in a given class, the graduate student who needs to read hundreds (thousands) of items for their thesis or dissertation, the scientist who wants to review the literature, or the humanist who wants to characterize a genre.

How it works

The Distant Reader takes five different forms of input:

  1. a URL – good for blogs, single journal articles, or long reports
  2. a list of URLs – the most scalable, but creating the list can be problematic
  3. a file – good for that long PDF document on your computer
  4. a zip file – the zip file can contain just about any number of files from your computer
  5. a zip file plus a metadata file – with the metadata file, the reader’s analysis is more complete

Once the input is provided, the Distant Reader creates a cache — a collection of all the desired content. This is done via the input or by crawling the ‘Net. Once the cache is collected, each & every document is transformed into plain text, and along the way basic bibliographic information is extracted. The next step is analysis against the plain text. This includes rudimentary counts & tabulations of ngrams, the computation of readability scores & keywords, basic topic modeling, parts-of-speech & named entity extraction, summarization, and the creation of a semantic index. All of these analyses are manifested as tab-delimited files and distilled into a single relational database file. After the analysis is complete, two reports are generated: 1) a simple plain text file which is very tabular, and 2) a set of HTML files which are more narrative and graphical. Finally, everything that has been accumulated & generated is compressed into a single zip file for downloading. This zip file is affectionately called a “study carrel“. It is completely self-contained and includes all of the data necessary for more in-depth analysis.

What it does

The Distant Reader supplements the traditional reading process. It does this in the way of traditional reading apparatus (tables of content, back-of-book indexes, page numbers, etc), but it does it more specifically and at scale.

Put another way, the Distant Reader can answer a myriad of questions about individual items or the corpus as a whole. Such questions are not readily apparent through traditional reading. Examples include but are not limited to:

  • How big is the corpus, and how does its size compare to other corpora?
  • How difficult (scholarly) is the corpus?
  • What words or phrases are used frequently and infrequently?
  • What statistically significant words characterize the corpus?
  • Are there latent themes in the corpus, and if so, then what are they and how do they change over both time and place?
  • How do any latent themes compare to basic characteristics of each item in the corpus (author, genre, date, type, location, etc.)?
  • What is discussed in the corpus (nouns)?
  • What actions take place in the corpus (verbs)?
  • How are those things and actions described (adjectives and adverbs)?
  • What is the tone or “sentiment” of the corpus?
  • How are the things represented by nouns, verbs, and adjective related?
  • Who is mentioned in the corpus, how frequently, and where?
  • What places are mentioned in the corpus, how frequently, and where?

People who use the Distant Reader look at the reports it generates, and they often say, “That’s interesting!” This is because it highlights characteristics of the corpus which are not readily apparent. If you were asked what a particular corpus was about or what are the names of people mentioned in the corpus, then you might answer with a couple of sentences or a few names, but with the Distant Reader you would be able to be more thorough with your answer.

The questions outlined above are not necessarily apropos to every student, researcher, or scholar, but the answers to many of these questions will lead to other, more specific questions. Many of those questions can be answered directly or indirectly through further analysis of the structured data provided in the study carrel. For example, each & every feature of each & every sentence of each & every item in the corpus has been saved in a relational database file. By querying the database, the student can extract every sentence with a given word or matching a given grammer to answer a question such as “How was the king described before & after the civil war?” or “How did this paper’s influence change over time?”

A lot of natural language processing requires pre-processing, and the Distant Reader does this work automatically. For example, collections need to be created, and they need to be transformed into plain text. The text will then be evaluated in terms of parts-of-speech and named-entities. Analysis is then done on the results. This analysis may be as simple as the use of concordance or as complex as the application of machine learning. The Distant Reader “primes the pump” for this sort of work because all the raw data is already in the study carrel. The Distant Reader is not intended to be used alone. It is intended to be used in conjunction with other tools, everything from a plain text editor, to a spreadsheet, to database, to topic modelers, to classifiers, to visualization tools.

Conclusion

I don’t know about you, but now-a-days I can find plenty of scholarly & authoritative content. My problem is not one of discovery but instead one of comprehension. How do I make sense of all the content I find? The Distant Reader is intended to address this question by making observations against a corpus and providing tools for interpreting the results.

Links

[1] Distant Reader – https://distantreader.org

Project Gutenberg and the Distant Reader

Posted on November 6, 2019 in Distant Reader

The venerable Project Gutenberg is perfect fodder for the Distant Reader, and this essay outlines how & why. (tl;dnr: Search my mirror of Project Gutenberg, save the result as a list of URLs, and feed them to the Distant Reader.)

Project Gutenberg

wall paper by Eric

Wall Paper by Eric

A long time ago, in a galaxy far far away, there was a man named Micheal Hart. Story has it he went to college at the University of Illinois, Urbana-Champagne. He was there during a summer, and the weather was seasonably warm. On the other hand, the computer lab was cool. After all, computers run hot, and air conditioning is a must. To cool off, Micheal went into the computer lab to be in a cool space.† While he was there he decided to transcribe the United States Declaration of Independence, ultimately, in the hopes of enabling people to use a computers to “read” this and additional transcriptions. That was in 1971. One thing led to another, and Project Gutenberg was born. I learned this story while attending a presentation by the now late Mr. Hart on Saturday, February 27, 2010 in Roanoke (Indiana). As it happened it was also Mr. Hart’s birthday. [1]

To date, Project Gutenberg is a corpus of more than 60,000 freely available transcribed ebooks. The texts are predominantly in English, but many languages are represented. Many academics look down on Project Gutenberg, probably because it is not as scholarly as they desire, or maybe because the provenance of the materials is in dispute. Despite these things, Project Gutenberg is a wonderful resource, especially for high school students, college students, or life-long learners. Moreover, its transcribed nature eliminates any problems of optical character recognition, such as one encounters with the HathiTrust. The content of Project Gutenberg is all but perfectly formatted for distant reading.

Unfortunately, the interface to Project Gutenberg is less than desirable; the index to Project Gutenberg is limited to author, title, and “category” values. The interface does not support free text searching, and there is limited support for fielded searching and Boolean logic. Similarly, the search results are not very interactive nor faceted. Nor is there any application programmer interface to the index. With so much “clean” data, so much more could be implemented. In order to demonstrate the power of distant reading, I endeavored to create a mirror of Project Gutenberg while enhancing the user interface.

To create a mirror of Project Gutenberg, I first downloaded a set of RDF files describing the collection. [2] I then wrote a suite of software which parses the RDF, updates a database of desired content, loops through the database, caches the content locally, indexes it, and provides a search interface to the index. [3, 4] The resulting interface is ill-documented but 100% functional. It supports free text searching, phrase searching, fielded searching (author, title, subject, classification code, language) and Boolean logic (using AND, OR, or NOT). Search results are faceted enabling the reader to refine their query sans a complicated query syntax. Because the cached content includes only English language materials, the index is only 33,000 items in size.

Project Gutenberg & the Distant Reader

The Distant Reader is a tool for reading. It takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of any size, the Distant Reader will analyze the corpus, and it will output a myriad of reports enabling you to use & understand the corpus. The Distant Reader is intended to supplement the traditional reading process. Project Gutenberg and the Distant Reader can be used hand-in-hand.

As described in a previous posting, the Distant Reader can take five different types of input. [5] One of those inputs is a file where each line in the file is a URL. My locally implemented mirror of Project Gutenberg enables the reader to search & browse in a manner similar to the canonical version of Project Gutenberg, but with two exceptions. First & foremost, once a search has been gone against my mirror, one of the resulting links is “only local URLs”. For example, below is an illustration of the query “love AND honor AND truth AND justice AND beauty”, and the “only local URLs” link is highlighted:

search result

Search result

By selecting the “only local URLs”, a list of… URLs is returned, like this:

urls

URLs

This list of URLs can then be saved as file, and any number of things can be done with the file. For example, there are Google Chrome extensions for the purposes of mass downloading. The file of URLs can be fed to command-line utilities (ie. curl or wget) also for the purposes of mass downloading. In fact, assuming the file of URLs is named love.txt, the following command will download the files in parallel and really fast:

cat love.txt | parallel wget

This same file of URLs can be used as input against the Distant Reader, and the result will be a “study carrel” where the whole corpus could be analyzed — read. For example, the Reader will extract all the nouns, verbs, and adjectives from the corpus. Thus you will be able to answer what and how questions. It will pull out named entities and enable you to answer who and where questions. The Reader will extract keywords and themes from the corpus, thus outlining the aboutness of your corpus. From the results of the Reader you will be set up for concordancing and machine learning (such as topic modeling or classification) thus enabling you to search for more narrow topics or “find more like this one”. The search for love, etc returned more than 8000 items. Just less than 500 of them were returned in the search result, and the Reader empowers you to read all 500 of them at one go.

Summary

Project Gutenberg is very useful resource because the content is: 1) free, and 2) transcribed. Mirroring Project Gutenberg is not difficult, and by doing so an interface to it can be enhanced. Project Gutenberg items are perfect items for reading & analysis by the Distant Reader. Search Project Gutenberg, save the results as a file, feed the file to the Reader and… read the results at scale.

Notes and links

† All puns are intended.

[1] Michael Hart in Roanoke (Indiana) – video: https://youtu.be/eeoBbSN9Esg; blog posting: http://infomotions.com/blog/2010/03/michael-hart-in-roanoke-indiana/

[2] The various Project Gutenberg feeds, including the RDF is located at https://www.gutenberg.org/wiki/Gutenberg:Feeds

[3] The suite of software to cache and index Project Gutenberg is available on GitHub at https://github.com/ericleasemorgan/gutenberg-index

[4] My full text index to the English language texts in Project Gutenberg is available at http://dh.crc.nd.edu/sandbox/gutenberg/cgi-bin/search.cgi

[5] The Distant Reader and its five different types of input – http://sites.nd.edu/emorgan/2019/10/dr-inputs/

The Distant Reader and its five different types of input

Posted on October 19, 2019 in Distant Reader

The Distant Reader can take five different types of input, and this blog posting describes what they are.

wall paper by Eric

Wall Paper by Eric

The Distant Reader is a tool for reading. It takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of any size, the Distant Reader will analyze the corpus, and it will output a myriad of reports enabling you to use & understand the corpus. The Distant Reader is intended to designed the traditional reading process.

At the present time, the Reader can accept five different types of input, and they include:

  1. a file
  2. a URL
  3. a list of URLs
  4. a zip file
  5. a zip file with a companion CSV file

Each of these different types of input are elaborated upon below.

A file

The simplest form of input is a single file from your computer. This can be just about file available to you, but to make sense, the file needs to contain textual data. Thus, the file can be a Word document, a PDF file, an Excel spreadsheet, an HTML file, a plain text file, etc. A file in the form of an image will not work because it contains zero text. Also, not all PDF files are created equal. Some PDF files are only facsimiles of their originals. Such PDF files are merely sets of images concatenated together. In order for PDF files to be used as input, the PDF files need to have been “born digitally” or they need to have had optical character recognition previously applied against them. Most PDF files are born digitally nor do they suffer from being facsimiles.

A good set of use-cases for single file input is the whole of a book, a long report, or maybe a journal article. Submitting a single file to the Distant Reader is quick & easy, but the Reader is designed for analyzing larger rather than small corpora. Thus, supplying a single journal article to the Reader doesn’t make much sense; the use of the traditional reading process probably makes more sense for a single journal article.

A URL

The Distant Reader can take a single URL as input. Given a URL, the Reader will turn into a rudimentary Internet spider and build a corpus. More specifically, given a URL, the Reader will:

  • retrieve & cache the content found at the other end of the URL
  • extract any URLs it finds in the content
  • retrieve & cache the content from these additional URLs
  • stop building the corpus but continue with its analysis

In short, given a URL, the Reader will cache the URL’s content, crawl the URL one level deep, cache the result, and stop caching.

Like the single file approach, submitting a URL to the Distant Reader is quick & easy, but there are a number of caveats. First of all, the Reader does not come with very many permissions, and just because you are authorized to read the content at the other end of a URL does not mean the Reader has the same authorization. A lot of content on the Web resides behind paywalls and firewalls. The Reader can only cache 100% freely accessible content.

“Landing pages” and “splash pages” represent additional caveats. Many of the URLs passed around the ‘Net do not point to the content itself, but instead they point to ill-structured pages describing the content — metadata pages. Such pages may include things like authors, titles, and dates, but these things are not presented in a consistent nor computer-readable fashion; they are laid out with aesthetics or graphic design in mind. These pages do contain pointers to the content you want to read, but the content may be two or three more clicks away. Be wary of URLs pointing to landing pages or splash pages.

Another caveat to this approach is the existence of extraneous input due to navigation. Many Web pages include links for navigating around the site. They also include links to things like “contact us” and “about this site”. Again, the Reader is sort of stupid. If found, the Reader will crawl such links and include their content in the resulting corpus.

Despite these drawbacks there are number of excellent use-cases for single URL input. One of the best is Wikipedia articles. Feed the Reader a URL pointing to a Wikipedia article. The Reader will cache the article itself, and then extract all the URLs the article uses as citations. The Reader will then cache the content of the citations, and then stop caching.

Similarly, a URL pointing to an open access journal article will function just like the Wikipedia article, and this will be even more fruitful if the citations are in the form of freely accessible URLs. Better yet, consider pointing the Reader to the root of an open access journal issue. If the site is not overly full of navigation links, and if the URLs to the content itself are not buried, then the whole of the issue will be harvested and analyzed.

Another good use-case is the home page of some sort of institution or organization. Want to know about Apple Computer, the White House, a conference, or a particular department of a university? Feed the root URL of any of these things to the Reader, and you will learn something. At the very least, you will learn how the organization prioritizes its public face. If things are more transparent than not, then you might be able to glean the names and addresses of the people in the organization, the public policies of the organization, or the breadth & depth of the organization.

Yet another excellent use-case includes blogs. Blogs often contain content at their root. Navigations links abound, but more often than not the navigation links point to more content. If the blog is well-designed, then the Reader may be able to create a corpus from the whole thing, and you can “read” it in one go.

A list of URLs

The third type of input is a list of URLs. The list is expected to be manifested as a plain text file, and each line in the file is a URL. Use whatever application you desire to build the list, but save the result as a .txt file, and you will probably have a plain text file.‡

Caveats? Like the single URL approach, the list of URLs must point to freely available content, and pointing to landing pages or splash pages is probably to be avoided. Unlike the single URL approach, the URLs in the list will not be used as starting points for Web crawling. Thus, if the list contains ten items, then ten items will be cached for analysis.

Another caveat is the actual process of creating the list; I have learned that is actually quite difficult to create lists of URLs. Copying & pasting gets old quickly. Navigating a site and right-clicking on URLs is tedious. While search engines & indexes often provide some sort of output in list format, the lists are poorly structured and not readily amenable to URL extraction. On the other hand, there are more than a few URL extraction tools. I use a Google Chrome extension called Link Grabber. [1] Install Link Grabber. Use Chrome to visit a site. Click the Link Grabber button, and all the links in the document will be revealed. Copy the links and paste them into a document. Repeat until you get tired. Sort and peruse the list of links. Remove the ones you don’t want. Save the result as a plain text file.‡ Feed the result to the Reader.

Despite these caveats, the list of URLs approach is enormously scalable; the list of URLs approach is the most scalable input option. Given a list of five or six items, the Reader will do quite well, but the Reader will operate just as well if the list contains dozens, hundreds, or even thousands of URLs. Imagine reading the complete works of your favorite author or the complete run of an electronic journal. Such is more than possible with the Distant Reader.‡

A zip file

The Distant Reader can take a zip file as input. Create a folder/directory on your computer. Copy just about any file into the folder/directory. Compress the file into a .zip file. Submit the result to the Reader.

Like the other approaches, there are a few caveats. First of all, the Reader is not able to accept .zip files whose size is greater than 64 megabytes. While we do it all the time, the World Wide Web was not really designed to push around files of any great size, and 64 megabytes is/was considered plenty. Besides, you will be surprised how many files can fit in a 64 megabyte file.

Second, the computer gods never intended file names to contain things other than simple Romanesque letters and a few rudimentary characters. Now-a-days our file names contain spaces, quote marks, apostrophes, question marks, back slashes, forward slashes, colons, commas, etc. Moreover, file names might be 64 characters long or longer! While every effort as been made to accomodate file names with such characters, your milage may vary. Instead, consider using file names which are shorter, simpler, and have some sort of structure. An example might be first word of author’s last name, first meaningful word of title, year (optional), and extension. Herman Melville’s Moby Dick might thus be named melville-moby.txt. In the end the Reader will be less confused, and you will be more able to find things on your computer.

There are a few advantages to the zip file approach. First, you can circumvent authorization restrictions; you can put licensed content into your zip files and it will be analyzed just like any other content. Second, the zip file approach affords you the opportunity to pre-process your data. For example, suppose you have downloaded a set of PDF files, and each page includes some sort of header or footer. You could transform each of these PDF files into plain text, use some sort of find/replace function to remove the headers & footers. Save the result, zip it up, and submit it to the Reader. The resulting analysis will be more accurate.

There are many use-cases for the zip file approach. Masters and Ph.D students are expected to read large amounts of material. Save all those things into a folder, zip them up, and feed them to the Reader. You have been given a set of slide decks from a conference. Zip them up and feed them to the Reader. A student is expected to read many different things for History 101. Download them all, put them in a folder, zip them up, and submit them to the Distant Reader. You have written many things but they are not on the Web. Copy them to a folder, zip them up, and “read” them with the… Reader.

A zip file with a companion CSV file

The final form of input is a zip file with a companion comma-separated value (CSV) file — a metadata file.

As the size of your corpus increases, so does the need for context. This context can often be manifested as metadata (authors, titles, dates, subject, genre, formats, etc.). For example, you might want to compare & contrast who wrote what. You will probably want to observe themes over space & time. You might want to see how things differ between different types of documents. To do this sort of analysis you will need to know metadata regarding your corpus.

As outlined above, the Distant Reader first creates a cache of content — a corpus. This is the raw data. In order to do any analysis against the corpus, the corpus must be transformed into plain text. A program called Tika is used to do this work. [2] Not only does Tika transform just about any file into plain text, but it also does its best to extract metadata. Depending on many factors, this metadata may include names of authors, titles of documents, dates of creation, number of pages, MIME-type, language, etc. Unfortunately, more often than not, this metadata extraction process fails and the metadata is inaccurate, incomplete, or simply non-existent.

This is where the CSV file comes in; by including a CSV file named “metadata.csv” in the .zip file, the Distant Reader will be able to provide meaningful context. In turn, you will be able to make more informed observations, and thus your analysis will be more thorough. Here’s how:

  • assemble a set of files for analysis
  • use your favorite spreadsheet or database application to create a list of the file names
  • assign a header to the list (column) and call it “file”
  • create one or more columns whose headers are “author” and/or “title” and/or “date”
  • to the best of your ability, update the list with author, title, or date values for each file
  • save the result as a CSV file named “metadata.csv” and put it in the folder/directory to be zipped
  • compress the folder/directory to create the zip file
  • submit the result to the Distant Reader for analysis

The zip file with a companion CSV file has all the strengths & weakness of the plain o’ zip file, but it adds some more. On the weakness side, creating a CSV file can be both tedious and daunting. On the other hand, many search engines & index export lists with author, title, and data metadata. One can use these lists as the starting point for the CSV file.♱ On the strength side, the addition of the CSV metadata file makes the Distant Reader’s output immeasurably more useful, and it leads the way to additional compare & contrast opportunities.

Summary

To date, the Distant Reader takes five different types of input. Each type has its own set of strengths & weaknesses:

  • a file – good for a single large file; quick & easy; not scalable
  • a URL – good for getting an overview of a single Web page and its immediate children; can include a lot of noise; has authorization limitations
  • a list of URLs – can accomodate thousands of items; has authorization limitations; somewhat difficult to create list
  • a zip file – easy to create; file names may get in the way; no authorization necessary; limited to 64 megabytes in size
  • a zip file with CSV file – same as above; difficult to create metadata; results in much more meaningful reports & opportunities

Happy reading!

Notes & links

‡ Distant Reader Bounty #1: To date, I have only tested plain text files using line-feed characters as delimiters, such are the format of plain text files in the Linux and Macintosh worlds. I will pay $10 to the first person who creates a plain text file of URLs delimited by carriage-return/line-feed characters (the format of Windows-based text files) and who demonstrates that such files break the Reader. “On you mark. Get set. Go!”

‡ Distant Reader Bounty #2: I will pay $20 to the first person who creates a list of 2,000 URLs and feeds it to the Reader.

♱ Distant Reader Bounty #3: I will pay $30 to the first person who writes a cross-platform application/script which successfully transforms a Zotero bibliography into a Distant Reader CSV metadata file.

[1] Link Grabber – http://bit.ly/2mgTKsp

[2] Tika – http://tika.apache.org

Invitation to hack the Distant Reader

Posted on June 13, 2019 in Distant Reader

We invite you to write a cool hack enabling students & scholars to “read” an arbitrarily large corpus of textual materials.

Introduction

A website called The Distant Reader takes an arbitrary number of files or links to files as input. [1] The Reader then amasses the files locally, transforms them into plain text files, and performs quite a bit of natural language processing against them. [2] The result — the the form of a file system — is a set of operating system independent indexes which point to individual files from the input. [3] Put another way, each input file is indexed in a number of ways, and therefore accessible by any one or combination of the following attributes:

  • any named entity (name of person, place, date, time, money amount, etc)
  • any part of speech (noun, verb, adjective, etc.)
  • email address
  • free text word
  • readability score
  • size of file
  • statistically significant keyword
  • textual summary
  • URL

All of things listed above are saved as plain text files, but they have also been reduced to an SQLite database (./etc/reader.db), which is also distributed with the file system.

The Challenge

Your mission, if you choose to accept it, is to write a cool hack against the Distant Reader’s output. By doing so, you will be enabling people to increase their comprehension of the given files. Here is a list of possible hacks:

  • create a timeline – The database includes a named entities table (ent). Each entity is denoted by a type, and one of those types is “PERSON”. Find all named entities of type PERSON, programmatically look them up in Wikidata, extract the entity’s birth & death dates, and plot the result on a timeline. As an added bonus, update the database with the dates. Alternatively, and possibly more simply, find all entities of type DATE (or TIME), and plot those values on a timeline.
  • create a map – Like the timeline hack, find all entities denoting places (GRE or LOC), look up their geographic coordinates in Wikidata, and plot them on a map. As an added bonus, update the database with the coordinates.
  • order documents based on similarity – “Find more like this one” is a age-old information retrieval use case. Given a reference document – a document denoted as particularly relevant — create a list of documents from the input which are similar to the reference document. For example, create a vector denoting the characteristics of the reference document. [4] Then create vectors for each document in the collection. Finally, use something like the Cosine Similarly algorithm to determine which documents are most similar (or different). [5] The reference document may be from either inside or outside the Reader’s file system, for example, the Bible or Shakespeare’s Hamlet.
  • write a Javascript interface to the database – The Distant Reader’s database (./etc/reader.db) is manifested as a single SQLite file. There exists a Javascript library enabling one to read & write to SQLite databases. [6] Sans a Web server, write sets of HTML pages enabling a person to query the database. Example queries might include: find all documents where Plato is a keyword, find all sentences where Plato is a named entity, find all questions, etc. The output of such queries can be HTML pages, but almost as importantly, they can be CSV files so people can do further analysis. As an added bonus, enable a person to update the database so things like authors, titles, dates, genres, or notes can be assigned to items in the bib table.
  • list what is being bought or sold – Use the entities table (ent) to identify all the money amounts (type equals “MONEY”) and the sentences from which they appear. Extract all of those sentences, analyze the sentence, and output the things being tendered. You will probably have to join the id and sentenced id in the ent table with the id and sentence id in the pos table to implement this hack. As an added bonus, calculate how much things would cost in today’s dollars or any other currency.
  • normalize metadata – The values in the named entities table (ent) are often repeated in various forms. For example, a value may be Plato, plato, or PLATO. Use something like the Levenshtein distance algorithm to normalize each value into something more consistent. [7]
  • prioritize metadata – Just because a word is frequent does not mean it is significant. A given document may mention Plato many times, but if Plato is mentioned in each and every document, then the word is akin to noise. Prioritize given named entities, specifically names, through the use of a something like TFIDF. Calculate a TFIDF score for a given word, and if the word is above a given threshold, then update the database accordingly. [8]
  • extract sentences matching a given grammer – Each & every word, punctuation mark, and part of speech of each & every document is enumerated and stored in the pos table of the database. Consequently it is rather easy to find all questions in the database and extract them. (Find all sentences ids where punctuation equals “?”. Find all words (tokens) with the same id and sentence id. Output all tokens sorted by token id.) Similarly, it is possible to find all sentences where a noun precedes a verb which precedes another noun. Or, find all sentences where a noun precedes a verb which is followed by the word “no” or “not” which precedes another noun. Such queries find sentence in the form of “cat goes home” or “dog is not cat”. Such are assertive sentences. A cool hack would be to identify sentences of any given grammer such as adjective-noun or entity-verb where the verb is some form of the lemma to be (is, was, are, were, etc.), as in “Plato is” or “Plato was”. The adjective-noun patterns is of particular interest, especially given a particular noun. Find all sentences matching the pattern adjective-king to learn how the king was described.
  • create a Mad Lib – This one is off the wall. Identify (random) items of interest from the database. Write a template in the form of a story. Fill in the template with the items of interest. Done. The “better” story would be one that is complete with “significant” words from the database; the better story would be one that relates truths from the underlying content. For example, identify the two most significant nouns. Identify a small handful of the most significant verbs. Output simple sentences in the form of noun-verb-noun.
  • implement one of two search engines – The Distant Reader’s output includes a schema file (./etc/schema.xml) defining the structure of a possible Solr index. The output also includes an indexer (./bin/db2solr.pl) as well as a command-line interface (./bin/search-solr.pl) to search the index. Install Solr. Create an index with the given schema. Index the content. Write a graphical front-end to the index complete with faceted search functionality. Allow search results to be displayed and saved in a tabular form for further analysis. The Reader’s output also includes a semantic index (./etc/reader.vec) a la word2vec, as well as a command-line interface (./bin/search-vec.py) for querying the semantic index. Write a graphical interface for querying the semantic index.

Sample data

In order for you to do your good work, you will need some Distant Reader output. Here are pointers to some such stuff:

Helpful hint

With the exception of only a few files (./etc/reader.db, ./etc/reader.vec, and ./cache/*), all of the files in the Distant Reader’s output are plain text files. More specifically, they are either unstructured data files or delimited files. Despite any file’s extension, the vast majority of the files can be read with your favorite text editor, spreadsheet, or database application. To read the database file (./etc/reader.db), you will need an SQLite application. The files in the adr, bib, ent, pos, urls, or wrd directories are all tab delimited files. A program called OpenRefine is a WONDERFUL tool for reading and analyzing tab delimited files. [9] In fact, a whole lot can be learned through the skillful use of OpenRefine against the tab delimited files.

Notes

[1] The home page of the Distant Reader is https://distantreader.org

[2] All of the code doing this processing is available on GitHub. See https://github.com/ericleasemorgan/reader

[3] This file system is affectionately known as a “study carrel”.

[4] A easy-to-use library for creating such vectors is a part of the Scikit Learn suite of software. See http://bit.ly/2F5EoxA

[5] The algorithm is described at https://en.wikipedia.org/wiki/Cosine_similarity, and a SciKit Learn module is available at http://bit.ly/2IaYcS3

[6] The name of the library is called sql.js and it is available at https://github.com/kripken/sql.js/

[7] The Levenshtein distance is described here — https://en.wikipedia.org/wiki/Levenshtein_distance, and various libraries doing the good work are outlined at http://bit.ly/2F30roM

[8] Yet another SciKit Learn module may be of use here — http://bit.ly/2F5o2oS

[9] OpenRefine eats delimited files for lunch. See http://openrefine.org