EEBO-TCP Workset Browser

Posted on June 11, 2015 in Uncategorized by Eric Lease Morgan

I have begun creating a “browser” against content from EEBO-TCP in the same way I have created a browser against worksets from the HathiTrust. The goal is to provide “distant reading” services against subsets of the Early English poetry and prose. You can see these fledgling efforts against a complete set of Richard Baxter’s works. Baxter was an English Puritan church leader, poet, and hymn-writer. [1, 2, 3]

EEBO is an acronym for Early English Books Online. It is intended to be a complete collection of English literature between 1475 through to 1700. TCP is an acronym for Text Creation Partnership, a consortium of libraries dedicated to making EEBO freely available in the form of XML called TEI (Text Encoding Initiative). [4, 5]

The EEBO-TCP initiative is releasing their efforts in stages. The content of Stage I is available from a number of (rather hidden) venues. I found the content on a University Michigan Box site to be the easiest to use, albiet not necessarily the most current. [6] Once the content is cached — in the fullest of TEI glory — it is possible to search and browse the collection. I created a local, terminal-only interface to the cache and was able to exploit authority lists, controlled vocabularies, and free text searching of metadata to create subsets of the cache. [7] The subsets are akin to HathiTrust “worksets” — items of particular interest to me.

Once a subset was identified, I was able to mirror (against myself) the necessary XML files and begin to do deeper analysis. For example, I am able to create a dictionary of all the words in the “workset” and tabulate their frequencies. Baxter used the word “god” more than any other, specifically, 65,230 times. [8] I am able to pull out sets of unique words, and I am able to count how many times Baxter used words from three sets of locally defined “lexicons” of colors, “big” names, and “great” ideas. Furthermore, I am be to chart and graph trends of the works, such as when they were written and how they cluster together in terms of word usage or lexicons. [9, 10]

I was then able to repeat the process for other subsets, items about: lutes, astronomy, Unitarians, and of course, Shakespeare. [11, 12, 13, 14]

The EEBO-TCP Workset Browser is not as mature as my HathiTrust Workset Browser, but it is coming along. [15] Next steps include: calculating an integer denoting the number of pages in an item, implementing a Web-based search interface to a subset’s full text as well as metadata, putting the source code (written in Python and Bash) on GitHub. After that I need to: identify more robust ways to create subsets from the whole of EEBO, provide links to the raw TEI/XML as well as HTML versions of items, implement quite a number of cosmetic enhancements, and most importantly, support the means to compare & contrast items of interest in each subset. Wish me luck?

More fun with well-structured data, open access content, and the definition of librarianship.

  1. Richard Baxter (the person) – http://en.wikipedia.org/wiki/Richard_Baxter
  2. Richard Baxter (works) – http://bit.ly/ebbo-browser-baxter-works
  3. Richard Baxter (analysis of works) – http://bit.ly/eebo-browser-baxter-analysis
  4. EEBO-TCP – http://www.textcreationpartnership.org/tcp-eebo/
  5. TEI – http://www.tei-c.org/
  6. University of Michigan Box site – http://bit.ly/1QcvxLP
  7. local cache of EEBO-TCP – http://bit.ly/eebo-cache
  8. dictionary of all Baxter words – http://bit.ly/eebo-browser-baxter-dictionary
  9. histogram of dates – http://bit.ly/eebo-browser-baxter-dates
  10. clusters of “great” ideas – http://bit.ly/eebo-browser-baxter-cluster
  11. lute – http://bit.ly/eebo-browser-lute
  12. astronomy – http://bit.ly/eebo-browser-astronomy
  13. Unitarians – http://bit.ly/eebo-browser-unitarian
  14. Shakespeare – http://bit.ly/eebo-browser-shakespeare
  15. HathiTrust Workset Browser – https://github.com/ericleasemorgan/HTRC-Workset-Browser

Developments with EEBO

Posted on June 8, 2015 in Uncategorized by Eric Lease Morgan

Here some of developments with the playing of my EEBO (Early English Books Online) data.

I used the repository on Box to get my content, and I mirrored it locally. [1, 2] I then looped through the content using XPath to extract rudimentary metadata, thus creating a “catalog” (index). Along the way I calculated the number of words in each document and saved that as a field of each “record”. Being a tab-delimited file, it is trivial to import the catalog into my favorite spreadsheet, database, editor, or statistics program. This allowed me to browse the collection. I then used grep to search my catalog, and save the results to a file. I searched for Richard Baxter. [6, 7, 8]. I then used an R script to graph the numeric data of my search results. Currently, there are only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] From these graphs I can tell that Baxter wrote a lot of relatively short things, and I can easily see when he published many of his works. (He published a lot around 1680 but little in 1665.) I then transformed the search results into a browsable HTML table. The table has hidden features. (Can you say, “Usability?”) For example, you can click on table headers to sort. This is cool because I want sort things by number of words. (Number of pages doesn’t really tell me anything about length.) There is also a hidden link to the left of each record. Upon clicking on the blank space you can see subjects, publisher, language, and a link to the raw XML.

For a good time, I then repeated the process for things Shakespeare and things astronomy. [14, 15] Baxter took me about twelve hours worth of work, not counting the caching of the data. Combined, Shakespeare and astronomy took me less than five minutes. I then got tired.

My next steps are multi-faceted and presented in the following incomplete unordered list:

  • create browsable lists – the TEI metadata is clean and consistent. The authors and subjects lend themselves very well to the creation of browsable lists.
  • CGI interface – The ability to search via Web interface is imperative, and indexing is a prerequisite.
  • transform into HTML – TEI/XML is cool, but…
  • create sets – The collection as a whole is very interesting, but many scholars will want sub-sets of the collection. I will do this sort of work, akin to my work with the HathiTrust. [16]
  • do text analysis – This is really the whole point. Given the full text combined with the inherent functionality of a computer, additional analysis and interpretation can be done against the corpus or its subsets. This analysis can be based the counting of words, the association of themes, parts-of-speech, etc. For example, I plan to give each item in the collection a colors, “big” names, and “great” ideas coefficient. These are scores denoting the use of researcher-defined “themes”. [17, 18, 19] You can see how these themes play out against the complete writings of “Dead White Men With Three Names”. [20, 21, 22]

Fun with TEI/XML, text mining, and the definition of librarianship.

  1. Box – http://bit.ly/1QcvxLP
  2. mirror – http://dh.crc.nd.edu/sandbox/eebo-tcp/xml/
  3. xpath script – http://dh.crc.nd.edu/sandbox/eebo-tcp/bin/xml2tab.pl
  4. catalog (index) – http://dh.crc.nd.edu/sandbox/eebo-tcp/catalog.txt
  5. search results – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.txt
  6. Baxter at VIAF – http://viaf.org/viaf/54178741
  7. Baxter at WorldCat – http://www.worldcat.org/wcidentities/lccn-n50-5510
  8. Baxter at Wikipedia – http://en.wikipedia.org/wiki/Richard_Baxter
  9. box plot of dates – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-dates.png
  10. box plot of words – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-words.png
  11. histogram of dates – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-dates.png
  12. histogram of words – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-words.png
  13. HTML – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.html
  14. Shakespeare – http://dh.crc.nd.edu/sandbox/eebo-tcp/shakespeare/
  15. astronomy – http://dh.crc.nd.edu/sandbox/eebo-tcp/astronomy/
  16. HathiTrust work – http://blogs.nd.edu/emorgan/2015/06/browser-on-github/
  17. colors – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-colors.txt
  18. “big” names – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-names.txt
  19. “great” ideas – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-ideas.txt
  20. Thoreau – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/thoreau/about.html
  21. Emerson – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/emerson/about.html
  22. Channing – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/channing/about.html

Boxplots, histograms, and scatter plots. Oh, my!

Posted on June 5, 2015 in Uncategorized by Eric Lease Morgan

I have started adding visualizations literally illustrating the characteristics of the various “catalogs” generated by the HathiTrust Workset Browser. These graphics (box plots, histograms, and scatter plots) make it easier to see what is in the catalog and the features of the items it contains.

boxplot

histogram

scatterplot

For example, read the “about page” reporting on the complete works of Henry David Thoreau. For more detail, see the “home page” on GitHub.

HathiTrust Workset Browser on GitHub

Posted on June 3, 2015 in Uncategorized by Eric Lease Morgan

cloudI have put my (fledgling) HathiTrust Workset Browser on GitHub. Try:

https://github.com/ericleasemorgan/HTRC-Workset-Browser

The Browser is a tool for doing “distant reading” against HathiTrust “worksets”. Given a workset rsync file, it will cache the workset’s content locally, index it, create some reports against the content, and provide the means to search/browse the collection. It should run out of the box on Linux and Macintosh computers. It requires the bash shell and Python, which come for free on these operating systems. Some sample content is available at:

http://bit.ly/browser-thoreau-about

Developing code with and through GitHub is interesting. I’m learning.

HathiTrust Resource Center Workset Browser

Posted on May 26, 2015 in Uncategorized by Eric Lease Morgan

In my copious spare time I have hacked together a thing I’m calling the HathiTrust Research Center Workset Browser, a (fledgling) tool for doing “distant reading” against corpora from the HathiTrust. [1]

The idea is to: 1) create, refine, or identify a HathiTrust Research Center workset of interest — your corpus, 2) feed the workset’s rsync file to the Browser, 3) have the Browser download, index, and analyze the corpus, and 4) enable to reader to search, browse, and interact with the result of the analysis. With varying success, I have done this with a number of worksets ranging on topics from literature, philosophy, Rome, and cookery. The best working examples are the ones from Thoreau and Austen. [2, 3] The others are still buggy.

As a further example, the Browser can/will create reports describing the corpus as a whole. This analysis includes the size of a corpus measured in pages as well as words, date ranges, word frequencies, and selected items of interest based on pre-set “themes” — usage of color words, name of “great” authors, and a set of timeless ideas. [4] This report is based on more fundamental reports such as frequency tables, a “catalog”, and lists of unique words. [5, 6, 7, 8]

catalog

The whole thing is written in a combination of shell and Python scripts. It should run on just about any out-of-the-box Linux or Macintosh computer. Take a look at the code. [9] No special libraries needed. (“Famous last words.”) In its current state, it is very Unix-y. Everything is done from the command line. Lot’s of plain text files and the exploitation of STDIN and STDOUT. Like a Renaissance cartoon, the Browser, in its current state, is only a sketch. Only later will a more full-bodied, Web-based interface be created.

The next steps are numerous and listed in no priority order: putting the whole thing on GitHub, outputting the reports in generic formats so other things can easily read them, improving the terminal-based search interface, implementing a Web-based search interface, writing advanced programs in R that chart and graph analysis, provide a means for comparing & contrasting two or more items from a corpus, indexing the corpus with a (real) indexer such as Solr, writing a “cookbook” describing how to use the browser to to “kewl” things, making the metadata of corpora available as Linked Data, etc.

‘Want to give it a try? For a limited period of time, go to the HathiTrust Research Center Portal, create (refine or identify) a collection of personal interest, use the Algorithms tool to export the collection’s rsync file, and send the file to me. I will feed the rsync file to the Browser, and then send you the URL pointing to the results. [10] Let’s see what happens.

Fun with public domain content, text mining, and the definition of librarianship.

Links

  1. HTRC Workset Browser – http://bit.ly/workset-browser
  2. Thoreau – http://bit.ly/browser-thoreau
  3. Austen – http://bit.ly/browser-austen
  4. Thoreau report – http://bit.ly/browser-thoreau-about
  5. Thoreau dictionary (frequency list) – http://bit.ly/thoreau-dictionary
  6. usage of color words in Thoreau — http://bit.ly/thoreau-colors
  7. unique words in the corpus – http://bit.ly/thoreau-unique
  8. Thoreau “catalog” — http://bit.ly/thoreau-catalog
  9. source code – http://ntrda.me/1Q8pPoI
  10. HathiTrust Research Center Portal – https://sharc.hathitrust.org

Text files

Posted on March 11, 2015 in Uncategorized by Eric Lease Morgan

While a rose is a rose is a rose, a text file is not a text file is not a text file.

For better or for worse, we here in our text analysis workshop are dealing with three different computer operating systems: Windows, Macintosh, and Linux. Text mining requires the subject of its analysis to be in the form of plain text files. [1] But there is a subtle difference between the ways each of our operating systems expect to deal with “lines” in that text. Let me explain.

Imagine a classic typerwriter. A cylinder (called a “platten”) fit into a “carriage” designed to move back & forth across a box while “keys” were slapped against a piece of inked ribbon ultimately imprinting a character on a piece of paper rolled around the platten. As each key was pressed the platten moved a tiny bit from right to left. When the platten got to the left-most position, the operator was expected to manually move the platten back to the right-most postion and continue typing. This movement was really two movements in one. First, the carriage was “returned” to the right-most position, and second, the platten was rolled one line up. (The paper was “fed” around the platten by one line.) If one or the other of these two movements were not performed, then the typing would either run off the right-hand side of the paper, or the letters would be imprinted on top of the previously typed characters. These two movements are called “carriage returns” and “line feeds”, respectively.

Enter computers. Digital representations of characters were saved to files. These files are then sent to printers, but there is no person there to manually move the platten from left to right nor to roll the paper further into the printer. Instead, invisible characters were created. There are many invisible characters, and the two of most interest to us are carriage return (ASCII character 13) and line feed (sometimes called “new line” and ASCII character 10). [2] When the printer received these characters the platten moved accordingly.

Enter our operating systems. For better or for worse, traditionally each of our operating systems treat the definition of lines differently:

  • in a traditional Macintosh file lines are delimited by a single carriage return (ASCII 13)
  • on Unix/Linux lines are delimited by line feeds (ASCII 10)
  • Windows computers expect lines to be delimited by a combination of both (ASCII 13 and ASCII 10)

Go figure?

Macintosh is much more like Unix now-a-days, so most Macintosh text files use the Unix convention.

Windows folks, remember how your text files looked funny when initially displayed? This is because the original text files only contained ASCII 10 and not ASCII 13. Notepad, your default text editor, did not “see” line feed characters and consequently everything looked funny. Years ago, if a Macintosh computer read a Unix/Linux text file, then all the letters would be displayed on top of each other, even messier.

If you create a text file on your Windows or (older) Macintosh computer, and then you use these files as input to other programs (ie., wget -i ./urls.txt), then the operation may fail because the programs may not know how a line is denoted in the input.

Confused yet? In any event, text files are not text files are not text files. And the solution to this problem is to use full-featured text editor — the subject of another essay.

[1] plain text files explained – http://en.wikipedia.org/wiki/Plain_text
[2] intro’ to ASCII – http://www.theasciicode.com.ar

Hands-on text analysis workshop

Posted on January 9, 2015 in Uncategorized by Eric Lease Morgan

I have all but finished writing a hands-on text analysis workshop. From the syllabus:

The purpose of this 5­-week workshop is to increase the knowledge of text mining principles among participants. By the end of the workshop, students will be able to describe the range of basic text mining techniques (everything from the creation of a corpus, to the counting/tabulating of words, to classification & clustering, and visualizing the results of text analysis) and have garnered hands­-on experience with all of them. All the materials for this workshop are available online. There are no prerequisites except for two things: 1) a sincere willingness to learn, and 2) a willingness to work at a computer’s command line interface. Students are really encouraged to bring their own computers to class.

The workshop is divided into the following five, 90-minute sessions, one per week:

  1. Overview of text mining and working from the command line
  2. Building a corpus
  3. Word and phrase frequencies
  4. Extracting meaning with dictionaries, parts­of­speech analysis, and named entity recognition
  5. Classification and topic modeling

For better or for worse, the workshop’s computing environment will be the Linux command line. Besides the usual command-line suspects, participants will get their hands dirty with wget, tika, a bit of Perl, a lot of Python, Wordnet, Treetagger, Standford’s Named Entity Recognizer, and Mallet.

For more detail, see the syllabus, sample code, and corpus.

distance.cgi – My first Python-based CGI script

Posted on January 9, 2015 in Uncategorized by Eric Lease Morgan

Yesterday I finished writing my first Python-based CGI script — distance.cgi. Given two words, it allows the reader to first disambiguate between various definitions of the words, and second, uses Wordnet’s network to display various relationships (distances) between the resulting “synsets”. (Source code is here.)

Reader input

Disambiguate

Display result

The script relies on Python’s Natural Language Toolkit (NLTK) which provides an enormous amount of functionality when it comes to natural language processing. I’m impressed. On the other hand, the script is not zippy, and I am not sure how performance can be improved. Any hints?

My second Python script, dispersion.py

Posted on November 19, 2014 in Uncategorized by Eric Lease Morgan

This is my second Python script, dispersion.py, and it illustrates where common words appear in a text.

#!/usr/bin/env python2

# dispersion.py - illustrate where common words appear in a text
#
# usage: ./dispersion.py <file>

# Eric Lease Morgan <emorgan@nd.edu>
# November 19, 2014 - my second real python script; "Thanks for the idioms, Don!"


# configure
MAXIMUM = 25
POS     = 'NN'

# require
import nltk
import operator
import sys

# sanity check
if len( sys.argv ) != 2 :
  print "Usage:", sys.argv[ 0 ], "<file>"
  quit()
  
# get input
file = sys.argv[ 1 ]

# initialize
with open( file, 'r' ) as handle : text = handle.read()
sentences = nltk.sent_tokenize( text )
pos       = {}

# process each sentence
for sentence in sentences : 
  
  # POS the sentence and then process each of the resulting words
  for word in nltk.pos_tag( nltk.word_tokenize( sentence ) ) :
    
    # check for configured POS, and increment the dictionary accordingly
    if word[ 1 ] == POS : pos[ word[ 0 ] ] = pos.get( word[ 0 ], 0 ) + 1

# sort the dictionary
pos = sorted( pos.items(), key = operator.itemgetter( 1 ), reverse = True )

# do the work; create a dispersion chart of the MAXIMUM most frequent pos words
text = nltk.Text( nltk.word_tokenize( text ) )
text.dispersion_plot( [ p[ 0 ] for p in pos[ : MAXIMUM ] ] )

# done
quit()

I used the program to analyze two works: 1) Thoreau’s Walden, and 2) Emerson’s Representative Men. From the dispersion plots displayed below, we can conclude a few things:

  • The words “man”, “life”, “day”, and “world” are common between both works.
  • Thoreau discusses water, ponds, shores, and surfaces together.
  • While Emerson seemingly discussed man and nature in the same breath, but none of his core concepts are discussed as densely as Thoreau’s.
Thoreau's Walden

Thoreau’s Walden

Emerson's Representative Men

Emerson’s Representative Men

Python’s Natural Langauge Toolkit (NLTK) is a good library to get start with for digital humanists. I have to learn more though. My jury is still out regarding which is better, Perl or Python. So far, they have more things in common than differences.

My first R script, wordcloud.r

Posted on November 10, 2014 in Uncategorized by Eric Lease Morgan

This is my first R script, wordcloud.r:

#!/usr/bin/env Rscript

# wordcloud.r - output a wordcloud from a set of files in a given directory

# Eric Lease Morgan <eric_morgan@infomotions.com>
# November 8, 2014 - my first R script!


# configure
MAXWORDS    = 100
RANDOMORDER = FALSE
ROTPER      = 0

# require
library( NLP )
library( tm )
library( methods )
library( RColorBrewer )
library( wordcloud )

# get input; needs error checking!
input <- commandArgs( trailingOnly = TRUE )
  
# create and normalize corpus
corpus <- VCorpus( DirSource( input[ 1 ] ) )
corpus <- tm_map( corpus, content_transformer( tolower ) )
corpus <- tm_map( corpus, removePunctuation )
corpus <- tm_map( corpus, removeNumbers )
corpus <- tm_map( corpus, removeWords, stopwords( "english" ) )
corpus <- tm_map( corpus, stripWhitespace )

# do the work
wordcloud( corpus, max.words = MAXWORDS, random.order = RANDOMORDER, rot.per = ROTPER )

# done
quit()

Given the path to a directory containing a set of plain text files, the script will generate a wordcloud.

Like Python, R has a library well-suited for text mining — tm. Its approach to text mining (or natural language processing) is both similar and dissimilar to Python’s. They are similar in that they both hope to provide a means for analyzing large volumes of texts. It is similar in that they use different underlying data structures to get there. R might be more for analytic person. Think statistics. Python may be more for the “literal” person, all puns intended. I will see if I can exploit the advantages of both.