Developments with EEBO

Here some of developments with the playing of my EEBO (Early English Books Online) data.

I used the repository on Box to get my content, and I mirrored it locally. [1, 2] I then looped through the content using XPath to extract rudimentary metadata, thus creating a “catalog” (index). Along the way I calculated the number of words in each document and saved that as a field of each “record”. Being a tab-delimited file, it is trivial to import the catalog into my favorite spreadsheet, database, editor, or statistics program. This allowed me to browse the collection. I then used grep to search my catalog, and save the results to a file. I searched for Richard Baxter. [6, 7, 8]. I then used an R script to graph the numeric data of my search results. Currently, there are only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] From these graphs I can tell that Baxter wrote a lot of relatively short things, and I can easily see when he published many of his works. (He published a lot around 1680 but little in 1665.) I then transformed the search results into a browsable HTML table. The table has hidden features. (Can you say, “Usability?”) For example, you can click on table headers to sort. This is cool because I want sort things by number of words. (Number of pages doesn’t really tell me anything about length.) There is also a hidden link to the left of each record. Upon clicking on the blank space you can see subjects, publisher, language, and a link to the raw XML.

For a good time, I then repeated the process for things Shakespeare and things astronomy. [14, 15] Baxter took me about twelve hours worth of work, not counting the caching of the data. Combined, Shakespeare and astronomy took me less than five minutes. I then got tired.

My next steps are multi-faceted and presented in the following incomplete unordered list:

  • create browsable lists – the TEI metadata is clean and consistent. The authors and subjects lend themselves very well to the creation of browsable lists.
  • CGI interface – The ability to search via Web interface is imperative, and indexing is a prerequisite.
  • transform into HTML – TEI/XML is cool, but…
  • create sets – The collection as a whole is very interesting, but many scholars will want sub-sets of the collection. I will do this sort of work, akin to my work with the HathiTrust. [16]
  • do text analysis – This is really the whole point. Given the full text combined with the inherent functionality of a computer, additional analysis and interpretation can be done against the corpus or its subsets. This analysis can be based the counting of words, the association of themes, parts-of-speech, etc. For example, I plan to give each item in the collection a colors, “big” names, and “great” ideas coefficient. These are scores denoting the use of researcher-defined “themes”. [17, 18, 19] You can see how these themes play out against the complete writings of “Dead White Men With Three Names”. [20, 21, 22]

Fun with TEI/XML, text mining, and the definition of librarianship.

  1. Box – http://bit.ly/1QcvxLP
  2. mirror – http://dh.crc.nd.edu/sandbox/eebo-tcp/xml/
  3. xpath script – http://dh.crc.nd.edu/sandbox/eebo-tcp/bin/xml2tab.pl
  4. catalog (index) – http://dh.crc.nd.edu/sandbox/eebo-tcp/catalog.txt
  5. search results – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.txt
  6. Baxter at VIAF – http://viaf.org/viaf/54178741
  7. Baxter at WorldCat – http://www.worldcat.org/wcidentities/lccn-n50-5510
  8. Baxter at Wikipedia – http://en.wikipedia.org/wiki/Richard_Baxter
  9. box plot of dates – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-dates.png
  10. box plot of words – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-words.png
  11. histogram of dates – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-dates.png
  12. histogram of words – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-words.png
  13. HTML – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.html
  14. Shakespeare – http://dh.crc.nd.edu/sandbox/eebo-tcp/shakespeare/
  15. astronomy – http://dh.crc.nd.edu/sandbox/eebo-tcp/astronomy/
  16. HathiTrust work – http://blogs.nd.edu/emorgan/2015/06/browser-on-github/
  17. colors – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-colors.txt
  18. “big” names – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-names.txt
  19. “great” ideas – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-ideas.txt
  20. Thoreau – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/thoreau/about.html
  21. Emerson – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/emerson/about.html
  22. Channing – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/channing/about.html

Comments are closed.