Freebo@ND
Posted on July 24, 2017 in Freebo@ND by Eric Lease Morgan
This is the initial blog posting introducing a fledgling website called Freebo@ND — a collection of early English print materials and services provided against them. [1]
For the past year a number of us here in the Hesburgh Libraries at the University of Notre Dame have been working on a grant-sponsored project with others from Northwestern University and Washington University in St. Louis. Collectively, we have been calling our efforts the Early English Print Project, and our goal is to improve on the good work done by the Text Creation Partnership (TCP). [2]
“What is the TCP?” Briefly stated, the TCP is/was an organization set out to make freely available the content of Early English Books Online (EBBO). The desire is/was to create & distribute thoroughly & accurately marked up (TEI) transcriptions of early English books printed between 1460 and 1699. Over time the scope of the TCP project seemed to wax & wane, and I’m still not really sure how many texts are in scope nor where they can all be found. But I do know the texts are being distributed in two phases. Phase I texts are freely available to anybody. [3] Phase II texts are only available to institutions who sponsored the Partnership, but they too will be freely available to everybody in a few years.
Our goals — the goals of the Early English Print Project — are to:
- improve the accuracy (reduce the number of “dot” words) in the TCP transcriptions
- associate page images (scans/facsimiles) with the TCP transcriptions
- provide useful services against the transcriptions for the purposes of distant reading
While I have had my hand in the first two tasks, much of my time has been spent on the third. To this end I have been engineering ways to collect, organize, archive, disseminate, and evaluate our Project’s output. To date, the local collection includes approximately 15,000 transcriptions and 60,000,000 words. When the whole thing is said & done, they tell me I will have close to 60,000 transcriptions and 2,000,000,000 words. Consequently, this is by far the biggest collection I’ve ever curated.
My desire is to make sure Freebo@ND goes beyond “find & get” and towards “use & understanding”. [4] My goal is to provide services against the texts, not just the texts themselves. Locally collecting & archiving the original transcriptions has been relatively trivial. [5] After extracting the bibliographic data from each transcription, and after transforming the transcriptions into plain text, implementing full text searching has been easy. [6] Search even comes with faceted browse. To support “use & understanding” I’m beginning to provide services against the texts. For example, it is possible to download — in a computer-readable format — all the words from a given text, where each word from each text is characterized by its part-of-speech, lemma, given form, normalized form, and position in the text. Using this output, it is more than possible for students or researchers to compare & contrast the use of words & types of words across texts. Because the texts are described in both bibliographic as well as numeric terms, it is possible to sort search results by date, page length, or word count. [7] Additional numeric characteristics are being implemented. The use of “log-likelihood ratios” is a simple and effective way to compare the use of words in a given text with an entire corpus. Such has been implemented in Freebo@ND using a set of words called the “great ideas”. [8] There is also a way to create one’s own sub-collection for analysis, but the functionality is meager. [9]
I have had to learn a lot to get this far, and I have had to use a myriad of technologies. Some of these things include: getting along sans a fully normalized database, parallel processing & cluster computing, “map & reduce”, responsive Web page design, etc. This being the initial blog posting documenting the why’s & wherefore’s of Freebo@ND, more postings ought to be coming; I hope to document here more thoroughly my part in our Project. Thank you for listening.
Links
[1] Freebo@ND – http://cds.crc.nd.edu/
[2] Text Creation Partnership (TCP) – http://www.textcreationpartnership.org
[3] The Phase I TCP texts are “best” gotten from GitHub – https://github.com/textcreationpartnership
[4] use & understanding – http://infomotions.com/blog/2011/09/dpla/
[5] local collection & archive – http://cds.crc.nd.edu/freebo/
[6] search – http://cds.crc.nd.edu/cgi-bin/search.cgi
[7] tabled search results – http://cds.crc.nd.edu/cgi-bin/did2catalog.cgi
[8] log-likelihood ratios – http://cds.crc.nd.edu/cgi-bin/likelihood.cgi
[9] sub-collections – http://cds.crc.nd.edu/cgi-bin/request-collection.cgi