{"id":678,"date":"2015-06-08T05:55:48","date_gmt":"2015-06-08T00:55:48","guid":{"rendered":"http:\/\/blogs.nd.edu\/emorgan\/?p=678"},"modified":"2015-06-08T05:55:48","modified_gmt":"2015-06-08T00:55:48","slug":"eebo","status":"publish","type":"post","link":"https:\/\/sites.nd.edu\/emorgan\/2015\/06\/eebo\/","title":{"rendered":"Developments with EEBO"},"content":{"rendered":"<p>\nHere some of developments with the playing of my EEBO (Early English Books Online) data.\n<\/p>\n<p>\n<img height='300' align='right' src='http:\/\/blogs.nd.edu\/emorgan\/files\/2015\/06\/gentle-reader.jpg' \/>I used the repository on Box to get my content, and I mirrored it locally. [1, 2] I then looped through the content using XPath to extract rudimentary metadata, thus creating a \u201ccatalog\u201d (index). Along the way I calculated the number of words in each document and saved that as a field of each &#8220;record&#8221;. Being a tab-delimited file, it is trivial to import the catalog into my favorite spreadsheet, database, editor, or statistics program. This allowed me to browse the collection. I then used grep to search my catalog, and save the results to a file. I searched for Richard Baxter. [6, 7, 8]. I then used an R script to graph the numeric data of my search results. Currently, there are only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] From these graphs I can tell that Baxter wrote a lot of relatively short things, and I can easily see when he published many of his works. (He published a lot around 1680 but little in 1665.) I then transformed the search results into a browsable HTML table. The table has hidden features. (Can you say, \u201cUsability?\u201d) For example, you can click on table headers to sort. This is cool because I want sort things by number of words. (Number of pages doesn\u2019t really tell me anything about length.) There is also a hidden link to the left of each record. Upon clicking on the blank space you can see subjects, publisher, language, and a link to the raw XML.\n<\/p>\n<p>\nFor a good time, I then repeated the process for things Shakespeare and things astronomy. [14, 15] Baxter took me about twelve hours worth of work, not counting the caching of the data. Combined, Shakespeare and astronomy took me less than five minutes. I then got tired.\n<\/p>\n<p>\nMy next steps are multi-faceted and presented in the following incomplete unordered list:\n<\/p>\n<ul>\n<li><strong>create browsable lists<\/strong> &#8211; the TEI metadata is clean and consistent. The authors and subjects lend themselves very well to the creation of browsable lists.<\/li>\n<li><strong>CGI interface<\/strong> &#8211; The ability to search via Web interface is imperative, and indexing is a prerequisite.<\/li>\n<li><strong>transform into HTML<\/strong> &#8211; TEI\/XML is cool, but\u2026<\/li>\n<li><strong>create sets<\/strong> &#8211; The collection as a whole is very interesting, but many scholars will want sub-sets of the collection. I will do this sort of work, akin to my work with the HathiTrust. [16]<\/li>\n<li><strong>do text analysis<\/strong> &#8211; This is really the whole point. Given the full text combined with the inherent functionality of a computer, additional analysis and interpretation can be done against the corpus or its subsets. This analysis can be based the counting of words, the association of themes, parts-of-speech, etc. For example, I plan to give each item in the collection a colors, \u201cbig\u201d names, and \u201cgreat\u201d ideas coefficient. These are scores denoting the use of researcher-defined \u201cthemes\u201d. [17, 18, 19] You can see how these themes play out against the complete writings of \u201cDead White Men With Three Names\u201d. [20, 21, 22]<\/li>\n<\/ul>\n<p>\nFun with TEI\/XML, text mining, and the definition of librarianship.\n<\/p>\n<ol>\n<li>Box &#8211; <a href=\"http:\/\/bit.ly\/1QcvxLP\">http:\/\/bit.ly\/1QcvxLP<\/a><\/li>\n<li>mirror &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/xml\/\">http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/xml\/<\/a><\/li>\n<li>xpath script &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/bin\/xml2tab.pl\">http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/bin\/xml2tab.pl<\/a><\/li>\n<li>catalog (index) &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/catalog.txt\">http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/catalog.txt<\/a><\/li>\n<li>search results &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/baxter.txt\"><a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/baxter.txt\">http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/baxter.txt<\/a><\/a><\/li>\n<li>Baxter at VIAF &#8211; <a href=\"http:\/\/viaf.org\/viaf\/54178741\">http:\/\/viaf.org\/viaf\/54178741<\/a><\/li>\n<li>Baxter at WorldCat &#8211; <a href=\"http:\/\/www.worldcat.org\/wcidentities\/lccn-n50-5510\">http:\/\/www.worldcat.org\/wcidentities\/lccn-n50-5510<\/a><\/li>\n<li>Baxter at Wikipedia &#8211; <a href=\"http:\/\/en.wikipedia.org\/wiki\/Richard_Baxter\">http:\/\/en.wikipedia.org\/wiki\/Richard_Baxter<\/a><\/li>\n<li>box plot of dates &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/boxplot-dates.png\">http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/boxplot-dates.png<\/a><\/li>\n<li>box plot of words &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/boxplot-words.png\">http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/boxplot-words.png<\/a><\/li>\n<li>histogram of dates &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/histogram-dates.png\">http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/histogram-dates.png<\/a><\/li>\n<li>histogram of words &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/histogram-words.png\">http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/histogram-words.png<\/a><\/li>\n<li>HTML &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/baxter.html\">http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/baxter.html<\/a><\/li>\n<li>Shakespeare &#8211; h<a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/baxter\/baxter.html\">ttp:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/shakespeare\/<\/a><\/li>\n<li>astronomy &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/astronomy\/\">http:\/\/dh.crc.nd.edu\/sandbox\/eebo-tcp\/astronomy\/<\/a><\/li>\n<li>HathiTrust work &#8211; <a href=\"http:\/\/blogs.nd.edu\/emorgan\/2015\/06\/browser-on-github\/\">http:\/\/blogs.nd.edu\/emorgan\/2015\/06\/browser-on-github\/<\/a><\/li>\n<li>colors &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/etc\/theme-colors.txt\">http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/etc\/theme-colors.txt<\/a><\/li>\n<li>\u201cbig\u201d names &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/etc\/theme-names.txt\">http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/etc\/theme-names.txt<\/a><\/li>\n<li>\u201cgreat\u201d ideas &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/etc\/theme-ideas.txt\">http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/etc\/theme-ideas.txt<\/a><\/li>\n<li>Thoreau &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/thoreau\/about.html\">http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/thoreau\/about.html<\/a><\/li>\n<li>Emerson &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/emerson\/about.html\">http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/emerson\/about.html<\/a><\/li>\n<li>Channing &#8211; <a href=\"http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/channing\/about.html\">http:\/\/dh.crc.nd.edu\/sandbox\/htrc-workset-browser\/channing\/about.html<\/a><\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>Here some of developments with the playing of my EEBO (Early English Books Online) data. I used the repository on Box to get my content, and I mirrored it locally. [1, 2] I then looped through the content using XPath to extract rudimentary metadata, thus creating a \u201ccatalog\u201d (index). Along the way I calculated the [&hellip;]<\/p>\n","protected":false},"author":92,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-678","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts\/678","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/users\/92"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/comments?post=678"}],"version-history":[{"count":5,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts\/678\/revisions"}],"predecessor-version":[{"id":685,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts\/678\/revisions\/685"}],"wp:attachment":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/media?parent=678"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/categories?post=678"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/tags?post=678"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}