Archive for the ‘Uncategorized’ Category

Blueprint for a system surrounding Catholic social thought & human rights

Posted on August 30, 2016 in Uncategorized

This posting elaborates upon one possible blueprint for comparing & contrasting various positions in the realm of Catholic social thought and human rights.

We here in the Center For Digital Scholarship have been presented with a corpus of documents which can be broadly described as position papers on Catholic social thought and human rights. Some of these documents come from the Vatican, and some of these documents come from various governmental agencies. There is a desire by researchers & scholars to compare & contrast these documents on the paragraph level. The blueprint presented below illustrates one way — a system/flowchart — this desire may be addressed:

The following list enumerates the flow of the system:

Corpus creation – The system begins on the right with sets of documents from the Vatican as well as the various governmental agencies. The system also begins with a hierarchal “controlled vocabulary” outlined by researchers & scholars in the field and designed to denote the “aboutness” of individual paragraphs in the corpus.
Manual classification – Reading from left to right, the blueprint next illustrates how subsets of document paragraphs will be manually assigned to one more more controlled vocabulary terms. This work will be done by people familiar with the subject area as well as the documents themselves. Success in this regard is directly proportional to the volume & accuracy of the classified documents. At the very least, a few hundred paragraphs need to be consistently classified from each of the controlled vocabulary terms in order for the next step to be successful.
Computer “training” – Because the number of paragraphs from the corpus is too large for manual classification, a process known as “machine learning” will be employed to “train” a computer program to do the work automatically. If it is assumed the paragraphs from Step #2 have been classified consistently, then it can also be assumed that the each set of similarly classified documents will have identifiable characteristics. For example, documents classified with the term “business” may often include the word “money”. Documents classified as “government” may often include “law”, and documents classified as “family” may often include the words “mother”, “father”, or “children”. By counting & tabulating the existence & frequency of individual words (or phrases) in each of the sets of manually classified documents, it is possible to create computer “models” representing each set. The models will statistically describe the probabilities of the existence & frequency of words in a given classification. Thus, the output of this step will be two representations, one for the Vatican documents and another for the governmental documents.
Automated classification – Using the full text of the given corpus as well as the output of Step #3, a computer program will then be used to assign one or more controlled vocabulary terms to each paragraph in the corpus. In other words, the corpus will be divided into individual paragraphs, each paragraph will be compared to a model and assigned one more more classification terms, and the paragraph/term combinations will be passed on to a database for storage and ultimately an indexer to support search.
Indexing – A database will store each paragraph from the corpus along side metadata describing the paragraph. This meta will include titles, authors, dates, publishers, as well as the controlled vocabulary terms. An indexer (a sort of database specifically designed for the purposes of search) will make the content of the database searchable, but the index will also be supplemented with a thesaurus. Because human language is ambiguous, words often have many and subtle differences in meaning. For example, when talking about “dogs”, a person may also be alluding to “hounds”, “canines”, or even “beagles”. Given the set of controlled vocabulary terms, a thesaurus will be created so when researchers & scholars search for “children” the indexer may also return documents containing the phrase “sons & daughters of parents”, or another example, when a search is done for “war” documents (paragraphs) also containing the words “battle” or “insurgent” may be found.
Searching & browsing – Finally, a Web-based interface will be created enabling readers to find items of interest, compare & contrast these items, identify patterns & anomalies between these items, and ultimately make judgments of understanding. For example, the reader will be presented with a graphical representation of controlled vocabulary. By selecting terms from the vocabulary, the index will be queried, and the reader will be presented with sortable and groupable lists of paragraphs classified with the given term. (This process is called “browsing”.) Alternatively, researchers & scholars will be able to enter simple (or complex) queries into an online form, the queries will be applied to the indexer, and again, paragraphs matching the queries will be returned. (This process is called “searching”.) Either way, the researchers & scholars will be empowered to explore the corpus in many and varied ways, and none of these ways will be limited to any individuals’ specific topic of interest.

The text above only outlines one possible “blueprint” for comparing & contrasting a corpus of Catholic social thought and human rights. Moreover, there are at least two other ways of addressing the issue. For example, it it entirely possible to “simply” read each & every document. After all, that is they way things have been done for millennium. Another possible solution is to apply natural language processing techniques to the corpus as a whole. For example, one could automatically count & tabulate the most frequently used words & phrases to identify themes. One could compare the rise & fall of these themes over time, geographic location, author, or publisher. The same thing can be done in a more refined way using parts-of-speech analysis. Along these same lines there are well-understood relevancy ranking algorithms (such as term frequency / inverse frequency) allowing a computer to output the more statistically significant themes. Finally, documents could be compared & contrasted automatically through a sort of geometric analysis in an abstract and multi-dimensional “space”. These additional techniques are considerations for a phase two of the project, if it ever comes to pass.

How not to work during a sabbatical

Posted on July 19, 2016 in Uncategorized

This presentation — given at Code4Lib Midwest (Chicago, July 14, 2016) — outlines the various software systems I wrote during my recent tenure as an adjunct faculty member at the University of Notre Dame. (This presentation is also available as a one-page PDF handout designed to be duplex printed and folded in half as if it were a booklet.)

How rare is rare? – In an effort to determine the “rarity” of items in the Catholic Portal, I programmatically searched WorldCat for specific items, counted the number of times it was held by libraries in the United States, and recorded the list of the holding libraries. Through the process I learned that most of the items in the Catholic Portal are “rare”, but I also learned that “rarity” can be defined as the triangulation of scarcity, demand, and value. Thus the “rare” things may not be rare at all.
Image processing – By exploiting the features and functions of an open source library called OpenCV, I started exploring ways to evaluate images in the same way I have been evaluating texts. By counting & tabulating the pixels in an image it is possible to create ratios of colors, do facial recognition, or analyze geometric composition. Through these processes is may be possible to supplement art history and criticism. For example, one might be able to ask things like, “Show me all of the paintings from Picasso’s Rose Period.”
Library Of Congress Name Authorities – Given about 125,000 MARC authority records, I wrote an application that searched the Library Of Congress (LOC) Name Authority File, and updated the local authority records with LOC identifiers, thus making the local authority database more consistent. For items that needed disambiguation, I created a large set of simple button-based forms allowing librarians to choose the most correct name.
MARC record enrichment – Given about 500,000 MARC records describing ebooks, I wrote a program that found the richest OCLC record in WorldCat and then merged the found record with the local record. Ultimately the local records included more access points and thus proved to be more useful in a library catalog setting.
OAI-PMH processing – I finally got my brain around the process of harvesting & indexing OAI-PMH content into VUFind. Whoever wrote the original OAI-PMH applications for VUFind did a very good job, but there is a definite workflow to the process. Now that I understand the workflow it is relatively easy ingest metadata from things like ContentDM, but issues with the way Dublin Core is implement still make the process challenging.
EEBO/TCP – Given the most beautiful TEI mark-up I’ve ever seen, I have systematically harvested the Early English Books Online (EEBO) content from the Text Encoding Initiative (TCP) and done some broad & deep but also generic text analysis subsets of the collection. Readers are able to search the collection for items of interest, save the full text to their own space for analysis, and have a number of rudimentary reports done against the result. This process allows the reader to see the corpus from a “distance”. Very similar work has been done against subsets of content from JSTOR as well as the HathiTrust.
VIAF Lookup – Given about 100,000 MARC authority records, I wrote a program to search VIAF for the most appropriate identifier and associate it with the given record. Through the process I learned two things: 1) how to exploit the VIAF API, and 2) how to exploit the Levenshtein algorithm. Using the later I was able to make automated and “intelligent” choices when it came to name disambiguation. In the end, I was able to accurately associate more than 80% of the authority names with VIAF identifiers.

My tenure as an adjunct faculty member was very much akin to a one year education except for a fifty-five year old. I did many of the things college students do: go to class, attend sporting events, go on road trips, make friends, go to parties, go home for the holidays, write papers, give oral presentations, eat too much, drink too much, etc. Besides the software systems outlined above, I gave four or five professional presentations, attend & helped coordinate five or six professional meetings, taught an online, semester-long, graduate-level class of on the topic of XML, took many different classes (painting, sketching, dance, & language) many times, lived many months in Chicago, Philadelphia, and Rome, visited more than two dozen European cities, painted about fifty paintings, bound & filled about two dozen hand-made books, and took about three thousand photographs. The only thing I didn’t do is take tests.

JSTOR Workset Browser

Posted on June 30, 2015 in Uncategorized

Given a citations.xml file, this suite of software — the JSTOR Workset Browser — will cache and index content identified through JSTOR’s Data For Research service. The resulting (and fledgling) reports created by this suite enables the reader to “read distantly” against a collection of journal articles.

The suite requires a hodgepodge of software: Perl, Python, and the Bash Shell. Your milage may vary. Sample usage: cat etc/citations-thoreau.xml | bin/make-everything.sh thoreau

“Release early. Release often”.

Early English love was black & white

Posted on June 15, 2015 in Uncategorized

Apparently, when it comes to the idea of love during the Early English period, everything is black & white.

Have harvested the totality of the EEBO-TCP (Early English Books Online – Text Creation Partnership) corpus. Using an extraordinarily simple (but very effective) locally developed indexing system, I extracted all the EEBO-TCP identifiers whose content was cataloged with the word love. I then fed these identifiers to a suite of software which: 1) caches the EEBO-TCP TEI files locally, 2) indexes them, 3) creates a browsable catalog of them, 4) supports a simle full text search engine against them, and 5) reports on the whole business (below). Through this process I have employed three sets of “themes” akin to the opposite of stop (function) words. Instead of specifically eliminating these words from the analysis, I specifically do analysis based on these words. One theme is “big” names. Another theme is “great” ideas. The third them is colors: white, black, red, yellow, blue, etc. Based on the ratio of each item’s number of words compared the number of times specific color words appear, I can generate a word cloud of colors (or colours) words, and you can “see” that in terms of love, everything is black & white. Moreover, the “most colorful” item is entitled The whole work of love, or, A new poem, on a young lady, who is violently in love with a gentleman of Lincolns-Inn by a student in the said art. — a charming, one-page document whose first two lines are:

LOVE is a thing that’s not on Reaſon laid,
But upon Nature and her Dictates made.

The corpus of the EEBO-TCP is some of the cleanest data I’ve ever seen. The XML is not only well-formed, but conforms the TEI schema. The metadata is thorough, (almost) 100% complete, (usually) consistently applied. It comes with very effective stylesheets, and the content is made freely easily available in a number of places. It has been a real joy to work with!

General statistics

An analysis of the corpus’s metadata provides an overview of what and how many things it contains, when things were published, and the sizes of its items:

Number of items – 156
Publication date range – 1493 to 9999 (histogram : boxplot)
Sizes in pages – 1 to 606 (histogram : boxplot)
Total number of pages – 12332
Average number of pages per item – 79

Possible correlations between numeric characteristics of records in the catalog can be illustrated through a matrix of scatter plots. As you would expect, there is almost always a correlation between pages and number of words. Are others exist? For more detail, browse the catalog.

Notes on word usage

By counting and tabulating the words in each item of the corpus, it is possible to measure additional characteristics:

Sizes of items in words – 97 to 127471 (histogram : boxplot)
Total number of words – 1348641
Average number of words per item – 8645
Total number of unique words – 49942
Most common words – god (15610) love (12184) may (8301) one (6911) man (6862) us (6838) yet (6475) hath (6124) good (6067) upon (5721) would (5523) men (5307) things (4956) much (4819) lord (4515) loue (4269) great (4201) make (4027) doth (3850) world (3736) shall (3677) thing (3556) know (3549) heart (3476) therefore (3383)

Perusing the list of all words in the corpus (and their frequencies) as well as all unique words can prove to be quite insightful. Are there one or more words in these lists connoting an idea of interest to you, and if so, then to what degree do these words occur in the corpus?

To begin to see how words of your choosing occur in specific items, search the collection.

Through the creation of locally defined “dictionaries” or “lexicons”, it is possible to count and tabulate how specific sets of words are used across a corpus. This particular corpus employs three such dictionaries — sets of: 1) “big” names, 2) “great” ideas, and 3) colors. Their frequencies are listed below:

Most common “big” names – smith (108) plato (108) galen (87) hippocrates (80) james (67) homer (49) swift (44) plutarch (36) augustine (25) aristotle (25) virgil (23) euripides (18) lucretius (13) mill (13) sterne (11) aquinas (10) tacitus (9) herodotus (9) sophocles (8) plotinus (8) apollonius (6) ptolemy (6) aristophanes (5) locke (5) aurelius (5) For more detail, see the list of “big” name frequencies.
Most common “great” ideas – god (15611) love (12185) one (6912) man (6863) good (6068) world (3737) many (3107) life (2945) time (2850) nature (2494) soul (1810) death (1774) law (1536) desire (1315) mind (1238) truth (1234) art (1209) knowledge (1051) peace (971) matter (970) duty (963) sin (943) religion (882) evil (880) cause (838) For more detail, see the list of “great” idea frequencies.
Colors – black (235) white (207) red (84) green (61) purple (47) yellow (20) brown (17) blue (13) gray (13) orange (4) For more detail, see the list of color word frequencies.

The distribution of words (histograms and boxplots) and the frequency of words (wordclouds), and how these frequencies “cluster” together can be illustrated:

Histograms – “big” names; “great” ideas; colors
Boxplots – “big” names; “great” ideas; colors
Wordclouds – most common words; “big” names; “great” ideas; colors
Cluster dendrograms – most common words; “big” names; “great” ideas; colors

Items of interest

Based on the information above, the following items (and their associated links) are of possible interest:

Shortest item (1 p.) – Now she that I louyd trewly beryth a full fayre face hath chosen her … (TEI : HTML : plain text)
Longest item (606 p.) – Psyche, or, Loves mysterie in XX canto’s, displaying the intercourse betwixt Christ and the soule / by Joseph Beaumont … (TEI : HTML : plain text)
Oldest item (1493) – This tretyse is of loue and spekyth of iiij of the most specyall louys that ben in the worlde and shewyth veryly and perfitely bi gret resons and causis, how the meruelous [and] bounteuous loue that our lord Ihesu cryste had to mannys soule excedyth to ferre alle other loues … Whiche tretyse was translatid out of frenshe into englyshe, the yere of our lord M cccc lxxxxiij, by a persone that is vnperfight insuche werke … (TEI : HTML : plain text)
Most recent (9999) – Ovid’s Art of love; in three books: : together with his Remedy of love: / translated into English verse, by several eminent hands: ; to which are added, The court of love, The history of love, and Armstrong’s Oeconomy of love. (TEI : HTML : plain text)
Most thoughtful item – A sermon directing what we are to do, after strict enquiry whether or no we truly love God preached April 29, 1688. (TEI : HTML : plain text)
Least thoughtful item – Amoris effigies, sive, Quid sit amor? efflagitanti responsum (TEI : HTML : plain text)
Biggest name dropper – Wit for money, or, Poet Stutter a dialogue between Smith, Johnson, and Poet Stutter : containing reflections on some late plays and particularly, on Love for money, or, The boarding school. (TEI : HTML : plain text)
Fewest quotations – Mount Ebal, or A heavenly treatise of divine love Shewing the equity and necessity of his being accursed that loves not the Lord Iesus Christ. Together with the motives meanes markes of our love towards him. By that late faithfull and worthy divine, John Preston, Doctor in Divinitie, chaplaine in ordinary to his Majestie, master of Emmanuel Colledge in Cambridge, and sometimes preacher of Lincolnes Inne. (TEI : HTML : plain text)
Most colorful – The whole work of love, or, A new poem, on a young lady, who is violently in love with a gentleman of Lincolns-Inn by a student in the said art. (TEI : HTML : plain text)
Ugliest – Eubulus, or A dialogue, where-in a rugged Romish rhyme, (inscrybed, Catholicke questions, to the Protestaut [sic]) is confuted, and the questions there-of answered. By P.A. (TEI : HTML : plain text)

Some automated analysis of Ralph Waldo Emerson’s works

Posted on June 12, 2015 in Uncategorized

emerson

This page describes a corpus named emerson, and it was programmatically created with a program called the HathiTrust Research Center Workset Browser.

General statistics

An analysis of the corpus’s metadata provides an overview of what and how many things it contains, when things were published, and the sizes of its items:

Number of items – 62
Publication date range – 1838 to 1956 (histogram : boxplot)
Sizes in pages – 20 to 660 (histogram : boxplot)
Total number of pages – 11866
Average number of pages per item – 191

Notes on word usage

By counting and tabulating the words in each item of the corpus, it is possible to measure additional characteristics:

Sizes of items in words – 1032 to 78195 (histogram : boxplot)
Total number of words – 1176289
Average number of words per item – 18972
Total number of unique words – 30507
Most common words – man (8911) one (7704) men (6260) every (5108) nature (4837) us (4652) life (4072) good (3904) must (3825) new (3812) great (3760) like (3518) world (3455) shall (3391) would (3359) see (3349) yet (3099) may (3092) much (2978) time (2928) thought (2902) never (2750) old (2700) mind (2692) day (2605)

To begin to see how words of your choosing occur in specific items, search the collection.

Most common “big” names – plato (424) goethe (364) milton (283) james (250) montaigne (250) homer (206) bacon (189) plutarch (172) newton (141) swift (136) dante (131) shakespeare (116) smith (99) aristotle (88) mill (72) chaucer (71) augustine (51) hegel (51) plotinus (51) kant (46) locke (43) gibbon (41) berkeley (37) archimedes (35) aristophanes (35) For more detail, see the list of “big” name frequencies.
Most common “great” ideas – man (8912) one (7705) nature (4838) life (4073) good (3905) world (3456) time (2929) mind (2693) love (2333) many (2247) soul (1907) god (1742) history (1639) truth (1602) beauty (1434) art (1380) law (1207) state (1135) form (1058) sense (927) poetry (898) virtue (844) religion (698) science (693) matter (691) For more detail, see the list of “great” idea frequencies.
Colors – white (310) green (236) red (219) black (211) brown (206) blue (167) gray (129) purple (102) yellow (85) orange (23) For more detail, see the list of color word frequencies.

The distribution of words (histograms and boxplots) and the frequency of words (wordclouds), and how these frequencies “cluster” together can be illustrated:

Histograms – “big” names; “great” ideas; colors
Boxplots – “big” names; “great” ideas; colors
Wordclouds – most common words; “big” names; “great” ideas; colors
Cluster dendrograms – most common words; “big” names; “great” ideas; colors

Items of interest

Based on the information above, the following items (and their associated links) are of possible interest:

Shortest item (20 p.) – The wisest words ever written on war / by R.W. Emerson … Preface by Henry Ford. (HathiTrust : WorldCat : plain text)
Longest item (660 p.) – Representative men : nature, addresses and lectures. (HathiTrust : WorldCat : plain text)
Oldest item (1838) – An address delivered before the senior class in Divinity College, Cambridge, Sunday evening, 15 July, 1838 / by Ralph Waldo Emerson. (HathiTrust : WorldCat : plain text)
Most recent (1956) – Emerson at Dartmouth; a reprint of his oration, Literary ethics. With an introd. by Herbert Faulkner West. (HathiTrust : WorldCat : plain text)
Most thoughtful item – Transcendentalism : and other addresses / by Ralph Waldo Emerson. (HathiTrust : WorldCat : plain text)
Least thoughtful item – Emerson-Clough letters, edited by Howard F. Lowry and Ralph Leslie Rusk. (HathiTrust : WorldCat : plain text)
Biggest name dropper – A letter of Emerson : being the first publication of the reply of Ralph Waldo Emerson to Solomon Corner of Baltimore in 1842 ; With analysis and notes by Willard Reed. (HathiTrust : WorldCat : plain text)
Fewest quotations – The wisest words ever written on war / by R.W. Emerson … Preface by Henry Ford. (HathiTrust : WorldCat : plain text)
Most colorful – Excursions. Illustrated by Clifton Johnson. (HathiTrust : WorldCat : plain text)
Ugliest – An address delivered before the senior class in Divinity College, Cambridge, Sunday evening, 15 July, 1838 / by Ralph Waldo Emerson. (HathiTrust : WorldCat : plain text)

Some automated analysis of Henry David Thoreau’s works

Posted on June 12, 2015 in Uncategorized

thoreau

This page describes a corpus named thoreau, and it was programmatically created with a program called the HathiTrust Research Center Workset Browser.

General statistics

An analysis of the corpus’s metadata provides an overview of what and how many things it contains, when things were published, and the sizes of its items:

Number of items – 32
Publication date range – 1866 to 1953 (histogram : boxplot)
Sizes in pages – 38 to 556 (histogram : boxplot)
Total number of pages – 7918
Average number of pages per item – 247

Notes on word usage

By counting and tabulating the words in each item of the corpus, it is possible to measure additional characteristics:

Sizes of items in words – 1201 to 64988 (histogram : boxplot)
Total number of words – 864751
Average number of words per item – 27023
Total number of unique words – 23456
Most common words – one (7020) like (4212) see (3928) would (3603) man (3520) may (3071) two (2826) us (2670) time (2626) still (2442) though (2389) day (2369) much (2365) many (2345) men (2329) water (2292) life (2256) little (2211) long (2162) could (2134) yet (2080) river (2033) first (1981) new (1974) even (1943)

To begin to see how words of your choosing occur in specific items, search the collection.

Most common “big” names – mill (112) swift (102) james (100) smith (99) homer (92) shakespeare (75) chaucer (40) milton (36) goethe (33) plato (29) virgil (28) bacon (23) dante (20) aristotle (19) tolstoy (17) plutarch (16) darwin (16) herodotus (13) newton (11) augustine (8) gilbert (8) copernicus (7) berkeley (7) plotinus (7) lucretius (6) For more detail, see the list of “big” name frequencies.
Most common “great” ideas – one (7021) man (3521) time (2627) many (2346) life (2257) nature (1741) good (1542) world (1199) love (921) state (671) mind (530) god (508) sense (479) history (446) form (424) truth (419) beauty (418) experience (366) government (337) family (316) particular (285) poetry (285) law (279) knowledge (276) art (273) For more detail, see the list of “great” idea frequencies.
Colors – white (1822) black (837) green (789) red (757) brown (593) blue (518) yellow (511) gray (313) purple (204) orange (31) For more detail, see the list of color word frequencies.

The distribution of words (histograms and boxplots) and the frequency of words (wordclouds), and how these frequencies “cluster” together can be illustrated:

Histograms – “big” names; “great” ideas; colors
Boxplots – “big” names; “great” ideas; colors
Wordclouds – most common words; “big” names; “great” ideas; colors
Cluster dendrograms – most common words; “big” names; “great” ideas; colors

Items of interest

Based on the information above, the following items (and their associated links) are of possible interest:

Shortest item (38 p.) – A bit of unpublished correspondence between Henry D. Thoreau and Isaac T. Hecker. By E. Harlow Russell. (HathiTrust : WorldCat : plain text)
Longest item (556 p.) – Excursions / by Henry D. Thoreau. (HathiTrust : WorldCat : plain text)
Oldest item (1866) – A Yankee in Canada with Anti-slavery and reform papers / by Henry D. Thoreau. (HathiTrust : WorldCat : plain text)
Most recent (1953) – Selected writings on nature and liberty; edited with an introd., by Oscar Cargill. (HathiTrust : WorldCat : plain text)
Most thoughtful item – On the Duty of Civil Disobedience. (HathiTrust : WorldCat : plain text)
Least thoughtful item – Journal / edited by Bradford Torrey. (HathiTrust : WorldCat : plain text)
Biggest name dropper – The service / by Henry David Thoreau; ed. by F. B. Sanborn. (HathiTrust : WorldCat : plain text)
Fewest quotations – A bit of unpublished correspondence between Henry D. Thoreau and Isaac T. Hecker. By E. Harlow Russell. (HathiTrust : WorldCat : plain text)
Most colorful – Notes on New England birds, by Henry D. Thoreau; arranged and ed. by Francis H. Allen; with illustrations from photographs of birds in nature. (HathiTrust : WorldCat : plain text)
Ugliest – The service / by Henry David Thoreau; ed. by F. B. Sanborn. (HathiTrust : WorldCat : plain text)

EEBO-TCP Workset Browser

Posted on June 11, 2015 in Uncategorized

I have begun creating a “browser” against content from EEBO-TCP in the same way I have created a browser against worksets from the HathiTrust. The goal is to provide “distant reading” services against subsets of the Early English poetry and prose. You can see these fledgling efforts against a complete set of Richard Baxter’s works. Baxter was an English Puritan church leader, poet, and hymn-writer. [1, 2, 3]

EEBO is an acronym for Early English Books Online. It is intended to be a complete collection of English literature between 1475 through to 1700. TCP is an acronym for Text Creation Partnership, a consortium of libraries dedicated to making EEBO freely available in the form of XML called TEI (Text Encoding Initiative). [4, 5]

The EEBO-TCP initiative is releasing their efforts in stages. The content of Stage I is available from a number of (rather hidden) venues. I found the content on a University Michigan Box site to be the easiest to use, albiet not necessarily the most current. [6] Once the content is cached — in the fullest of TEI glory — it is possible to search and browse the collection. I created a local, terminal-only interface to the cache and was able to exploit authority lists, controlled vocabularies, and free text searching of metadata to create subsets of the cache. [7] The subsets are akin to HathiTrust “worksets” — items of particular interest to me.

Once a subset was identified, I was able to mirror (against myself) the necessary XML files and begin to do deeper analysis. For example, I am able to create a dictionary of all the words in the “workset” and tabulate their frequencies. Baxter used the word “god” more than any other, specifically, 65,230 times. [8] I am able to pull out sets of unique words, and I am able to count how many times Baxter used words from three sets of locally defined “lexicons” of colors, “big” names, and “great” ideas. Furthermore, I am be to chart and graph trends of the works, such as when they were written and how they cluster together in terms of word usage or lexicons. [9, 10]

I was then able to repeat the process for other subsets, items about: lutes, astronomy, Unitarians, and of course, Shakespeare. [11, 12, 13, 14]

The EEBO-TCP Workset Browser is not as mature as my HathiTrust Workset Browser, but it is coming along. [15] Next steps include: calculating an integer denoting the number of pages in an item, implementing a Web-based search interface to a subset’s full text as well as metadata, putting the source code (written in Python and Bash) on GitHub. After that I need to: identify more robust ways to create subsets from the whole of EEBO, provide links to the raw TEI/XML as well as HTML versions of items, implement quite a number of cosmetic enhancements, and most importantly, support the means to compare & contrast items of interest in each subset. Wish me luck?

More fun with well-structured data, open access content, and the definition of librarianship.

Richard Baxter (the person) – http://en.wikipedia.org/wiki/Richard_Baxter
Richard Baxter (works) – http://bit.ly/ebbo-browser-baxter-works
Richard Baxter (analysis of works) – http://bit.ly/eebo-browser-baxter-analysis
EEBO-TCP – http://www.textcreationpartnership.org/tcp-eebo/
TEI – http://www.tei-c.org/
University of Michigan Box site – http://bit.ly/1QcvxLP
local cache of EEBO-TCP – http://bit.ly/eebo-cache
dictionary of all Baxter words – http://bit.ly/eebo-browser-baxter-dictionary
histogram of dates – http://bit.ly/eebo-browser-baxter-dates
clusters of “great” ideas – http://bit.ly/eebo-browser-baxter-cluster
lute – http://bit.ly/eebo-browser-lute
astronomy – http://bit.ly/eebo-browser-astronomy
Unitarians – http://bit.ly/eebo-browser-unitarian
Shakespeare – http://bit.ly/eebo-browser-shakespeare
HathiTrust Workset Browser – https://github.com/ericleasemorgan/HTRC-Workset-Browser

Developments with EEBO

Posted on June 8, 2015 in Uncategorized

Here some of developments with the playing of my EEBO (Early English Books Online) data.

I used the repository on Box to get my content, and I mirrored it locally. [1, 2] I then looped through the content using XPath to extract rudimentary metadata, thus creating a “catalog” (index). Along the way I calculated the number of words in each document and saved that as a field of each “record”. Being a tab-delimited file, it is trivial to import the catalog into my favorite spreadsheet, database, editor, or statistics program. This allowed me to browse the collection. I then used grep to search my catalog, and save the results to a file. I searched for Richard Baxter. [6, 7, 8]. I then used an R script to graph the numeric data of my search results. Currently, there are only two types: 1) dates, and 2) number of words. [9, 10, 11, 12] From these graphs I can tell that Baxter wrote a lot of relatively short things, and I can easily see when he published many of his works. (He published a lot around 1680 but little in 1665.) I then transformed the search results into a browsable HTML table. The table has hidden features. (Can you say, “Usability?”) For example, you can click on table headers to sort. This is cool because I want sort things by number of words. (Number of pages doesn’t really tell me anything about length.) There is also a hidden link to the left of each record. Upon clicking on the blank space you can see subjects, publisher, language, and a link to the raw XML.

For a good time, I then repeated the process for things Shakespeare and things astronomy. [14, 15] Baxter took me about twelve hours worth of work, not counting the caching of the data. Combined, Shakespeare and astronomy took me less than five minutes. I then got tired.

My next steps are multi-faceted and presented in the following incomplete unordered list:

create browsable lists – the TEI metadata is clean and consistent. The authors and subjects lend themselves very well to the creation of browsable lists.
CGI interface – The ability to search via Web interface is imperative, and indexing is a prerequisite.
transform into HTML – TEI/XML is cool, but…
create sets – The collection as a whole is very interesting, but many scholars will want sub-sets of the collection. I will do this sort of work, akin to my work with the HathiTrust. [16]
do text analysis – This is really the whole point. Given the full text combined with the inherent functionality of a computer, additional analysis and interpretation can be done against the corpus or its subsets. This analysis can be based the counting of words, the association of themes, parts-of-speech, etc. For example, I plan to give each item in the collection a colors, “big” names, and “great” ideas coefficient. These are scores denoting the use of researcher-defined “themes”. [17, 18, 19] You can see how these themes play out against the complete writings of “Dead White Men With Three Names”. [20, 21, 22]

Fun with TEI/XML, text mining, and the definition of librarianship.

Box – http://bit.ly/1QcvxLP
mirror – http://dh.crc.nd.edu/sandbox/eebo-tcp/xml/
xpath script – http://dh.crc.nd.edu/sandbox/eebo-tcp/bin/xml2tab.pl
catalog (index) – http://dh.crc.nd.edu/sandbox/eebo-tcp/catalog.txt
search results – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.txt
Baxter at VIAF – http://viaf.org/viaf/54178741
Baxter at WorldCat – http://www.worldcat.org/wcidentities/lccn-n50-5510
Baxter at Wikipedia – http://en.wikipedia.org/wiki/Richard_Baxter
box plot of dates – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-dates.png
box plot of words – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/boxplot-words.png
histogram of dates – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-dates.png
histogram of words – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/histogram-words.png
HTML – http://dh.crc.nd.edu/sandbox/eebo-tcp/baxter/baxter.html
Shakespeare – http://dh.crc.nd.edu/sandbox/eebo-tcp/shakespeare/
astronomy – http://dh.crc.nd.edu/sandbox/eebo-tcp/astronomy/
HathiTrust work – http://blogs.nd.edu/emorgan/2015/06/browser-on-github/
colors – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-colors.txt
“big” names – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-names.txt
“great” ideas – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/etc/theme-ideas.txt
Thoreau – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/thoreau/about.html
Emerson – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/emerson/about.html
Channing – http://dh.crc.nd.edu/sandbox/htrc-workset-browser/channing/about.html

Boxplots, histograms, and scatter plots. Oh, my!

Posted on June 5, 2015 in Uncategorized

I have started adding visualizations literally illustrating the characteristics of the various “catalogs” generated by the HathiTrust Workset Browser. These graphics (box plots, histograms, and scatter plots) make it easier to see what is in the catalog and the features of the items it contains.

For example, read the “about page” reporting on the complete works of Henry David Thoreau. For more detail, see the “home page” on GitHub.

HathiTrust Workset Browser on GitHub

Posted on June 3, 2015 in Uncategorized

cloud I have put my (fledgling) HathiTrust Workset Browser on GitHub. Try:

https://github.com/ericleasemorgan/HTRC-Workset-Browser

The Browser is a tool for doing “distant reading” against HathiTrust “worksets”. Given a workset rsync file, it will cache the workset’s content locally, index it, create some reports against the content, and provide the means to search/browse the collection. It should run out of the box on Linux and Macintosh computers. It requires the bash shell and Python, which come for free on these operating systems. Some sample content is available at:

http://bit.ly/browser-thoreau-about

Developing code with and through GitHub is interesting. I’m learning.

Days in the Life of a Librarian

Recent Posts

Subscribe

Archive for the ‘Uncategorized’ Category

Blueprint for a system surrounding Catholic social thought & human rights

How not to work during a sabbatical

JSTOR Workset Browser

Early English love was black & white

General statistics

Notes on word usage

Items of interest

Some automated analysis of Ralph Waldo Emerson’s works

emerson

General statistics

Notes on word usage

Items of interest

Some automated analysis of Henry David Thoreau’s works

thoreau

General statistics

Notes on word usage

Items of interest

EEBO-TCP Workset Browser

Developments with EEBO

Boxplots, histograms, and scatter plots. Oh, my!

HathiTrust Workset Browser on GitHub