Some automated analysis of Henry David Thoreau’s works

Posted on June 12, 2015 in Uncategorized by Eric Lease Morgan

thoreau

This page describes a corpus named thoreau, and it was programmatically created with a program called the HathiTrust Research Center Workset Browser.

General statistics

An analysis of the corpus’s metadata provides an overview of what and how many things it contains, when things were published, and the sizes of its items:

Number of items – 32
Publication date range – 1866 to 1953 (histogram : boxplot)
Sizes in pages – 38 to 556 (histogram : boxplot)
Total number of pages – 7918
Average number of pages per item – 247

Possible correlations between numeric characteristics of records in the catalog can be illustrated through a matrix of scatter plots. As you would expect, there is almost always a correlation between pages and number of words. Are others exist? For more detail, browse the catalog.

Notes on word usage

By counting and tabulating the words in each item of the corpus, it is possible to measure additional characteristics:

Sizes of items in words – 1201 to 64988 (histogram : boxplot)
Total number of words – 864751
Average number of words per item – 27023
Total number of unique words – 23456
Most common words – one (7020) like (4212) see (3928) would (3603) man (3520) may (3071) two (2826) us (2670) time (2626) still (2442) though (2389) day (2369) much (2365) many (2345) men (2329) water (2292) life (2256) little (2211) long (2162) could (2134) yet (2080) river (2033) first (1981) new (1974) even (1943)

Perusing the list of all words in the corpus (and their frequencies) as well as all unique words can prove to be quite insightful. Are there one or more words in these lists connoting an idea of interest to you, and if so, then to what degree do these words occur in the corpus?

To begin to see how words of your choosing occur in specific items, search the collection.

Through the creation of locally defined “dictionaries” or “lexicons”, it is possible to count and tabulate how specific sets of words are used across a corpus. This particular corpus employs three such dictionaries — sets of: 1) “big” names, 2) “great” ideas, and 3) colors. Their frequencies are listed below:

Most common “big” names – mill (112) swift (102) james (100) smith (99) homer (92) shakespeare (75) chaucer (40) milton (36) goethe (33) plato (29) virgil (28) bacon (23) dante (20) aristotle (19) tolstoy (17) plutarch (16) darwin (16) herodotus (13) newton (11) augustine (8) gilbert (8) copernicus (7) berkeley (7) plotinus (7) lucretius (6) For more detail, see the list of “big” name frequencies.
Most common “great” ideas – one (7021) man (3521) time (2627) many (2346) life (2257) nature (1741) good (1542) world (1199) love (921) state (671) mind (530) god (508) sense (479) history (446) form (424) truth (419) beauty (418) experience (366) government (337) family (316) particular (285) poetry (285) law (279) knowledge (276) art (273) For more detail, see the list of “great” idea frequencies.
Colors – white (1822) black (837) green (789) red (757) brown (593) blue (518) yellow (511) gray (313) purple (204) orange (31) For more detail, see the list of color word frequencies.

The distribution of words (histograms and boxplots) and the frequency of words (wordclouds), and how these frequencies “cluster” together can be illustrated:

Histograms – “big” names; “great” ideas; colors
Boxplots – “big” names; “great” ideas; colors
Wordclouds – most common words; “big” names; “great” ideas; colors
Cluster dendrograms – most common words; “big” names; “great” ideas; colors

Items of interest

Based on the information above, the following items (and their associated links) are of possible interest:

Shortest item (38 p.) – A bit of unpublished correspondence between Henry D. Thoreau and Isaac T. Hecker. By E. Harlow Russell. (HathiTrust : WorldCat : plain text)
Longest item (556 p.) – Excursions / by Henry D. Thoreau. (HathiTrust : WorldCat : plain text)
Oldest item (1866) – A Yankee in Canada with Anti-slavery and reform papers / by Henry D. Thoreau. (HathiTrust : WorldCat : plain text)
Most recent (1953) – Selected writings on nature and liberty; edited with an introd., by Oscar Cargill. (HathiTrust : WorldCat : plain text)
Most thoughtful item – On the Duty of Civil Disobedience. (HathiTrust : WorldCat : plain text)
Least thoughtful item – Journal / edited by Bradford Torrey. (HathiTrust : WorldCat : plain text)
Biggest name dropper – The service / by Henry David Thoreau; ed. by F. B. Sanborn. (HathiTrust : WorldCat : plain text)
Fewest quotations – A bit of unpublished correspondence between Henry D. Thoreau and Isaac T. Hecker. By E. Harlow Russell. (HathiTrust : WorldCat : plain text)
Most colorful – Notes on New England birds, by Henry D. Thoreau; arranged and ed. by Francis H. Allen; with illustrations from photographs of birds in nature. (HathiTrust : WorldCat : plain text)
Ugliest – The service / by Henry David Thoreau; ed. by F. B. Sanborn. (HathiTrust : WorldCat : plain text)

Days in the Life of a Librarian

Recent Posts

Subscribe