Some automated analysis of Henry David Thoreau’s works

thoreau

This page describes a corpus named thoreau, and it was programmatically created with a program called the HathiTrust Research Center Workset Browser.

General statistics

An analysis of the corpus’s metadata provides an overview of what and how many things it contains, when things were published, and the sizes of its items:

  • Number of items – 32
  • Publication date range – 1866 to 1953 (histogram : boxplot)
  • Sizes in pages – 38 to 556 (histogram : boxplot)
  • Total number of pages – 7918
  • Average number of pages per item – 247

Possible correlations between numeric characteristics of records in the catalog can be illustrated through a matrix of scatter plots. As you would expect, there is almost always a correlation between pages and number of words. Are others exist? For more detail, browse the catalog.

Notes on word usage

By counting and tabulating the words in each item of the corpus, it is possible to measure additional characteristics:

Perusing the list of all words in the corpus (and their frequencies) as well as all unique words can prove to be quite insightful. Are there one or more words in these lists connoting an idea of interest to you, and if so, then to what degree do these words occur in the corpus?

To begin to see how words of your choosing occur in specific items, search the collection.

Through the creation of locally defined “dictionaries” or “lexicons”, it is possible to count and tabulate how specific sets of words are used across a corpus. This particular corpus employs three such dictionaries — sets of: 1) “big” names, 2) “great” ideas, and 3) colors. Their frequencies are listed below:

The distribution of words (histograms and boxplots) and the frequency of words (wordclouds), and how these frequencies “cluster” together can be illustrated:

Items of interest

Based on the information above, the following items (and their associated links) are of possible interest:

Comments are closed.