A couple of weeks ago Kevin Phaup took the lead of facilitating a 3D printing workshop here in the Libraries’s Center For Digital Scholarship. More than a dozen students from across the University participated. Kevin presented them with an overview of 3D printing, pointed them towards a online 3D image editing application (Shapeshifter), and everybody created various objects which Matt Sisk has been diligently printing. The event was deemed a success, and there will probably be more specialized workshops scheduled for the Fall.

Since the last blog posting there has also been another Working Group meeting. A short dozen of us got together in Stinson-Remick where we discussed the future possibilities for the Group. The consensus was to create a more formal mailing list, maybe create a directory of people with 3D printing interests, and see about doing something more substancial — with a purpose — for the University.

To those ends, a mailing list has been created. Its name is 3D Printing Working Group . The list is open to anybody, and its purpose is to facilitate discussion of all things 3D printing around Notre Dame and the region. To subscribe address an email message to, and in the body of the message include the following command:

subscribe nd-3d-printing Your Name

where Your Name is… your name.

Finally, the next meeting of the Working Group has been scheduled for Wednesday, May 14. It will be sponsored by Bob Sutton of Springboard Technologies, and it will be located in Innovation Park across from the University, and it will take place from 11:30 to 1 o’clock. I’m pretty sure lunch will be provided. The purpose of the meeting will be continue to outline the future directions of the Group as well as to see a demonstration of a printer called the Isis3D.

This posting outlines a current trend in some academic libraries, specifically, the inclusion of digital humanities into their service offerings. It provides the briefest of introductions to the digital humanities, and then describes how one branch of the digital humanities — text mining — is being put into practice here in the Hesburgh Libraries’ Center For Digital Scholarship at the University of Notre Dame.

Digital humanities

For all intents and purposes, the digital humanities is a newer rather than older scholarly endeavor. A priest named Father Busa is considered the “Father of the Digital Humanities” when, in 1965, he worked with IBM to evaluate the writings of Thomas Aquinas. With the advent of the Internet, ubiquitous desktop computing, an increased volume of digitized content, and sophisticated markup languages like TEI (the Text Encoding Initiative), the processes of digital humanities work has moved away from a fad towards a trend. While digital humanities work is sometimes called a discipline this author sees it more akin to a method. It is a process of doing “distant reading” to evaluate human expression. (The phrase “distant reading” is attributed to Franco Moretti who coined it in a book entitles Graphs, Maps, Trees: Abstract Models for a Literary History. Distant reading is complementary to “close reading”, and is used to denote the idea of observing many documents simultaneously.) The digital humanities community has grown significantly in the past ten or fifteen years complete with international academic conferences, graduate school programs, and scholarly publications.

Digital humanities work is a practice where digitized content of the humanist is quantitatively analyzed as if it were the content studied by a scientist. This sort of analysis can be done against any sort of human expression: written and spoken words, music, images, dance, sculpture, etc. Invariably, the process begins with counting and tabulating. This leads to measurement, which in turn provides opportunities for comparison. From here patterns can be observed and anomalies perceived. Finally, predictions, thesis, and judgements can be articulated. Digital humanities work does not replace the more traditional ways of experiencing expressions of the human condition. Instead it supplements the experience.

This author often compares the methods of the digital humanist to the reading of a thermometer. Suppose you observe an outdoor thermometer and it reads 32° (Fahrenheit). This reading, in and of itself, carries little meaning. It is only a measurement. In order to make sense of the reading it is important to put it into context. What is the weather outside? What time of year is it? What time of day is it? How does the reading compare to other readings? If you live in the Northern Hemisphere and the month is July, then the reading is probably an anomaly. On the other hand, if the month is January, then the reading is perfectly normal and not out of the ordinary. The processes of the digital humanist make it possible to make many measurements from a very large body of materials in order to evaluate things like texts, sounds, images, etc. It makes it possible to evaluate the totality of Victorian literature, the use of color in paintings over time, or the rhythmic similarities & difference between various forms of music.

Digital humanities centers in libraries

As the more traditional services of academic libraries become more accessible via the Internet, libraries have found the need to necessarily evolve. One manifestation of this evolution is the establishment of digital humanities centers. Probably one of oldest of these centers is located at the University of Virginia, but they now exist in many libraries across the country. These centers provide a myriad of services including combinations of digitization, markup, website creation, textual analysis, speaker series, etc. Sometimes these centers are akin to computing labs. Sometimes they are more like small but campus-wide departments staffed with scholars, researchers, and graduate students.

The Hesburgh Libraries’ Center For Digital Scholarship at the University of Notre Dame was recently established in this vein. The Center supports services around geographic information systems (GIS), data management, statistical analysis of data, and text mining. It is located in a 5,000 square foot space on the Libraries’s first floor and includes a myriad of computers, scanners, printers, a 3D printer, and collaborative work spaces. Below is an annotated list of projects the author has spent time against in regards to text mining and the Center. It is intended to give the reader a flavor of the types of work done in the Hesburgh Libraries:

  • Great Books – This was almost a tongue-in-cheek investigation to calculate which book was the “greatest” from a set of books called the Great Books of the Western World. The editors of the set defined a great book as one which discussed any one of a number of great ideas both deeply and broadly. These ideas were tabulated and compared across the corpus and then sorted by the resulting calculation. Aristotle’s Politics was determined to be the greatest book and Shakespeare was determined to have written nine of the top ten greatest books when it comes to the idea of love.
  • HathiTrust Research Center – The HathiTrust Research Center is a branch of the HathiTrust. The Center supports a number of algorithms used to do analysis against reader-defined worksets. The Center For Digital Scholarship facilitates workshops on the use of the HathiTrust Research Center as well as a small set of tools for programmatically searching and retrieving items from the HathiTrust.
  • JSTOR ToolData For Research (DFR) is a freely available and alternative interface to the bibliographic index called JSTOR. DFR enables the reader to search the entirety of JSTOR through a faceted querying. Search results are tabulated enabling the reader to create charts and graphs illustrating the results. Search results can be downloaded for more detailed investigations. JSTOR Tool is a Web-based application allowing the reader to summarize and do distant reading against these downloaded results.
  • PDF To Text – Text mining almost always requires the content of its investigation to be in the form of plain text, but much of the content used by people is in PDF. PDF To Text is a Web-based tool which extracts the plain text from PDF files and provides a number of services against the result (readability scores, ngram extraction, concordancing, and rudimentary parts-of-speech analysis.)
  • Perceptions of China – This project is in the earliest stages. Prior to visiting China students have identified photographs and written short paragraphs describing, in their minds, what they think of China. After visiting China the process is repeated. The faculty member leading the students on their trips to China wants to look for patterns of perception in the paragraphs.
  • Poverty Tourism – A university senior believes they have identified a trend — the desire to tourist poverty-stricken places. They identified as many as forty websites advertising “Come vist our slum”. Working with the Center they programmatically mirrored the content of the remote websites. They programmatically removed all the HTML tags from the mirrors. They then used Voyant Tools as well as various ngram tabulation tools to do distant reading against the corpus. Their investigations demonstrated the preponderant use of the word “you”, and they posit this because the authors of the websites are trying to get readers to imagine being in a slum.
  • State Trials – In collaboration with a number of other people, transcripts of the State Trials dating between 1650 and 1700 were analyzed. Digital versions of the Trails was obtained, and a number of descriptive analyses were done. The content was indexed and a timeline was created from search results. Ngram extraction was done as well as parts-of-speech analysis. Various types of similarity measures were done based on named entities and the over-all frequency of words (vectors). A stop word list was created based on additional frequency tabulations. Much of these analysis was visualized using word clouds, line charts, and histograms. This project is an excellent example of how much of digital humanities work is collaborative and requires the skills of many different types of people.
  • Tiny Text Mining Tools – Text mining is rooted in the counting and tabulation of words. Computers are very good at counting and tabulating. To that end a set of tiny text mining tools has been created enabling the Center to perform quick & dirty analysis against one or more items in a corpus. Written in Perl, the tools implement a well-respected relevancy ranking algorithm (term-frequency inverse document frequency or TFIDF) to support searching and classification, a cosine similarity measure for clustering and “finding more items like this one”, a concordancing (keyword in context) application, and an ngram (phrase) extractor.


starry night
Text mining, and digital humanities work in general, is simply the application computing techniques applied against the content of human expression. Their use is similar to use of the magnifying glass by Galileo. Instead of turning it down to count the number of fibers in a cloth (or to write an email message), it is being turned up to gaze at the stars (or to analyze the human condition). What he finds there is not so much truth as much as new ways to observe. The same is true of text mining and the digital humanities. They are additional ways to “see”.


I have posted to Github the very beginnings of Perl library used to support simple and introductory text mining analysis — tiny text mining tools.

Presently the library is implemented in a set of subroutines stored in a single file supporting:

  • simple in-memory indexing and single-term searching
  • relevancy ranking through term-frequency inverse document frequency (TFIDF) for searching and classification
  • cosine similarity for clustering and “finding more items like this one”

I use these subroutines and the associated Perl scripts to do quick & dirty analysis against corpuses of journal articles, books, and websites.

I know, I know. It would be better to implement these thing as a set of Perl modules, but I’m practicing what I preach. “Give it away even if it is not ready.” The ultimate idea is to package these things into a single distribution, and enable researchers to have them at their finger tips as opposed to a Web-based application.