tei2json: Summarizing the structure of Early English poetry and prose
Posted on January 17, 2017 in Uncategorized by Eric Lease Morgan
This posting describes a hack of mine, tei2json.pl – a Perl program to summarize the structure of Early English poetry and prose. [0]
In collaboration with Northwestern University and Washington University, the University of Notre Dame is working on a project whose primary purpose is to correct (“annotate”) the Early English corpus created by the Text Creation Partnership (TCP). My role in the project is to do interesting things with the corpus once it has been corrected. One of those things is the creation of metdata files denoting the structure of each item in the corpus.
Some of my work is really an effort to reverse engineer good work done by the late Sebastian Rahtz. For example, Mr. Rahtz cached a version of the TCP corpus, transformed each item into a number of different formats, and put the whole thing on GitHub. [1] As a part of this project, he created metadata files enumerating what TEI elements were in each file and what attributes were associated with each element. The result was an HTML display allowing the reader to quickly see how many bibliographies an item may have, what languages may be present, how long the document was measured in page breaks, etc. One of my goals is/was to do something very similar.
The workings of the script are really very simple: 1) configure and denote what elements to count & tabulate, 2) loop through each configuration, 3) keep a running total of the result, 4) convert the result to JSON (a specific data format), and 5) save the result to a file. Here are (temporary) links to a few examples:
- http://dh.crc.nd.edu/tmp/early-print/A00002.json
- http://dh.crc.nd.edu/tmp/early-print/A00395.json
- http://dh.crc.nd.edu/tmp/early-print/A00959.json
JSON files are not really very useful in & of themselves; JSON files are designed to be transport mechanisms allowing other applications to read and process them. This is exactly what I did. In fact, I created two different applications: 1) json2table.pl and 2) json2tsv.pl. [2, 3] The former script takes a JSON file and creates a HTML file whose appearance is very similar to Rahtz’s. Using the JSON files (above) the following HTML files have been created through the use of json2table.pl:
- http://dh.crc.nd.edu/tmp/early-print/A00002.htm
- http://dh.crc.nd.edu/tmp/early-print/A00395.htm
- http://dh.crc.nd.edu/tmp/early-print/A00959.htm
The second script (json2tsv.pl) allows the reader to compare & contrast structural elements between items. Json2tsv.pl reads many JSON files and outputs a matrix of values. This matrix is a delimited file suitable for analysis in spreadsheets, database applications, statistical analysis tools (such as R or SPSS), or programming languages libraries (such as Python’s numpy or Perl’s PDL). In its present configuration, the json2tsv.pl outputs a matrix looking like this:
id bibl figure l lg note p q A00002 3 4 4118 490 8 18 3 A00011 3 0 2 0 47 68 6 A00089 0 0 0 0 0 65 0 A00214 0 0 0 0 151 131 0 A00289 0 0 0 0 41 286 0 A00293 0 1 189 38 0 2 0 A00395 2 0 0 0 0 160 2 A00749 0 4 120 18 0 0 2 A00926 0 0 124 12 0 31 7 A00959 0 0 2633 9 0 4 0 A00966 0 0 2656 0 0 17 0 A00967 0 0 2450 0 0 3 0
Given such a file, the reader could then ask & answer questions such as:
- Which item has the greatest number of figures?
- What is average number of lines per line group?
- Is there a statistical correlation between paragraphs and quotes?
Additional examples of input & output files are temporarily available online. [4]
My next steps include at least a couple of things. One, I need/want to evaluate whether or not save my counts & tabulations in a database before (or after) creating the JSON files. The data may be prove to be useful there. Two, as a librarian, I want to go beyond qualitative description of narrative texts, and the counting & tabulating of structural elements moves in that direction, but it does not really address the “aboutness”, “meaning”, nor “allusions” found in a corpus. Sure, librarians have applied controlled vocabularies and bits of genre to metadata descriptions, but such things are not quantitive and consequently allude statistical analysis. For example, using sentiment analysis one could measure and calculate the “lovingness”, “war mongering”, “artisticness”, or “philosophic nature” of the texts. One could count & tabulate the number of times family-related terms are used, assign the result a score, and record the score. One could then amass all documents and sort them by how much they discussed family, love, philosophy, etc. Such is on my mind, and more than half-way baked. Wish me luck.
Links
- [0] tei2json.pl – http://dh.crc.nd.edu/tmp/early-print/tei2json.pl
- [1] An example of this good work is found at https://github.com/textcreationpartnership/A00002
- [2] json2table.pl – http://dh.crc.nd.edu/tmp/early-print/json2table.pl
- [3] json2tsv.pl – http://dh.crc.nd.edu/tmp/early-print/json2tsv.pl
- [4] more examples – http://dh.crc.nd.edu/tmp/early-print/