Optical Character Recognition with Tesseract: a Tutorial for Medievalists

This tutorial is for those of you who want to learn some basic programming in Python for the digital humanities, but also for those who have never programmed or may become filled with terror at the sight of a single line of code (trust me, I know the feeling!). At the end of this guide, you should know how to perform optical character recognition (OCR) to make a pdf searchable. For this, we are going to use “ocrmypdf”, a Tesseract based Python package with wonderful capabilities. Everything will be done online from your web browser, so don’t worry, you will not have to install anything on your computer! 

I have tried to keep the tutorial very simple and straight to the point, at the cost of occasionally sacrificing some useful and important explanations about the code. For that, I profusely apologize to my colleagues in the department of computer science. Please, do not send the Spanish Inquisition.

The name “Python” is a tribute to Monty Python, hence my poor joke.

Why Tesseract and ocrmypdf?

Some of you may be familiar with, or even regular users of the OCR function provided by Adobe Acrobat DC pro. The ctrl+f function is one of the simplest and most efficient tools to support analysis of text and research. Acrobat DC is a powerful and very easy to use software, but this comes at a price. First, Acrobat works only with a handful of common modern languages such as English, French, German or Japanese. Tesseract on the other hand can recognize characters from a broad variety of modern and classical languages, including, but not limited to, Armenian, Classical Arabic, Classical Greek, Syriac, and Old Georgian, to name only a few. Second, Tesseract is a free, and open-source software, presenting a more cost-efficient option compared to other expensive commercial options and I am sure many of you would rather settle for the free but equally powerful alternative.

First, go to https://colab.research.google.com/

This is a free Jupyter notebook that will save your data on your personal google drive. Just remember to hit the save button before closing the page.      

Open a new notebook and sign up with a Gmail address. You will want to use your @nd.edu address but any personal google account will work too.

As a first step, we need to download ocrmypdf, and all its dependencies. To do so, simply type the following lines. The first will download the python package ocrmypdf, while the other lines will deal with the dependencies. For those of you new to coding, you will learn the first rule of coding: any errors in spelling, indentation and so forth can break your code. Be careful!

When the above lines have been written, run the cell by pressing ctrl+enter or press the button in the upper left corner. The download process should take around a minute. Once the download is complete, you should see a little green tick next to the upper left arrow.

Once the package and its dependencies have been downloaded, we will need to import “ocrmypdf” so that we can put it to work. Add a line of code to your colab sheet by clicking on + Code and write the following code:

Next, add a copy of your scanned pdf to the “files/content” folder  on the left side of your screen (or any other folder of your choice, you will just have to note its path somewhere). In our case, we are going to work with the first page of an article on the Mevlevi Sufi order published by the Byzantinist Speros Vryonis Junior.

In a new cell, enter the following lines of code (beware, the underscores are double underscores!).

Here, the name on the left should be that of the file you want to ocr (or its path if put in another folder), the one on the right should be that of the new, postprocessed file. The code uses the exact name of the file, ‘Vryonis_Article.pdf’. You can keep the exact same name if you want the new file to overwrite the original one. In my case, my code is directing ocrmypdf to create a new file: ‘Vryonis_Article_OCR.pdf’. Once generated, the post-processed article should appear in the same folder.

If you ever get an error, simply restart the runtime before running the cell again.

Et voila! You can now search your pdf with ctrl+F or copy and paste any sentence you want.

But this is not the most exciting part of this tutorial, and we may want to spice things up a little bit. Tesseract is very good at OCRing (yes, this is a verb, at least according to the WordSense dictionary) non-Latin scripts, but the process is a bit more involved. As an example, let’s take a page from the Masālik al-abṣār mamālik al-amṣār written in the 14th century by the Syrian polymath al-ʿUmarī.

For this, you need first to download the Arabic trained data at https://github.com/tesseract-ocr/tessdata/tree/main/script

Then move the downloaded file to the following folder /usr/share/tesseract-ocr/4.00/tessdata

The process is the same as before, simply change the language code to that of the language you just added, in our case “ara”. The various language codes can be found here: (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)

وهو أثرى الممالك بلا احتشام خلا أنه بكثرة

And here we are. Those who can read Arabic will notice that the result is extremely impressive. This, however, is a rather neat scan and Arabic is often difficult to properly OCR because of the cursive nature of the script. If the quality of your scan is poor, you may also be able to clean it with Python beforehand for a better result. I may develop this point further in another post.

Beyond Arabic, Tesseract works very well with other non-cursive Semitic scripts, and you may get excellent results with Hebrew for example. Here is a last example from a Syriac Bible.

If you have any questions or comments about this guide, feel free to contact me.

The Phoenix Returns

Although it does not often get the same attention as other wondrous and fiery creatures, such as dragons, the marvelous phoenix has an equally deep and ancient history. One of the oldest known accounts of the phoenix myth comes from Horapollo’s Hieroglyphica, translated into ancient Greek around the 5th century B.C.E. The phoenix, called benu by the Egyptian author, becomes increasingly popular, appearing in works by Greek authors, such as Herodotus’s Histories and Antiphanes of Athens’ Homopatrioi, and in works by Latin authors, such as Tacitus’s Annals, Ovid’s Metamorphoses, Pliny the Elder’s Natural History, and of course Lactantius’ De ave phoenice, which is adapted, expanded and allegorized in the Old English Phoenix poem found in the medieval codex known as the Exeter Book (Exeter Cathedral Library MS 3501).

Phoenix rising in Aberdeen Bestiary, Aberdeen University Library, Univ Lib. MS 24, f.55v.

As I mentioned in my previous blog centered on translating the Exeter Book Phoenix, the phoenix bird also appears in the Abrahamic tradition, from the bird of paradise (chol) in commentaries on Jewish scripture (especially the Midrash and Talmud) to the phoenix’s allegorization and comparisons with Christ himself by early Christian authors. Sometimes, these early Christian authors would use the phoenix as evidence for the possibility of Christ’s resurrection, as can be observed in Clement of Rome’s Epistula ad Corinthos, Tertullian’s De resurrectione carnis, St. Epiphanius’ Physiologus and in St. Ambrose’s De excessu Satyri. This moralizing interpretation of the phoenix extends into the modern era and continues unto our own contemporary age.

Dumbledore’s phoenix, Fawks, comes to Harry Potter’s aid in “The Chamber of Secrets” (2002).

Within the realm of fantasy literature and popular fiction, Harry Potter & the Order of the Phoenix highlight the longstanding association with the phoenix and moral goodness, in this book the day-saving gang of noble, good and trustworthy witches and wizards, also called as Dumbledore’s army, are known as the Order of the Phoenix. It is this group which twice stands up to Voldemort and his Death-eaters, and each time they succeed.

Indeed, the ultimate white wizard in J.K. Rowling’s fantasy world, Albus Dumbledore, has his own pet phoenix named Fawks, who swiftly delivers the sword of Godrick Gryffindor to Harry Potter in his moment of need and bravely pecks the monstrous basilisk’s eyes out in The Chamber of Secrets. Later, Fawks saves his master from unpleasant arrest and an uncomfortable stay in the magical prison Azkaban in The Order of the Phoenix. This extremely positive association is likely a result of medieval Christological allegory often linked the phoenix, which parallels Christ in its death and rebirth.

Fawks helps Dumbledore escape from the Ministry of Magic in “The Order of the Phoenix” (2007).

In the Exeter Book Phoenix, this allegory is emphasized and dramatized as the phoenix is aligned with both paradise in heaven and compared to the westward journey of the sun. Moreover, the mythical bird—like the sun—is repeatedly connected to images of glistening treasure and beautiful jewels. In my translation of the Old English Phoenix, lines 85-119, I do my best to preserve as much of the original poem’s language and semantics as possible, and even at times imitate the cadence, but as with my earlier translation of previous lines 1-49, I take certain creative liberties and mobilize poetic licensure when I feel it enhances my English translation.

Stay tuned for additional forthcoming translations from the Exeter Book Phoenix, reborn as modern English poems!

Reading the Hildeburh Episode: Feuding, Vengeance & the Problem of Motherhood in Beowulf

Beowulf is historically known for its “digressions” into extratextual storytelling, and scholars have regarded these intrusions as everything from evidence of Beowulf’s oral origin to a demonstration of the problematic structure of the poem. My interpretation of this narrative interlace understands the various stories as directly engaged with the main subject of the plot by providing parallel circumstances that highlight important aspects of the main narrative centered on Beowulf and monster-slaying.

Much ink has been spilled on the Sigemund and Heremod episodes. Some read these stories as foils of each other with Sigemund representing a positive model for Beowulf to follow and Heremod representing a negative model that serves as a warning for the young hero. However, Mark Griffith has demonstrated how even the Sigemund episode is coded with misdeeds, and he has suggested that many of the details included in the story portray the hero rather pejoratively.

There are numerous other “digressions” within Beowulf, though these two have traditionally gained the lion’s share of attention in the scholarship. Today, I want to look closely at the form and possible narrative function of the Hildeburh episode (1076-1159), frequently called the Finn episode, which follows directly after the two previously referenced stories, and the three serve as entertainment during the celebration following Grendel’s defeat and Beowulf’s triumph.

John Howe’s illustration of the funeral of king Finn (2005).

While the first two “digressions” seem to parallel aspects of Beowulf’s own character, the episode centered on Hildeburh conveys a very different message, and I would argue, perhaps to a specific audience. While the first two stories focus on heroes who possess great strength, the third story centers on something only hinted at thus far in the poem: maternal loss.

Just prior to the celebratory storytelling in Heorot, we learn that Wealhðeow, queen of the Danes, advises her husband, King Hroðgar, to place his trust in his nephew and kinsman Hroðulf rather than investing in a foreign hero, like Beowulf. Thomas Shippey has noted the irony in this as earlier in the poem there is reference to the burning of Heorot, which is perpetrated by Hroðulf and results in the murder of both of Hroðgar’s sons and Hroðulf’s usurpation. These enigmatic references to a future Danish power struggle might easily be missed, but they nevertheless frame Wealhðeow as a mother who will lose her sons to violence and kin-slaying, possibly within the broader context of a feud between rival brothers for the throne. After all, Hroðgar is not the first in line, and he even remarks of his late (and elder) brother Heorogar—deep in his cups—that se wæs betera ðonne ic “he was better than I” (469) presumably referring to his prior kingship.

J. R. Skelton’s image of Wealhðeow as a cup-bearer in Stories of Beowulf by Henrietta Elizabeth Marshall (1908).

Indeed, the need for Hroðgar to build Heorot at all suggests that the former Danish mead hall is no longer around, which invites further questions such as whether its destruction was a result of inter-family violence and Hroðgar’s overthrow of his older brother to claim the Danish crown. Alas, the poem does not tell.

Although the Hildeburh episode concludes the celebration of Beowulf’s victory over Grendel, its mood is far from jovial. The tale relates a feud between the Danes and the Frisians and Hildeburh is caught in the middle. Hildeburh’s song relates how her bearn ond broðor “sons and brothers” (1074) find themselves on opposite sides of a feud where everybody dies in the ensuing conflict—everyone loses—all of them die in the violence. Indeed, Hildeburh’s role as Danish princess made Frisian queen herself—a failed freoðuwebbe “peace-weaver” (1942) is highlighted by the mutual deaths of her family members. The feud takes both Finnes eaferan “the heirs of Finn” (1068) and hæleð Healfdena “heroes of the half-Danes”(1069) as the parallel descriptions of how wig ealle fornam (1080) “war took all” and lig ealle forswealg “fire swallowed all” (1122) connects warfare with their shared cremation next to one another on the funeral pyre.

Hildeburh metodsceaft bemearn “bemoaned her fate” (1077) because she has no way to avenge her kinsmen. She is on both sides and therefore on neither. No matter what happens in the ongoing feud between her peoples, Hildeburh will suffer loss. And again, a mother loses her sons. Moreover, her tale parallels the foreshadowed fate of Wealhðeow’s sons, who will be betrayed by her treacherous nephew Hroðulf (1180-7). 

As I discuss in much greater depth in my dissertation subchapter “The Ethical Paradox of Grendel’s Mother’s Revenge” (358-370), it is this contextual framework within which Grendel’s mother appears in the narrative (out of nowhere) as a wrecend “avenger” to wreak vengeance upon those who murdered her son. In a sense, Grendel’s mother does—and is able to do—what Hildeburh cannot. And, as Leslie Lockett and others have observed, Grendel’s mother’s actions represent a legally and ethically “fair” exchange: a life for a life. This engenders further sympathy for her character’s suffering and retaliation, especially following directly after the context established by Hildeburh episode.

Image of monstrous hybrid-woman from The Wonders of the East in British Library, Cotton Vitellius a.xv, f.105v.

Even after Grendel’s mother is slain, the pattern repeats. Not long after we meet Queen Hygd in Geatland, her son is killed in a feud with the Swedish king Onela, leaving Beowulf to inherit the throne. Yet another mother loses her son to a feud, underscoring the narrator’s comments on the violence between the Danes and the Grendelkin: ne wæs þæt gewrixle til,/ þæt hie on ba healfa bicgan scoldon/ freonda feorum “that was not a good exchange, that they on both sides should pay with the lives of kinsmen” (1304-06).

We do not know who wrote Beowulf, and probably never will. Nevertheless, at this point in the poem, I am reminded of Virginia Woolf’s argument in A Room Of One’s Own: “I would venture to guess that Anon, who wrote so many poems without signing them, was often a woman.”  While I am not arguing for a female author of the poem (though why not), I would contend that there seem to be strong rhetorical appeals directed at women—especially mothers—within Beowulf, which suggest that they were likely part of the poem’s anticipated audience.

