Optical Character Recognition with Tesseract: a Tutorial for Medievalists

This tutorial is for those of you who want to learn some basic programming in Python for the digital humanities, but also for those who have never programmed or may become filled with terror at the sight of a single line of code (trust me, I know the feeling!). At the end of this guide, you should know how to perform optical character recognition (OCR) to make a pdf searchable. For this, we are going to use “ocrmypdf”, a Tesseract-based Python package with wonderful capabilities. Everything will be done online from your web browser, so don’t worry, you will not have to install anything on your computer! 

I have tried to keep the tutorial very simple and straight to the point, at the cost of occasionally sacrificing some useful and important explanations about the code. For that, I profusely apologize to my colleagues in the department of computer science. Please, do not send the Spanish Inquisition.

The name “Python” is a tribute to Monty Python (please excuse the poor joke).

Why Tesseract and ocrmypdf?

Some of you may be familiar with, or even regular users of the OCR function provided by Adobe Acrobat DC pro. The ctrl+f function is one of the simplest and most efficient tools to support analysis of text and research. Acrobat DC is a powerful and very easy to use software, but this comes at a price. First, Acrobat works only with a handful of common modern languages such as English, French, German or Japanese. Tesseract on the other hand can recognize characters from a broad variety of modern and classical languages, including, but not limited to, Armenian, Classical Arabic, Classical Greek, Syriac, and Old Georgian, to name only a few. Second, Tesseract is a free, and open-source software, presenting a more cost-efficient option compared to other expensive commercial options and I am sure many of you would rather settle for the free but equally powerful alternative.

First, go to https://colab.research.google.com/

This is a free Jupyter notebook that will save your data on your personal google drive. Just remember to hit the save button before closing the page.      

Open a new notebook and sign up with a Gmail address. You will want to use your @nd.edu address but any personal google account will work too.

As a first step, we need to download ocrmypdf, and all its dependencies. To do so, simply type the following lines. The first will download the python package ocrmypdf, while the other lines will deal with the dependencies. For those of you new to coding, you will learn the first rule of coding: any errors in spelling, indentation and so forth can break your code. Be careful!

When the above lines have been written, run the cell by pressing ctrl+enter or press the button in the upper left corner. The download process should take around a minute. Once the download is complete, you should see a little green tick next to the upper left arrow.

Once the package and its dependencies have been downloaded, we will need to import “ocrmypdf” so that we can put it to work. Add a line of code to your colab sheet by clicking on + Code and write the following code:

Next, add a copy of your scanned pdf to the “files/content” folder  on the left side of your screen (or any other folder of your choice, you will just have to note its path somewhere). In our case, we are going to work with the first page of an article on the Mevlevi Sufi order published by the Byzantinist Speros Vryonis Junior.

In a new cell, enter the following lines of code (beware, the underscores are double underscores!).

Here, the name on the left should be that of the file you want to ocr (or its path if put in another folder), the one on the right should be that of the new, postprocessed file. The code uses the exact name of the file, ‘Vryonis_Article.pdf’. You can keep the exact same name if you want the new file to overwrite the original one. In my case, my code is directing ocrmypdf to create a new file: ‘Vryonis_Article_OCR.pdf’. Once generated, the post-processed article should appear in the same folder.

If you ever get an error, simply restart the runtime before running the cell again.

Et voila! You can now search your pdf with ctrl+F or copy and paste any sentence you want.

But this is not the most exciting part of this tutorial, and we may want to spice things up a little bit. Tesseract is very good at OCRing (yes, this is a verb, at least according to the WordSense dictionary) non-Latin scripts, but the process is a bit more involved. As an example, let’s take a page from the Masālik al-abṣār mamālik al-amṣār written in the 14th century by the Syrian polymath al-ʿUmarī.

For this, you need first to download the Arabic trained data at https://github.com/tesseract-ocr/tessdata/tree/main/script

Then move the downloaded file to the following folder /usr/share/tesseract-ocr/4.00/tessdata

The process is the same as before, simply change the language code to that of the language you just added, in our case “ara”. The various language codes can be found here: (https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html)

وهو أثرى الممالك بلا احتشام خلا أنه بكثرة

And here we are. Those who can read Arabic will notice that the result is extremely impressive. This, however, is a rather neat scan and Arabic is often difficult to properly OCR because of the cursive nature of the script. If the quality of your scan is poor, you may also be able to clean it with Python beforehand for a better result. I may develop this point further in another post.

Beyond Arabic, Tesseract works very well with other non-cursive Semitic scripts, and you may get excellent results with Hebrew for example. Here is a last example from a Syriac Bible.

If you have any questions or comments about this guide, feel free to contact me.

Romain Thurin
PhD Candidate
Medieval Institute

Working in the Archives – The Vatican Secret Archives

This post continues an ongoing special series of the Notre Dame Medieval Studies Research Blog called “Working in the Archives.” This series focuses on practical knowledge for accessing archives across Europe and North Africa, for making each archival visit a productive one, and for enhancing the quality of life of the researcher during the visit.

This entry in the series will discuss how to navigate a trip to one of the most famous archives in the world: the Archivio Segreto Vaticano (ASV), or the Vatican Secret Archives.

Below, I will discuss what is needed to make an archival visit to the ASV productive. I take each archive in turn, explaining how to get to both archives from the various modes of transit in Rome (bus, metro, walking), what is needed to access the archive, how to search for material, how to request that material, and other essential information needed for a successful research trip.

How to Get There (ASV) Cortile del Belvedere – 00120 Città del Vaticano

Public transit is the most affordable way to get around Rome and to the Vatican unless staying near the archive. A bus will get you the closest to the ASV, with buses 32, 81, and 590 dropping off at the Piazza del Risorgimento, the stop nearest the Porta Sant’Anna, the entry to Vatican City on its eastern side. If you would like to take the metro, the nearest metro stop is the A-line stop, Ottaviano. There are three metro lines in Rome, with lines A and B intersecting at Roma Termini, Rome’s train station, and lines A and C connecting at stop San Giovanni. A weekly public transit ticket (7 calendar days) costs 24 euros. I found this method the most convenient, as the ticket allows access to both buses and the metro.

The ASV website does not say by which gate a researcher to the archive is supposed to enter. As mentioned just above, the gate is the Porta Sant’Anna, which is the gate by which cars enter the Vatican. Once at the gate, you must pass through multiple lines of security, beginning with the Swiss Guard watching the gate. Prepare yourself for an awkward first exchange, as you will not have your research card your first time entering the archive. You must collect it at the archive itself. Do not expect the guard to know English and be ready with a few prepared sentences or a piece of paper explaining the situation. After the first visit, it is a much less stressful experience.

After you pass through security, head up the Via Sant’Anna into the Belvedere Courtyard, then take a right. The ASV overlooks the adjacent courtyard, the Cortile della Bibliotecha sitting next to the Sistine Salon.

What You Need to Access the Archive

Of all the archives I have personally visited, accessing the Vatican Secret Archives is certainly the most complicated. Before visiting the archive, one must first fill out an application online: http://www.archiviosegretovaticano.va/content/archiviosegretovaticano/en/consultazione/admission-request.html. Before filling out the application, the researcher must have a detailed research plan—what holdings one plans to consult and the length and dates of the planned visit to the archive must be known before approval is granted. The application itself contains a Collection Index by which you can identify the desired collection, however, for those not confident in their Italian, navigating it will perhaps be difficult. I would recommend consulting Francis X. Blouin’s Vatican Archives: An Inventory and Guide to Historical Documents of the Holy See as a supplement to the application process.[1] Finally, an affiliation with a university and a letter of introduction are also both required.

The approval process for access for an ASV card takes less than a week, and in my experience, was handled and approved on the same day.

Some Important Details of the ASV

After your research plan and topic have been approved, the ASV will prepare your card for pickup from the archival reception counter. The ASV does not send you your research card in the mail! You must first go to the archive to get the card, and subsequent visits pass much more smoothly. Additionally, while it is always nice to dress professionally while conducting archival research, there is an actual dress code for researchers in the Secret Archives and its subsidiaries. Dress clothes are required, and I personally wore a blazer, although it is not specifically mandated.

Be prepared for several barriers to effective archival research when working at the Vatican Secret Archives. First, you cannot take photos in the Secret Archive. While unsurprising considering the nature of the material, the ASV also does not allow consultation of more than 5 archival items per day (3 in the morning and 2 more in the afternoon). Furthermore, photocopies of archival material, digital or print, are extremely expensive. The archive charges a flat fee of 8 euros to scan any archival item. On top of this flat fee, the archive charges 2 euros per page for the first hundred pages scanned. After the first hundred pages, however, they cost .80 cents.  So, were I to request a single scanned page, it would cost me 10 euros. Two pages would cost me 12 euros, and so on. Scanning a page from two different archival units would cost 20 euros.

If applying for grants to research at the ASV, I strongly encourage you to factor in this cost into your grant applications.

As a final note, the ASV closes at the end of June and reopens in September, leaving no room for scholars or researchers planning to visit in the later summer months. This information is readily available, but it is still an important thing to consider in planning your trip.

Quality of Life

One of the nicest parts of conducting research in Rome is the abundance of good food and good coffee to be found almost anywhere in the center of the city. There are many little coffee shops and restaurants right next to the Porta Sant’Anna, although they are expensive and crowded. If you don’t mind a little walk, there are cheaper (but still good!) restaurants and coffee shops south of the Vatican, along the Via Aurelia and the Via di Porta Cavalleggeri.

Regarding places to stay, Air B&B and the like can be quite expensive in the center of Rome and near the Vatican, especially if you are traveling alone. A financially sensible alternative is to stay in one of the many monasteries located near the Vatican. Many of these are populated with practicing monks and nuns, providing a much different experience than a normal hotel or B&B. I stayed in the Santa Emilia De Vialar, about a 20-minute walk from the Vatican gates.

Sean Sapp
University of Notre Dame

[1] Francis X. Blouin, Vatican Archives: An Inventory and Guide to Historical Documents of the Holy See (Oxford: Oxford University Press, 1998).

Working in the Archives – Manuscript Research at the Khizana al-Hasaniyya, Rabat

One of the major manuscript collections in Morocco is currently the property of His Majesty, King Mohammad VI. The Khizana al-Hasaniyya or the Bibliothèque Royale as it is known, is housed in the royal palace in Rabat and directly attached to the royal residences. The Researcher Annex where most guests of the library work, is detached from the palace yet located within the palatial environs.

Due to the personal nature and physical location of this library, it is necessary for the aspiring researcher to observe security protocol and to put their best professional foot forward. Like many things in Morocco, the rules will not be explained in detail but everyone will act as though you know them. When in doubt, ask questions.

Basic Details

The Khizana al-Hasaniyya is part of the Qasr al-Malik or Palais Royale in Rabat and consists of the main manuscript library attached to the royal residences and the Researcher Annex. The Researcher Annex is opened from 9am-4pm Monday – Friday. Most work is done in the Researcher Annex, a new building completed sometime after 2014 and staffed with computers for manuscript consultation. Be advised that electronic devices such as phones, tablets, and laptops are forbidden in the Researcher Annex, and all bags must be stored in the small cupboard in the corner of the Researcher Annex.

At the time of the visit, the Khizana al-Hasaniyya manuscript library was open only to those researchers with specific codicological research though it is possible for one of the librarians to give you a tour of the Khizana. The manuscripts they have on display are stunning, from early Qur’ans to musicology texts to a copy of Ibn Khaldun’s al-Muqaddima copied by one of his students and annotated in the margins by Ibn Khaldun himself.

Researcher cards can be obtained by contacting the director of the Khizania al-Hasaniyya, Dr. Ahmed Chouqui Binebine (a.binebine@gmail.com), and requesting a meeting with him to discuss your research. Dr. Binebine speaks Darija (Moroccan dialect of Arabic), Fusha (Modern Standard Arabic or MSA) and French; if you do not speak these languages, it is best to arrange your meeting with the help of another scholar with current researcher privileges; that way, they can advocate on your behalf while translating as needed.

If you arrive in Morocco and you don’t have someone in country who can pull strings for you at the Hasaniyya, contact Dr. James Miller, the director of MACECE in Rabat.  He is used to helping Americans make connections with Moroccans and may know someone who can help.

Bring your passport and a copy of the passport face page and the page with your date of entry to Morocco or your Carte de Séjour, your research clearance and/or Lettre D’Attestation, and two passport sized photographs to your meeting. Unlike the BRNM, there is no fee associated with this card. It is unclear as to whether or not researcher cards are valid for a specific amount of time or if this time can be negotiated.

For the record, I was able to gain a researcher card valid for three months. This card is an index sized paper card written in Arabic and stamped with the official seal; you will need it for subsequent visits to the library.

During your meeting, you can also request a copy of the General Index for the Khizana al-Hasaniyya along with other catalogs relevant to your research.  This is invaluable as copies of the General Index are quite hard to come by in the United States (Emory and the Metropolitan Museum of Art have copies of the General Index).

The General Index of Manuscripts for the Khizana al-Hasaniyya. All entries are in Arabic.

While the General Index just lists the manuscripts alphabetically and with little information about the manuscript, the more in-depth catalogs, such as the catalogs on Ash’arite manuscripts and those concerned with Islamic law, are much more detailed.  You can request copies of these subject-specific indexes from Dr. Binebine.

Additional manuscript catalogs; text in Arabic.

Dr. Binebine and other librarians at the Khizana may also give you additional texts, such as their publications on codicology. Dr. Binebine’s 2015 book, Histoire des bibliotheques au Maroc, is worth having for any medievalist or manuscript specialist.

The dress code is business wear, with many Moroccan researchers wearing traditional Moroccan clothes such as djellabas. Looking like you have a valid reason to go to the royal palace will help convince the guards and employees that you are not some random tourist hoping to see the king.

For those spending the day or at least the lunch hour at the Khizana al-Hasaniyya, there is a small arcade opposite the soccer field near the Researcher Annex where you can get a pizza (15 MAD) and fresh orange juice (10MAD) as well as a sandwich on occasion or snacks from the nearby hanout (kiosk). It is best to be discreet about drinking water in the Researcher Annex just to avoid any problems.

Getting There

The Qasr al-Mālik is a massive compound located at the end of the Avenue Mohammed V in downtown Rabat and is guarded around the clock. For your first visit, you will need to present your passport and tell the guard that you have an appointment with the director of the Khizana; the guard will then phone to confirm your visit. For subsequent visits, saying that you are a researcher (chercheur/chercheuse) at the library and presenting your researcher card is enough to get in, though it never hurts to have your passport and your Lettre D’Attestation in case someone asks for it.

Unlike European palaces, the Qasr al-Mālik is more akin to a city within a city, making it difficult for the first time visitor to get to where they are going. For the person going alone, it is best to take a petit taxi to the main gate, Bab Soufara, and then have the cab driver continue through the gate and take you directly to the Khizana.

NB: Make sure the driver takes you to the right spot; simply asking for the Khizana al-Hasaniyya might bring you to the Researcher Annex or it might bring you to the Khizana itself. The same is true when it comes to asking for directions inside the compound. If the cab driver wants to leave you at the gate, it is a 10-15 minute walk to the Researcher Annex from Bab Soufara.

Conducting Research

As previously mentioned, the physical manuscripts are largely off-limits to most researchers, meaning that the majority of manuscript work is now done digitally. To request a copy of the manuscript, you will need to fill out a small request form at the desk in the Researcher Annex and give it to the librarian sitting there; they will then call up the digital copies of the manuscript and load them on one of the computers lining the walls. When the files are ready, the librarian will call you over to the computer.

To request digital copies of the manuscripts, you will need to write directly to the director of the Khizana, Dr. Binebeine, and state what it is you want and why you need it. Do not email Dr. Binebine but present a printed and signed copy of your letter to the librarian at the Researcher Annex and ask them to give it to Dr. Binebine. The librarian will then convey your request and, if it is approved, a digital copy of the manuscript will be given to you within 24-48 hours.

Prior to 2013, digital copies were presented to researchers on CD but as of March 2017, a colleague was able to load the files directly onto a USB stick.

NB: One might be limited to the number of folia they are allowed to request per manuscript. Some have reported that they were only able to request 10 folia of a manuscript, while others said they were able to get 40-50 folia. As such, plan your requests and research accordingly.

Language

The Khizana al-Hasaniyya runs on Arabic, especially Darija. The various manuscript indexes, from the General Index to the more thematic indexes of manuscripts, are in Arabic, along with the manuscript request forms. Researchers should have a solid command of the Arabic script and decent penmanship in order to correctly write out their requests.

Spoken French can get one by in a pinch, especially if one’s vocabulary related to manuscripts is not as strong in Darija or Fusha (MSA) as it is in French. For those researchers who are Caucasian or black, the staff may speak to you in somewhat broken French, assuming that you either come from France (if white) or one of the francophone African countries (if black).

The computers in the Researcher Annex all run Microsoft OS and are in French, not English. The screens are touch screens, meaning that you can pinch and zoom in on the images, as well as swipe back and forth. However, most of the other researchers in the Annex use the mouse so it’s probably best to follow their lead.

When it comes to writing a manuscript request letter, the letter must be in Arabic (MSA) or French. If you are unsure about the protocol or language within the letter, ask the librarian in the Researcher Annex if they have a copy of a request on file for you to use.

Misc.

The Royal Palace is a trip for a medievalist not in the least because it is a functioning palace on the scale of medieval administrative cities.  Those who live on the palace grounds and who work there inherited the position from their family members, many of whom may have been part of the royal slave retinues just under two hundred years ago.  To see the Royal Palace in Rabat gives one a good appreciation for the scale of medieval administrative cities like Baghdad, Samarra, Qayrawan, Cairo, Fez, and Marrakech and for the way in which such palaces were cities in their own right.

For additional resources on the Hasaniyya as well as other manuscript libraries in Morocco, see J. Hendrickson and S. Adil, “A Guide to Arabic Manuscript Libraries in Morocco: Further Developments,” (2013)