Searching CORD-19 at the Distant Reader

Posted on July 26, 2021 in Distant Reader by Eric Lease Morgan

This blog posting documents the query syntax for an index of scientific journal articles called CORD-19.

flowerCORD-19 is a data set of scientific journal articles on the topic of COVID-19. As of this writing, it includes more than 750,000 items. This data set has been harvested, pre-processed, indexed, and made available as a part of the Distant Reader. Access to the index is freely available to anybody and everybody.

The index is rooted in a technology called Solr, a very popular indexing tool. The index supports simple searching, phrase searching, wildcard searches, fielded searching, Boolean logic, and nested queries. Each of these techniques are described below:

  • simple searches – Enter any words you desire, and you will most likely get results. In this regard, it is difficult to break the search engine.
  • phrase searches – Enclose query terms in double-quote marks to search the query as a phrase. Examples include: "waste water", "circulating disease", and "acute respiratory syndrome".
  • wildcard searches – Append an asterisk (*) to any non-phrase query to perform a stemming operation on the given query. For example, the query virus* will return results including the words virus and viruses.
  • fielded searches – The index has many different fields. The most important include: authors, title, year, journal, abstract, and keywords. To limit a query to a specific field, prefix the query with the name of the field and a colon (:). Examples include: title:disease, abstract:"cardiovascular disease", or year:2020. Of special note is the keywords field. Keywords are sets of statistically significant and computer-selected terms akin to traditional library subject headings. The use of the keywords field is a very efficient way to create a small set of very relevant articles. Examples include: keywords:mrna, keywords:ribosome, or keywords:China.
  • Boolean logic – Queries can be combined with three Boolean operators: 1) AND, 2) OR, or 3) NOT. The use of AND creates the intersection of two queries. The use of OR creates the union of two queries. The use of NOT creates the negation of the second query. The Boolean operators are case-sensitive. Examples include: covid AND title:SARS, abstract:cat* OR abstract:dog*, and abstract:cat* NOT abstract:dog*
  • nested queries – Boolean logic queries can be nested to return more sophisticated sets of articles; nesting allows you to override the way rudimentary Boolean operations get combined. Use matching parentheses (()) to create nested queries. An example includes ((covid AND title:SARS) OR abstract:cat* OR abstract:dog*) NOT year:2020. Of all the different types of queries, nested queries will probably give you the most grief.

Distant Reader Workshop Hands-On Activities

Posted on July 9, 2021 in Distant Reader by Eric Lease Morgan

This is a small set of hands-on activities presented for the Keystone Digital Humanities 2021 annual meeting. The intent of the activities is to familiarize participants with the use and creation of Distant Reader study carrels. This page is also available as PDF file designed for printing.

Introduction

The Distant Reader is a tool for reading. Given an almost arbitrary amount of unstructured data (text), the Reader creates a corpus, applies text mining against the corpus, and returns a structured data set amenable to analysis (“reading”) by students, researchers, scholars, and computers.

The data sets created by the Reader are called “study carrels”. They contain a cache of the original input, plain text versions of the same, many different tab-delimited files enumerating textual features, a relational database file, and a number of narrative reports summarizing the whole. Given this set of information, it is easy to answer all sorts of questions that would have previously been very time consuming to address. Many of these questions are akin to newspaper reporter questions: who, what, when, where, how, and how many.

Using more sophisticated techniques, the Reader can help you elucidate on a corpus’s aboutness, plot themes over authors and time, create maps, create timelines, or even answer sublime questions such as, “What are some definitions of love, and how did the writings of St. Augustine and Jean-Jacques Rousseau compare to those definitions?”

The Distant Reader and its library of study carrels are located at:

Activity #1: Compare & contrast two study carrels

These tasks introduce you to the nature of study carrels:

  1. From the library, identify two study carrels of interest, and call them Carrel A and Carrel B. Don’t think too hard about your selections.
  2. Read Carrel A, and answer the following three questions: 1) how many items are in the carrel, 2) if you were to describe the content of the carrel in one sentence, then what might that sentence be, and 3) what are some of the carrel’s bigrams that you find interesting and why.
  3. Read Carrel B, and answer the same three questions.
  4. Answer the question, “How are Carrels A and B similar and different?”

Activity #2: Become familiar with the content of a study carrel

These tasks stress the structured and consistent nature of study carrels:

  1. Download and uncompress both Carrel A and Carrel B.
  2. Count the number of items (files and directories) at the root of Carrel A. Count the number of items (files and directories) at the root of Carrel B. Answer the question, “What is the difference between the two counts?”. What can you infer from the answer?
  3. Open any of the items in the directory/folder named “cache”, and all of the files there ought to be exact duplicates of the original inputs, even if they are HTML documents. In this way, the Reader implements aspects of preservation. A la LOCKSS, “Lots of copies keep stuff safe.”
  4. From the cache directory, identify an item of interest; pick any document-like file, and don’t think too hard about your selection.
  5. Given the name of the file from the previous step, open the file with the similar name but located in the folder/directory named “txt”, and you ought to see a plain text version of the original file. The Reader uses these plain text files as input for its text mining processes.
  6. Given the name of the file from the previous step, use your favorite spreadsheet program to open the similarly named file but located in the folder/directory named “pos”. All files in the pos directory are tab-delimited files, and they can be opened in your spreadsheet program. I promise. Once opened, you ought to see a list of each and every token (“word”) found in the original document as well as the tokens’ lemma and part-of-speech values. Given this type of information, what sorts of questions do you think you can answer?
  7. Open the file named “MANIFEST.htm” found at the root of the study carrel, and once opened you will see an enumeration and description of all the folders/files in any given carrel. What types of files exist in a carrel, and what sorts of questions can you address if given such files?

Activity #3: Create study carrels

Anybody can create study carrels, there are many ways to do so, and here are two:

  1. Go to https://distantreader.org/create/url2carrel, and you may need to go through ORCID authentication along the way.
  2. Give your carrel a one-word name.
  3. Enter a URL of your choosing. Your home page, your institutional home page, or the home page of a Wikipedia article are good candidates.
  4. Click the Create button, and the Reader will begin to do its work.
  5. Create an empty folder/directory on your computer.
  6. Identify three or four PDF files on your computer, and copy them to the newly created directory. Compress (zip) the directory.
  7. Go to https://distantreader.org/create/zip2carrel, and you may need to go through ORCID authentication along the way.
  8. Give your carrel a different one-word name.
  9. Select the .zip file you just created.
  10. Click the Create button, and the Reader will begin to do its work.
  11. Wait patiently, and along the way the Reader will inform you of its progress. Depending on many factors, your carrels will be completed in as little as two minutes or as long as an hour.
  12. Finally, repeat Activities #1 and #2 with your newly created study carrels.

Extra credit activities

The following activities outline how to use a number of cross-platform desktop/GUI applications to read study carrels:

  • Print any document found in the cache directory and use the traditional reading process to… read it. Consider using an active reading process by annotating passages with your pen or pencil.
  • Download Wordle from the Wayback Machine, a fine visualization tool. Open any document found in the txt directory, and copy all of its content to the clipboard. Open Wordle, paste in the text, and create a tag cloud.
  • Download AntConc, a cross-platform concordance application. Use AntConc to open one more more files found in the txt directory, and then use AntConc to find snippets of text containing the bigrams identified in Activity #1. To increase precision, configure AntConc to use the stopword list found in any carrel at etc/stopwords.txt.
  • Download OpenRefine, a robust data cleaning and analysis program. Use OpenRefine to open one or more of the files in the folder/directory named “ent”. (These files enumerate named-entities found in your carrel.) Use OpenRefine to first clean the entities, and then use it to count & tabulate things like the people, places, and organizations identified in the carrel. Repeat this process for any of the files found in the directories named “adr”, “pos”, “wrd”, or “urls”.

Extra extra credit activities

As sets of structured data, the content of study carrels can be computed against. In other words, programs can be written in Python, R, Java, Bash, etc. which open up study carrel files, manipulate the content in ways of your own design, and output knowledge. For example, you could open up the named entity files, select the entities of type PERSON, look up those people in Wikidata, extract their birthdates and death dates, and finally create a timeline illustrating who was mentioned in a carrel and when they lived. The same thing could be done for entities of type GRE (place), and a map could be output. A fledgling set of Jupyter Notebooks and command-line tools have been created just for these sorts of purposes, and you can find them on GitHub:

Every study carrel includes an SQLite relational database file (etc/reader.db). The database file includes all the information from all tab-delimited files (named-entities, parts-of-speech, keywords, bibliographics, etc.). Given this database, a person can either query the database from the command-line, write a program to do so, or use GUI tools like DB Browser for SQLite or Datasette. The result of such queries can be elaborate if-then statement such as “Find all keywords from documents dated less than Y” or “Find all documents, and output them in a given citation style.” Take a gander at the SQL file named “etc/queries.sql” to learn how the database is structured. It will give you a head start.

Summary

Given an almost arbitrary set of unstructured data (text), the Distant Reader outputs sets of structured data known as “study carrels”. The content of study carrels can be consumed using the traditional reading process, through the use of any number of desktop/GUI applications, or programmatically. This document outlined each of these techniques.

Embrace information overload. Use the Distant Reader.

Distant Reader

Posted on July 2, 2021 in Distant Reader by Eric Lease Morgan

Embrace information overload. Use the Distant Reader.

PTPBio and the Reader

Posted on May 6, 2021 in Distant Reader by Eric Lease Morgan

reflections[The following missive was written via an email message to a former colleague, and it is a gentle introduction to Distant Reader “study carrels”. –ELM]

On another note, I see you help edit a journal (PTPBio), and I used it as a case-study for a thing I call the Distant Reader.

The Distant Reader takes an arbitrary amount of text as input, does text mining and natural language processing against it, saves the result as a set of structured data, writes a few reports, and packages the whole thing into a zip file. The zip file is really a data set, and Distant Reader data sets are affectionately called “study carrels”. I took the liberty of applying the Reader process to PTPBio, and the result has manifested itself in a number of ways. Let me enumerate them. First, there is the cache of the original content;

Next, there are plain text versions of the cached items. These files are used for text mining, etc.:

The Reader does many different things against the plain text. For example, the Reader enumerates and describes each and every token (“word”) in each and every document. The descriptions include the word, its lemma, is part-of-speech, and its location in the corpus. Each plain text file is really a tab-delimited file easily importable into your favorite spreadsheet or database program:

Similar sets of files are created for named entities, URLs, email addresses, and statistically significant keywords:

All of this data is distilled into a (SQLite) database file, and various reports are run against the database. For example, a very simple and rudimentary report as well as a more verbose HTML report:

All of this data is stored in a single directory:

Finally, the whole thing is zipped up and available for downloading. What is cool about the download is that it is 100% functional on your desktop as it is on the ‘Net. The study carrels does not require the ‘Net to be operational; study carrels are manifested as plain text files, are stand-alone items, and will endure the test of time:

“But wait. There’s more!”

It is not possible for me to create a Web-based interface empowering students, researchers, or scholars to answer any given research question. There are too many questions. On the other hand, since the study carrels are “structured”, one can write more sophisticated applications against the data. That is what the Reader Toolbox and Reader (Jupyter) Notebooks are for. Using the Toolbox and/or the Notebooks the student, researcher, or scholar can do all sorts of things:

  • download carrels from the Reader’s library
  • extract ngrams
  • do concordancing
  • do topic modeling
  • create a full text index
  • output all sentences containing a given word
  • find all people, use the ‘Net to get birth date and death dates, and create a timeline
  • find all places, use the ‘Net to get locations, and plot a map
  • articulate an “interesting” idea, and illustrate how that idea ebbed & flowed over time
  • play hangman, do a cross-word puzzle, or plat a hidden word search game

Finally, the Reader is by no means perfect. “Software is never done. If it were, then it would be called ‘hardware’.” Ironically though, the hard part about the Reader is not interpreting the result. The hard part is two other things. First, in order to use the Reader effectively, a person needs to have a (research) question in mind. The question can be as simple as “What are the characteristics of the given corpus?” Or, they can be as sublime as “How does St. Augustine define love, and how does his definition differ from Rousseau’s?”

Just as difficult it the creation of the corpus to begin with. For example, I needed to get just the PDF versions of your journal, but the website (understandably) is covered with about pages, navigation pages, etc. Listing the URLs of the PDF files was not difficult, but it was a bit tedious. Again, that is not your fault. In fact, your site was (relatively) easy. Some places seem to make it impossible to get to the content. (Sometimes I think the Internet is really one huge advertisement.)

Okay. That was plenty!

Your journal was a good use-case. Thank you for the fodder.

Oh, by the way, the Reader is located at https://distantreader.org, and it available for use by anybody in the world.

OpenRefine and the Distant Reader

Posted on February 10, 2020 in Distant Reader by Eric Lease Morgan

The student, researcher, or scholar can use OpenRefine to open one or more different types of delimited files. OpenRefine will then parse the file(s) into fields. It can makes many things easy such as finding/replacing, faceting (think “grouping”), filtering (think “searching”), sorting, clustering (think “normalizing/cleannig”), counting & tabulating, and finally, exporting data. OpenRefine is an excellent go-between when spreadsheets fail and full-blown databases are too hard to use. OpenRefine eats delimited files for lunch.

Many (actually, most) of the files in a study carrel are tab-delimited files, and they will import into OpenRefine with ease. For example, after all a carrel’s part-of-speech (pos) files are imported into OpenRefine, the student, researcher, or scholar can very easily count, tabulate, search (filter), and facet on nouns, verbs, adjectives, etc. If the named entities files (ent) are imported, then it is easy to see what types of entities exist and who might be the people mentioned in the carrel:

pos

Facets (counts & tabulations) of parts-of-speech

nouns

Most frequent nouns

entities

Types of named-entities

who

Who is mentioned in a file and how often

OpenRefine recipes

Like everything else, using OpenRefine requires practice. The problem to solve is not so much learning how to use OpenRefine. Instead, the problem to solve is to ask and answer interesting questions. That said, the student, researcher, or scholar will want to sort the data, search/filter the data, and compare pieces of the data to other pieces to articulate possible relationships. The following recipes endeavor to demonstrate some such tasks. The first is to simply facet (count & tabulate) on parts-of-speech files:

  1. Download, install, and run OpenRefine
  2. Create a new project and as input, randomly chose any file from a study carrel’s part-of-speech (pos) directory
  3. Continue to accept the defaults, and continue with “Create Project »”; the result ought to be a spreadsheet-like interface
  4. Click the arrow next to the POS column and select Facet/Text facet from the resulting menu; the result ought to be a new window containing a column of words and a column of frequencies — counts & tabulations of each type of part-of-speech in the file
  5. Go to Step #4, until you get tired, but this time facet by other values

Faceting is a whole like like “grouping” in the world of relational databases. Faceting alphabetically sorts a list and then counts the number of times each item appears in the list. Different types of works have different parts-of-speech ratios. For example, it is not uncommon for there to be a preponderance of past-tense verbs stories. Counts & tabulations of personal pronouns as well as proper nouns give senses of genders. A more in-depth faceting against adjectives allude to sentiment.

This recipe outlines how to filter (“search”):

  1. Click the “Remove All” button, if it exists; this ought to reset your view of the data
  2. Click the arrow next to the “token” column and select “Text filter” from the resulting menu
  3. In your mind, think of a word of interest, and enter it into the resulting search box
  4. Take notice of how the content in the spreadsheet view changes
  5. Go to Step #3 until you get tired
  6. Click the “Remove All” button to reset the view
  7. Text filter on the “token” column but search for “^N” (which is code for any noun) and make sure the “regular expression” check box is… checked
  8. Text facet on the “lemma” column; the result ought to be a count & tabulation of all the nouns
  9. Go to Step #6, but this time search for “^V” or “^J”, which are the codes for any verb or any adjective, respectively

By combining the functionalities of faceting and filtering the student, researcher, or scholar can investigate the original content more deeply or at least in different ways. The use of OpenRefine in this way is akin to leafing through book or a back-of-the-book index. As patterns & anomalies present themselves, they can be followed up more thoroughly through the use of a concordance and literally see the patterns & anomalies in context.

This recipe answers the question, “Who is mentioned in a corpus, and how often?“:

  1. Download, install, and run OpenRefine
  2. Create a new project and as input, select all of the files in the named-entity (ent) directory
  3. Continue to accept the defaults, but remember, all the almost all of the files in a study carrel are tab-delimited files, so remember to import them as “CSV / TSV / separator-based files”, not Excel files
  4. Continue to accept the defaults, and continue with “Create Project »”; the result ought to be a spreadsheet-like interface
  5. Click the arrow next to “type” column and select Facet/Text facet from the resulting menu; the result ought to be a new window containing a column of words and a column of frequencies — counts & tabulations of each type of named-entity in the whole of the study carrel
  6. Select “PERSON” from the list of named entities; the result ought to be a count & tabulation of the names of the people mentioned in the whole of the study carrel
  7. Go to Step #5 until tired, but each time select a different named-entity value

This final recipe is a visualization:

  1. Create a new parts-of-speech or named-entity project
  2. Create any sort of meaningful set of faceted results
  3. Select the “choices” link; the result ought to be a text area containing the counts & tabulation
  4. Copy the whole of the resulting text area
  5. Paste the result into your text editor, find all tab characters and change them to colons (:), copy the whole of the resulting text
  6. Open Wordle and create a word cloud with the contents of your clipboard; word counts may only illustrate frequencies, but sometimes the frequencies are preponderance.

A study carrel’s parts-of-speech (pos) and named-entities (ent) files enumerate each and every word or named-entity in each and every sentence of each and every item in the study carrel. Given a question relatively quantitative in nature and pertaining to parts-of-speech or named-entities, the pos and ent files are likely to be able to address the question. The pos and ent files are tab-delimited files, and OpenRefine is a very good tool for reading and analyzing such files. It does much more than was outlined here, but enumerating them here is beyond scope. Such is left up to the… reader.

Topic Modeling Tool – Enumerating and visualizing latent themes

Posted on February 6, 2020 in Distant Reader by Eric Lease Morgan

Technically speaking, topic modeling is an unsupervised machine learning process used to extract latent themes from a text. Given a text and an integer, a topic modeler will count & tabulate the frequency of words and compare those frequencies with the distances between the words. The words form “clusters” when they are both frequent and near each other, and these clusters can sometimes represent themes, topics, or subjects. Topic modeling is often used to denote the “aboutness” of a text or compare themes between authors, dates, genres, demographics, other topics, or other metadata items.

Topic Modeling Tool is a GUI/desktop topic modeler based on the venerable MALLET suite of software. It can be used in a number of ways, and it is relatively easy to use it to: list five distinct themes from the Iliad and the Odyssey, compare those themes between books, and, assuming each chapter occurs chronologically, compare the themes over time.

topics

Simple list of topics

topics

Topics distributed across a corpus

topics

Comparing the two books of Homer

topics

Topics compared over time

Topic Modeling Tool Recipes

These few recipes are intended to get you up and running when it comes to Topic Modeling Tool. They are not intended to be a full-blown tutorial. This first recipe merely divides a corpus into the default number of topics and dimensions:

  1. Download and install Topic Modeling Tool
  2. Copy (not move) the whole of the txt directory to your computer’s desktop
  3. Create a folder/directory named “model” on your computer’s desktop
  4. Open Topic Modeling Tool
  5. Specify the “Input Dir…” to be the txt folder/directory on your desktop
  6. Specify the “Output Dir…” to be the folder/directory named “model” on your desktop
  7. Click “Learn Topics”; the result ought to be a a list of ten topics (numbered 0 to 9), and each topic is denoted with a set of scores and twenty words (“dimensions”), and while functional, such a result is often confusing

This recipe will make things less confusing:

  1. Change the number of topics from the default (10) to five (5)
  2. Click the “Optional Settings…” button
  3. Change the “The number of topic words to print” to something smaller, say five (5)
  4. Click the “Ok” button
  5. Click “Learn Topics”; the result will include fewer topics and fewer dimensions, and the result will probably be more meaningful, if not less confusing

There is no correct number of topics to extract with the process of topic modeling. “When considering the whole of Shakespeare’s writings, what is the number of topics it is about?” This being the case, repeat and re-repeat the previous recipe until you: 1) get tired, or 2) feel like the results are at least somewhat meaningful.

This recipe will help you make the results even cleaner by removing nonsense from the output:

  1. Copy the file named “stopwords.txt” from the etc directory to your desktop
  2. Click “Optional Settings…”; specify “Stopword File…” to be stopwords.txt; click “Ok”
  3. Click “Learn Topics”
  4. If the results contain nonsense words of any kind (or words that you just don’t care about), edit stopwords.txt to specify additional words to remove from the analysis
  5. Go to Step #3 until you get tired; the result ought to be topics with more meaningful words

Adding individual words to the stopword list can be tedious, and consequently, here is a power-user’s recipe to accomplish the same goal:

  1. Identify words or regular expressions to be excluded from analysis, and good examples include all numbers (\d+), all single-letter words (\b\w\b), or all two-letter words (\b\w\w\b)
  2. Use your text editor’s find/replace function to remove all occurrences of the identified words/patterns from the files in the txt folder/directory; remember, you were asked to copy (not move) the whole of the txt directory, so editing the files in the txt directory will not effect your study carrel
  3. Run the topic modeling process
  4. Go to Step #1 until you: 1) get tired, or 2) are satisfied with the results

Now that you have somewhat meaningful topics, you will probably want to visualize the results, and one way to do that is to illustrate how the topics are dispersed over the whole of the corpus. Luckily, the list of topics displayed in the Tool’s console is tab-delimited, making it easy to visualize. Here’s how:

  1. Topic model until you get a set of topics which you think is meaningful
  2. Copy the resulting topics, and this will include the labels (numbers 0 through n), the scores, and the topic words
  3. Open your spreadsheet application, and paste the topics into a new sheet; the result ought to be three columns of information (labels, scores, and words)
  4. Sort the whole sheet by the second column (scores) in descending numeric order
  5. Optionally replace the generic labels (numbers 0 through n) with a single meaningful word, thus denoting a topic
  6. Create a pie chart based on the contents of the first two columns (labels and scores); the result will appear similar to an illustration above and it will give you an idea of how large each topic is in relation to the others

Because of a great feature in Topic Modeling Tool it is relatively easy to compare topics against metadata values such as authors, dates, formats, genres, etc. To accomplish this goal the raw numeric information output by the Tool (the actual model) needs to be supplemented with metadata, the data then needs to be pivoted, and subsequently visualized. This is a power-user’s recipe because it requires: 1) a specifically shaped comma-separated values (CSV) file, 2) Python and a few accompanying modules, and 3) the ability to work from the command line. That said, here’s a recipe to compare & contrast the two books of Homer:

  1. Copy the file named homer-books.csv to your computer’s desktop
  2. Click “Optional Settings…”; specify “Metadata File…” to be homer-books.csv; click “Ok”
  3. Click “Learn Topics”; the result ought to pretty much like your previous results, but the underlying model has been enhanced
  4. Copy the file named pivot.py to your computer’s desktop
  5. When the modeling is complete, open up a terminal application and navigate to your computer’s desktop
  6. Run the pivot program (python pivot.py); the result ought to an error message outlining the input pivot.py expects
  7. Run pivot.py again, but this time give it input; more specifically, specify “./model/output_csv/topics-metadata.csv” as the first argument (Windows users will specify .\model\output_csv\topics-metadata.csv), specify “barh” for the second argument, and “title” as the third argument; the result ought to be a horizontal bar chart illustrating the differences in topics across the Iliad and the Odyssey, and ask yourself, “To what degree are the books similar?”

The following recipe is very similar to the previous recipe, but it illustrates the ebb & flow of topics throughout the whole of the two books:

  1. Copy the file named homer-chapters.csv to your computer’s desktop
  2. Click “Optional Settings…”; specify “Metadata File…” to be homer-chapters.csv; click “Ok”
  3. Click “Learn Topics”
  4. When the modeling is complete, open up a terminal application and navigate to your computer’s desktop
  5. Run pivot.py and specify “./model/output_csv/topics-metadata.csv” as the first argument (Windows users will specify .\model\output_csv\topics-metadata.csv), specify “line” for the second argument, and “title” as the third argument; the result ought to be a line chart illustrating the increase & decrease of topics from the beginning of the saga to the end, and ask yourself “What topics are discussed concurrently, and what topics are discussed when others are not?”

Topic modeling is an effective process for “reading” a corpus “from a distance”. Topic Modeling Tool makes the process easier, but the process requires practice. Next steps are for the student to play with the additional options behind the “Optional Settings…” dialog box, read the Tool’s documentation, take a look at the structure of the CSV/metadata file, and take a look under the hood at pivot.py.

The Distant Reader and concordancing with AntConc

Posted on January 31, 2020 in Distant Reader by Eric Lease Morgan

Concordancing is really a process about find, and AntConc is a very useful program for this purpose. Given one or more plain text files, AntConc will enable the student, researcher, or scholar to: find all the occurrences of a word, illustrate where the word is located, navigate through document(s) where the word occurs, list word collocations, and calculate quite a number of useful statistics regarding a word. Concordancing, dating from the 13th Century, is the oldest form of text mining. Think of it as control-F (^f) on steroids. AntConc does all this and more. For example, one can load all of the Iliad and the Odyssey into AntConc. Find all the occurrences of the word ship, visualize where ship appears in each chapter, and list the most significant words associated with the word ship.

occurrences

Occurrences of a word

dispersion

Dispersion charts

interesting

“interesting” words

AntConc recipes

This recipe simply implements search:

  1. Download and install AntConc
  2. Use the “Open Files(s)…” menu option to open all files in the txt directory
  3. Select the Concordance tab
  4. Enter a word of interest into the search box
  5. Click the Start button

The result ought to be a list of phrases where the word of interest is displayed in the middle of the screen. In modern-day terms, such a list is called a “key word in context” (KWIC) index.

This recipe combines search with “control-F”:

  1. Select the Concordance tab
  2. Enter a word of interest into the search box
  3. Click the Start button
  4. Peruse the resulting phrases and click on one of interest; the result ought to a display of a text and the search term(s) is highlighted in the larger context
  5. Go to Step #1 until tired

This recipe produces a dispersion plot, an illustration of where a search term appears in a document:

  1. Select the Concordance tab
  2. Enter a word of interest into the search box
  3. Select the “Concordance Plot” tab

The result will be a list of illustrations. Each illustration will include zero or more vertical lines denoting the location of your search term in a given file. The more lines in each illustrations, the more times the search terms appear in the document.

This recipe counts & tabulates the frequency of words:

  1. Select the “Word List” tab
  2. Click the Start button; the result will be a list of all the words and their frequencies
  3. Scroll up and down the list to get a feel for what is common
  4. Select a word of interest; the result will be the same as if you entered the word in Recipe #1

It is quite probable the most frequent words will be “stop words” like the, a, an, etc. AntConc supports the elimination of stop words, and the Reader supplies a stop word list. Describing how to implement this functionality is too difficult to put into words. (No puns intended.) But here is an outline:

  1. Select the “Tool Preferences” menu option
  2. Select the “Word List” category
  3. Use the resulting dialog box to select a stop words list, and such a list is called stopwords.txt found in the etc directory
  4. Click the Apply button
  5. Go to Step #1; and the result will be a frequency list sans any stop words, and the result will be much more meaningful

Ideas are rarely articulated through the use of individual words; ideas are usually articulated through the use of sets of words (ngrams, sentences, paragraphs, etc.). Thus, as John Rupert Firth once said, “You shall know a word by the company it keeps.” This recipe outlines how to list word co-occurrences and collocations:

  1. Select the “Cluster/N-grams” tab
  2. Enter a word of interest in the search box
  3. Click the Start button; the result ought to be a list of two-word phrases (bigrams) sort in frequency order
  4. Select a phrase of interest, and the result will just as if you had search for the phrase in Recipe #1
  5. Go to Step #1 until tired
  6. Select the Collocates tab
  7. Enter a word of interest in the search box
  8. Click the Start button; the result ought to be a list of words and associated scores, and the scores compare the frequencies of the search word and the given word; words with higher scores can be considered “more interesting”
  9. Select “Sort by Freq” from the “Sort by” pop-up menu
  10. Click the Sort button; the result will be the same list of words and associated scores, but this time the list will be sorted by the frequency of the search term/given word combination

Again, a word is known by the company it keeps. Use the co-occurrences and collocations features to learn how a given word (or phrase) is associated with other words.

There is much more to AntConc than outlined in the recipes outlined above. Learning more is left up to you, the student, research, and scholar.

The Distant Reader Workbook

Posted on January 31, 2020 in Distant Reader by Eric Lease Morgan

I am in the process of writing a/the Distant Reader workbook, which will make its debut at a Code4Lib preconference workshop in March. Below is both the “finished” introduction and table-of-contents.

Hands-on with the Distant Reader: A Workbook

This workbook outlines sets of hands-on exercises surrounding a computer system called the Distant Reader — https://distantreader.org.

By going through the workbook, you will become familiar with the problems the Distant Reader is designed to address, how to submit content to the Reader, how to download the results (affectionately called “study carrels”), and how to interpret them. The bulk of the workbook is about the later. Interpretation can be as simple as reading a narrative report in your Web browser, as complex as doing machine learning, and everything else in-between.

You will need to bring very little to the workbook in order to get very much out. At the very least, you will need a computer with a Web browser and an Internet connection. A text editor such as Notepad++ for Windows or BBEdit for Macintosh will come in very handy, but a word processor of any type will do in a pinch. You will want some sort of spreadsheet application for reading tabular data, and Microsoft Excel or Macintosh Numbers will both work quite well. All the other applications used in the workbook are freely available for downloading and cross-platform in nature. You may need to install a Java virtual machine in order to use some of them, but Java is probably already installed on your computer.

I hope you enjoy using the Distant Reader. It helps me use and understand large volumes of text quickly and easily.

Table of contents

    I. What is the Distant Reader, and why should I care?
       A. The Distant Reader is a tool for reading
       B. How it works
       C. What it does
   II. Five different types of input
       A. Introduction
       B. A file
       C. A URL
       D. A list of URLs
       E. A zip file
       F. A zip file with a companion CSV file
       F. Summary
  III. Submitting "experiments" and downloading "study carrels"
   IV. An introduction to study carrels
    V. The structured data of study carrels; taking inventory through the manifest
   VI. Using combinations of desktop tools to analyze the data
       A. Introduction - The three essential types of desktop tools
       B. Text editors
       C. Spreadsheet/database applications
       D. Analysis applications
           i. Wordle and Wordle recipes
          ii. AntConc and AntConc recipes
         iii. Excel and Excel recipes
          iv. OpenRefine and OpenRefine recipes
           v. Topic Modeling Tool and Tool recipes
  VII. Using command-line tools to dig even deeper
 VIII. Summary/conclusion
   IX. About the author

As per usual these days, the “code” is available on GitHub.

Wordle and the Distant Reader

Posted on January 29, 2020 in Distant Reader by Eric Lease Morgan

Visualized word frequencies, while often considered sophomoric, can be quite useful when it comes to understanding a text, especially when the frequencies are focused on things like parts-of-speech, named entities, or co-occurrences. Wordle visualizes such frequencies very well. For example, the 100 most frequent words in the Iliad and the Odyssey, the 100 most frequent nouns in the Iliad and the Odyssey, or the statistically significant words associated with the word ship from the Iliad and the Odyssey.

words

simple word frequencies

nouns

frequency of nouns

ship

Significant words related to ship

Wordle recipes

Here is a generic Wordle recipe where Wordle will calculate the frequencies for you:

  1. Download and install Wordle. It is a Java application, so you may need to download and install Java along the way, but Java is probably already installed on your computer.
  2. Use your text editor to open reader.txt which is located in the etc directory/folder. Once opened, copy all of the text.
  3. Open Wordle, select the “Your Text” tab, and paste the whole of the text file into the window.
  4. Click the “Wordle” tab and your word cloud will be generated. Use the Wordle’s menu options to customize the output.

Congratulations, you have just visualized the whole of your study carrel.

Here is another recipe, a recipe where you supply the frequencies (or any other score):

  1. Download and install AntConc.
  2. Use the “Open Files(s)…” menu option to open any file in the txt directory.
  3. Click the “Word list” tab, and then click the “Start” button. The result will be a list of words and their frequencies.
  4. Use the “Save Output to Text File…” menu option, and save the frequencies accordingly.
  5. Open the resulting file in your spreadsheet.
  6. Remove any blank rows, and remove the columns that are not the words and their frequencies
  7. Invert the order of the remaining two columns; make the words the first column and the frequencies the second column.
  8. Copy the whole of the spreadsheet and paste it into your text editor.
  9. Use the text editor’s find/replace function to find all occurrences of the tab character and replace them with the colon (:) character. Copy the whole of the text editor’s contents.
  10. Open Wordle, click the “Your text” tab, paste the frequencies into the resulting window.
  11. Finally, click the “Wordle” tab to generate the word cloud.

Notice how you used a variety of generic applications to achieve the desired result. The word/value pairs given to Wordle do not have be frequencies. Instead they can be any number of different scores or weights. Keep your eyes open for word/value combinations. They are everywhere. Word clouds have been given a bad rap. Wordle is a very useful tool.

The Distant Reader and a Web-based demonstration

Posted on January 18, 2020 in Distant Reader by Eric Lease Morgan

The following is an announcement of a Web-based demonstration to the Distant Reader:

Please join us for a web-based demo and Q&A on The Distant Reader, a web-based text analysis toolset for reading and analyzing texts that removes the hurdle of acquiring computational expertise. The Distant Reader offers a ready way to onboard scholars to text analysis and its possibilities. Eric Lease Morgan (Notre Dame) will demo his tool and answer your questions. This session is suitable for digital textual scholars at any level, from beginning to expert.

The Distant Reader: Reading at scale

The Distant Reader is a tool for reading. It takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of just about any size (hundreds of books or thousands of journal articles), the Distant Reader analyzes the corpus, and outputs a myriad of reports enabling the researcher to use and understand the corpus. Designed with college students, graduate students, scientists, or humanists in mind, the Distant Reader is intended to supplement the traditional reading process.

This presentation outlines the problems the Reader is intended to address as well as the way it is implemented on the Jetstream platform with the help of both software and personnel resources from XSEDE. The Distant Reader is freely available for anybody to use at https://distantreader.org.

Other Distant Reader links of possible interest include:

‘Hope to see you there?