Introducing the Distant Reader Toolbox

readerThe Distant Reader Toolbox is a command-line tool for interacting with data sets created by the Distant Reader — data sets affectionally called “study carrels”. See:

https://reader-toolbox.readthedocs.io

Background

The Distant Reader takes an almost arbitrary amount of unstructured data (text) as input, creates a corpus, performs a number of text mining and natural language processing functions against the corpus, saves the results in the form of delimited files as well as an SQLite database, summarizes the results, and compresses the whole into a zip file. The resulting zip file is a data set intended to be used by people as well as computers. These data sets are called “study carrels”. There exists a collection of more than 3,000 pre-created study carrels, and anybody is authorized to create their own.

Study carrels

The contents of study carrels is overwhelmingly plain text in nature. Moreover, the files making up study carrels are consistently named and consistently located. This makes study carrels easy to compute against.

The narrative nature of study carrel content lends itself to quite a number of different text mining and natural language processing functions, including but not limited to:

  • bibliometrics
  • full-text indexing and search
  • grammar analysis
  • keyword-in-context searching (concordancing)
  • named-entity extraction
  • ngrams extraction
  • parts-of-speech analysis
  • semantic indexing (also known as “word embedding”)
  • topic modeling

Given something like a set of scholarly articles, or all the chapters of all the Jane Austen novels, study carrels lend themselves to a supplemental type of reading, where reading is defined as the use and understanding of narrative text.

Toolbox

The Toolbox exploits the structured nature of study carrels, and makes it easy to address questions from the mundane to the sublime. Examples include but are not limited to:

  • How big is this corpus, and how big is this corpus compared to others?
  • Sans stop words, what are the most frequent one-word, two-word, etc-word phrases in this corpus?
  • To what degree does a given word appear in a corpus? Zero times? Many times, and if many, then in what context?
  • What words can be deemed as keywords for a given text, and what other texts have been classified similarly?
  • What things are mentioned in a corpus? (Think nouns.)
  • What do the things do? (Think verbs.)
  • How are those things described? (Think adjectives.)
  • What types of entities are mentioned in a corpus? The full names of people? Organizations? Places? Locations? Money amounts? Dates? Times? Works of art? Diseases? Chemicals? Organisms? And given these entities, how are they related to each other?
  • What are all the noun phrases in a text, and how often do they occur?
  • What did people say?
  • What are all the sentences fragments matching the grammar subject-verb-object, and which ones of those fragments match a given regular expression?
  • Assuming that a word is known by the company it keeps, what words are in the same semantic space (word embedding), or what latent themes may exist in a corpus beyond keywords (topic modeling)?
  • How did a given idea ebb and flow over time? Who articulated the idea, and how? Where did a given idea manifest itself in the world?
  • If a given book is denoted as “great”, then what are its salient characteristics, and what other books can be characterized similarly?
  • What is justice and if murder is morally wrong, then how can war be justified?
  • What is love, and how do Augustine’s and Rousseau’s definitions of love compare and contrast?

Given a study carrel with relevant content, the Toolbox can be used to address all of the questions outlined above.

Quickstart

The Toolbox requires Python 3, and it can be installed from the terminal with the following command:

pip install reader-toolbox

Once installed, you can invoke it from the terminal like this:

rdr

The result ought to be a help text looking much like this:

  Usage: rdr [OPTIONS] COMMAND [ARGS]...

  Options:
    --help  Show this message and exit.

  Commands:
    browse       Peruse <carrel> as a file system Study carrels are sets of...
    catalog      List study carrels Use this command to enumerate the study...
    cluster      Apply dimension reduction to <carrel> and visualize the...
    concordance  A poor man's search engine Given a query, this subcommand...
    download     Cache <carrel> from the public library of study carrels A...
    edit         Modify <carrel>'s stop word list When using subcommands such...
    get          Echo the values denoted by the set subcommand This is useful...
    grammars     Extract sentence fragments from <carrel> where fragments are...
    ngrams       Output and list words or phrases found in <carrel> This is...
    play         Play the word game called hangman.
    read         Open <carrel> in your Web browser Use this subcommand to...
    search       Perform a full text query against <carrel> Given words,...
    semantics    Apply semantic indexing queries against <carrel> Sometimes...
    set          Configure the location of your study carrels and a subsystem...
    sql          Use SQL queries against the database of <carrel> Study...
    tm           Apply topic modeling against <carrel> Topic modeling is the...

Once you get this far, you can run quite a number of different commands:

  # browse the remote library of study carrels
  rdr catalog -l remote -h
  
  # read a carrel from the remote library
  rdr read -l remote homer
  
  # browse a carrel from the remote library
  rdr browse -l remote homer
  
  # list all the words in a remote carrel
  rdr ngrams -l remote homer
  
  # initialize a local library; accept the default
  rdr set
  
  # cache a carrel from the remote library
  rdr download homer
  
  # list all two-word phrases containing the word love
  rdr ngrams -s 2 -q love homer
  
  # see how the word love is used in context
  rdr concordance -q love homer
  
  # list all the subject-verb-object sentence fragments containing love; please be patient
  rdr grammars -q love homer
  
  # much the same, but for the word war; will return much faster
  rdr grammars -q '\bwar\b' -s homer | more

Summary

The Distant Reader creates data sets called “study carrels”, and study carrels lend themselves to analysis by people as well as computers. The Toolbox is a companion command-line application written in Python. It simplifies the process of answering questions — from the mundane to the sublime — against study carrels.

Comments are closed.