Searching Project Gutenberg at the Distant Reader

flowersThe venerable Project Gutenberg is a collection of about 60,000 transcribed editions of classic literature in the public domain, mostly from the Western cannon. A subset of about 30,000 Project Gutenberg items has been cached locally, indexed, and made available through a website called the Distant Reader. The index is freely for anybody and anywhere to use. This blog posting describes how to query the index.

The index is rooted in a technology called Solr, a very popular indexing tool. The index supports simple searching, phrase searching, wildcard searches, fielded searching, Boolean logic, and nested queries. Each of these techniques are described below:

  • simple searches – Enter any words you desire, and you will most likely get results. In this regard, it is difficult to break the search engine.
  • phrase searches – Enclose query terms in double-quote marks to search the query as a phrase. Examples include: "tom sawyer", "little country schoolhouse", and "medieval europe".
  • wildcard searches – Append an asterisk (*) to any non-phrase query to perform a stemming operation on the given query. For example, the query potato* will return results including the words potato and potatoes.
  • fielded searches – The index has many different fields. The most important include: author, title, subject, and classification. To limit a query to a specific field, prefix the query with the name of the field and a colon (:). Examples include: title:mississippi, author:plato, or subject:knowledge.
  • Boolean logic – Queries can be combined with three Boolean operators: 1) AND, 2) OR, or 3) NOT. The use of AND creates the intersection of two queries. The use of OR creates the union of two queries. The use of NOT creates the negation of the second query. The Boolean operators are case-sensitive. Examples include: love AND author:plato, love OR affection, and love NOT war.
  • nested queries – Boolean logic queries can be nested to return more sophisticated sets of items; nesting allows you to override the way rudimentary Boolean operations get combined. Use matching parentheses (()) to create nested queries. An example includes (love NOT war) AND (justice AND honor) AND (classification:BX OR subject:"spiritual life"). Of all the different types of queries, nested queries will probably give you the most grief.

Becase this index is a full text index on a wide variety of topics, you will probably need to exploit the query language to create truly meaningful results.

Comments are closed.