Searching CORD-19 at the Distant Reader
Posted on July 26, 2021 in Distant Reader by Eric Lease Morgan
This blog posting documents the query syntax for an index of scientific journal articles called CORD-19.
CORD-19 is a data set of scientific journal articles on the topic of COVID-19. As of this writing, it includes more than 750,000 items. This data set has been harvested, pre-processed, indexed, and made available as a part of the Distant Reader. Access to the index is freely available to anybody and everybody.
The index is rooted in a technology called Solr, a very popular indexing tool. The index supports simple searching, phrase searching, wildcard searches, fielded searching, Boolean logic, and nested queries. Each of these techniques are described below:
- simple searches – Enter any words you desire, and you will most likely get results. In this regard, it is difficult to break the search engine.
- phrase searches – Enclose query terms in double-quote marks to search the query as a phrase. Examples include:
"waste water"
,"circulating disease"
, and"acute respiratory syndrome"
. - wildcard searches – Append an asterisk (*) to any non-phrase query to perform a stemming operation on the given query. For example, the query
virus*
will return results including the words virus and viruses. - fielded searches – The index has many different fields. The most important include: authors, title, year, journal, abstract, and keywords. To limit a query to a specific field, prefix the query with the name of the field and a colon (:). Examples include:
title:disease
,abstract:"cardiovascular disease"
, oryear:2020
. Of special note is the keywords field. Keywords are sets of statistically significant and computer-selected terms akin to traditional library subject headings. The use of the keywords field is a very efficient way to create a small set of very relevant articles. Examples include:keywords:mrna
,keywords:ribosome
, orkeywords:China
. - Boolean logic – Queries can be combined with three Boolean operators: 1) AND, 2) OR, or 3) NOT. The use of AND creates the intersection of two queries. The use of OR creates the union of two queries. The use of NOT creates the negation of the second query. The Boolean operators are case-sensitive. Examples include:
covid AND title:SARS
,abstract:cat* OR abstract:dog*
, andabstract:cat* NOT abstract:dog*
- nested queries – Boolean logic queries can be nested to return more sophisticated sets of articles; nesting allows you to override the way rudimentary Boolean operations get combined. Use matching parentheses (()) to create nested queries. An example includes
((covid AND title:SARS) OR abstract:cat* OR abstract:dog*) NOT year:2020
. Of all the different types of queries, nested queries will probably give you the most grief.