Archive for the ‘Uncategorized’ Category

ORCID Outreach Meeting (May 21 & 22, 2014)

Posted on June 13, 2014 in Uncategorized

This posting documents some of my experiences at the ORCID Outreach Meeting in Chicago (May 21 & 22, 2014).

As you may or may now know, ORCID is an acronym for “Open Researcher and Contributor ID”.* It is also the name of a non-profit organization whose purpose is to facilitate the creation and maintenance of identifiers for scholars, researchers, and academics. From ORCID’s mission statement:

ORCID aims to solve the name ambiguity problem in research and scholarly communications by creating a central registry of unique identifiers for individual researchers and an open and transparent linking mechanism between ORCID and other current researcher ID schemes. These identifiers, and the relationships among them, can be linked to the researcher’s output to enhance the scientific discovery process and to improve the efficiency of research funding and collaboration within the research community.

A few weeks ago the ORCID folks facilitated a user’s group meeting. It was attended by approximately 125 people (mostly librarians or people who work in/around libraries), and some of the attendees came from as far away as Japan. The purpose of the meeting was to build community and provide an opportunity to share experiences.

The meeting itself was divided into number of panel discussions and a “codefest”. The panel discussions described successes (and failures) for creating, maintaining, enhancing, and integrating ORCID identifiers into workflows, institutional repositories, grant application processes, and information systems. Presenters described poster sessions, marketing materials, information sessions, computerized systems, policies, and politics all surrounding the implementation of ORCID identifiers. Quite frankly, nobody seemed to have a hugely successful story to tell because too few researchers seem to think there is a need for identifiers. I, as a librarian and information professional, understand the problem (as well as the solution), but outside the profession there may not seem to be much of a problem to be solved.

That said, the primary purpose of my attendance was to participate in the codefest. There were less than a dozen of us coders, and we all wanted to use the various ORCID APIs to create new and useful applications. I was most interested in the possibilities of exploiting the RDF output obtainable through content negotiation against an ORCID identifier, a la the command line application called curl:

curl -L -H "Accept: application/rdf+xml" http://orcid.org/0000-0002-9952-7800

Unfortunately, the RDF output only included the merest of FOAF-based information, and I was interested in bibliographic citations.

Consequently I shifted gears, took advantage of the ORCID-specific API, and I decided to do some text mining. Specifically, I wrote a Perl program — orcid.pl — that takes an ORCID identifier as input (ie. 0000-0002-9952-7800) and then:

  1. queries ORCID for all the works associated with the identifier**
  2. extracts the DOIs from the resulting XML
  3. feeds the DOIs to a program called Tika for the purposes of extracting the full text from documents
  4. concatenates the result into a single stream of text, and sends the whole thing to standard output

For example, the following command will create a “bag of words” containing the content of all the writings associated with my ORCID identifier and have DOIs:

$ ./orcid.pl 0000-0002-9952-7800 > morgan.txt

Using this program I proceeded to create a corpus of files based on the ORCID identifiers of eleven Outreach Meeting attendees. I then used my “tiny text mining tools” to do analysis against the corpus. The results were somewhat surprising:

  • The most significant key words shared across the corpus of eleven people included: information, system, site, and orcid.
  • The authors Haak and Paglione wrote the most similar articles. (They both wrote about ORCID.) Morgan and Havert were a very close second. (We both wrote about “information” and “sites”.)
  • The DOIs often point to splash pages, and consequently my “bags of words” included lots of content about cookies and publishers as opposed to meaty journal article content. ***

Ideally, the hack I wrote would allow a person to feed one or more identifiers to a system and output a report summarizing and analyzing the journal article content at a glance — a quick & easy “distant reading” tool.

I finished my “hack” in one sitting which gave me time to attend the presentations of the second day.

All of the hacks were added to a pile and judged by a vendor on their utility. I’m proud to say that Jeremy Friesen’s — a colleague here at Notre Dame — hack won a prize. His application followed the links to people’s publications, created a screen dump of the publications’ root pages, and made a montage of the result. It was a visual version of orcid.pl. Congratulations, Jeremy!

I’m very glad I attended the Meeting. I reconnected with a number of professional colleagues, and I my awareness of researcher identifiers was increased. More specifically, there seem to be a growing number of these identifiers. Examples for myself include:

And for a really geeky good time, I learned to create the following set of RDF triples with the use of these identifiers:

@prefix dc: <http://purl.org/dc/elements/1.1/> .
  <http://dx.doi.org/10.1108/07378831211213201> dc:creator
  "http://isni.org/isni/0000000035290715" ,
  "http://id.loc.gov/authorities/names/n94036700" ,
  "http://orcid.org/0000-0002-9952-7800" ,
  "http://viaf.org/viaf/26290254" ,
  "http://www.researcherid.com/rid/F-2062-2014" ,
  "http://www.scopus.com/authid/detail.url?authorId=25944695600" .

I learned about the (subtle) difference between an identifier and a authority control record. I learned of the advantages and disadvantages the various identifiers. And through a number of serendipitous email exchanges, I learned about ISNIs which are a NISO standard for identifiers and seemingly popular in Europe but relatively unknown here in the United States. For more detail, see the short discussion of these things in the Code4Lib mailing list archives.

Now might be a good time for some of my own grassroots efforts to promote the use of ORCID identifiers.

* Thanks, Pam Masamitsu!

** For a good time, try http://pub.orcid.org/0000-0002-9952-7800/orcid-works, or substitute your identifier to see a list of your publications.

*** The problem with splash screens is exactly what the very recent CrossRef Text And Data Mining API is designed to address.

CrossRef’s Text and Data Mining (TDM) API

Posted on June 11, 2014 in Uncategorized

A few weeks ago I learned that CrossRef’s Text And Data Mining (TDM) API had gone version 1.0, and this blog posting describes my tertiary experience with it.

A number of months ago I learned about Prospect, a fledgling API being developed by CrossRef. Its purpose was to facilitate direct access to full text journal content without going through the hassle of screen scraping journal article splash pages. Since then the API has been upgraded to version 1.0 and renamed the Text And Data Mining API. This is how the API is expected to be used:

  1. Given a (CrossRef) DOI, resolve the DOI using HTTP content negotiation. Specifically, request text/turtle output.
  2. From the response, capture the HTTP header called “links”.
  3. Parse the links header to extract URIs denoting full text, licenses, and people.
  4. Make choices based on the values of the URIs.

What sorts of choices is one expected to make? Good question. First and foremost, a person is suppose to evaluate the license URI. If the URI points to a palatable license, then you may want to download the full text which seems to come in PDF and/or XML flavors. With version 1.0 of the API, I have discovered ORCID identifiers are included in the header. I believe these denote authors/contributors of the articles.

Again, all of this is based on the content of the HTTP links header. Here is an example header, with carriage returns added for readability:

<http://downloads.hindawi.com/journals/isrn.neurology/2013/908317.pdf>;
rel="http://id.crossref.org/schema/fulltext"; type="application/pdf"; version="vor",
<http://downloads.hindawi.com/journals/isrn.neurology/2013/908317.xml>;
rel="http://id.crossref.org/schema/fulltext"; type="application/xml"; version="vor",
<http://creativecommons.org/licenses/by/3.0/>; rel="http://id.crossref.org/schema/license";
version="vor", <http://orcid.org/0000-0002-8443-5196>; rel="http://id.crossref.org/schema/person",
<http://orcid.org/0000-0002-0987-9651>; rel="http://id.crossref.org/schema/person",
<http://orcid.org/0000-0003-4669-8769>; rel="http://id.crossref.org/schema/person"

I wrote a tiny Perl library — extractor.pl — used to do steps #1 through #3, above. It returns a reference to a hash containing the values in the links header. I then wrote three Perl scripts which exploit the library:

  1. resolver.cgi – a Web-based application taking a DOI as input and returning the URIs in the links header, if they exist. Your milage with the script will vary because most DOIs are not associated with full text URIs.
  2. search.cgi – given a simple query, use CrossRef’s Metadata API to find no more than five articles associated with full text content, and then resolve the links to the full text.
  3. search.pl – a command-line version of search.cgi

Here are a few comments. Myself, as a person who increasingly wants direct access to full text articles, the Text And Data Mining API is a step in the right direction. Now all that needs to happen is for publishers to get on board and feed CrossRef the URIs of full text content along the associated licensing terms. I found the links header to be a bit convoluted, but this is what programming libraries are for. I could not find a comprehensive description of what name/value combinations can exist in the links header. For example, the documentation alludes to beginning and ending dates. CrossRef seems to have a growing number of interesting applications and APIs which are probably going unnoticed, and there is an opportunity of some sort lurking in there. Specifically, somebody out to do something the text/turtle (RDF) output of the DOI resolutions.

‘More fun with HTTP and bibliographics.

Code4Lib jobs topic

Posted on May 15, 2014 in Uncategorized

entrance to romeThis posting describes how to turn off and on a thing called the jobs topic in the Code4Lib mailing list.

Code4Lib is a mailing list whose primary focus is computers and libraries. Since its inception in 2004, it has grown to include about 2,800 members from all around the world but mostly from the United States. The Code4Lib community has also spawned an annual conference, a refereed online journal, its own domain, and a growing number of regional “franchises”.

The Code4Lib community has also spawned job postings. Sometimes these job postings flood the mailing list, and while it is entirely possible use mail filters to exclude such postings, there is also “more than one way to skin a cat”. Since the mailing list uses the LISTSERV software, the mailing list has been configured to support the idea of “topics“, and through this feature a person can configure their subscription preferences to exclude job postings. Here’s how. By default every subscriber to the mailing list will get all postings. If you want to turn off getting the jobs postings, then email the following command to listserv@listserv.nd.edu:

SET code4lib TOPICS: -JOBS

If you want to turn on the jobs topic and receive the notices, then email the following command to listserv@listserv.nd.edu:

SET code4lib TOPICS: +JOBS

Sorry, but if you subscribe to the mailing list in digest mode, then the topics command has no effect; you will get the job postings no matter what.

HTH.

Special thanks go to Jodi Schneider and Joe Hourcle who pointed me in the direction of this LISTSERV functionality. Thank you!

The 3D Printing Working Group is maturing, complete with a shiny new mailing list

Posted on April 9, 2014 in Uncategorized

A couple of weeks ago Kevin Phaup took the lead of facilitating a 3D printing workshop here in the Libraries’s Center For Digital Scholarship. More than a dozen students from across the University participated. Kevin presented them with an overview of 3D printing, pointed them towards a online 3D image editing application (Shapeshifter), and everybody created various objects which Matt Sisk has been diligently printing. The event was deemed a success, and there will probably be more specialized workshops scheduled for the Fall.

Since the last blog posting there has also been another Working Group meeting. A short dozen of us got together in Stinson-Remick where we discussed the future possibilities for the Group. The consensus was to create a more formal mailing list, maybe create a directory of people with 3D printing interests, and see about doing something more substancial — with a purpose — for the University.

To those ends, a mailing list has been created. Its name is 3D Printing Working Group . The list is open to anybody, and its purpose is to facilitate discussion of all things 3D printing around Notre Dame and the region. To subscribe address an email message to listserv@listserv.nd.edu, and in the body of the message include the following command:

subscribe nd-3d-printing Your Name

where Your Name is… your name.

Finally, the next meeting of the Working Group has been scheduled for Wednesday, May 14. It will be sponsored by Bob Sutton of Springboard Technologies, and it will be located in Innovation Park across from the University, and it will take place from 11:30 to 1 o’clock. I’m pretty sure lunch will be provided. The purpose of the meeting will be continue to outline the future directions of the Group as well as to see a demonstration of a printer called the Isis3D.

Digital humanities and libraries

Posted on April 3, 2014 in Uncategorized

This posting outlines a current trend in some academic libraries, specifically, the inclusion of digital humanities into their service offerings. It provides the briefest of introductions to the digital humanities, and then describes how one branch of the digital humanities — text mining — is being put into practice here in the Hesburgh Libraries’ Center For Digital Scholarship at the University of Notre Dame.

(This posting and its companion one-page handout was written for the Information Organization Research Group, School of Information Studies at the University of Wisconsin Milwaukee, in preparation for a presentation dated April 10, 2014.)

Digital humanities

busa
For all intents and purposes, the digital humanities is a newer rather than older scholarly endeavor. A priest named Father Busa is considered the “Father of the Digital Humanities” when, in 1965, he worked with IBM to evaluate the writings of Thomas Aquinas. With the advent of the Internet, ubiquitous desktop computing, an increased volume of digitized content, and sophisticated markup languages like TEI (the Text Encoding Initiative), the processes of digital humanities work has moved away from a fad towards a trend. While digital humanities work is sometimes called a discipline this author sees it more akin to a method. It is a process of doing “distant reading” to evaluate human expression. (The phrase “distant reading” is attributed to Franco Moretti who coined it in a book entitles Graphs, Maps, Trees: Abstract Models for a Literary History. Distant reading is complementary to “close reading”, and is used to denote the idea of observing many documents simultaneously.) The digital humanities community has grown significantly in the past ten or fifteen years complete with international academic conferences, graduate school programs, and scholarly publications.

Digital humanities work is a practice where digitized content of the humanist is quantitatively analyzed as if it were the content studied by a scientist. This sort of analysis can be done against any sort of human expression: written and spoken words, music, images, dance, sculpture, etc. Invariably, the process begins with counting and tabulating. This leads to measurement, which in turn provides opportunities for comparison. From here patterns can be observed and anomalies perceived. Finally, predictions, thesis, and judgements can be articulated. Digital humanities work does not replace the more traditional ways of experiencing expressions of the human condition. Instead it supplements the experience.

This author often compares the methods of the digital humanist to the reading of a thermometer. Suppose you observe an outdoor thermometer and it reads 32° (Fahrenheit). This reading, in and of itself, carries little meaning. It is only a measurement. In order to make sense of the reading it is important to put it into context. What is the weather outside? What time of year is it? What time of day is it? How does the reading compare to other readings? If you live in the Northern Hemisphere and the month is July, then the reading is probably an anomaly. On the other hand, if the month is January, then the reading is perfectly normal and not out of the ordinary. The processes of the digital humanist make it possible to make many measurements from a very large body of materials in order to evaluate things like texts, sounds, images, etc. It makes it possible to evaluate the totality of Victorian literature, the use of color in paintings over time, or the rhythmic similarities & difference between various forms of music.

Digital humanities centers in libraries

As the more traditional services of academic libraries become more accessible via the Internet, libraries have found the need to necessarily evolve. One manifestation of this evolution is the establishment of digital humanities centers. Probably one of oldest of these centers is located at the University of Virginia, but they now exist in many libraries across the country. These centers provide a myriad of services including combinations of digitization, markup, website creation, textual analysis, speaker series, etc. Sometimes these centers are akin to computing labs. Sometimes they are more like small but campus-wide departments staffed with scholars, researchers, and graduate students.

The Hesburgh Libraries’ Center For Digital Scholarship at the University of Notre Dame was recently established in this vein. The Center supports services around geographic information systems (GIS), data management, statistical analysis of data, and text mining. It is located in a 5,000 square foot space on the Libraries’s first floor and includes a myriad of computers, scanners, printers, a 3D printer, and collaborative work spaces. Below is an annotated list of projects the author has spent time against in regards to text mining and the Center. It is intended to give the reader a flavor of the types of work done in the Hesburgh Libraries:

  • Great Books – This was almost a tongue-in-cheek investigation to calculate which book was the “greatest” from a set of books called the Great Books of the Western World. The editors of the set defined a great book as one which discussed any one of a number of great ideas both deeply and broadly. These ideas were tabulated and compared across the corpus and then sorted by the resulting calculation. Aristotle’s Politics was determined to be the greatest book and Shakespeare was determined to have written nine of the top ten greatest books when it comes to the idea of love.
  • HathiTrust Research Center – The HathiTrust Research Center is a branch of the HathiTrust. The Center supports a number of algorithms used to do analysis against reader-defined worksets. The Center For Digital Scholarship facilitates workshops on the use of the HathiTrust Research Center as well as a small set of tools for programmatically searching and retrieving items from the HathiTrust.
  • JSTOR ToolData For Research (DFR) is a freely available and alternative interface to the bibliographic index called JSTOR. DFR enables the reader to search the entirety of JSTOR through a faceted querying. Search results are tabulated enabling the reader to create charts and graphs illustrating the results. Search results can be downloaded for more detailed investigations. JSTOR Tool is a Web-based application allowing the reader to summarize and do distant reading against these downloaded results.
  • PDF To Text – Text mining almost always requires the content of its investigation to be in the form of plain text, but much of the content used by people is in PDF. PDF To Text is a Web-based tool which extracts the plain text from PDF files and provides a number of services against the result (readability scores, ngram extraction, concordancing, and rudimentary parts-of-speech analysis.)
  • Perceptions of China – This project is in the earliest stages. Prior to visiting China students have identified photographs and written short paragraphs describing, in their minds, what they think of China. After visiting China the process is repeated. The faculty member leading the students on their trips to China wants to look for patterns of perception in the paragraphs.
  • Poverty Tourism – A university senior believes they have identified a trend — the desire to tourist poverty-stricken places. They identified as many as forty websites advertising “Come vist our slum”. Working with the Center they programmatically mirrored the content of the remote websites. They programmatically removed all the HTML tags from the mirrors. They then used Voyant Tools as well as various ngram tabulation tools to do distant reading against the corpus. Their investigations demonstrated the preponderant use of the word “you”, and they posit this because the authors of the websites are trying to get readers to imagine being in a slum.
  • State Trials – In collaboration with a number of other people, transcripts of the State Trials dating between 1650 and 1700 were analyzed. Digital versions of the Trails was obtained, and a number of descriptive analyses were done. The content was indexed and a timeline was created from search results. Ngram extraction was done as well as parts-of-speech analysis. Various types of similarity measures were done based on named entities and the over-all frequency of words (vectors). A stop word list was created based on additional frequency tabulations. Much of these analysis was visualized using word clouds, line charts, and histograms. This project is an excellent example of how much of digital humanities work is collaborative and requires the skills of many different types of people.
  • Tiny Text Mining Tools – Text mining is rooted in the counting and tabulation of words. Computers are very good at counting and tabulating. To that end a set of tiny text mining tools has been created enabling the Center to perform quick & dirty analysis against one or more items in a corpus. Written in Perl, the tools implement a well-respected relevancy ranking algorithm (term-frequency inverse document frequency or TFIDF) to support searching and classification, a cosine similarity measure for clustering and “finding more items like this one”, a concordancing (keyword in context) application, and an ngram (phrase) extractor.

Summary

starry night
Text mining, and digital humanities work in general, is simply the application computing techniques applied against the content of human expression. Their use is similar to use of the magnifying glass by Galileo. Instead of turning it down to count the number of fibers in a cloth (or to write an email message), it is being turned up to gaze at the stars (or to analyze the human condition). What he finds there is not so much truth as much as new ways to observe. The same is true of text mining and the digital humanities. They are additional ways to “see”.

Links

Here is a short list of links for further reading:

  • ACRL Digital Humanities Interest Group – This is a mailing list whose content includes mostly announcements of interest to librarians doing digital humanities work.
  • asking for it – Written by Bethany Nowviskie, this is a through response to the OCLC report, below.
  • dh+lib – A website amalgamating things of interest to digital humanities librarianship (job postings, conference announcements, blog posttings, newly established projects, etc.)
  • Digital Humanities and the Library: A Bibliography – Written by Miriam Posner, this is a nice list of print and digital readings on the topic of digital humanities work in libraries.
  • Does Every Research Library Need a Digital Humanities Center? – A recently published, OCLC-sponsored report intended for library directors who are considering the creation of a digital humanities center.
  • THATCamp – While not necessarily library-related THATCamp is a organization and process of facilitating informal digital humanities workshops, usually in academic settings.

Tiny Text Mining Tools

Posted on April 2, 2014 in Uncategorized

I have posted to Github the very beginnings of Perl library used to support simple and introductory text mining analysis — tiny text mining tools.

Presently the library is implemented in a set of subroutines stored in a single file supporting:

  • simple in-memory indexing and single-term searching
  • relevancy ranking through term-frequency inverse document frequency (TFIDF) for searching and classification
  • cosine similarity for clustering and “finding more items like this one”

I use these subroutines and the associated Perl scripts to do quick & dirty analysis against corpuses of journal articles, books, and websites.

I know, I know. It would be better to implement these thing as a set of Perl modules, but I’m practicing what I preach. “Give it away even if it is not ready.” The ultimate idea is to package these things into a single distribution, and enable researchers to have them at their finger tips as opposed to a Web-based application.

University of Notre Dame 3-D Printing Working Group

Posted on March 19, 2014 in Uncategorized

This is the tiniest of blog postings describing a fledgling community here on campus called the University of Notre Dame 3-D Printing Working Group.

Working group

Working group

A few months ago Paul Turner said to me, “There are an increasing number of people across campus who are interested in 3-D printing. With your new space there in the Center, maybe you could host a few of us and we can share experiences.” Since then a few of us have gotten together a few times to discuss problems and solutions when it comes these relatively new devices. We have discussed things like:

  • what can these things be used for?
  • what are the advantages and disadvantages of different printers?
  • how and when does one charge for services?
  • what might the future of 3-D printing look like?
  • how can this technology be applied to the humanities?
nose

nose

Mike Elwell from Industrial Design hosted one of our meetings. We learned about “fab labs” and “maker spaces”. 3-D printing seems to be latest of gizmos for prototyping. He gave us a tour of his space, and I was most impressed with the laser cutter. At our most recent meeting Matt Leevy of Biology showed us how he is making models of people’s nasal cavities so doctors and practice before surgery. There I learned about the use of multiple plastics to do printing and how these multiple plastics can be used to make a more diverse set of objects. Because of the growing interest in 3-D printing, the Center will be hosting a beginner’s 3-D printing workshop in on March 28 from 1 – 5 o’clock and facilitated by graduate student Kevin Phaup.

With every get together there have been more and more people attending with faculty and staff from Biology, Industrial Design, the Libraries, Engineering, OIT, and Innovation Park. Our next meeting — just as loosely as the previous meetings — is scheduled for Friday, April 4 from noon to 1 o’clock in room 213 Stinson-Remick Hall. (That’s the new engineering building, and I believe it is the Notre Dame Design Deck space.) Everybody is welcome. The more people who attend, the more we can each get accomplished.

‘Hope to see you there!

CrossRef’s Prospect API

Posted on February 17, 2014 in Uncategorized

This is the tiniest of blog postings outlining my experiences with a fledgling API called Prospect.

Prospect is an API being developed by CrossRef. I learned about it through both word-of-mouth as well as a blog posting by Eileen Clancy called “Easy access to data for text mining“. In a nutshell, given a CrossRef DOI via content negotiation, the API will return both the DOI’s bibliographic information as well as URL(s) pointing to the location of full text instances of the article. The purpose of the API is to provide a straight-forward method for acquiring full text content without the need for screen scraping.

I wrote a simple, almost brain-deal Perl subroutine implementing the API. For a good time, I put the subroutine into action in a CGI script. Enter a simple query, and the script will search CrossRef for full text articles, and return a list of no more than five (5) titles and their associated URL’s where you can get them in a number of formats.

screen shot
screen shot of CrossRef Prospect API in action

The API is pretty straight-forward, but the URLs pointing to the full text are stuffed into a “Links” HTTP header, and the value of the header is not as easily parseable as one might desire. Still, this can be put to good use in my slowly growing stock of text mining tools. Get DOI. Feed to one of my tools. Get data. Do analysis.

Fun with HTTP.

Analyzing search results using JSTOR’s Data For Research

Posted on February 17, 2014 in Uncategorized

Introduction

Data For Research (DFR) is an alternative interface to JSTOR enabling the reader to download statistical information describing JSTOR search results. For example, using DFR a person can create a graph illustrating when sets of citations where written, create a word cloud illustrating the most frequently used words in a journal article, or classify sets of JSTOR articles according to a set of broad subject headings. More advanced features enable the reader to extract frequently used phrases in a text as well as list statistically significant keywords. JSTOR’s DFR is a powerful tool enabling the reader to look for trends in large sets of articles as well as drill down into the specifics of individual articles. This hands-on workshop leads the student through a set of exercises demonstrating these techniques.

Faceted searching

DFR supports an easy-to-use search interface. Enter one or two words into the search box and submit your query. Alternatively you can do some field searching using the advanced search options. The search results are then displayed and sortable by date, relevance, or a citation rank. More importantly, facets are displayed along side the search results, and searches can be limited by selecting one or more of the facet terms. Limiting by years, language, subjects, and disciplines prove to be the most useful.

screen shot
search results screen

Publication trends over time

By downloading the number of citations from multiple search results, it is possible to illustrate publication trends over time.

In the upper right-hand corner of every search result is a “charts view” link. Once selected it will display a line graph illustrating the number of citations fitting your query over time. It also displays a bar chart illustrating the broad subject areas of your search results. Just as importantly, there is a link at the bottom of the page — “Download data for year chart” — allowing you to download a comma-separated (CSV) file of publication counts and years. This file is easily importable into your favorite spreadsheet program and chartable. If you do multiple searches and download multiple CSV files, then you can compare publication trends. For example, the following chart compares the number of times the phrases “Henry Wadsworth Longfellow”, “Henry David Thoreau”, and “Ralph Waldo Emerson” have appeared in the JSTOR literature between 1950 and 2000. From the chart we can see that Emerson was consistently mentioned more of than both Longfellow and Thoreau. It would be interesting to compare the JSTOR results with the results from Google Books Ngram Viewer, which offers a similar service against their collection of digitized books.

chart view
chart view screen shot

publication trends
publications trends for Emerson, Thoreau, and Longfellow

Key word analysis

DFR counts and tabulates frequently used words and statistically significant key words. These tabulations can be used to illustrate characteristics of search results.

Each search result item comes complete with title, author, citation, subject, and key terms information. The subjects and key terms are computed values — words and phrases determined by frequency and statistical analysis. Each search result item comes with a “More Info” link which returns lists of the item’s most frequently used words, phrases, and keyword terms. Unfortunately, these lists often include stop words like “the”, “of”, “that”, etc. making the results not as meaningful as they could be. Still, these lists are somewhat informative. They allude to the “aboutness” of the selected article.

Key terms are also facets. You can expand the Key terms facets to get a small word cloud illustrating the frequency of each term across the entire search result. Clicking on one of the key terms limits the search results accordingly. You can also click on the Export button to download a CVS file of key terms and their frequency. This information can then be fed to any number of applications for creating word clouds. For example, download the CSV file. Use your text editor to open the CSV file, and find/replace the commas with colons. Copy the entire result, and paste it into Wordle’s advanced interface. This process can be done multiple times for different searches, and the results can be compared & contrasted. Word clouds for Longfellow, Thoreau, and Emerson are depicted below, and from the results you can quickly see both similarities and differences between each writer.

emerson
Ralph Waldo Emerson key terms

thoreau
Henry David Thoreau key terms

longfellow
Henry Wadsworth Longfellow key terms

Downloading complete data sets

If you create a DFR account, and if you limit your search results to 1,000 items or less, then you can download a data set describing your search results.

In the upper right-hand corner of the search results screen is a pull-down menu option for submitting data set requests. The resulting screen presents you with options for downloading a number of different types of data (citations, word counts, phrases, and key terms) in two different formats (CSV and XML). The CSV format is inherently easier to use, but the XML format seems to be more complete, especially when it comes to citation information. After submitting your data set request you will have to wait for an email message from DFR because it takes a while (anywhere from a few minutes to a couple of hours) for it to be compiled.

data set request
data set request page

After downloading a data set you can do additional analysis against it. For example, it is possible to create a timeline illustrating when individual articles where written. It is not be too difficult to create word clouds from titles or author names. If you have programming experience, then you might be be able to track ideas over time or the originator of specific ideas. Concordances — keyword in context search engines — can be implemented. Some of this functionality, but certainly not all, is being slowly implemented in a Web-based application called JSTOR Tool.

Summary

As the written word is increasingly manifested in digital form, so does the ability to evaluate the written word quantifiably. JSTOR’s DFR is one example of how this can be exploited for the purposes of academic research.

Note

A .zip file containing some sample data and well as the briefest of instructions on how to use it is linked from this document.

Paper Machines

Posted on January 22, 2014 in Uncategorized

Today I learned about Paper Machines, a very useful plug-in for Zotero allowing the reader to do visualizations against their collection of citations.

Today Jeffrey Bain-Conkin pointed me towards a website called Lincoln Logarithms where sermons about Abraham Lincoln and slavery were analyzed. To do some of the analysis a Zotero plug-in called Paper Machines was used, and it works pretty well:

  1. make sure Zotero’s full text indexing feature is turned on
  2. download and install Paper Machines
  3. select one of your Zotero collections to be analyzed
  4. wait
  5. select any one of a number of visualizations to create

I am in the process of writing a book on linked data for archivists. I have am using Zotero to keep track the books citations, etc. I used Paper Machines to generate the following images:

word cloud
a word cloud where the words are weighted by a TF-IDF score

phrase net
a “phrase net” where the words are joined by the word “is”

topic model
a topic modeling map — illustration of automatically classified documents

From these visualizations I learned:

  • not much from the word cloud except what I already knew
  • the word “data” is connected to many different actions
  • I have few, if any, citations in my collection from the mid-2000’s

I have often thought collections of metadata (citations, MARC records, the output from JSTOR’s Data For Research service) could easily be evaluated and visualized. Paper Machines does just that. I wish I had done it. (To some extent, I have, but the work is fledgling and called JSTOR Tool.)

In any event, if you use Zotero, then I suggest you give Paper Machines a try.