November // 2014 // Days in the Life of a Librarian // Blog Network // University of Notre Dame

Archive for November, 2014

My second Python script, dispersion.py

Posted on November 19, 2014 in Uncategorized

This is my second Python script, dispersion.py, and it illustrates where common words appear in a text.

#!/usr/bin/env python2

# dispersion.py - illustrate where common words appear in a text
#
# usage: ./dispersion.py <file>

# Eric Lease Morgan <emorgan@nd.edu>
# November 19, 2014 - my second real python script; "Thanks for the idioms, Don!"


# configure
MAXIMUM = 25
POS     = 'NN'

# require
import nltk
import operator
import sys

# sanity check
if len( sys.argv ) != 2 :
  print "Usage:", sys.argv[ 0 ], "<file>"
  quit()
  
# get input
file = sys.argv[ 1 ]

# initialize
with open( file, 'r' ) as handle : text = handle.read()
sentences = nltk.sent_tokenize( text )
pos       = {}

# process each sentence
for sentence in sentences : 
  
  # POS the sentence and then process each of the resulting words
  for word in nltk.pos_tag( nltk.word_tokenize( sentence ) ) :
    
    # check for configured POS, and increment the dictionary accordingly
    if word[ 1 ] == POS : pos[ word[ 0 ] ] = pos.get( word[ 0 ], 0 ) + 1

# sort the dictionary
pos = sorted( pos.items(), key = operator.itemgetter( 1 ), reverse = True )

# do the work; create a dispersion chart of the MAXIMUM most frequent pos words
text = nltk.Text( nltk.word_tokenize( text ) )
text.dispersion_plot( [ p[ 0 ] for p in pos[ : MAXIMUM ] ] )

# done
quit()

I used the program to analyze two works: 1) Thoreau’s Walden, and 2) Emerson’s Representative Men. From the dispersion plots displayed below, we can conclude a few things:

The words “man”, “life”, “day”, and “world” are common between both works.
Thoreau discusses water, ponds, shores, and surfaces together.
While Emerson seemingly discussed man and nature in the same breath, but none of his core concepts are discussed as densely as Thoreau’s.

Thoreau’s Walden

Emerson’s Representative Men

Python’s Natural Langauge Toolkit (NLTK) is a good library to get start with for digital humanists. I have to learn more though. My jury is still out regarding which is better, Perl or Python. So far, they have more things in common than differences.

My first R script, wordcloud.r

Posted on November 10, 2014 in Uncategorized

This is my first R script, wordcloud.r:


#!/usr/bin/env Rscript

# wordcloud.r - output a wordcloud from a set of files in a given directory

# Eric Lease Morgan <eric_morgan@infomotions.com>
# November 8, 2014 - my first R script!


# configure
MAXWORDS    = 100
RANDOMORDER = FALSE
ROTPER      = 0

# require
library( NLP )
library( tm )
library( methods )
library( RColorBrewer )
library( wordcloud )

# get input; needs error checking!
input <- commandArgs( trailingOnly = TRUE )
  
# create and normalize corpus
corpus <- VCorpus( DirSource( input[ 1 ] ) )
corpus <- tm_map( corpus, content_transformer( tolower ) )
corpus <- tm_map( corpus, removePunctuation )
corpus <- tm_map( corpus, removeNumbers )
corpus <- tm_map( corpus, removeWords, stopwords( "english" ) )
corpus <- tm_map( corpus, stripWhitespace )

# do the work
wordcloud( corpus, max.words = MAXWORDS, random.order = RANDOMORDER, rot.per = ROTPER )

# done
quit()

Given the path to a directory containing a set of plain text files, the script will generate a wordcloud.

Like Python, R has a library well-suited for text mining — tm. Its approach to text mining (or natural language processing) is both similar and dissimilar to Python’s. They are similar in that they both hope to provide a means for analyzing large volumes of texts. It is similar in that they use different underlying data structures to get there. R might be more for analytic person. Think statistics. Python may be more for the “literal” person, all puns intended. I will see if I can exploit the advantages of both.

My first Python script, concordance.py

Posted on November 10, 2014 in Uncategorized

Below is my first Python script, concordance.py:


#!/usr/bin/env python2

# concordance.py - do KWIK search against a text
#
# usage: ./concordance.py <file> <word>ph

# Eric Lease Morgan <emorgan@nd.edu>
# November 5, 2014 - my first real python script!


# require
import sys
import nltk

# get input; needs sanity checking
file = sys.argv[ 1 ]
word = sys.argv[ 2 ]

# do the work
text = nltk.Text( nltk.word_tokenize( open( file ).read( ) ) )
text.concordance( word )

# done
quit()

Given the path to a plain text file as well as a word, the script will output no more than twenty-five lines containing the given word. It is a keyword-in-context (KWIC) search engine, one of the oldest text mining tools in existence.

The script is my first foray into Python scripting. While Perl is cool (and “kewl”), it behooves me to learn the language of others if I expect good communication to happen. This includes others using my code and me using the code of others. Moreover, Python comes with a library (module) call the Natural Langauge Toolkit (NLTK) which makes it relatively easy to get my feet wet with text mining in this environment.

Days in the Life of a Librarian

Recent Posts

Subscribe

Archive for November, 2014

My second Python script, dispersion.py

My first R script, wordcloud.r

My first Python script, concordance.py