My first R script, wordcloud.r
Posted on November 10, 2014 in Uncategorized by Eric Lease Morgan
This is my first R script, wordcloud.r:
#!/usr/bin/env Rscript # wordcloud.r - output a wordcloud from a set of files in a given directory # Eric Lease Morgan <eric_morgan@infomotions.com> # November 8, 2014 - my first R script! # configure MAXWORDS = 100 RANDOMORDER = FALSE ROTPER = 0 # require library( NLP ) library( tm ) library( methods ) library( RColorBrewer ) library( wordcloud ) # get input; needs error checking! input <- commandArgs( trailingOnly = TRUE ) # create and normalize corpus corpus <- VCorpus( DirSource( input[ 1 ] ) ) corpus <- tm_map( corpus, content_transformer( tolower ) ) corpus <- tm_map( corpus, removePunctuation ) corpus <- tm_map( corpus, removeNumbers ) corpus <- tm_map( corpus, removeWords, stopwords( "english" ) ) corpus <- tm_map( corpus, stripWhitespace ) # do the work wordcloud( corpus, max.words = MAXWORDS, random.order = RANDOMORDER, rot.per = ROTPER ) # done quit()
Given the path to a directory containing a set of plain text files, the script will generate a wordcloud.
Like Python, R has a library well-suited for text mining — tm. Its approach to text mining (or natural language processing) is both similar and dissimilar to Python’s. They are similar in that they both hope to provide a means for analyzing large volumes of texts. It is similar in that they use different underlying data structures to get there. R might be more for analytic person. Think statistics. Python may be more for the “literal” person, all puns intended. I will see if I can exploit the advantages of both.