Tiny Text Mining Tools
Posted on April 2, 2014 in Uncategorized by Eric Lease Morgan
I have posted to Github the very beginnings of Perl library used to support simple and introductory text mining analysis — tiny text mining tools.
Presently the library is implemented in a set of subroutines stored in a single file supporting:
- simple in-memory indexing and single-term searching
- relevancy ranking through term-frequency inverse document frequency (TFIDF) for searching and classification
- cosine similarity for clustering and “finding more items like this one”
I use these subroutines and the associated Perl scripts to do quick & dirty analysis against corpuses of journal articles, books, and websites.
I know, I know. It would be better to implement these thing as a set of Perl modules, but I’m practicing what I preach. “Give it away even if it is not ready.” The ultimate idea is to package these things into a single distribution, and enable researchers to have them at their finger tips as opposed to a Web-based application.