How to use text mining to address research questions

eye candyThis tiny blog posting outlines a sort of recipe for the use of text mining to address research questions. Through the use of this process the student, researcher, or scholar can easily supplement the traditional reading process.

  1. Articulate a research question – This is one of the more difficult parts of the process, and the questions can range from the mundane to the sublime. Examples might include: 1) how big is this corpus, 2) what words are used in this corpus, 3) how have given ideas ebbed & flowed over time, or 4) what is St. Augustine’s definition of love and how does it compare with Rousseau’s?
  2. Identify one or more textual items that might contain the necessary answers – These items may range from set of social media posts, sets of journal articles, sets of reports, sets of books, etc. Point to the collection of documents.
  3. Get items – Even in the age of the Internet, when we are all suffering from information overload, you would be surprised how difficult it is to accomplish this step. One might search a bibliographic index and download articles. One might exploit some sort of application programmer interface to download tweets. One might do a whole lot of copying & pasting. What ever the process, I suggest one save each and every file in a single directory with some sort of meaningful name.
  4. Add metadata – Put another way, this means create a list, where each item on the list is described with attributes which are directly related to the research question. Dates are an obvious attribute. If your research question compares and contrasts authorship, then you will need author names. You might need to denote language. If your research question revoles around types of authors, then you will need to associate each item with a type. If you want to compare & contrast ideas between different types of documents, then you will need to associate each document with a type. To make your corpus more meaningful, you will probably want to associate each item with a title value. Adding metadata is tedious. Be forewarned.
  5. Convert items to plain text – Text mining is not possible without plain text; you MUST have plain text to do the work. This means PDF files, Word documents, spreadsheets, etc need to have their underlying texts extracted. Tika is a very good tool for doing and automating this process. Save each item in your corpus as a corresponding plain text file.
  6. Extract features – In this case, the word “features” is text mining parlance for enumerating characteristics of a text, and the list of such things is quite long. In includes: size of documents measured in number of words, counts & tabulations (frequencies) of ngrams, readability scores, frequencies of parts-of-speech, frequencies of named entities, frequencies of given grammars such as noun phrases, etc. There are many different tools for doing this work.
  7. Analyze – Given a set of features, once all the prep work is done, one can actually begin to address the research question, and there are number of tools and subprocesses that can be applied here. Concordancing is one of the quickest and easiest. From the features, identify a word of interest. Load the plain text into a concordance. Search for the word, and examine the surrounding words to see how the word was used. This is like ^F on steroids. Topic modeling is a useful process for denoting themes. Load the texts into a topic modeler, denote the number of desired topics. Run the modeler. Evaluate the results. Repeat. Associate each document with a metadata value, such as date. Run the modeler. Pivot the results on the date value. Plot the results as a line chart to see how topics ebbed & flowed over time. If the corpus is big enough (at leaset a million words long), then word embedding is a useful to learn what words are used in conjunction with other words. Those words can then be fed back into a concordance. Full text indexing is also a useful analysis tool. Index corpus complete with metadata. Identify words or phrases of interest. Search the index to learn what documents are most relevant. Use a concordance to read just those documents. Listing grammars is also useful. Identify a thing (noun) of interest. Identify an action (verb) of interest. Apply a language model to a given text and output a list all sentences with the given thing and action to learn what they are with. An example is “Ahab has”, and the result will be lists of matching sentences including “Ahab has…”, “Ahab had…”, or “Ahab will have…”
  8. Evaluate – Ask yourself, “To what degree did I address the research question?” If the degree is high, or if you are tired, then stop. Otherwise, go to Step #1.

Such is an outline of using text mining to address research questions, and there are a few things one ought to take away from the process. First, this is an iterative process. In reality, it is never done. Similarly, do not attempt to completely finish one step before you go on to the next. If you do, then you will never get past step #3. Moreover, computers do not mind if processes are done over and over again. Thus, you can repeat many of the subprocesses many times.

Second, this process is best done as a team of at least two. One person plays the role of domain expert armed with the research question. A person who knows how to manipulate different types of data structures (different types of lists) with a computer is the other part of the team.

Third, a thing called the Distant Reader automates Step #5 and #6 with ease. To some degree, the Reader can do #3, #4, and #7. The balance of the steps are up people.

Finally, the use of text mining to adress research questions is only a supplement to the traditional reading process. It is not a replacement. Text mining scales very well, but it does poorly when it comes to nuance. Similarly, text mining is better at addressing quantitative-esque questions; text mining will not be able to answer why questions. Moreover, text mining is a repeatable process enabling others to verify results.

Remember, use the proper process for the proper problem.

Comments are closed.