{"id":948,"date":"2020-02-06T01:41:09","date_gmt":"2020-02-05T20:41:09","guid":{"rendered":"http:\/\/sites.nd.edu\/emorgan\/?p=948"},"modified":"2020-02-06T01:41:09","modified_gmt":"2020-02-05T20:41:09","slug":"topic-modeling","status":"publish","type":"post","link":"https:\/\/sites.nd.edu\/emorgan\/2020\/02\/topic-modeling\/","title":{"rendered":"Topic Modeling Tool &#8211; Enumerating and visualizing latent themes"},"content":{"rendered":"<p>\nTechnically speaking, topic modeling is an unsupervised machine learning process used to extract latent themes from a text. Given a text and an integer, a topic modeler will count &amp; tabulate the frequency of words and compare those frequencies with the distances between the words. The words form &#8220;clusters&#8221; when they are both frequent and near each other, and these clusters can sometimes represent themes, topics, or subjects. Topic modeling is often used to denote the &#8220;aboutness&#8221; of a text or compare themes between authors, dates, genres, demographics, other topics, or other metadata items.\n<\/p>\n<p>\n<a href=\"https:\/\/github.com\/senderle\/topic-modeling-tool\">Topic Modeling Tool<\/a> is a GUI\/desktop topic modeler based on the venerable <a href=\"http:\/\/mallet.cs.umass.edu\">MALLET suite of software<\/a>. It can be used in a number of ways, and it is relatively easy to use it to: list five distinct themes from the <cite>Iliad<\/cite> and the <cite>Odyssey<\/cite>, compare those themes between books, and, assuming each chapter occurs chronologically, compare the themes over time.\n<\/p>\n<div id=\"attachment_957\" style=\"width: 294px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-topics-01.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-957\" src=\"http:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-topics-01-284x300.png\" alt=\"topics\" width=\"284\" height=\"300\" class=\"size-medium wp-image-957\" srcset=\"https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-topics-01-284x300.png 284w, https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-topics-01.png 671w\" sizes=\"auto, (max-width: 284px) 100vw, 284px\" \/><\/a><p id=\"caption-attachment-957\" class=\"wp-caption-text\">Simple list of topics<\/p><\/div>\n<div id=\"attachment_956\" style=\"width: 260px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-topics-02.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-956\" src=\"http:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-topics-02-250x300.png\" alt=\"topics\" width=\"250\" height=\"300\" class=\"size-medium wp-image-956\" srcset=\"https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-topics-02-250x300.png 250w, https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-topics-02.png 685w\" sizes=\"auto, (max-width: 250px) 100vw, 250px\" \/><\/a><p id=\"caption-attachment-956\" class=\"wp-caption-text\">Topics distributed across a corpus<\/p><\/div>\n<div id=\"attachment_959\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-books.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-959\" src=\"http:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-books-300x204.png\" alt=\"topics\" width=\"300\" height=\"204\" class=\"size-medium wp-image-959\" srcset=\"https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-books-300x204.png 300w, https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-books-1024x697.png 1024w, https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-books-768x522.png 768w, https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-books.png 1454w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><p id=\"caption-attachment-959\" class=\"wp-caption-text\">Comparing the two books of Homer<\/p><\/div>\n<div id=\"attachment_958\" style=\"width: 310px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-chapters.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-958\" src=\"http:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-chapters-300x204.png\" alt=\"topics\" width=\"300\" height=\"204\" class=\"size-medium wp-image-958\" srcset=\"https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-chapters-300x204.png 300w, https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-chapters-1024x697.png 1024w, https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-chapters-768x522.png 768w, https:\/\/sites.nd.edu\/emorgan\/files\/2020\/02\/model-chapters.png 1454w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><p id=\"caption-attachment-958\" class=\"wp-caption-text\">Topics compared over time<\/p><\/div>\n<h2>Topic Modeling Tool Recipes<\/h2>\n<p>\nThese few recipes are intended to get you up and running when it comes to Topic Modeling Tool. They are not intended to be a full-blown tutorial. This first recipe merely <strong>divides a corpus into the default number of topics<\/strong> and dimensions:\n<\/p>\n<ol>\n<li>Download and install Topic Modeling Tool<\/li>\n<li>Copy (<em>not move<\/em>) the whole of the txt directory to your computer&#8217;s desktop<\/li>\n<li>Create a folder\/directory named &#8220;model&#8221; on your computer&#8217;s desktop<\/li>\n<li>Open Topic Modeling Tool<\/li>\n<li>Specify the &#8220;Input Dir&#8230;&#8221; to be the txt folder\/directory on your desktop<\/li>\n<li>Specify the &#8220;Output Dir&#8230;&#8221; to be the folder\/directory named &#8220;model&#8221; on your desktop<\/li>\n<li>Click &#8220;Learn Topics&#8221;; the result ought to be a a list of ten topics (numbered 0 to 9), and each topic is denoted with a set of scores and twenty words (&#8220;dimensions&#8221;), and while functional, such a result is often confusing<\/li>\n<\/ol>\n<p>\nThis recipe will <strong>make things less confusing<\/strong>:\n<\/p>\n<ol>\n<li>Change the number of topics from the default (10) to five (5)<\/li>\n<li>Click the &#8220;Optional Settings&#8230;&#8221; button<\/li>\n<li>Change the &#8220;The number of topic words to print&#8221; to something smaller, say five (5)<\/li>\n<li>Click the &#8220;Ok&#8221; button<\/li>\n<li>Click &#8220;Learn Topics&#8221;; the result will include fewer topics and fewer dimensions, and the result will probably be more meaningful, if not less confusing<\/li>\n<\/ol>\n<p>\nThere is no correct number of topics to extract with the process of topic modeling. &#8220;When considering the whole of Shakespeare&#8217;s writings, what is the number of topics it is about?&#8221; This being the case, repeat and re-repeat the previous recipe until you: 1) get tired, or 2) feel like the results are at least somewhat meaningful.\n<\/p>\n<p>\nThis recipe will help you make the results even cleaner by <strong>removing nonsense<\/strong> from the output:\n<\/p>\n<ol>\n<li>Copy the file named &#8220;stopwords.txt&#8221; from the etc directory to your desktop<\/li>\n<li>Click &#8220;Optional Settings&#8230;&#8221;; specify &#8220;Stopword File&#8230;&#8221; to be stopwords.txt; click &#8220;Ok&#8221;<\/li>\n<li>Click &#8220;Learn Topics&#8221;<\/li>\n<li>If the results contain nonsense words of any kind (or words that you just don&#8217;t care about), edit stopwords.txt to specify additional words to remove from the analysis<\/li>\n<li>Go to Step #3 until you get tired; the result ought to be topics with more meaningful words<\/li>\n<\/ol>\n<p>\nAdding individual words to the stopword list can be tedious, and consequently, here is a power-user&#8217;s recipe to <strong>accomplish the same goal<\/strong>:\n<\/p>\n<ol>\n<li>Identify words or regular expressions to be excluded from analysis, and good examples include all numbers (\\d+), all single-letter words (\\b\\w\\b), or all two-letter words (\\b\\w\\w\\b)<\/li>\n<li>Use your text editor&#8217;s find\/replace function to remove all occurrences of the identified words\/patterns from the files in the txt folder\/directory; remember, you were asked to copy (not move) the whole of the txt directory, so editing the files in the txt directory will not effect your study carrel<\/li>\n<li>Run the topic modeling process<\/li>\n<li>Go to Step #1 until you: 1) get tired, or 2) are satisfied with the results<\/li>\n<\/ol>\n<p>\nNow that you have somewhat meaningful topics, you will probably want to visualize the results, and one way to do that is to illustrate how the topics are dispersed over the whole of the corpus. Luckily, the list of topics displayed in the Tool&#8217;s console is tab-delimited, making it easy to <strong>visualize<\/strong>. Here&#8217;s how:\n<\/p>\n<ol>\n<li>Topic model until you get a set of topics which you think is meaningful<\/li>\n<li>Copy the resulting topics, and this will include the labels (numbers 0 through n), the scores, and the topic words<\/li>\n<li>Open your spreadsheet application, and paste the topics into a new sheet; the result ought to be three columns of information (labels, scores, and words)<\/li>\n<li>Sort the whole sheet by the second column (scores) in descending numeric order<\/li>\n<li>Optionally replace the generic labels (numbers 0 through n) with a single meaningful word, thus denoting a topic<\/li>\n<li>Create a pie chart based on the contents of the first two columns (labels and scores); the result will appear similar to an illustration above and it will give you an idea of how large each topic is in relation to the others<\/li>\n<\/ol>\n<p>\nBecause of a great feature in Topic Modeling Tool it is relatively easy to compare topics against metadata values such as authors, dates, formats, genres, etc. To accomplish this goal the raw numeric information output by the Tool (the actual model) needs to be supplemented with metadata, the data then needs to be pivoted, and subsequently visualized. This is a power-user&#8217;s recipe because it requires: 1) a specifically shaped comma-separated values (CSV) file, 2) Python and a few accompanying modules, and 3) the ability to work from the command line. That said, here&#8217;s a recipe to <strong>compare &amp; contrast the two books of Homer<\/strong>:\n<\/p>\n<ol>\n<li>Copy the file named homer-books.csv to your computer&#8217;s desktop<\/li>\n<li>Click &#8220;Optional Settings&#8230;&#8221;; specify &#8220;Metadata File&#8230;&#8221; to be homer-books.csv; click &#8220;Ok&#8221; <\/li>\n<li>Click &#8220;Learn Topics&#8221;; the result ought to pretty much like your previous results, but the underlying model has been enhanced<\/li>\n<li>Copy the file named pivot.py to your computer&#8217;s desktop<\/li>\n<li>When the modeling is complete, open up a terminal application and navigate to your computer&#8217;s desktop<\/li>\n<li>Run the pivot program (<code>python pivot.py<\/code>); the result ought to an error message outlining the input pivot.py expects<\/li>\n<li>Run <code>pivot.py<\/code> again, but this time give it input; more specifically, specify &#8220;.\/model\/output_csv\/topics-metadata.csv&#8221; as the first argument (Windows users will specify .\\model\\output_csv\\topics-metadata.csv), specify &#8220;barh&#8221; for the second argument, and &#8220;title&#8221; as the third argument; the result ought to be a horizontal bar chart illustrating the differences in topics across the <cite>Iliad<\/cite> and the <cite>Odyssey<\/cite>, and ask yourself, &#8220;To what degree are the books similar?&#8221;<\/li>\n<\/ol>\n<p>\nThe following recipe is very similar to the previous recipe, but it <strong>illustrates the ebb &amp; flow of topics<\/strong> throughout the whole of the two books:\n<\/p>\n<ol>\n<li>Copy the file named homer-chapters.csv to your computer&#8217;s desktop<\/li>\n<li>Click &#8220;Optional Settings&#8230;&#8221;; specify &#8220;Metadata File&#8230;&#8221; to be homer-chapters.csv; click &#8220;Ok&#8221; <\/li>\n<li>Click &#8220;Learn Topics&#8221;<\/li>\n<li>When the modeling is complete, open up a terminal application and navigate to your computer&#8217;s desktop<\/li>\n<li>Run <code>pivot.py<\/code> and specify &#8220;.\/model\/output_csv\/topics-metadata.csv&#8221; as the first argument (Windows users will specify .\\model\\output_csv\\topics-metadata.csv), specify &#8220;line&#8221; for the second argument, and &#8220;title&#8221; as the third argument; the result ought to be a line chart illustrating the increase &amp; decrease of topics from the beginning of the saga to the end, and ask yourself &#8220;What topics are discussed concurrently, and what topics are discussed when others are not?&#8221;<\/li>\n<\/ol>\n<p>\nTopic modeling is an effective process for &#8220;reading&#8221; a corpus &#8220;from a distance&#8221;. Topic Modeling Tool makes the process easier, but the process requires practice. Next steps are for the student to play with the additional options behind the &#8220;Optional Settings&#8230;&#8221; dialog box, read the Tool&#8217;s documentation, take a look at the structure of the CSV\/metadata file, and take a look under the hood at <code>pivot.py<\/code>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Technically speaking, topic modeling is an unsupervised machine learning process used to extract latent themes from a text. Given a text and an integer, a topic modeler will count &amp; tabulate the frequency of words and compare those frequencies with the distances between the words. The words form &#8220;clusters&#8221; when they are both frequent and [&hellip;]<\/p>\n","protected":false},"author":92,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10716],"tags":[],"class_list":["post-948","post","type-post","status-publish","format-standard","hentry","category-distant-reader"],"_links":{"self":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts\/948","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/users\/92"}],"replies":[{"embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/comments?post=948"}],"version-history":[{"count":13,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts\/948\/revisions"}],"predecessor-version":[{"id":965,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/posts\/948\/revisions\/965"}],"wp:attachment":[{"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/media?parent=948"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/categories?post=948"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sites.nd.edu\/emorgan\/wp-json\/wp\/v2\/tags?post=948"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}