Lexicons and sentiment analysis – Notes to self

This is mostly a set of notes to myself on lexicons and sentiment analysis.

A couple of weeks ago I asked Jeffrey Bain-Conkin to read at least one article about sentiment analysis (sometimes called “opinion mining”), and specifically I asked him to help me learn about the use of lexicons in such a process. He came back with a few more articles and a list of pointers to additional information. Thank you, Jeffrey! I am echoing the list here for future reference, for the possible benefit of others, and to remove some of the clutter from my to-do list. While I haven’t read and examined each of the items in great detail, just re-creating the list increases my knowledge. The list is divided into three sections: lexicons, software, and “more”.

Lexicons

  • Arguing Lexicon – “The lexicon includes patterns that represent arguing.”
  • BOOTStrep Bio-Lexicon – “Biological terminology is a frequent cause of analysis errors when processing literature written in the biology domain. For example, ‘retro-regulate’ is a terminological verb often used in molecular biology but it is not included in conventional dictionaries. The BioLexicon is a linguistic resource tailored for the biology domain to cope with these problems. It contains the following types of entries: a set of terminological verbs, a set of derived forms of the terminological verbs, general English words frequently used in the biology domain, [and] domain terms.”
  • English Phrases for Information Retrieval – “Goal of the ‘English Phrases for IR’ (EP4IR) project at the Radboud University Nijmegen (The Netherlands) is the development of a grammar and lexicon of English suitable for applications in Information Retrieval and available in the public domain.”
  • General Inquirer – “The General Inquirer is basically a mapping tool. It maps each text file with counts on dictionary-supplied categories. The currently distributed version combines the ‘Harvard IV-4’ dictionary content-analysis categories, the ‘Lasswell’ dictionary content-analysis categories, and five categories based on the social cognition work of Semin and Fiedler, making for 182 categories in all. Each category is a list of words and word senses. A category such as ‘self references’ may contain only a dozen entries, mostly pronouns. Currently, the category ‘negative’ is our largest with 2291 entries. Users can also add additional categories of any size.”
  • NRC word-emotion association lexicon – “The lexicon has human annotations of emotion associations for more than 24,200 word senses (about 14,200 word types). The annotations include whether the target is positive or negative, and whether the target has associations with eight basic emotions (joy, sadness, anger, fear, surprise, anticipation, trust, disgust).” The URL also points to a large number of articles on sentiment analysis in general.
  • Subjectivity Lexicon – “The Subjectivity Lexicon (list of subjectivity clues) that is part of OpinionFinder…”
  • WordNet – “WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser. WordNet is also freely and publicly available for download. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing.”
  • WordNet Domains – “WordNet Domains is a lexical resource created in a semi-automatic way by augmenting WordNet with domain labels. WordNet Synsets have been annotated with at least one semantic domain label, selected from a set of about two hundred labels structured according the WordNet Domain Hierarchy. Information brought by domains is complementary to what is already in Wordnet. A domain may include synsets of different syntactic categories and from different WordNet sub-hierarchies. Domains may group senses of the same word into homogeneous clusters, with the side effect of reducing word polysemy in WordNet.”
  • WordNet-Affect – “WordNet-Affect is an extension of WordNet Domains, including a subset of synsets suitable to represent affective concepts correlated with affective words. Similarly to our method for domain labels, we assigned to a number of WordNet synsets one or more affective labels (a-labels). In particular, the affective concepts representing emotional state are individuated by synsets marked with the a-label emotion. There are also other a-labels for those concepts representing moods, situations eliciting emotions, or emotional responses. The resource was extended with a set of additional a-labels (called emotional categories), hierarchically organized, in order to specialize synsets with a-label emotion. The hierarchical structure of new a-labels was modeled on the WordNet hyperonym relation. In a second stage, we introduced some modifications, in order to distinguish synsets according to emotional valence. We defined four addictional a-labels: positive, negative, ambiguous, and neutral.”

Software / applications

  • Linguistic Inquiry and Word Count – “Linguistic Inquiry and Word Count (LIWC) is a text analysis software program designed by James W. Pennebaker, Roger J. Booth, and Martha E. Francis. LIWC calculates the degree to which people use different categories of words across a wide array of texts, including emails, speeches, poems, or transcribed daily speech. With a click of a button, you can determine the degree any text uses positive or negative emotions, self-references, causal words, and 70 other language dimensions.”
  • OpinionFinder – “OpinionFinder is a system that processes documents and automatically identifies subjective sentences as well as various aspects of subjectivity within sentences, including agents who are sources of opinion, direct subjective expressions and speech events, and sentiment expressions.”
  • SenticNet – “SenticNet is a publicly available semantic resource for concept-level sentiment analysis. The affective common-sense knowledge base is built by means of sentic computing, a paradigm that exploits both AI and Semantic Web techniques to better recognize, interpret, and process natural language opinions over the Web. In particular, SenticNet exploits an ensemble of graph-mining and dimensionality-reduction techniques to bridge the conceptual and affective gap between word-level natural language data and the concept-level opinions and sentiments conveyed by them. SenticNet is a knowledge base that can be employed for the development of applications in fields such as big social data analysis, human-computer interaction, and e-health.”
  • SPECIALIST NLP Tools – “The SPECIALIST Natural Language Processing (NLP) Tools have been developed by the The Lexical Systems Group of The Lister Hill National Center for Biomedical Communications to investigate the contributions that natural language processing techniques can make to the task of mediating between the language of users and the language of online biomedical information resources. The SPECIALIST NLP Tools facilitate natural language processing by helping application developers with lexical variation and text analysis tasks in the biomedical domain. The NLP Tools are open source resources distributed subject to these [specific] terms and conditions.”
  • Visual Sentiment Ontology – “The analysis of emotion, affect and sentiment from visual content has become an exciting area in the multimedia community allowing to build new applications for brand monitoring, advertising, and opinion mining. There exists no corpora for sentiment analysis on visual content, and therefore limits the progress in this critical area. To stimulate innovative research on this challenging issue, we constructed a new benchmark and database. This database contains a Visual Sentiment Ontology (VSO) consisting of 3244 adjective noun pairs (ANP), SentiBank a set of 1200 trained visual concept detectors providing a mid-level representation of sentiment, associated training images acquired from Flickr, and a benchmark containing 603 photo tweets covering a diverse set of 21 topics. This website provides the above mentioned material for download…”

Lists of additional information

  • Lexical databases and corpora – “This is a list of links to lexical databases and corpora, organized by language or language group. The resources on this page were initially compiled from announcements on the LINGUIST list and web-search results. This is not intended to be an exhaustive list, but rather a place to organize and store potentially useful links as I [Jen Smith] encounter them.”
  • Opinion Mining, Sentiment Analysis, and Opinion Spam Detection – a long list of links pointing to articles, etc. about opinion mining.
  • Sentiment Symposium Tutorial – “This tutorial covers all aspects of building effective sentiment analysis systems for textual data, with and without sentiment-relevant metadata like star ratings. We proceed from pre-processing techniques to advanced uses cases, assessing common approaches and identifying best practices.”

Summary

What did I learn? I learned that to do sentiment analysis, lexicons are often employed. I learned that to evaluate a corpus for a particular sentiment, a researcher first needs to create a lexicon embodying that sentiment. Each element in the lexicon then needs to be assigned a quantitative value. The lexicon is then compared to the corpus tabulating the occurrences. Once tabulated, scores can then be summed, measurements taken, observations made and graphed, and conclusions/judgments made. Correct? Again, thank you, Jeffrey!

“Librarians love lists.”

Comments are closed.