---
title: "Data Analytics for Social Science - Lab 11"
author: "Johan A. Elkink"
date: "23 April 2020"
output:
html_document:
number_sections: yes
toc: yes
toc_float: true
---
```{r echo = FALSE, cache = FALSE}
source("http://www.joselkink.net/files/dass_utils.R", echo = FALSE)
# allow knitr to run with errors
knitr::opts_chunk$set(error = TRUE)
```
```{r}
suppressWarnings(
suppressMessages({
library(rio)
library(tm)
library(knitr)
library(ggplot2)
library(reshape)
library(LDAvis)
library(topicmodels)
library(knitr)
library(quanteda.corpora)
library(pander)
library(dplyr)
})
)
```
# Opening data
We continue with the statistical text data we used in Lab 8, based on the [State of the Union addresses by U.S. presidents](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union) since 1790. This data has been prepared for text analysis in R by the team of [Quanteda](https://quanteda.io), saved in the "quanteda.corpora" package and listed under the name "data_corpus_sotu".
The following code repeats all the data preparations from Lab 8.
```{r}
info <- attr(data_corpus_sotu, "docvars")
info$year <- as.integer(substr(info$Date, 1, 4))
corpus <- VCorpus(VectorSource(data_corpus_sotu))
names(corpus) <- info$docname_
# put all text in lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# remove English stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, c("must", "will", "can", "make"))
# stem the text
corpus <- tm_map(corpus, stemDocument)
# remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# remove numbers
corpus <- tm_map(corpus, removeNumbers)
dtm <- DocumentTermMatrix(corpus)
# remove rare words
dtm <- removeSparseTerms(dtm, sparse = 0.4)
# create a document-term matrix
dtmMatrix <- as.matrix(dtm)
rownames(dtmMatrix) <- info$docname_
dtmMatrixRelative <- dtmMatrix / rowSums(dtmMatrix)
# create a term-document matrix
tdmMatrix <- t(dtmMatrix)
tdmMatrixRelative <- tdmMatrix / rowSums(tdmMatrix)
# save the dimensions of the matrix
ndocs <- dim(dtmMatrix)[1]
nterms <- dim(dtmMatrix)[2]
```
# Topic models
In this lab we will run topic models on the same State of the Union speeches we have been working on in previous labs. Topic models are entirely unsupervised, so there are very few parameters to provide, but we have to indicate the number of topics we want to identify.
After we estimate the topic model, there is different ways of looking at the output. A first step is to see which terms are associated with which topics, to try to identify what topics might mean and how clearly they are identifiable. We use the terms() function to do this, e.g. here extracting the 10 most frequent terms per topic.
```{r}
ldaModel <- topicmodels::LDA(dtm, k = 5)
terms(ldaModel, 10)
```
> **`r qnr()` *If you had to come up with a label for each topic, what would you say?***
> **`r qnr()` *Compare these topics with the plots after the cluster analysis on terms in Week 8 - do you find similar groups of terms?***
> **`r qnr()` *Compare these topics with the plots after the principal components analysis from Week 9 - do you find the same terms defining the extracted dimensions?***
> **`r qnr()` *The above model extracts 5 topics. In topic modeling it is common to have much larger numbers of topics. Try the model extracting 25 topics and see if there is a difference in the clarity of the topics. (WARNING: might be a bit slow.)***
***Continue below using a small number of topics, e.g. 5, to keep things readable.***
Topic modeling is a decomposition of a matrix into two matrices, one showing the probability of topic being associated with each document, and one showing the probability of each term being associated with each topic. The above summaries show the highest probabilities, but we might want to work with the full matrices.
The topic model we use here is a Bayesian model and in Bayesian statistics these probabilities are called posterior probabilities. So you can see posterior as the probability we obtain after estimating the model (the opposite is the prior, the probability before we estimate the model, but you can ignore this now).
We create here a data frame for all probabilities a particular document relates to a particular topic, and add some information from our meta data (president, year, party, etc.).
```{r}
postTopics <- cbind(info, data.frame(posterior(ldaModel)$topics))
kable(head(postTopics))
```
We then reshape this data, so that we can more easily create plots by different groups. This can easily be done using the melt() function in the reshape package, but sometimes this works more easily than other times (see [Reshaping Data](http://www.statmethods.net/management/reshape.html)). This time we're lucky.
```{r}
postTopicsLong <- melt(postTopics, id = c("docname_", "docid_", "segid_", "FirstName", "President", "Date", "delivery", "type", "party", "year"))
colnames(postTopicsLong)[11:12] <- c("topic", "probability")
kable(head(postTopicsLong))
```
We can use this data frame to plot, for example, the use of a particular topic over time.
```{r}
ggplot(postTopicsLong, aes(x = year, y = probability, col = topic)) +
geom_smooth(se = FALSE) +
labs(col = "Topic") +
labs(x = "Year", y = "Topic probability") +
theme_minimal()
```
> **`r qnr()` *Look back at the labels you just made up for each of the topics. What does the plot say about the contents of State of the Union speeches over time?***
We can also look at the topics by party or by individual president.
```{r}
aggData <- aggregate(probability ~ topic + party, postTopicsLong, mean)
ggplot(aggData, aes(x = topic, fill = topic, y = probability)) +
geom_bar(stat = "identity") +
labs(fill = "Topic") +
facet_grid(. ~ party) +
theme_minimal() +
theme(axis.text.x = element_blank())
```
> **`r qnr()` *Given your interpretation of the meaning of each of the topics, interpret the above graph.***
```{r}
aggData <- aggregate(probability ~ topic + President, postTopicsLong, mean)
ggplot(aggData, aes(x = topic, fill = topic, y = probability)) +
geom_bar(stat = "identity") +
facet_wrap(~ President) +
labs(fill = "Topic") +
theme_minimal() +
theme(axis.text.x = element_blank())
```
> **`r qnr()` *Given your interpretation of the meaning of each of the topics, interpret the above graph.***
The principal components analysis we used was to identify underlying dimensions in the document-term matrix. We can also use this method to find underlying dimensions in the topic-term matrix, to better understand how the topics relate. Or more precisely, this analysis calculates a distance matrix based on the similarity between topics in terms of their term usage, and then does the dimensional analysis on this distance matrix. So the further topics are apart from each other in the plot, the more their term usage is different.
```{r}
pca <- as.data.frame(jsPCA(posterior(ldaModel)$terms))
pca$topic <- names(terms(ldaModel))
ggplot(pca, aes(x = x, y = y, label = topic)) +
labs(x = "Dimension 1", y = "Dimension 2") +
geom_point(alpha = .3, color = "blue", size = 5) +
geom_text() +
theme_minimal()
```
Instead of the topic numbers, with a bit of R trickery, we can also plot the terms themselves.
```{r}
pca$label <- apply(terms(ldaModel, 4), 2, paste, collapse = "\n")
ggplot(pca, aes(x = x, y = y, label = label)) +
labs(x = "Dimension 1", y = "Dimension 2") +
geom_point(alpha = .3, color = "blue", size = 5) +
geom_text() +
lims(x = range(pca$x) * 1.25, y = range(pca$y) * 1.25) +
theme_minimal()
```
A similar analysis can be performed on the document-topic matrix.
```{r}
pcaDocs <- cbind(info,
as.data.frame(jsPCA(posterior(ldaModel)$topics)))
ggplot(pcaDocs, aes(x = x, y = y, label = docname_, color = party)) +
labs(x = "Dimension 1", y = "Dimension 2") +
geom_text(size = 3) +
lims(x = range(pca$x) * 1.25, y = range(pca$y) * 1.25) +
theme_minimal() +
theme(legend.position = "bottom")
```
> **`r qnr()` *Compare the above plot with the output from earlier PCA analyses (in Lab 9).***
> **`r qnr()` *Try the PCA plot with the output for 25 topics. Can you find interesting dimensions?***
> **`r qnr()` *If there is time left, experiment with the following: (1) different number of topics (k) in the models; (2) copy the text cleaning code from the lab of Week 11 and try different cleaning options (sparsity, stop words).***
# Extra
> **`r qnr()` *Instead of the LDA (Latent Dirichlet Allocation) method of topic modeling there is also the CTM (Correlated Topic Model) method, which allows topics to be correlated (i.e. some topics might more often co-occur in documents than other topics). Repeat some of the above analysis using CTM instead of LDA. (WARNING: CTM is a lot slower than LDA - try only one model, with few topics.)***
***Acknowledgements***
This lab made use of [Quantitative Text Analysis, Exercise 9: Topic Models](http://www.kenbenoit.net/courses/essex2014qta/exercise9.pdf) by Ken Benoit.