---
title: "Data Analytics for Social Science - Lab 12"
author: "Johan A. Elkink"
date: "26 April 2017"
output: html_document
---
# Opening data
```{r}
suppressWarnings(
suppressMessages({
library(rio)
library(tm)
library(knitr)
library(ggplot2)
library(reshape)
library(LDAvis)
library(topicmodels)
})
)
load(url("http://www.joselkink.net/files/data/sou_corpus_postcleaning.Rdata"))
data <- import("http://www.joselkink.net/files/data/sou_meta_data.dta")
```
# Topic models
In this lab we will run topic models on the same State of the Union speeches we have been working on in previous labs. Topic models are entirely unsupervised, so there are very few parameters to provide, but we have to indicate the number of topics we want to identify.
After we estimate the topic model, there is different ways of looking at the output. A first step is to see which terms are associated with which topics, to try to identify what topics might mean and how clearly they are identifiable. We use the terms() function to do this, e.g. here extracting the 10 most frequent terms per topic.
```{r}
ldaModel <- topicmodels::LDA(dtm, k = 5)
terms(ldaModel, 10)
```
> ***If you had to come up with a label for each topic, what would you say?***
> ***Compare these topics with the plots after the cluster analysis on terms in Week 9 - do you find similar groups of terms?***
> ***Compare these topics with the plots after the principal components analysis from Week 10 - do you find the same terms defining the extracted dimensions?***
We can also extract the most important topics for each speech, for example to extract the two most important topics:
```{r}
topics(ldaModel, 2)
```
> ***The above model extracts 5 topics. In topic modeling it is common to have much larger numbers of topics. Try the model extracting 25 topics and see if there is a difference in the clarity of the topics. (WARNING: might be a bit slow.)***
***Continue below using a small number of topics, e.g. 5, to keep things readable.***
Topic modeling is a decomposition of a matrix into two matrices, one showing the probability of topic being associated with each document, and one showing the probability of each term being associated with each topic. The above summaries show the highest probabilities, but we might want to work with the full matrices.
The topic model we use here is a Bayesian model and in Bayesian statistics these probabilities are called posterior probabilities. So you can see posterior as the probability we obtain after estimating the model (the opposite is the prior, the probability before we estimate the model, but you can ignore this now).
We create here a data frame for all probabilities a particular document relates to a particular topic, and add some information from our meta data (president, year, party, etc.).
```{r}
postTopics <- cbind(data, data.frame(posterior(ldaModel)$topics))
kable(head(postTopics))
```
We then reshape this data, so that we can more easily create plots by different groups. This can easily be done using the melt() function in the reshape package, but sometimes this works more easily than other times (see [Reshaping Data](http://www.statmethods.net/management/reshape.html)). This time we're lucky.
```{r}
postTopicsLong <- melt(postTopics, id = c("textfile", "year", "president", "label", "party"))
colnames(postTopicsLong)[6:7] <- c("topic", "probability")
kable(head(postTopicsLong))
```
We can use this data frame to plot, for example, the use of a particular topic over time.
```{r}
ggplot(postTopicsLong, aes(x = year, y = probability, col = topic)) +
geom_smooth() +
guides(col = guide_legend(title = "Topic")) +
labs(x = "Year", y = "Topic probability")
```
> ***Look back at the labels you just made up for each of the topics. What does the plot say about the contents of State of the Union speeches over time?***
We can also look at the topics by party or by individual president.
```{r}
aggData <- aggregate(probability ~ topic + party, postTopicsLong, mean)
ggplot(aggData, aes(x = topic, y = probability)) +
geom_bar(stat = "identity") +
facet_grid(. ~ party)
```
> ***Given your interpretation of the meaning of each of the topics, interpret the above graph.***
```{r}
aggData <- aggregate(probability ~ topic + president, postTopicsLong, mean)
ggplot(aggData, aes(x = topic, y = probability)) +
geom_bar(stat = "identity") +
facet_wrap(~ president)
```
> ***Given your interpretation of the meaning of each of the topics, interpret the above graph.***
The principal components analysis we used was to identify underlying dimensions in the document-term matrix. We can also use this method to find underlying dimensions in the topic-term matrix, to better understand how the topics relate. Or more precisely, this analysis calculates a distance matrix based on the similarity between topics in terms of their term usage, and then does the dimensional analysis on this distance matrix. So the further topics are apart from each other in the plot, the more their term usage is different.
```{r}
pca <- as.data.frame(jsPCA(posterior(ldaModel)$terms))
pca$topic <- names(terms(ldaModel))
ggplot(pca, aes(x = x, y = y, label = topic)) + geom_text()
```
Instead of the topic numbers, with a bit of R trickery, we can also plot the terms themselves.
```{r}
pca$label <- apply(terms(ldaModel, 4), 2, paste, collapse = "\n")
ggplot(pca, aes(x = x, y = y, label = label)) + geom_point() + geom_text() +
lims(x = range(pca$x) * 1.25, y = range(pca$y) * 1.25)
```
A similar analysis can be performed on the document-topic matrix.
```{r}
pcaDocs <- cbind(data, as.data.frame(jsPCA(posterior(ldaModel)$topics)))
ggplot(pcaDocs, aes(x = x, y = y, label = label, color = party)) +
geom_text(size = 3) +
lims(x = range(pca$x) * 1.25, y = range(pca$y) * 1.25)
```
> ***Compare the above plot with the output from earlier PCA analyses (in Lab 10).***
> ***Try the PCA plot with the output for 25 topics. Can you find interesting dimensions?***
> ***If there is time left you, experiment with the following: (1) different number of topics (k) in the models; (2) copy the text cleaning code from the lab of Week 11 and try different cleaning options (sparsity, stop words).***
# Extra
> ***Instead of the LDA (Latent Dirichlet Allocation) method of topic modeling there is also the CTM (Correlated Topic Model) method, which allows topics to be correlated (i.e. some topics might more often co-occur in documents than other topics). Repeat some of the above analysis using CTM instead of LDA. (WARNING: CTM is a lot slower than LDA - try only one model, with few topics.)***
***Acknowledgements***
This lab made use of [Quantitative Text Analysis, Exercise 9: Topic Models](http://www.kenbenoit.net/courses/essex2014qta/exercise9.pdf) by Ken Benoit.