---
title: "Data Analytics for Social Science - Lab 9"
author: "Johan A. Elkink"
date: "5 April 2017"
output: html_document
---
# Opening data
The data we will use for this lab is a statistical data set summarizing textual data. The idea is that text can be used for statistical analysis by looking at word frequencies. We get a data set where each row is a text document, and each column a word, and the data itself is just the number of times that word appears in that text. To get a reasonable data set this requires some cleaning of the data, but we will turn to that in Lab 11. For now we will use a pre-cleaned data set.
The data is based on the [State of the Union addresses by U.S. presidents](https://archive.org/details/State-of-the-Union-Addresses-1945-2006) since World War II. Speeches by Obama and Trump were acquired from [State of the Union Addresses and Messages: research notes by Gerhard Peters](http://www.presidency.ucsb.edu/sou.php).
```{r}
suppressWarnings(
suppressMessages({
library(rio)
library(ggplot2)
library(knitr)
library(tm)
})
)
load(url("http://www.joselkink.net/files/data/sou_corpus_postcleaning.Rdata"))
data <- import("http://www.joselkink.net/files/data/sou_meta_data.dta")
```
The load() command above creates an object called "dtm", which contains a document-term-matrix, a matrix with a row for each document (in this case, each address) and a column for each term, after some cleaning of the data. The cells of the matrix contain the number of times the term appears in the speech. In this case we have `r dim(dtm)[1]` documents (rows) and `r dim(dtm)[2]` terms (columns). The tm (text mining) library is necessary to open or handle such document-term-matrices.
The data object is a regular data set, with information on each speech, such as the year of the speech, the president concerned, and the party of the president.
# Word clouds
We can use the document-term matrix to produce word clouds for specific documents. For example, to compare Clinton's first State of the Union with G.W. Bush's first one:
```{r}
library(wordcloud)
dtmMatrix <- as.matrix(dtm)
wordcloud(colnames(dtmMatrix), dtmMatrix["1993-Clinton.txt", ], scale=c(2.5,.2), max.words = 70)
wordcloud(colnames(dtmMatrix), dtmMatrix["2001-GWBush-1.txt", ], scale=c(2.5,.2), max.words = 70)
```
> **Create a word cloud for the first speech, Truman in 1945.**
> **Create a word cloud for the last speech, Trump in 2017.**
# Use of terms over time
We can use the same matrix to plot the use of particular terms over time. For example, reference to peace:
```{r}
ggplot(mapping = aes(x = data$year, y = dtmMatrix[, "peac"])) + geom_line() + labs(x = "Year", y = "Frequency of peace") + geom_smooth(se = FALSE)
```
This image is somewhat skewed by the fact that the lengths of the speeches differ, so we should be looking at relative word frequencies, not absolute ones.
```{r}
dtmMatrixRelative <- dtmMatrix / rowSums(dtmMatrix)
ggplot(mapping = aes(x = data$year, y = dtmMatrixRelative[, "peac"])) + geom_line() + labs(x = "Year", y = "Frequency of peace") + geom_smooth(se = FALSE)
```
We can also add multiple terms together this way.
```{r}
ggplot(mapping = aes(x = data$year, y = dtmMatrixRelative[, "econom"] + dtmMatrixRelative[, "economi"])) + geom_line() + labs(x = "Year", y = "Frequency of economics") + geom_smooth(se = FALSE)
```
# Cluster analysis of documents
To perform a cluster analysis, we need a distance matrix between documents. This distance needs to be calculated based on word frequency, but that calculation gets heavily influenced by the length of the document in the first place. We therefore need to look at proportions, the relative usage of a word in a document, instead of raw counts. We calculate the proportions as follows:
```{r}
dtmMatrixRelative <- dtmMatrix / rowSums(dtmMatrix)
```
We start using basic K-Means cluster analysis. We might start just creating three groups. We use the "MacQueen" algorithm, which is the simple algorithm explained in the slides. The default in R is "Hartigan-Wong", which tends to work a little better, and which puts restrictions on what kind of moves points can make.
```{r}
k <- kmeans(dtmMatrixRelative, 3, algorithm = "MacQueen")
```
We can look at a table of cluster by president to see how speeches were roughly classified.
```{r}
kable(table(k$cluster, data$president), row.names = TRUE)
```
We can use a slightly better algorithm and instead of just one set of random starting points, we can use 20 and average results.
```{r}
k <- kmeans(dtmMatrixRelative, 3, algorithm = "Hartigan-Wong", nstart = 20)
kable(table(k$cluster, data$president), row.names = TRUE)
```
> **Repeat the above analysis, generating 8 groups. Do you notice any patterns in the cluster assignment?**
Much of the clustering appears to be related to changes in themes, political issues, and use of language over time, which can be seen when producing a cross-table with year instead of president.
> **For the 8 group clustering result, produce a table using year instead of president. What do you notice?**
# Hierarchical cluster analysis
We then need to use a distance matrix, which can be done using the dist() command.
```{r}
souDistances <- dist(dtmMatrixRelative)
```
We now perform and visualise the hierarchical cluster analysis, using three different methods for aggregation.
## Method 1: complete linkage
```{r}
souCluster <- hclust(souDistances)
souCluster$labels <- paste(data$president, data$year)
plot(souCluster, cex = .5, axes = FALSE, xlab = "", ylab = "")
```
> **Interpret the cluster dendrogram. What clusters do you see? Can you guess why those might be clustered together?**
We can cut this dendrogram such that we have a fixed number of groups - i.e. this is comparable to the pruning of tree we did before. The result is a variable that indicates the group each observation belongs to, so in this case we get the group classification for each document. For example, we can classify in 7 groups and then see which speech belongs to which group.
```{r}
group <- cutree(souCluster, k = 7)
table(rownames(dtm), group)
```
It seems some clusters are related to periods of war (1945 end of World War II, 1965 Vietnam, 1991 Iraq, 2001 Twin Towers), and some clusters reflect certain time periods.
There seems to be some correlation with party, but not a strong one:
```{r}
table(group, data$party)
```
We could also make a word cloud for a specific cluster, for example the 3rd (Truman/Eisenhower/Kennedy). Here we need to add documents together first - we simply take the total frequency across all documents in the group.
```{r}
freq <- colSums(dtmMatrix[group == 3, ])
wordcloud(colnames(dtmMatrix), freq, scale=c(2.5,.2), max.words = 70)
```
> **Repeat the above analysis, but reducing to just two groups.**
## Method 2: Ward's minimum variance
```{r}
souCluster <- hclust(souDistances, method = "ward.D2")
souCluster$labels <- paste(data$president, data$year)
plot(souCluster, cex = .5, axes = FALSE, xlab = "", ylab = "")
```
> **Check whether there are striking differences between this and the previous dendrogram.**
## Method 3: single linkage
```{r}
souCluster <- hclust(souDistances, method = "single")
souCluster$labels <- paste(data$president, data$year)
plot(souCluster, cex = .5, axes = FALSE, xlab = "", ylab = "")
```
## Using absolute instead of squared distances
```{r}
souCluster <- hclust(sqrt(souDistances), method = "complete")
souCluster$labels <- paste(data$president, data$year)
plot(souCluster, cex = .5, axes = FALSE, xlab = "", ylab = "")
```
## Using correlation instead of distance
Alternatively to using Euclidean distances between the documents in the space defined by the relative word frequencies, we can also use the correlation in word usage between documents as a measure of similarity. (See also [Correlation "Distances" and Hierarchical Clustering](http://research.stowers.org/mcm/efg/R/Visualization/cor-cluster/index.htm).)
```{r}
souCorrelation <- 1 - abs(cor(t(dtmMatrixRelative)))
souCluster <- hclust(as.dist(souCorrelation))
souCluster$labels <- paste(data$president, data$year)
plot(souCluster, cex = .5, axes = FALSE, xlab = "", ylab = "")
```
> **Check whether results are different from above.**
# Cluster analysis of terms
The above analysis clusters documents (State of the Union addresses) based on their (dis)similarity in word usage. This gives us an idea which speeches are similar to which other speeches and what kind of groups or clusters of speeches we can observe.
For the above cluster analysis we use a document-term-matrix, but we could also have done the same analysis using a term-document-matrix. So instead of clustering documents by word usage, we cluster words by how often they appear in the same documents. In the above, documents are very different if they use different words; here, words are very different when they appear in different documents.
Because there are too many words and you get an unreadable dendrogram as a result, we select only the 25% of the words which vary the most across documents.
```{r}
# calculate variation in term usage
v <- apply(dtmMatrix, 2, var)
tdmMatrix <- t(dtmMatrix)
# get only 25% of terms with highest variation in usage
tdmMatrix <- tdmMatrix[v > quantile(v, .75), ]
tdmMatrixRelative <- tdmMatrix / rowSums(tdmMatrix)
```
```{r}
souDistancesT <- dist(tdmMatrixRelative)
souClusterT <- hclust(souDistancesT)
plot(souClusterT, cex = .5, axes = FALSE, xlab = "", ylab = "")
```
> **Interpret the output. Can you see a clustering by topic?**
While this is not useful for visualisation, because there are too many terms, we can also cluster on all terms, cut down to a limited number of groups, and study the attention paid to each group over time.
```{r}
tdmMatrix <- t(dtmMatrix)
tdmMatrixRelative <- tdmMatrix / rowSums(tdmMatrix)
souDistancesTfull <- dist(tdmMatrixRelative)
souClusterTfull <- hclust(souDistancesTfull)
group <- cutree(souClusterTfull, k = 3)
ggplot(mapping = aes(x = data$year, y = rowSums(dtmMatrixRelative[, group == 1]))) + geom_line() + labs(x = "Year", y = "Frequency of group 1") + geom_smooth(se = FALSE)
colnames(dtmMatrix)[group == 1]
ggplot(mapping = aes(x = data$year, y = rowSums(dtmMatrixRelative[, group == 2]))) + geom_line() + labs(x = "Year", y = "Frequency of group 2") + geom_smooth(se = FALSE)
colnames(dtmMatrix)[group == 2]
ggplot(mapping = aes(x = data$year, y = dtmMatrixRelative[, group == 3])) + geom_line() + labs(x = "Year", y = "Frequency of group 3") + geom_smooth(se = FALSE)
colnames(dtmMatrix)[group == 3]
```
> **Use the above but cut in nine instead of three groups. List the keywords associated with cluster 5 - what would this be about? Plot the trend over time for this cluster.**
(See [Partitioning Cluster Analysis](http://www.sthda.com/english/wiki/partitioning-cluster-analysis-quick-start-guide-unsupervised-machine-learning) for much more on clustering and related visualisation in R.)
# Extra: Tree analysis
Although the topic of this lab is unsupervised cluster analysis, we could also do some supervised tree analysis on this data, for example to predict the party of the president or the individual president. This kind of analysis requires a regular data set, not a document-term-matrix, so we transform the matrix to a data frame. We also add a small prefix before each term, as some terms are not valid variable names in R.
```{r}
colnames(dtm) <- paste(".", colnames(dtm), sep = "")
dtmDF <- as.data.frame(as.matrix(dtm))
```
```{r}
library(tree)
library(randomForest)
t <- tree(as.factor(data$party) ~ ., dtmDF)
plot(t, main = "Tree classification of party of president")
text(t)
t <- tree(as.factor(data$president) ~ ., dtmDF)
plot(t, main = "Tree classification of president")
text(t)
f <- randomForest(as.factor(data$party) ~ ., dtmDF)
varImpPlot(f, main = "Variable importance in random forest\nexplaining party of president", pch = 19, n.var = 15)
```