---
title: "Data Analytics for Social Science - Lab 10"
author: "Johan A. Elkink"
date: "14 November 2017"
output: html_document
---
# Opening data
We continue with the statistical text data we used in Lab 9, based on the [State of the Union addresses by U.S. presidents](https://archive.org/details/State-of-the-Union-Addresses-1945-2006) since World War II. Speeches by Obama and Trump were acquired from [State of the Union Addresses and Messages: research notes by Gerhard Peters](http://www.presidency.ucsb.edu/sou.php) and it would be relatively straightforward to expand to earlier speeches, before Truman.
```{r}
suppressWarnings(
suppressMessages({
library(rio)
library(ggplot2)
library(knitr)
library(tm)
library(psych)
})
)
load(url("http://www.joselkink.net/files/data/sou_corpus_postcleaning.Rdata"))
data <- import("http://www.joselkink.net/files/data/sou_meta_data.dta")
# For the remainder of this analysis we remove Truman 1946, which appears to be
# a very unusual speech,three times longer than normal, but also creating real
# outliers in word usage.
data <- data[-2, ]
dtm <- dtm[-2, ]
ndocs <- dim(dtm)[1]
nterms <- dim(dtm)[2]
dtmMatrix <- as.matrix(dtm)
dtmMatrixRelative <- dtmMatrix / rowSums(dtmMatrix)
tdm <- t(dtm)
tdmMatrix <- t(dtmMatrix)
tdmMatrixRelative <- tdmMatrix / rowSums(tdmMatrix)
```
So we have three versions of the document-term matrix: _dtm_ is the original matrix, which is in a format specifically designed for text analysis; _dtmMatrix_ contains the same information in a standard R matrix; _dtmMatrixRelative_ is a standard R matrix, but instead of the number of times a term appears in a document, it records the proportion of words in a document that are that particular term (e.g. 5% of the words in document D are term T). We also have the same for the term-document matrix: _tdm_ the original matrix; _tdmMatrix_ a standard R matrix; and _tdmMatrixRelative_ the proportion of total use of a term for each document (e.g. 5% of the time term T is used, it is in document D).
The below analyses are primarily based on _tdmMatrixRelative_.
Just so you get a feel for what the matrices look like, here are five random rows, six random terms:
```{r}
kable(dtmMatrix[sample(1:ndocs, 5), sample(1:nterms, 6)])
kable(dtmMatrixRelative[sample(1:ndocs, 5), sample(1:nterms, 6)], digits = 3)
kable(tdmMatrix[sample(1:nterms, 5), sample(1:ndocs, 6)])
kable(tdmMatrixRelative[sample(1:nterms, 5), sample(1:ndocs, 6)], digits = 3)
```
# Principal Components Analysis
## PCA on raw word counts
We first perform a PCA on the baseline matrix, one where each term is seen as a variable and we measure for each document the number of times that term appears.
```{r}
pca <- princomp(tdmMatrix)
ggplot(as.data.frame(pca$scores), aes(x = Comp.1, y = Comp.2, label = colnames(dtm))) +
geom_text() +
labs(x = "Principal Component 1", y = "Principal Component 2", title = "PCA on counts, scores")
```
The problem here is that the variance of some terms is really high, such that they "drive" the PCA results and everything else gets pressed together. The variation in the data is simply the use of the words "nation", "america", "year", etc. Indeed, "nation" is used 46 times in one speech (Eisenhower in 1956), while it is only used twice in another speech (Nixon ni 1973). (Note that we removed Truman in 1946, who used it no fewer than 127 times!)
```{r}
table(dtmMatrix[, "nation"])
which.max(dtmMatrix[, "nation"])
which.min(dtmMatrix[, "nation"])
```
## PCA on relative word counts
We use instead the matrix that records relative word usage as opposed to absolute word usage, so that we correct for the overall frequency that a particular word occurs. This means that very common words will get relatively less weight in our analysis and very rare words more weight, and it means that the variance in word usage is much more similar so that not a few words drive all results.
```{r}
pca <- princomp(tdmMatrixRelative)
ggplot(as.data.frame(pca$scores), aes(x = Comp.1, y = Comp.2, label = colnames(dtm))) +
geom_text() +
labs(x = "Principal Component 1", y = "Principal Component 2", title = "PCA on proportions, scores")
```
We see a much more spread use of words. So here we have reduced a `r ncol(tdm)`-dimensional space, where each axis is a document and each point in this space a term, to a two-dimensional space that we can easily plot on the screen. Here the axes are not particular documents, but dimensions that explain most of the variation in word usage.
> **If you read the terms on the left in the graph and on the right, what seem to be the topics that most separate different speeches?**
> **Note that the vertical dimension explains the second-most variation in the speech contents. How do you interpret that distinction?**
How frequent are the words that drive this interpretation? We might want to change the font size depending on how common the word is.
```{r}
ggplot(as.data.frame(pca$scores), aes(x = Comp.1, y = Comp.2, label = colnames(dtm), size = rowSums(tdmMatrix)/1000)) +
geom_text() +
labs(x = "Principal Component 1", y = "Principal Component 2", title = "PCA on proportions, scores") +
guides(size = "none")
```
There are more than two principal components, however, and we might want to plot another combination, for example components 1 and 3.
> **Produce a plot of scores from the PCA output, plotting the first and third dimension.**
> **Interpret the third principal component.**
We can look at the same PCA in two different ways. PCA is a rotation of the axes, whereby the original axes in this case are documents, and the rotated axes point in the direction where there is the most variation in word usage. We can now project each word that was originally projected on each document (based on the relative usage of the word) only the new axes that point in the direction of the most variation. This is what we plot above. We can also, however, see how each of those new axes relates to the original axes, the documents. If a document loads high on an axes, this document relates closely to that axes, if has a low loading, it is not closely related.
We have to play a slight trick to get R to understand the loadings as a matrix, as opposed to a special type of object.
```{r}
L <- pca$loadings
class(L) <- "matrix"
ggplot(as.data.frame(L), aes(x = Comp.1, y = Comp.2, label = data$label, color = data$party)) +
geom_text(size = 3) +
labs(x = "Principal Component 1", y = "Principal Component 2", title = "PCA on proportions, loadings") +
guides(size = "none", color = guide_legend(""))
```
> **Produce also a plot to visualise the third dimension.**
> **What do you think of the style of speech by Trump? Given the hype around his presidency, does he stand out here?**
> **Produce a loadings plot for the second and third component.**
> **Look back at the output from last week's lab. Do you see some of the clustering results reflected in the PCA output?**
Each dimension of the PCA analysis is in the direction of most variation, orthogonal to earlier dimensions. This means that each new dimension explains a slightly smaller part of the variation. We might want to investigate how quickly this declines - and therefore how many dimensions are likely to depict the variation well.
```{r}
plot(pca)
```
We see that the first dimension is dominant in this data, relative to subsequent dimensions.
# Factor analysis
```{r}
fa <- fa(tdmMatrixRelative, nfactors = 3, rotate = "oblimin", scores = "regression", fm = "minres")
fa
ggplot(as.data.frame(fa$scores), aes(x = MR1, y = MR2, label = colnames(dtm), size = rowSums(tdmMatrix)/1000)) +
geom_text() +
labs(x = "Factor 1", y = "Factor 2", title = "FA on proportions, scores") +
guides(size = "none")
ggplot(as.data.frame(fa$scores), aes(x = MR1, y = MR3, label = colnames(dtm), size = rowSums(tdmMatrix)/1000)) +
geom_text() +
labs(x = "Factor 1", y = "Factor 3", title = "FA on proportions, scores") +
guides(size = "none")
ggplot(as.data.frame(fa$scores), aes(x = MR2, y = MR3, label = colnames(dtm), size = rowSums(tdmMatrix)/1000)) +
geom_text() +
labs(x = "Factor 2", y = "Factor 3", title = "FA on proportions, scores") +
guides(size = "none")
L <- fa$loadings
class(L) <- "matrix"
ggplot(as.data.frame(L), aes(x = MR1, y = MR2, label = data$label, color = data$party)) +
geom_text(size = 3) +
labs(x = "Factor 1", y = "Factor 2", title = "FA on proportions, loadings") +
guides(size = "none", color = guide_legend(""))
ggplot(as.data.frame(L), aes(x = MR1, y = MR3, label = data$label, color = data$party)) +
geom_text(size = 3) +
labs(x = "Factor 1", y = "Factor 3", title = "FA on proportions, loadings") +
guides(size = "none", color = guide_legend(""))
ggplot(as.data.frame(L), aes(x = MR2, y = MR3, label = data$label, color = data$party)) +
geom_text(size = 3) +
labs(x = "Factor 2", y = "Factor 3", title = "FA on proportions, loadings") +
guides(size = "none", color = guide_legend(""))
```
> **How do the factor analysis results compare to the PCA above?**
> **Repeat the above analysis using varimax rotation.**
> **Repeat the above analysis using oblimin rotation.**
> **How much does the rotation matter for the interpretation of the factors?**
> **Repeat the above analysis, using your prefered rotation, extracting only two dimensions.**
> **Repeat the above analysis, using your prefered rotation, without plots, extracting fifteen dimensions.**
# Multidimensional scaling
## Distances between terms
```{r}
mds <- cmdscale(dist(tdmMatrixRelative), k = 3)
ggplot(as.data.frame(mds), aes(x = V1, y = V2, label = colnames(dtm), size = rowSums(tdmMatrix)/1000)) +
geom_text() +
labs(x = "Coordinate 1", y = "Coordinate 2", title = "MDS on proportions, scores") +
guides(size = "none")
```
> **Add a plot to visualise the third dimension.**
> **Repeat the analysis extracting only two axes.**
## Distances between speeches
```{r}
mds <- cmdscale(dist(dtmMatrixRelative), k = 3)
ggplot(as.data.frame(mds), aes(x = V1, y = V2, label = data$label, color = data$party)) +
geom_text(size = 3) +
labs(x = "Coordinate 1", y = "Coordinate 2", title = "MDS on proportions, scores") +
guides(size = "none", color = guide_legend(""))
```
> **Add a plot to visualise the third dimension.**
> **How do the multidimensional scaling results compare to the PCA above?**