---
title: "Data Analytics for Social Science - Lab 9"
author: "Johan A. Elkink"
date: "9 April 2020"
output:
html_document:
number_sections: yes
toc: yes
toc_float: true
---
```{r echo = FALSE, cache = FALSE}
source("http://www.joselkink.net/files/dass_utils.R", echo = FALSE)
# allow knitr to run with errors
knitr::opts_chunk$set(error = TRUE)
suppressWarnings(
suppressMessages({
library(rio)
library(ggplot2)
library(knitr)
library(quanteda.corpora)
library(tm)
library(pander)
library(psych)
library(dplyr)
})
)
```
# Opening data
We continue with the statistical text data we used in Lab 8, based on the [State of the Union addresses by U.S. presidents](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/annual-messages-congress-the-state-the-union) since 1790. This data has been prepared for text analysis in R by the team of [Quanteda](https://quanteda.io), saved in the "quanteda.corpora" package and listed under the name "data_corpus_sotu".
The following code repeats all the data preparations from Lab 8.
```{r}
info <- attr(data_corpus_sotu, "docvars")
info$year <- as.integer(substr(info$Date, 1, 4))
corpus <- VCorpus(VectorSource(data_corpus_sotu))
names(corpus) <- info$docname_
# put all text in lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# remove English stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, c("must", "will", "can", "make"))
# stem the text
corpus <- tm_map(corpus, stemDocument)
# remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# remove numbers
corpus <- tm_map(corpus, removeNumbers)
dtm <- DocumentTermMatrix(corpus)
# remove rare words
dtm <- removeSparseTerms(dtm, sparse = 0.4)
# create a document-term matrix
dtmMatrix <- as.matrix(dtm)
rownames(dtmMatrix) <- info$docname_
dtmMatrixRelative <- dtmMatrix / rowSums(dtmMatrix)
# create a term-document matrix
tdmMatrix <- t(dtmMatrix)
tdmMatrixRelative <- tdmMatrix / rowSums(tdmMatrix)
# save the dimensions of the matrix
ndocs <- dim(dtmMatrix)[1]
nterms <- dim(dtmMatrix)[2]
```
So we have three versions of the document-term matrix: _dtm_ is the original matrix, which is in a format specifically designed for text analysis; _dtmMatrix_ contains the same information in a standard R matrix; _dtmMatrixRelative_ is a standard R matrix, but instead of the number of times a term appears in a document, it records the proportion of words in a document that are that particular term (e.g. 5% of the words in document D are term T). We also have the same for the term-document matrix: _tdm_ the original matrix; _tdmMatrix_ a standard R matrix; and _tdmMatrixRelative_ the proportion of total use of a term for each document (e.g. 5% of the time term T is used, it is in document D).
The below analyses are primarily based on _tdmMatrixRelative_.
Just so you get a feel for what the matrices look like, here are five random rows, six random terms:
```{r}
kable(dtmMatrix[sample(1:ndocs, 5), sample(1:nterms, 6)])
kable(dtmMatrixRelative[sample(1:ndocs, 5), sample(1:nterms, 6)], digits = 3)
kable(tdmMatrix[sample(1:nterms, 5), sample(1:ndocs, 6)])
kable(tdmMatrixRelative[sample(1:nterms, 5), sample(1:ndocs, 6)], digits = 3)
```
# Principal Components Analysis
## PCA on raw word counts
We first perform a PCA on the baseline matrix, one where each term is seen as a variable and we measure for each document the number of times that term appears.
```{r}
pca <- princomp(tdmMatrix)
ggplot(as.data.frame(pca$scores),
aes(x = Comp.1, y = Comp.2, label = colnames(dtm))) +
geom_point(color = "blue", alpha = .15) +
geom_text() +
labs(x = "Principal Component 1", y = "Principal Component 2",
title = "PCA on counts, scores") +
theme_minimal()
```
The problem here is that the variance of some terms is really high, such that they "drive" the PCA results and everything else gets pressed together. The variation in the data is simply the use of the words "nation", "govern", "state", etc. Indeed, "nation" is used 203 times in one speech (Carter in 1981), while it is only used three times in another speech (Jefferson in 1802):
```{r}
lst <- sort(dtmMatrix[, "nation"])
pander(head(lst))
pander(tail(lst))
```
## PCA on relative word counts
We use instead the matrix that records relative word usage as opposed to absolute word usage, so that we correct for the overall frequency that a particular word occurs. This means that very common words will get relatively less weight in our analysis and very rare words more weight, and it means that the variance in word usage is somewhat more similar so that not a few words drive all results.
```{r}
pca <- princomp(tdmMatrixRelative)
ggplot(as.data.frame(pca$scores),
aes(x = Comp.1, y = Comp.2, label = colnames(dtm))) +
geom_point(color = "blue", alpha = .15) +
geom_text() +
labs(x = "Principal Component 1", y = "Principal Component 2",
title = "PCA on proportions, scores") +
theme_minimal()
```
We see a much more spread use of words. So here we have reduced a `r ncol(tdmMatrix)`-dimensional space, where each axis is a document and each point in this space a term, to a two-dimensional space that we can easily plot on the screen. Here the axes are not particular documents, but dimensions that explain most of the variation in word usage.
> **`r qnr()` If you read the terms on the left in the graph and on the right, what seem to be the topics that most separate different speeches?**
> **`r qnr()` Note that the vertical dimension explains the second-most variation in the speech contents. How do you interpret that distinction?**
How frequent are the words that drive this interpretation? We might want to change the font size and transparency (alpha) value depending on how common the word is.
```{r}
ggplot(as.data.frame(pca$scores),
aes(x = Comp.1, y = Comp.2, label = colnames(dtm),
size = rowSums(tdmMatrix)/1000)) +
geom_point(color = "blue", alpha = .15) +
geom_text(aes(alpha = rowSums(tdmMatrix) / max(rowSums(tdmMatrix)))) +
labs(x = "Principal Component 1", y = "Principal Component 2",
title = "PCA on proportions, scores") +
guides(size = "none", alpha = "none") +
theme_minimal()
```
There are more than two principal components, however, and we might want to plot another combination, for example components 1 and 3.
> **`r qnr()` Produce a plot of scores from the PCA output, plotting the first and third dimension.**
> **`r qnr()` Interpret the third principal component.**
We can look at the same PCA in two different ways. PCA is a rotation of the axes, whereby the original axes in this case are documents, and the rotated axes point in the direction where there is the most variation in word usage. We can now project each word that was originally projected on each document (based on the relative usage of the word) only the new axes that point in the direction of the most variation. This is what we plot above. We can also, however, see how each of those new axes relates to the original axes, the documents. If a document loads high on an axes, this document relates closely to that axes, if has a low loading, it is not closely related.
We have to play a slight trick to get R to understand the loadings as a matrix, as opposed to a special type of object.
```{r}
L <- pca$loadings
class(L) <- "matrix"
ggplot(as.data.frame(L), aes(x = Comp.1, y = Comp.2,
label = info$docname_,
color = info$party)) +
geom_point(alpha = .15) +
geom_text(size = 3) +
labs(x = "Principal Component 1", y = "Principal Component 2",
title = "PCA on proportions, loadings") +
guides(size = "none", color = guide_legend("")) +
theme_minimal() +
theme(legend.position = "bottom")
```
Here we might change the transparency on text based on the year, so more recent presidents are more visible.
```{r}
ggplot(as.data.frame(L), aes(x = Comp.1, y = Comp.2,
label = info$docname_,
color = info$party)) +
geom_point(alpha = .15) +
geom_text(size = (info$year - 1700) / 320 * 3,
alpha = (info$year - 1700) / 320) +
labs(x = "Principal Component 1", y = "Principal Component 2",
title = "PCA on proportions, loadings") +
guides(size = "none", color = guide_legend("")) +
theme_minimal() +
theme(legend.position = "bottom")
```
> **`r qnr()` Produce also a plot to visualise the third dimension.**
Let's produce one where Trump's speeches are highlighted.
```{r}
ggplot(as.data.frame(L), aes(x = Comp.1, y = Comp.2,
label = info$docname_,
color = info$party)) +
geom_point(alpha = .15) +
geom_text(size = (info$year - 1700) / 320 * 3,
alpha = (info$year - 1700) / 320) +
geom_point(color = "black", size = ifelse(info$President == "Trump", 2, 0), alpha = .5) +
labs(x = "Principal Component 1", y = "Principal Component 2",
title = "PCA on proportions, loadings") +
guides(size = "none", color = guide_legend("")) +
theme_minimal() +
theme(legend.position = "bottom")
```
> **`r qnr()` What do you think of the style of speech by Trump? Given the hype around his presidency, does he stand out here?**
> **`r qnr()` Produce a loadings plot for the second and third component.**
> **`r qnr()` Look back at the output from last week's lab. Do you see some of the clustering results reflected in the PCA output?**
Each dimension of the PCA analysis is in the direction of most variation, orthogonal to earlier dimensions. This means that each new dimension explains a slightly smaller part of the variation. We might want to investigate how quickly this declines - and therefore how many dimensions are likely to depict the variation well.
```{r}
plot(pca)
```
We see that the first dimension is dominant in this data, relative to subsequent dimensions.
# Factor analysis
```{r}
suppressWarnings(
fa <- fa(tdmMatrixRelative, nfactors = 3, rotate = "oblimin",
scores = "regression", fm = "minres")
)
```
## Plots of scores
The output of a factor analysis can be seen from two perspectives: the scores indicate the relationship between each variable (word) and the factors (dimensions), while the loadings incidate the relationship between each case (speech) and the factors. This is unlike principal component analysis, where we could look at a dimensional analysis of the space defined by the words, and a separate one of the space defined by the speeches, but they did not come out of the same analysis.
Here we first turn to visualisations of the words in relation to the factors, separately for factors 1 and 2, for factors 1 and 3, and for factors 2 and 3.
```{r}
ggplot(as.data.frame(fa$scores), aes(x = MR1, y = MR2, label = colnames(dtm), size = rowSums(tdmMatrix)/1000)) +
geom_text() +
labs(x = "Factor 1", y = "Factor 2", title = "FA on proportions, scores") +
guides(size = "none") +
theme_minimal()
ggplot(as.data.frame(fa$scores), aes(x = MR1, y = MR3, label = colnames(dtm), size = rowSums(tdmMatrix)/1000)) +
geom_text() +
labs(x = "Factor 1", y = "Factor 3", title = "FA on proportions, scores") +
guides(size = "none") +
theme_minimal()
ggplot(as.data.frame(fa$scores), aes(x = MR2, y = MR3, label = colnames(dtm), size = rowSums(tdmMatrix)/1000)) +
geom_text() +
labs(x = "Factor 2", y = "Factor 3", title = "FA on proportions, scores") +
guides(size = "none") +
theme_minimal()
```
## Plots of loadings
The alternative, based on the same output, is the loadings, which show us how the speeches relate to the underlying factors.
```{r}
L <- fa$loadings
class(L) <- "matrix"
ggplot(as.data.frame(L), aes(x = MR1, y = MR2,
label = info$docname_,
color = info$party)) +
geom_point(alpha = .15) +
geom_text(size = 3) +
labs(x = "Factor 1", y = "Factor 2", title = "FA on proportions, loadings") +
guides(size = "none", color = guide_legend("")) +
theme_minimal()
ggplot(as.data.frame(L), aes(x = MR1, y = MR3,
label = info$docname_,
color = info$party)) +
geom_point(alpha = .15) +
geom_text(size = 3) +
labs(x = "Factor 1", y = "Factor 3", title = "FA on proportions, loadings") +
guides(size = "none", color = guide_legend("")) +
theme_minimal()
ggplot(as.data.frame(L), aes(x = MR2, y = MR3,
label = info$docname_,
color = info$party)) +
geom_point(alpha = .15) +
geom_text(size = 3) +
labs(x = "Factor 2", y = "Factor 3", title = "FA on proportions, loadings") +
guides(size = "none", color = guide_legend("")) +
theme_minimal()
```
> **`r qnr()` How do the factor analysis results compare to the PCA above?**
> **`r qnr()` Try to highlight Trump's speeches as before.**
> **`r qnr()` Repeat the above analysis using varimax rotation.**
> **`r qnr()` How much does the rotation matter for the interpretation of the factors?**
> **`r qnr()` Repeat the above analysis, using your prefered rotation, extracting only two dimensions.**
> **`r qnr()` Repeat the above analysis, using your prefered rotation, without plots, extracting fifteen dimensions.**
# Multidimensional scaling
As with principal component analysis, we can perform multidimensional scaling either on the space defined by the words or on the space defined by the speeches.
## Distances between terms
```{r}
mds <- cmdscale(dist(tdmMatrixRelative), k = 3)
ggplot(as.data.frame(mds), aes(x = V1, y = V2, label = colnames(dtm),
size = rowSums(tdmMatrix) / 1000)) +
geom_point(color = "blue", alpha = .15) +
geom_text() +
labs(x = "Coordinate 1", y = "Coordinate 2", title = "MDS on proportions, scores") +
guides(size = "none") +
theme_minimal()
```
> **`r qnr()` Add a plot to visualise the third dimension.**
> **`r qnr()` Repeat the analysis extracting only two axes.**
## Distances between speeches
```{r}
mds <- cmdscale(dist(dtmMatrixRelative), k = 3)
ggplot(as.data.frame(mds), aes(x = V1, y = V2,
label = info$docname_,
color = info$party)) +
geom_point(alpha = .15) +
geom_text(size = 3) +
labs(x = "Coordinate 1", y = "Coordinate 2", title = "MDS on proportions, scores") +
guides(size = "none", color = guide_legend("")) +
theme_minimal() +
theme(legend.position = "bottom")
```
> **`r qnr()` Try to highlight Trump's speeches as before.**
> **`r qnr()` Add a plot to visualise the third dimension.**
> **`r qnr()` How do the multidimensional scaling results compare to the PCA above?**