Opening data

The data we will use for this lab is a statistical data set summarizing textual data. The idea is that text can be used for statistical analysis by looking at word frequencies. We get a data set where each row is a text document, and each column a word, and the data itself is just the number of times that word appears in that text. To get a reasonable data set this requires some cleaning of the data, but we will turn to that in Lab 11. For now we will use a pre-cleaned data set.

The data is based on the State of the Union addresses by U.S. presidents since World War II. Speeches by Obama and Trump were acquired from State of the Union Addresses and Messages: research notes by Gerhard Peters.

suppressWarnings(
  suppressMessages({
    library(rio)
    library(ggplot2)
    library(knitr)
    library(tm)
  })
)

load(url("http://www.joselkink.net/files/data/sou_corpus_postcleaning.Rdata"))

data <- import("http://www.joselkink.net/files/data/sou_meta_data.dta")

The load() command above creates an object called “dtm”, which contains a document-term-matrix, a matrix with a row for each document (in this case, each address) and a column for each term, after some cleaning of the data. The cells of the matrix contain the number of times the term appears in the speech. In this case we have 76 documents (rows) and 213 terms (columns). The tm (text mining) library is necessary to open or handle such document-term-matrices.

The data object is a regular data set, with information on each speech, such as the year of the speech, the president concerned, and the party of the president.

Word clouds

We can use the document-term matrix to produce word clouds for specific documents. For example, to compare Clinton’s first State of the Union with G.W. Bush’s first one:

library(wordcloud)
## Loading required package: RColorBrewer
dtmMatrix <- as.matrix(dtm)

wordcloud(colnames(dtmMatrix), dtmMatrix["1993-Clinton.txt", ], scale=c(2.5,.2), max.words = 70)

wordcloud(colnames(dtmMatrix), dtmMatrix["2001-GWBush-1.txt", ], scale=c(2.5,.2), max.words = 70)