Note that for today’s lab, we need a package that is not normally part of the R repository, so we need to be explicit where it should be installed from. Run the following in your console before compiling the rest:

install.packages("austin", repos = "http://R-Forge.R-project.org", 
                 dependencies = "Depends", type = "source")

Then, if other packages are also installed, we can proceed.

suppressWarnings(
  suppressMessages({
    library(tm)
    library(NLP)
    library(rio)
    library(austin)
    library(ggplot2)
  })
)

Opening data

In the previous two labs we have made use of data pre-cleaned. This lab we will learn how to do the data preparation, so that you can potentially use it on new data sets.

When we download the State of the Union addresses by U.S. presidents since World War II data, we obtain a ZIP file, which when unzipped contains a set of plain text files. These text files can be imported as follows (this code will not be executed in the lab), using the text mining library in R:

library(tm)

ds <- DirSource(
  directory = "~/dass/data/project 3/speeches/",   
      # This has to be set to the folder where 
      # the .txt files are located.
  pattern = ".*txt",
  mode = "text"
)

corpus <- VCorpus(ds)

We pick up from here in the lab, loading the imported data directly into R:

load(url("http://www.joselkink.net/files/data/sou_corpus_precleaning.Rdata"))

Note that this data actually has two additional speeches beyond the State of the Union speeches: it also contains two party convention speeches from 2016, by Ted Cruz and Bernie Sanders. These will be used a reference points to place speeches on a left-right ideological dimension.

We also load some meta data about the documents.

data <- import("http://www.joselkink.net/files/data/sou_meta_data.dta")

Cleaning text data

Text data requires a reasonable amount of cleaning before it can be used for statistical analysis:

The tm package in R helps with all those steps:

# put all text in lower case
corpus <- tm_map(corpus, content_transformer(tolower))

# remove English stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, c("must", "will", "can", "make"))

# stem the text
corpus <- tm_map(corpus, stemDocument)

# remove punctuation
corpus <- tm_map(corpus, removePunctuation)

# remove numbers
corpus <- tm_map(corpus, removeNumbers)

Use the output from the previous two labs to see if there are any additional stopwords you would like to remove.

We normally do our statistical analysis using a document-term matrix, where each row is a document and each column a word. We also need this matrix to remove sparse terms.

dtm <- DocumentTermMatrix(corpus)

# remove rare words
dtm <- removeSparseTerms(dtm, sparse = 0.4)

After finishing the rest of this lab, try the analysis below with different levels of sparseness.

The inspect() function shows a small segment of the matrix, to give you an idea what it looks like.

inspect(dtm)
<<DocumentTermMatrix (documents: 78, terms: 428)>>
Non-/sparse entries: 26225/7159
Sparsity           : 21%
Maximal term length: 10
Weighting          : term frequency (tf)
Sample             :
                     Terms
Docs                  america american congress nation new now peopl work
  1946-Truman.txt           2       19       75    128  45  67    39   44
  1955-Eisenhower.txt       6        8       25     44  18  11    18   12
  1956-Eisenhower.txt       9       12       29     46  26  19    22   10
  1993-Clinton.txt         14       40       12     22  29  18    39   27
  1994-Clinton.txt         12       32       25     17  17  16    59   47
  1995-Clinton.txt         22       30       21     22  41  18    63   67
  1998-Clinton.txt         44       33       25     30  46  22    18   23
  1999-Clinton.txt         37       40       27     28  29  40    14   42
  2000-Clinton.txt         32       37       18     29  47  32    31   44
  2013-Obama.txt           29       24       17      9  24  39    17   33
                     Terms
Docs                  world year
  1946-Truman.txt        55  199
  1955-Eisenhower.txt    17   36
  1956-Eisenhower.txt    17   46
  1993-Clinton.txt       18   33
  1994-Clinton.txt       20   41
  1995-Clinton.txt       16   49
  1998-Clinton.txt       20   46
  1999-Clinton.txt       17   46
  2000-Clinton.txt       16   45
  2013-Obama.txt         10   31

We can see that the matrix has 78 documents (rows) and 428 terms (columns).

We can use the findFreqTerms() function to find often used words, for example all words that occur more than 400 times:

findFreqTerms(dtm, 400)
 [1] "act"      "also"     "america"  "american" "ask"      "billion" 
 [7] "budget"   "care"     "children" "come"     "congress" "continu" 
[13] "countri"  "econom"   "effort"   "everi"    "feder"    "first"   
[19] "free"     "freedom"  "give"     "good"     "govern"   "great"   
[25] "health"   "help"     "home"     "hope"     "increas"  "job"     
[31] "just"     "know"     "last"     "let"      "live"     "mani"    
[37] "million"  "nation"   "need"     "new"      "now"      "one"     
[43] "peac"     "peopl"    "people"   "power"    "program"  "propos"  
[49] "provid"   "right"    "secur"    "state"    "support"  "system"  
[55] "take"     "tax"      "time"     "today"    "tonight"  "unit"    
[61] "use"      "want"     "war"      "way"      "work"     "world"   
[67] "year"     "years"   

Wordscores

The wordscores algorithm scales a set of texts on a one-dimensional space based on a set of reference texts. Here we estimate the word scores based on the convention speeches by Cruz and Sanders, as the most right- and most left-wing candidates in the recent elections. We scale all State of the Union speeches on this Sanders-Cruz dimension, to get an ideological spectrum.

While there is a risk that word usage is more correlated with time than with ideology (see previous two labs), by using two reference texts from the same time period, we should be mitigating this problem.

The austin library provides functions to estimate wordscores and wordfish models. This works on the basis of a word-frequency-matrix (wfm), i.e. what we called a document-term-matrix. We therefore convert the matrix.

The wordscores consists of two separate steps: (1) acquire scores for each term in the reference documents, based on their a priori classification, and (2) score the virgin documents based on the word scores.

We print 40 of the scored terms, to get an idea of what these scores look like, and a histogram of all terms.

reference.texts <- c(
  which(names(corpus) == "cruz.txt"),
  which(names(corpus) == "sanders.txt")
)

reference.scores <- c(100, -100)

ws <- classic.wordscores(as.wfm(dtm[reference.texts, ]), reference.scores)

head(ws$pi[, "Score"], 40)
      abil        abl     achiev        act    address     afford 
-100.00000   16.66667  100.00000  -36.36364 -100.00000 -100.00000 
       age        ago  agreement      allow     almost       also 
 100.00000   47.36842 -100.00000   61.53846 -100.00000  -17.64706 
   america   american  americans      among     around        ask 
  47.36842  -26.31579 -100.00000 -100.00000 -100.00000  100.00000 
      away       back       base      becom      begin      begun 
 100.00000  100.00000 -100.00000 -100.00000   16.66667 -100.00000 
    behind     believ    benefit       best     better       bill 
 100.00000  -28.20513  100.00000  100.00000  100.00000  100.00000 
   billion      bless      bring      build       busi       call 
-100.00000  100.00000   16.66667  100.00000 -100.00000  -36.36364 
      care      chang      child   children 
 -66.66667   16.66667  100.00000  -17.64706 
hist(ws$pi, col = "blue", border = 0)

This histogram has a particular structure, namely very high frequencies at the two extremes of the score, and much less in the middle. What does this say about the two reference texts?

Once we have the estimated word scores, we can use these to classify all remaining speeches.

scores <- predict(ws, newdata = as.wfm(dtm[-reference.texts, ]))
281 of 428 words (65.65%) are scorable

                     Score Std. Err. Rescaled   Lower    Upper
1945-Truman.txt       3.80      4.13    334.6  246.39  422.769
1946-Truman.txt     -26.32      1.14   -342.9 -367.20 -318.654
1947-Truman.txt     -15.48      2.45    -99.1 -151.52  -46.697
1948-Truman.txt     -18.70      2.38   -171.4 -222.25 -120.642
1949-Truman.txt     -11.60      3.06    -11.7  -77.03   53.614
1950-Truman.txt      -5.71      2.51    120.7   67.00  174.386
1951-Truman.txt      -3.08      2.75    179.8  121.04  238.514
1953-Eisenhower.txt -12.33      2.40    -28.2  -79.41   23.075
1954-Eisenhower.txt -14.84      2.42    -84.7 -136.47  -32.893
1955-Eisenhower.txt -14.31      2.15    -72.7 -118.60  -26.801
1956-Eisenhower.txt -19.16      2.03   -181.7 -225.04 -138.358
1957-Eisenhower.txt  -8.76      3.00     52.0  -12.20  116.204
1958-Eisenhower.txt -13.49      2.67    -54.4 -111.39    2.684
1959-Eisenhower.txt -10.63      2.62     10.1  -45.85   66.034
1960-Eisenhower.txt  -6.81      2.60     96.0   40.40  151.629
1961-Kennedy.txt     -7.42      2.64     82.2   25.93  138.555
1962-Kennedy.txt    -13.06      2.34    -44.5  -94.47    5.458
1963-Johnson.txt     -1.24      4.44    221.1  126.26  316.034
1963-Kennedy.txt    -15.44      2.47    -98.2 -150.99  -45.461
1964-Johnson.txt    -13.41      3.22    -52.6 -121.36   16.247
1965-Johnson-1.txt   -1.23      2.64    221.4  165.08  277.816
1965-Johnson-2.txt    2.03      2.79    294.8  235.22  354.365
1966-Johnson.txt     -9.17      2.35     42.8   -7.46   93.149
1967-Johnson.txt    -12.97      2.07    -42.5  -86.69    1.594
1968-Johnson.txt     -9.28      2.44     40.4  -11.87   92.576
1969-Johnson.txt    -12.87      2.64    -40.3  -96.76   16.112
1970-Nixon.txt       -6.86      2.63     94.8   38.69  150.892
1971-Nixon.txt       -9.70      2.56     30.9  -23.86   85.655
1972-Nixon.txt      -15.47      2.66    -98.9 -155.65  -42.068
1973-Nixon.txt       -5.17      4.17    132.7   43.66  221.787
1974-Nixon.txt      -12.35      2.17    -28.7  -75.19   17.742
1975-Ford.txt       -27.94      2.53   -379.4 -433.47 -325.293
1976-Ford.txt       -18.61      2.30   -169.4 -218.61 -120.144
1977-Ford.txt       -16.76      2.53   -127.8 -181.92  -73.733
1978-Carter.txt     -19.18      2.38   -182.3 -233.17 -131.481
1979-Carter.txt     -11.64      2.94    -12.7  -75.47   50.006
1980-Carter.txt      -6.52      3.07    102.5   36.94  167.983
1981-Reagan.txt     -26.41      2.54   -344.9 -399.20 -290.559
1982-Reagan.txt     -17.85      2.35   -152.3 -202.61 -101.975
1983-Reagan.txt     -18.65      2.22   -170.2 -217.59 -122.867
1984-Reagan.txt      -6.50      2.43    102.8   50.95  154.665
1985-Reagan.txt     -14.59      2.45    -79.1 -131.44  -26.820
1986-Reagan.txt      -6.14      2.73    111.0   52.56  169.416
1987-Reagan.txt      -8.63      2.67     55.0   -2.19  112.109
1988-Reagan.txt     -10.26      2.36     18.4  -31.95   68.782
1989-Bush.txt        -9.47      2.49     36.0  -17.26   89.322
1990-Bush.txt        -6.56      2.58    101.6   46.59  156.688
1991-Bush-1.txt      -7.16      2.66     88.0   31.26  144.822
1991-Bush-2.txt      -4.10      3.05    156.9   91.67  222.116
1992-Bush.txt       -13.65      2.26    -58.0 -106.31   -9.597
1993-Clinton.txt    -23.76      1.78   -285.3 -323.39 -247.191
1994-Clinton.txt    -13.91      1.79    -63.7 -101.93  -25.473
1995-Clinton.txt    -12.79      1.61    -38.5  -72.87   -4.155
1996-Clinton.txt    -11.59      1.90    -11.5  -52.07   29.136
1997-Clinton.txt     -8.63      1.94     54.9   13.42   96.408
1998-Clinton.txt     -7.64      1.83     77.4   38.24  116.490
1999-Clinton.txt    -12.80      1.80    -38.8  -77.30   -0.345
2000-Clinton.txt    -13.65      1.64    -57.9  -92.91  -22.955
2001-GWBush-1.txt   -19.12      2.39   -181.0 -232.13 -129.884
2001-GWBush-2.txt     6.56      3.01    396.7  332.29  461.094
2002-GWBush.txt      -4.10      2.68    156.9   99.68  214.174
2003-GWBush.txt      -9.87      2.28     27.1  -21.66   75.768
2004-GWBush.txt     -12.82      2.25    -39.2  -87.32    8.936
2005-GWBush.txt      -9.92      2.31     26.0  -23.35   75.279
2006-GWBush.txt     -12.58      2.19    -33.8  -80.56   12.982
2007-GWBush.txt      -9.89      2.29     26.6  -22.28   75.444
2008-GWBush.txt      -9.45      2.17     36.6   -9.79   82.957
2009-Obama.txt      -18.18      2.10   -159.8 -204.58 -115.044
2010-Obama.txt      -16.39      1.85   -119.6 -159.11  -80.010
2011-Obama.txt      -11.65      1.94    -12.9  -54.29   28.569
2012-Obama.txt      -15.03      1.87    -88.9 -128.72  -49.022
2013-Obama.txt      -16.98      1.83   -132.8 -171.90  -93.609
2014-Obama.txt      -14.15      1.86    -69.2 -108.96  -29.454
2015-Obama.txt      -12.90      1.90    -41.0  -81.59   -0.489
2016-Obama.txt       -8.59      2.02     55.9   12.83   98.987
2017-Trump.txt       -9.43      2.23     36.9  -10.68   84.525

We can use this output to see which are the most Cruz-like and the most Sanders-like speeches in history.

Most like Sanders:

scores[order(scores$Score, decreasing = FALSE), ][1:5, ]
                     Score Std. Err.  Rescaled     Lower     Upper
1975-Ford.txt    -27.94499  2.531543 -379.3805 -433.4682 -325.2927
1981-Reagan.txt  -26.41084  2.542356 -344.8774 -399.1962 -290.5586
1946-Truman.txt  -26.32413  1.136075 -342.9272 -367.2001 -318.6544
1993-Clinton.txt -23.76145  1.783325 -285.2925 -323.3941 -247.1908
1978-Carter.txt  -19.18303  2.379651 -182.3237 -233.1662 -131.4812

Most like Cruz:

scores[order(scores$Score, decreasing = TRUE), ][1:5, ]
                       Score Std. Err. Rescaled    Lower    Upper
2001-GWBush-2.txt   6.562430  3.014253 396.6924 332.2913 461.0935
1945-Truman.txt     3.800649  4.127630 334.5799 246.3909 422.7689
1965-Johnson-2.txt  2.031419  2.788358 294.7899 235.2151 354.3646
1965-Johnson-1.txt -1.229747  2.638351 221.4462 165.0764 277.8159
1963-Johnson.txt   -1.243082  4.441154 221.1462 126.2587 316.0338

We might want to plot the scores over time, to get a better sense of the evolution of the speeches over time - and to see whether Cruz and Sanders provide reasonable reference texts in this case.

ggplot(mapping = aes(y = scores$Rescaled, x = data$year)) + geom_point() + geom_smooth() +
  labs(x = "Year", y = "Wordscores")
## `geom_smooth()` using method = 'loess'

Wordscores also provides standard errors and a lower and upper bound for estimated text scores. We can visualise these with vertical bars, using geom_segments() in ggplot.

ggplot(mapping = aes(y = scores$Rescaled, x = data$year)) + geom_point() + geom_smooth() +
  geom_segment(aes(x = data$year, y = scores$Lower, xend = data$year, yend = scores$Upper)) +
  labs(x = "Year", y = "Wordscores")
## `geom_smooth()` using method = 'loess'