Opening data

suppressWarnings(
  suppressMessages({
    library(rio)
    library(tm)
    library(knitr)
    library(ggplot2)
    library(reshape)
    library(LDAvis)
    library(topicmodels)
  })
)

load(url("http://www.joselkink.net/files/data/sou_corpus_postcleaning.Rdata"))

data <- import("http://www.joselkink.net/files/data/sou_meta_data.dta")

Topic models

In this lab we will run topic models on the same State of the Union speeches we have been working on in previous labs. Topic models are entirely unsupervised, so there are very few parameters to provide, but we have to indicate the number of topics we want to identify.

After we estimate the topic model, there is different ways of looking at the output. A first step is to see which terms are associated with which topics, to try to identify what topics might mean and how clearly they are identifiable. We use the terms() function to do this, e.g. here extracting the 10 most frequent terms per topic.

ldaModel <- topicmodels::LDA(dtm, k = 5)
terms(ldaModel, 10)
##       Topic 1    Topic 2 Topic 3    Topic 4    Topic 5   
##  [1,] "nation"   "year"  "right"    "program"  "america" 
##  [2,] "world"    "new"   "work"     "year"     "american"
##  [3,] "freedom"  "now"   "now"      "nation"   "work"    
##  [4,] "free"     "need"  "know"     "govern"   "year"    
##  [5,] "peac"     "job"   "american" "feder"    "peopl"   
##  [6,] "great"    "tax"   "world"    "congress" "children"
##  [7,] "america"  "well"  "peopl"    "increas"  "help"    
##  [8,] "american" "today" "countri"  "war"      "congress"
##  [9,] "new"      "time"  "everi"    "million"  "ask"     
## [10,] "congress" "want"  "one"      "unit"     "let"

If you had to come up with a label for each topic, what would you say?

Compare these topics with the plots after the cluster analysis on terms in Week 9 - do you find similar groups of terms?

Compare these topics with the plots after the principal components analysis from Week 10 - do you find the same terms defining the extracted dimensions?

We can also extract the most important topics for each speech, for example to extract the two most important topics:

topics(ldaModel, 2)
##      1945-Truman.txt 1946-Truman.txt 1947-Truman.txt 1948-Truman.txt
## [1,]               1               4               4               4
## [2,]               3               2               1               1
##      1949-Truman.txt 1950-Truman.txt 1951-Truman.txt 1953-Eisenhower.txt
## [1,]               4               4               1                   4
## [2,]               1               1               3                   1
##      1954-Eisenhower.txt 1955-Eisenhower.txt 1956-Eisenhower.txt
## [1,]                   4                   4                   4
## [2,]                   1                   1                   1
##      1957-Eisenhower.txt 1958-Eisenhower.txt 1959-Eisenhower.txt
## [1,]                   1                   1                   1
## [2,]                   4                   4                   4
##      1960-Eisenhower.txt 1961-Kennedy.txt 1962-Kennedy.txt
## [1,]                   1                1                1
## [2,]                   4                4                4
##      1963-Johnson.txt 1963-Kennedy.txt 1964-Johnson.txt 1965-Johnson-1.txt
## [1,]                3                1                4                  1
## [2,]                1                4                2                  5
##      1965-Johnson-2.txt 1966-Johnson.txt 1967-Johnson.txt 1968-Johnson.txt
## [1,]                  3                1                1                4
## [2,]                  1                4                4                5
##      1969-Johnson.txt 1970-Nixon.txt 1971-Nixon.txt 1972-Nixon.txt
## [1,]                1              1              5              1
## [2,]                4              4              4              2
##      1973-Nixon.txt 1974-Nixon.txt 1975-Ford.txt 1976-Ford.txt
## [1,]              1              1             4             4
## [2,]              4              2             2             5
##      1977-Ford.txt 1978-Carter.txt 1979-Carter.txt 1980-Carter.txt
## [1,]             1               4               1               1
## [2,]             4               1               4               3
##      1981-Reagan.txt 1982-Reagan.txt 1983-Reagan.txt 1984-Reagan.txt
## [1,]               4               4               5               1
## [2,]               2               5               1               5
##      1985-Reagan.txt 1986-Reagan.txt 1987-Reagan.txt 1988-Reagan.txt
## [1,]               2               1               5               5
## [2,]               1               5               1               1
##      1989-Bush.txt 1990-Bush.txt 1991-Bush-1.txt 1991-Bush-2.txt
## [1,]             5             1               3               1
## [2,]             1             5               1               3
##      1992-Bush.txt 1993-Clinton.txt 1994-Clinton.txt 1995-Clinton.txt
## [1,]             3                5                5                5
## [2,]             5                2                3                3
##      1996-Clinton.txt 1997-Clinton.txt 1998-Clinton.txt 1999-Clinton.txt
## [1,]                5                5                5                5
## [2,]                3                3                2                2
##      2000-Clinton.txt 2001-GWBush-1.txt 2001-GWBush-2.txt 2002-GWBush.txt
## [1,]                5                 5                 1               1
## [2,]                2                 4                 5               5
##      2003-GWBush.txt 2004-GWBush.txt 2005-GWBush.txt 2006-GWBush.txt
## [1,]               5               5               5               1
## [2,]               1               1               1               5
##      2007-GWBush.txt 2008-GWBush.txt 2009-Obama.txt 2010-Obama.txt
## [1,]               5               5              5              5
## [2,]               1               1              3              3
##      2011-Obama.txt 2012-Obama.txt 2013-Obama.txt 2014-Obama.txt
## [1,]              5              3              2              3
## [2,]              2              2              3              2
##      2015-Obama.txt 2016-Obama.txt 2017-Trump.txt
## [1,]              3              3              5
## [2,]              2              2              3

The above model extracts 5 topics. In topic modeling it is common to have much larger numbers of topics. Try the model extracting 25 topics and see if there is a difference in the clarity of the topics. (WARNING: might be a bit slow.)

Continue below using a small number of topics, e.g. 5, to keep things readable.

Topic modeling is a decomposition of a matrix into two matrices, one showing the probability of topic being associated with each document, and one showing the probability of each term being associated with each topic. The above summaries show the highest probabilities, but we might want to work with the full matrices.

The topic model we use here is a Bayesian model and in Bayesian statistics these probabilities are called posterior probabilities. So you can see posterior as the probability we obtain after estimating the model (the opposite is the prior, the probability before we estimate the model, but you can ignore this now).

We create here a data frame for all probabilities a particular document relates to a particular topic, and add some information from our meta data (president, year, party, etc.).

postTopics <- cbind(data, data.frame(posterior(ldaModel)$topics))
kable(head(postTopics))
textfile year president label party X1 X2 X3 X4 X5
1945-Truman.txt 1945 Truman Truman 45 Democrat 0.7237031 0.0049659 0.2540553 0.0098870 0.0073887
1946-Truman.txt 1946 Truman Truman 46 Democrat 0.0049032 0.0688446 0.0084389 0.9155448 0.0022685
1947-Truman.txt 1947 Truman Truman 47 Democrat 0.2262027 0.0130825 0.0910026 0.6659162 0.0037959
1948-Truman.txt 1948 Truman Truman 48 Democrat 0.3539567 0.1303378 0.0129898 0.4946306 0.0080852
1949-Truman.txt 1949 Truman Truman 49 Democrat 0.2168296 0.0492362 0.1559544 0.5319897 0.0459901
1950-Truman.txt 1950 Truman Truman 50 Democrat 0.3686784 0.1410833 0.0452719 0.4355133 0.0094531

We then reshape this data, so that we can more easily create plots by different groups. This can easily be done using the melt() function in the reshape package, but sometimes this works more easily than other times (see Reshaping Data). This time we’re lucky.

postTopicsLong <- melt(postTopics, id = c("textfile", "year", "president", "label", "party"))
colnames(postTopicsLong)[6:7] <- c("topic", "probability")
kable(head(postTopicsLong))
textfile year president label party topic probability
1945-Truman.txt 1945 Truman Truman 45 Democrat X1 0.7237031
1946-Truman.txt 1946 Truman Truman 46 Democrat X1 0.0049032
1947-Truman.txt 1947 Truman Truman 47 Democrat X1 0.2262027
1948-Truman.txt 1948 Truman Truman 48 Democrat X1 0.3539567
1949-Truman.txt 1949 Truman Truman 49 Democrat X1 0.2168296
1950-Truman.txt 1950 Truman Truman 50 Democrat X1 0.3686784

We can use this data frame to plot, for example, the use of a particular topic over time.

ggplot(postTopicsLong, aes(x = year, y = probability, col = topic)) +
  geom_smooth() +
  guides(col = guide_legend(title = "Topic")) +
  labs(x = "Year", y = "Topic probability")
## `geom_smooth()` using method = 'loess'

Look back at the labels you just made up for each of the topics. What does the plot say about the contents of State of the Union speeches over time?

We can also look at the topics by party or by individual president.

aggData <- aggregate(probability ~ topic + party, postTopicsLong, mean)

ggplot(aggData, aes(x = topic, y = probability)) +
  geom_bar(stat = "identity") +
  facet_grid(. ~ party)

Given your interpretation of the meaning of each of the topics, interpret the above graph.

aggData <- aggregate(probability ~ topic + president, postTopicsLong, mean)

ggplot(aggData, aes(x = topic, y = probability)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ president)