---
title: "Data Analytics for Social Science - Lab 6"
author: "Johan A. Elkink"
date: "5 March 2020"
output:
html_document:
number_sections: yes
toc: yes
toc_float: true
---
```{r echo = FALSE, cache = FALSE}
source("http://www.joselkink.net/files/dass_utils.R", echo = FALSE)
# allow knitr to run with errors
knitr::opts_chunk$set(error = TRUE)
```
We start by opening all the relevant libraries for this lab:
```{r}
suppressWarnings(
suppressMessages({
library(rio)
library(ggplot2)
library(knitr)
library(car)
library(stargazer)
library(MASS)
library(rpart)
library(rpart.plot)
library(randomForest)
library(pROC)
library(dplyr)
}))
```
# Opening data
We continue with the data set for Project 2. The primary data set is replication data from Erasmus Kersting and Christopher Kily (2014), "[Aid and democracy redux](http://www.sciencedirect.com/science/article/pii/S0014292114000245)", *European Economic Review*. We also incorporate the code from two weeks ago to generate new variables.
```{r}
aid <- import("http://www.joselkink.net/files/data/KerstingKilby2014-Tab-2-5.dta")
```
## Merging additional data
```{r}
# Merge data file that translates country names in Kerstin and
# Kily to those in the Polity IV project
matchNames <- import("http://www.joselkink.net/files/data/KerstingKilby-Polity-name-matching.xlsx")
# Remove Ivory Coast due to problems with spelling on Windows computers
matchNames <- matchNames[-35, ]
aid <- aid[-35, ]
aid <- merge(aid, matchNames, by.x = "name", by.y = "kk", all = TRUE)
# Merge the Polity IV democracy data by Systemic Peace
p4 <- subset(import("http://www.systemicpeace.org/inscr/p4v2016.sav"), year == 2012)
names(p4) <- paste0("P4_", names(p4))
aid <- merge(aid, p4, by.x = "p4", by.y = "P4_country", all = TRUE)
# Merge the Major Episodes of Political Violence data by Systemic Peace
mepv <- subset(import("http://www.systemicpeace.org/inscr/MEPVv2016.sav"), year == 2012)
names(mepv) <- paste0("MEPV_", names(mepv))
aid <- merge(aid, mepv, by.x = "p4", by.y = "MEPV_country", all = TRUE)
# Merge the State Fragility Index data by Systemic Peace
sfi <- subset(import("http://www.systemicpeace.org/inscr/SFIv2016.sav"), year == 2012)
names(sfi) <- paste0("SFI_", names(sfi))
aid <- merge(aid, sfi, by.x = "p4", by.y = "SFI_country", all = TRUE)
```
## Computed variables
```{r}
# The data set only contains original Freedom House scores and their
# changes, so we compute end results.
aid <- aid %>% mutate(
CL_score = Initial_CL_score + .5 * (DCL_score_LL + DCL_score_UL),
FH_score = Initial_FH_score + .5 * (DFH_score_LL + DFH_score_UL),
PR_score = Initial_PR_score + .5 * (DPR_score_LL + DPR_score_UL),
# Create dummy variables for democracy based on the two
# measures of democracy (Polity IV and Freedom House)
democracy = as.factor(ifelse(P4_polity2 > 5, "Democracy", "Non-Democracy")),
FHdemocracy = as.factor(ifelse(CL_score > 30 & PR_score > 20,
"Democracy", "Non-Democracy")),
# This one is explicitly coded as 0 and 1
democracyDummy = na_if(recode(democracy, 'Democracy' = 1, 'Non-Democracy' = 0, .default = -1), -1),
# Create dummy variables for the initial level of democracy
initial_democracy = car::recode(Initial_Polity, "-10:5=0; 6:10=1; else=NA",
as.factor = FALSE),
initial_FHdemocracy = ifelse(Initial_CL_score > 30 & Initial_PR_score > 20, 1, 0),
# Compute logged versions of the two aid variables,
# Aid per Capita and Aid over GDP.
logAPC = log(Aid_per_capita),
logDAC = log(DAC_aid)
)
# Create a nominal level variable for colonial empire a country
# belonged to.
aid$colonialPower <- "Independent"
aid$colonialPower[aid$British == 1] <- "British"
aid$colonialPower[aid$French == 1] <- "French"
aid$colonialPower[aid$Spanish == 1] <- "Spanish"
aid$colonialPower[aid$Portugese == 1] <- "Portugese"
aid$colonialPower[aid$Belgian == 1] <- "Dutch"
aid$colonialPower[aid$Belgian == 1] <- "Belgian"
aid$colonialPower[aid$Italian == 1] <- "Italian"
aid$colonialPower[aid$German == 1] <- "German"
aid$colonialPower[aid$American == 1] <- "American"
aid$colonialPower <- as.factor(aid$colonialPower)
# Create second colonial empire variable, reducing the number of
# categories
aid$mainColonialPower <- car::recode(aid$colonialPower,
"c('American', 'Belgian', 'German', 'Italian', 'Portugese')='Other'")
# Create a nominal level variable for the region the country is in
aid$region <- ""
aid$region[aid$LAC == 1] <- "Latin America / Caribbean"
aid$region[aid$MENA == 1] <- "Middle East / North Africa"
aid$region[aid$SSA == 1] <- "Sub-Saharan Africa"
aid$region[aid$ECA == 1] <- "Europe / Central Asia"
aid$region[aid$SA == 1] <- "South Asia"
aid$region[aid$region == ""] <- NA
aid$region <- as.factor(aid$region)
```
We run the linear regression from last class just to get the design matrix:
```{r}
designMatrix <- model.frame(democracy ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy + colonialPower + MEPV_intwar + MEPV_muslim + GDP_growth + region, data = aid)
```
# Tree classification
We continue with the same model we used last class for classification, explaining whether or not countries can be classified as democratic, in part based on development aid. Note, however, that the focus here is really on the classification, not the assessment of the impact of any particular variable (see "importance" below, though, when discussing random forests).
```{r}
t <- rpart(democracy ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy + colonialPower + MEPV_intwar + MEPV_muslim + GDP_growth + region, data = aid)
rpart.plot(t)
```
> **`r qnr()` Interpret the tree graph.**
```{r}
plot(roc(aid$democracy, predict(t, aid)[,2]), main = "Tree")
```
> **`r qnr()` Compare this to the ROC curves from the previous lab. How does it compare?**
This tree has a lot of branches, so it is difficult to interpret the output. It would probably be worthwhile to prune the tree to, say, 3 branches:
```{r}
t3 <- prune(t, cp = .05)
rpart.plot(t3)
```
> **`r qnr()` Interpret the tree plot.**
The cp parameter in the pruning function determines how far to prune the tree. There is no simple way to determine the right cp value, but you can make a table to see how many branches you will end up with and how much your error will increase due to the pruning as follows:
```{r}
printcp(t)
```
Here you can see the threshold values of the cp parameter (bottom-left).
> **`r qnr()` Prune and plot separately for a cp value of 0.027 and of 0.029 - just below and just above the third threshold value. What do you see? Think about what the nsplit column might mean, based on this.**
While a pruned tree is easier to interpret and more robust and generalizable, due to the avoidance of overfitting, prediction quality of the data set at hand tends to go down:
```{r}
plot(roc(aid$democracy, predict(t, aid)[,2]), lty = "dashed")
lines(roc(aid$democracy, predict(t3, aid)[,2]), col = "green")
```
Here the dashed line is the full tree prediction quality, the solid line the pruned tree.
> **`r qnr()` Repeat to see how prediction quality (ROC curves) change if you reduce it to fewer branches.**
Let's also try a simpler model, with fewer variables, to focus on interpretation once more:
```{r}
t.small <- rpart(democracy ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy, data = aid)
rpart.plot(t.small)
```
Here the focus is more on aid, ethnic fractionalizatoin, and initial values of population, economy, and democracy.
> **`r qnr()` (Extra) Using pen and paper, make a sketched copy of the scatter plot of GDP per capita by population (see below) and roughly identify where the democracies and where the non-democracies are located, based on this simplified tree model.**
```{r}
ggplot(aid, aes(x = Initial_population, y = Initial_GDP_per_capita,
col = democracy)) +
geom_point() +
theme_minimal() +
labs(x = "Initial population", y = "Initial GDP per capita")
```
# Random forests
Instead of using just one tree, we can use a forest of trees. This increases predictive quality, but at the cost of interpretativeness.
```{r}
f <- randomForest(democracy ~ ., data = designMatrix)
importance(f)
```
Unlike with trees, we cannot simply plot the tree, as for a random forest we have many. We can calculate the importance of each variable overall, however, which is done above. This is more easy to follow graphically, but this requires a bit of R programming:
```{r}
i <- importance(f)
vnames <- rownames(i)
i <- as.vector(i)
names(i) <- vnames
i <- data.frame(importance = i[order(i, decreasing = TRUE)])
i$variable <- factor(rownames(i), levels = rownames(i)[order(i$importance)])
ggplot(i, aes(x = variable, y = importance)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Random forest regression explaining democracy classification") +
theme_minimal()
```
```{r}
plot(roc(aid$democracy, predict(t, aid)[,2]), lty = "dashed")
lines(roc(aid$democracy, predict(t3, aid)[,2]), col = "green")
lines(roc(designMatrix$democracy, predict(f, type = "prob")[, 2]), lty = "dotted", col = "red")
```
# Bagging
An older method than random forests is called bagging, which is basically the same as random forests, but setting the number of variables to randomly select each iteration to the total number of variables.
```{r}
b <- randomForest(democracy ~ ., data = designMatrix, mtry = 5)
```
```{r}
plot(roc(aid$democracy, predict(t, aid)[,2]), lty = "dashed")
lines(roc(aid$democracy, predict(t3, aid)[,2]), col = "green")
lines(roc(designMatrix$democracy, predict(f, type = "prob")[, 2]), lty = "dotted", col = "red")
lines(roc(designMatrix$democracy, predict(b, type = "prob")[, 2]), lty = "dotdash", col = "blue")
```
> **`r qnr()` Try some different variations of models - e.g. changing the set of independent variables, or using a democracy variable that is based on Freedom House scores instead of Polity IV.**
# Tree regression
Trees can not only be used for classification tasks, but also for regression tasks, i.e. with scale dependent variables. If we do not transform the **P4_polity2** variable to a dummy variable, this is what we are doing.
```{r}
t <- rpart(P4_polity2 ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy + colonialPower + MEPV_intwar + MEPV_muslim + GDP_growth + region, data = aid)
rpart.plot(t)
t4 <- prune(t, cp = .05)
rpart.plot(t4)
```
Note that the values at the end of the branches are now not class labels (democracy or non-democracy) but average scores. Remember how linear regression predicts means - here we get an alternative prediction of means. So if GDP per capita is greater than 8.2 and the Muslim variable equals 2 (majority Muslim population), we predict an mean of 1.5 on our democracy scale from -10 to 10.
> **`r qnr()` What would you predicted value for the democracy scale be if GDP per capita is 5 and there are few Muslims in the country?**
Since the dependent variable is now not a classification, but a scale variable, we cannot simply produce a confusion table. We can graphically evaluate the prediction quality, however:
```{r}
ggplot(aid, aes(x = P4_polity2, y = predict(t4, aid))) +
geom_jitter(alpha = .3, col = "blue") +
labs(x = "Observed democracy score", y = "Predicted democracy score") +
ggtitle("Evaluation of quality of pruned tree regression") +
theme_minimal()
```
> **`r qnr()` Evaluate the prediction quality from this plot.**
If we also want to perform a random forest analysis, we need to create a new design matrix as well.
```{r}
designMatrix <- model.frame(P4_polity2 ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy + colonialPower + MEPV_intwar + MEPV_muslim + GDP_growth + region, data = aid)
f <- randomForest(P4_polity2 ~ ., data = designMatrix)
importance(f)
i <- importance(f)
vnames <- rownames(i)
i <- as.vector(i)
names(i) <- vnames
i <- data.frame(importance = i[order(i, decreasing = TRUE)])
i$variable <- factor(rownames(i), levels = rownames(i)[order(i$importance)])
ggplot(i, aes(x = variable, y = importance)) +
geom_bar(stat = "identity") +
coord_flip() +
ggtitle("Random forest regression explaining Polity IV") +
theme_minimal()
```
> **`r qnr()` Would you conclude aid matters for predicting democracy scores?**