---
title: "Data Analytics for Social Science - Lab 7"
author: "Johan A. Elkink"
date: "24 October 2017"
output:
html_document:
number_sections: yes
toc: yes
---
We start by opening all the relevant libraries for this lab:
```{r}
suppressMessages({
library(rio)
library(ggplot2)
library(knitr)
library(car)
library(stargazer)
library(MASS)
library(tree)
library(randomForest)
library(pROC)
})
```
# Opening data
We continue with the data set for Project 2. The primary data set is replication data from Erasmus Kersting and Christopher Kily (2014), "[Aid and democracy redux](http://www.sciencedirect.com/science/article/pii/S0014292114000245)", *European Economic Review*. We also incorporate the code from two weeks ago to generate new variables.
```{r}
aid <- import("http://www.joselkink.net/files/data/KerstingKilby2014-Tab-2-5.dta")
```
## Merging additional data
```{r}
# Merge data file that translates country names in Kerstin and
# Kily to those in the Polity IV project
matchNames <- import("http://www.joselkink.net/files/data/KerstingKilby-Polity-name-matching.xlsx")
# Remove Ivory Coast due to problems with spelling on Windows computers
matchNames <- matchNames[-35, ]
aid <- aid[-35, ]
aid <- merge(aid, matchNames, by.x = "name", by.y = "kk", all = TRUE)
# Merge the Polity IV democracy data by Systemic Peace
p4 <- subset(import("http://www.systemicpeace.org/inscr/p4v2016.sav"), year == 2012)
names(p4) <- paste0("P4_", names(p4))
aid <- merge(aid, p4, by.x = "p4", by.y = "P4_country", all = TRUE)
# Merge the Major Episodes of Political Violence data by Systemic Peace
mepv <- subset(import("http://www.systemicpeace.org/inscr/MEPVv2016.sav"), year == 2012)
names(mepv) <- paste0("MEPV_", names(mepv))
aid <- merge(aid, mepv, by.x = "p4", by.y = "MEPV_country", all = TRUE)
# Merge the State Fragility Index data by Systemic Peace
sfi <- subset(import("http://www.systemicpeace.org/inscr/SFIv2016.sav"), year == 2012)
names(sfi) <- paste0("SFI_", names(sfi))
aid <- merge(aid, sfi, by.x = "p4", by.y = "SFI_country", all = TRUE)
```
## Computed variables
```{r}
# The data set only contains original Freedom House scores and their
# changes, so we compute end results.
aid$CL_score <- aid$Initial_CL_score + .5 * (aid$DCL_score_LL + aid$DCL_score_UL)
aid$FH_score <- aid$Initial_FH_score + .5 * (aid$DFH_score_LL + aid$DFH_score_UL)
aid$PR_score <- aid$Initial_PR_score + .5 * (aid$DPR_score_LL + aid$DPR_score_UL)
# Create dummy variables for democracy based on the two
# measures of democracy (Polity IV and Freedom House)
aid$democracy <- as.factor(ifelse(aid$P4_polity2 > 5, "Democracy", "Non-Democracy"))
aid$FHdemocracy <- as.factor(ifelse(aid$CL_score > 30 & aid$PR_score > 20, "Democracy", "Non-Democracy"))
# Create dummy variables for the initial level of democracy
aid$initial_democracy <- recode(aid$Initial_Polity, "-10:5=0; 6:10=1; else=NA", as.factor.result = FALSE)
aid$initial_FHdemocracy <- ifelse(aid$Initial_CL_score > 30 & aid$Initial_PR_score > 20, 1, 0)
# Compute logged versions of the two aid variables,
# Aid per Capita and Aid over GDP.
aid$logAPC <- log(aid$Aid_per_capita)
aid$logDAC <- log(aid$DAC_aid)
# Create a nominal level variable for colonial empire a country
# belonged to.
aid$colonialPower <- "Independent"
aid$colonialPower[aid$British == 1] <- "British"
aid$colonialPower[aid$French == 1] <- "French"
aid$colonialPower[aid$Spanish == 1] <- "Spanish"
aid$colonialPower[aid$Portugese == 1] <- "Portugese"
aid$colonialPower[aid$Belgian == 1] <- "Dutch"
aid$colonialPower[aid$Belgian == 1] <- "Belgian"
aid$colonialPower[aid$Italian == 1] <- "Italian"
aid$colonialPower[aid$German == 1] <- "German"
aid$colonialPower[aid$American == 1] <- "American"
aid$colonialPower <- as.factor(aid$colonialPower)
# Create second colonial empire variable, reducing the number of
# categories
aid$mainColonialPower <- recode(aid$colonialPower,
"c('American', 'Belgian', 'German', 'Italian', 'Portugese')='Other'")
# Create a nominal level variable for the region the country is in
aid$region <- ""
aid$region[aid$LAC == 1] <- "Latin America / Caribbean"
aid$region[aid$MENA == 1] <- "Middle East / North Africa"
aid$region[aid$SSA == 1] <- "Sub-Saharan Africa"
aid$region[aid$ECA == 1] <- "Europe / Central Asia"
aid$region[aid$SA == 1] <- "South Asia"
aid$region[aid$region == ""] <- NA
aid$region <- as.factor(aid$region)
aid$democracyDummy <- recode(aid$democracy, "'Democracy'=1; 'Non-Democracy'=0; else=NA", as.factor.result = FALSE)
```
We run the linear regression from last class just to get the design matrix:
```{r}
designMatrix <- model.frame(democracy ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy, data = aid)
```
# Tree classification
We continue with the same model we used last class for classification, explaining whether or not countries can be classified as democratic, in part based on development aid. Note, however, that the focus here is really on the classification, not the assessment of the impact of any particular variable (see "importance" below, though, when discussing random forests).
```{r}
t <- tree(democracy ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy, data = aid)
plot(t)
text(t, cex = .6) # note: cex makes the font smaller
```
> **Interpret the tree graph.**
```{r}
plot(roc(aid$democracy, predict(t, aid)[,2]), main = "Tree")
```
> **Compare this to the ROC curves from the previous lab. How does it compare?**
This tree has a lot of branches, so it is difficult to interpret the output. It would probably be worthwhile to prune the tree to, say, 6 branches:
```{r}
t6 <- prune.tree(t, best = 6)
plot(t6)
text(t6)
```
> **Interpret the tree plot.**
> **(Extra) Using pen and paper, make a sketched copy of the scatter plot of GDP per capita by population (see below) and roughly identify where the democracies and where the non-democracies are located.**
```{r}
ggplot(aid, aes(x = Initial_population, y = Initial_GDP_per_capita)) + geom_point()
```
While a pruned tree is easier to interpret and more robust and generalizable, due to the avoidance of overfitting, prediction quality of the data set at hand tends to go down:
```{r}
plot(roc(aid$democracy, predict(t, aid)[,2]), lty = "dashed")
lines(roc(aid$democracy, predict(t6, aid)[,2]))
```
Here the dashed line is the full tree prediction quality, the solid line the pruned tree.
> **Repeat to see how prediction quality (ROC curves) change if you reduce it to 5 or 4 branches.**
# Random forests
Instead of using just one tree, we can use a forest of trees. This increases predictive quality, but at the cost of interpretativeness.
```{r}
f <- randomForest(democracy ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy, data = designMatrix)
importance(f)
```
Unlike with trees, we cannot simply plot the tree, as for a random forest we have many. We can calculate the importance of each variable overall, however, which is done above. This is more easy to follow graphically, but this requires a bit of R programming:
```{r}
i <- importance(f)
vnames <- rownames(i)
i <- as.vector(i)
names(i) <- vnames
i <- data.frame(importance = i[order(i, decreasing = TRUE)])
i$variable <- factor(rownames(i), levels = rownames(i)[order(i$importance)])
ggplot(i, aes(x = variable, y = importance)) + geom_bar(stat = "identity") + coord_flip() + ggtitle("Random forest regression explaining democracy classification")
```
```{r}
plot(roc(designMatrix$democracy, predict(f, type = "prob")[, 2]), main = "Random Forest")
```
# Bagging
An older method than random forests is called bagging, which is basically the same as random forests, but setting the number of variables to randomly select each iteration to the total number of variables.
```{r}
b <- randomForest(democracy ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy, data = designMatrix, mtry = 5)
plot(roc(designMatrix$democracy, predict(b, type = "prob")[, 2]), main = "Bagging")
```
> **Try some different variations of models - e.g. changing the set of independent variables, or using a democracy variable that is based on Freedom House scores instead of Polity IV.**
# Tree regression
Trees can not only be used for classification tasks, but also for regression tasks, i.e. with scale dependent variables. If we do not transform the **P4_polity2** variable to a dummy variable, this is what we are doing.
```{r}
t <- tree(P4_polity2 ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy, data = aid)
plot(t)
text(t)
t6 <- prune.tree(t, best = 6)
plot(t6)
text(t6)
```
Note that the values at the end of the branches are now not class labels (democracy or non-democracy) but average scores. Remember how linear regression predicts means - here we get an alternative prediction of means. So if GDP per capita is 8.2 and ethnic fractionalization is 0.2, we predict an mean of 8.9 on our democracy scale from -10 to 10.
> **What would you predicted value for the democracy scale be if GDP per capita is 5 and aid per capita 30?**
Since the dependent variable is now not a classification, but a scale variable, we cannot simply produce a confusion table. We can graphically evaluate the prediction quality, however:
```{r}
ggplot(aid, aes(x = P4_polity2, y = predict(t6, aid))) + geom_jitter(alpha = .3) +
labs(x = "Observed democracy score", y = "Predicted democracy score") +
ggtitle("Evaluation of quality of pruned tree regression")
```
> **Evaluate the prediction quality from this plot.**
If we also want to perform a random forest analysis, we need to create a new design matrix as well.
```{r}
designMatrix <- model.frame(P4_polity2 ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy, data = aid)
f <- randomForest(P4_polity2 ~ Aid_per_capita + Initial_GDP_per_capita + Initial_population + Ethnic_Fractionalization + initial_FHdemocracy, data = designMatrix)
importance(f)
i <- importance(f)
vnames <- rownames(i)
i <- as.vector(i)
names(i) <- vnames
i <- data.frame(importance = i[order(i, decreasing = TRUE)])
i$variable <- factor(rownames(i), levels = rownames(i)[order(i$importance)])
ggplot(i, aes(x = variable, y = importance)) + geom_bar(stat = "identity") + coord_flip() + ggtitle("Random forest regression explaining Polity IV")
```
> **Would you conclude aid matters for predicting democracy scores?**