---
title: "Data Analytics for Social Science - Lab 5"
author: "Johan A. Elkink"
date: "20 February 2020"
output:
html_document:
number_sections: yes
toc: yes
toc_float: true
---
```{r echo = FALSE, cache = FALSE}
source("http://www.joselkink.net/files/dass_utils.R", echo = FALSE)
# allow knitr to run with errors
knitr::opts_chunk$set(error = TRUE)
```
We start by opening all the relevant libraries for this lab:
```{r}
suppressWarnings(
suppressMessages({
library(rio)
library(stargazer)
library(ggplot2)
library(MASS)
library(knitr)
library(pROC)
library(car)
library(dplyr)
}))
```
# Opening data
We continue with the data set for Project 2. The primary data set is replication data from Erasmus Kersting and Christopher Kily (2014), "[Aid and democracy redux](http://www.sciencedirect.com/science/article/pii/S0014292114000245)", *European Economic Review*. We also incorporate the code from last week to generate new variables.
```{r}
aid <- import("http://www.joselkink.net/files/data/KerstingKilby2014-Tab-2-5.dta")
```
## Merging additional data
```{r}
# Merge data file that translates country names in Kerstin and
# Kily to those in the Polity IV project
matchNames <- import("http://www.joselkink.net/files/data/KerstingKilby-Polity-name-matching.xlsx")
# Remove Ivory Coast due to problems with spelling on Windows computers
matchNames <- matchNames[-35, ]
aid <- aid[-35, ]
aid <- merge(aid, matchNames, by.x = "name", by.y = "kk", all = TRUE)
# Merge the Polity IV democracy data by Systemic Peace
p4 <- subset(import("http://www.systemicpeace.org/inscr/p4v2016.sav"), year == 2012)
names(p4) <- paste0("P4_", names(p4))
aid <- merge(aid, p4, by.x = "p4", by.y = "P4_country", all = TRUE)
# Merge the Major Episodes of Political Violence data by Systemic Peace
mepv <- subset(import("http://www.systemicpeace.org/inscr/MEPVv2016.sav"), year == 2012)
names(mepv) <- paste0("MEPV_", names(mepv))
aid <- merge(aid, mepv, by.x = "p4", by.y = "MEPV_country", all = TRUE)
# Merge the State Fragility Index data by Systemic Peace
sfi <- subset(import("http://www.systemicpeace.org/inscr/SFIv2016.sav"), year == 2012)
names(sfi) <- paste0("SFI_", names(sfi))
aid <- merge(aid, sfi, by.x = "p4", by.y = "SFI_country", all = TRUE)
```
## Computed variables
```{r}
# The data set only contains original Freedom House scores and their
# changes, so we compute end results.
aid <- aid %>% mutate(
CL_score = Initial_CL_score + .5 * (DCL_score_LL + DCL_score_UL),
FH_score = Initial_FH_score + .5 * (DFH_score_LL + DFH_score_UL),
PR_score = Initial_PR_score + .5 * (DPR_score_LL + DPR_score_UL),
# Create dummy variables for democracy based on the two
# measures of democracy (Polity IV and Freedom House)
democracy = as.factor(ifelse(P4_polity2 > 5, "Democracy", "Non-Democracy")),
FHdemocracy = as.factor(ifelse(CL_score > 30 & PR_score > 20,
"Democracy", "Non-Democracy")),
# Create dummy variables for the initial level of democracy
initial_democracy = car::recode(Initial_Polity, "-10:5=0; 6:10=1; else=NA",
as.factor = FALSE),
initial_FHdemocracy = ifelse(Initial_CL_score > 30 & Initial_PR_score > 20, 1, 0),
# Compute logged versions of the two aid variables,
# Aid per Capita and Aid over GDP.
logAPC = log(Aid_per_capita),
logDAC = log(DAC_aid)
)
# Create a nominal level variable for colonial empire a country
# belonged to.
aid$colonialPower <- "Independent"
aid$colonialPower[aid$British == 1] <- "British"
aid$colonialPower[aid$French == 1] <- "French"
aid$colonialPower[aid$Spanish == 1] <- "Spanish"
aid$colonialPower[aid$Portugese == 1] <- "Portugese"
aid$colonialPower[aid$Belgian == 1] <- "Dutch"
aid$colonialPower[aid$Belgian == 1] <- "Belgian"
aid$colonialPower[aid$Italian == 1] <- "Italian"
aid$colonialPower[aid$German == 1] <- "German"
aid$colonialPower[aid$American == 1] <- "American"
aid$colonialPower <- as.factor(aid$colonialPower)
# Create second colonial empire variable, reducing the number of
# categories
aid$mainColonialPower <- car::recode(aid$colonialPower,
"c('American', 'Belgian', 'German', 'Italian', 'Portugese')='Other'")
# Create a nominal level variable for the region the country is in
aid$region <- ""
aid$region[aid$LAC == 1] <- "Latin America / Caribbean"
aid$region[aid$MENA == 1] <- "Middle East / North Africa"
aid$region[aid$SSA == 1] <- "Sub-Saharan Africa"
aid$region[aid$ECA == 1] <- "Europe / Central Asia"
aid$region[aid$SA == 1] <- "South Asia"
aid$region[aid$region == ""] <- NA
aid$region <- as.factor(aid$region)
```
Some of the above code makes use of the recode() command from the car library instead of from the dplyr library as in previous labs.
# Linear probability model
Lets first try linear regression with a binary dependent variable. We will take dependent variable **democracyDummy**, which needs to be a binary variable where 1 is democracy and 0 an autocracy, based on the **FHdemocracy** variable.
```{r}
aid <- aid %>% mutate(
democracyDummy = na_if(recode(FHdemocracy,
'Democracy' = 1, 'Non-Democracy' = 0,
.default = -1), -1)
)
```
Our key independent variable is **Aid_per_capita**, so we could say that we are trying to evaluate whether development aids supports democratization - but most of the analysis is focused on *prediction* rather than *causal inference*.
```{r results = "asis"}
stargazer(m1 <- lm(democracyDummy ~ Aid_per_capita, data = aid),
m2 <- lm(democracyDummy ~ Aid_per_capita +
Initial_GDP_per_capita + Initial_population +
Ethnic_Fractionalization + initial_FHdemocracy, data = aid),
type = "html")
designMatrix <- m2$model
olsPredicted <- ifelse(predict(m2) > 0.5, "Democracy", "Non democracy")
```
Note that this *designMatrix* extracts the data set after the regression, so that all the missing cases are removed. If you change a model specification below, be careful that you might also have to change the model from which the design matrix is extracted, before you can, e.g., create a cross-table of predicted vs actual or a plot with predicted values.
Note that the entire data set has `r dim(aid)[1]` observations, while after removing missing data, the design matrix - the full data set used - has only `r dim(designMatrix)[1]` observations.
> **`r qnr()` Use the View() command in the console (not in the Markdown file) to look at the design matrix.**
> **`r qnr()` Create a cross-table of *FHdemocracy* (in the design matrix it's called *democracyDummy*) by *olsPredicted*. Due to missing data, you will need to use the *designMatrix* instead of *aid* as the data set. How well do you think the model does in predicting the outcome?**
```{r}
ggplot(designMatrix, aes(x = Aid_per_capita, y = democracyDummy)) +
geom_jitter(aes(shape = olsPredicted, color = olsPredicted), height = .1) +
geom_smooth(method = "lm", se = TRUE) +
labs(x = "Aid per capita", y = "Democracy", title = "OLS: Democracy and aid") +
theme_minimal() +
scale_color_manual(values = c("darkgray", "blue"))
```
Note that the plotted curve is for the *simple* regression, not the *multiple* regression, but the predicted classification is based on the multiple regression. Note how this part of the code adds such regression line to a scatter plot:
```{r eval = FALSE}
geom_smooth(method = "lm")
```
If you want to add the decision boundary, it requires a slight bit more computation based on the regression coefficients:
```{r}
b1 <- coef(m1)
ggplot(designMatrix, aes(x = Aid_per_capita, y = democracyDummy)) +
geom_jitter(aes(shape = olsPredicted, color = olsPredicted), height = .1) +
geom_smooth(method = "lm", se = TRUE) +
labs(x = "Aid per capita", y = "Democracy", title = "OLS: Democracy and aid") +
theme_minimal() +
scale_color_manual(values = c("darkgray", "blue")) +
geom_vline(xintercept = (0.5 - b1[1]) / b1[2], col = "green",
lty = "dashed", size = 1.5)
```
Here the green decision boundary line is based on the simple regression (m1). The boundary line for the multiple regression would depend on the values of the other explanatory variables, so would differ for each observation.
# Logistic regression
Lets next try a logistic regression for the same model.
```{r results = "asis"}
stargazer(m3 <- glm(democracyDummy ~ Aid_per_capita, data = aid, family = binomial(link = "logit")),
m4 <- glm(democracyDummy ~ Aid_per_capita +
Initial_GDP_per_capita + Initial_population +
Ethnic_Fractionalization + initial_FHdemocracy, data = aid,
family = binomial(link = "logit")),
type = "html")
logitPredicted <- ifelse(predict(m4, type = "response") > 0.5,
"Democracy", "Non democracy")
```
> **`r qnr()` Create a cross-table of *FHdemocracy* by *logitPredicted*. Due to missing data, you will need to use the *designMatrix* instead of *aid* as the data set. How well do you think the model does in predicting the outcome?**
```{r}
b3 <- coef(m3)
ggplot(designMatrix, aes(x = Aid_per_capita, y = democracyDummy)) +
geom_jitter(aes(shape = logitPredicted, col = logitPredicted), height = .1) +
geom_smooth(method = "glm", se = TRUE, method.args = list(family = "binomial")) +
labs(x = "Aid per capita", y = "Democracy", title = "Logistic: Democracy and aid") +
theme_minimal() +
scale_color_manual(values = c("darkgray", "blue")) +
geom_vline(xintercept = (0 - b3[1]) / b3[2], col = "green",
lty = "dashed", size = 1.5) +
geom_vline(xintercept = (0.5 - b1[1]) / b1[2], col = "gray",
lty = "dashed", size = 1.5)
```
Note that the plotted curve is for the *simple* regression, not the *multiple* regression, but the predicted classification is based on the multiple regression. Note how this part of the code adds such regression line for a logistic regression to a scatter plot:
```{r eval = FALSE}
geom_smooth(method = "glm", method.args = list(family = "binomial"))
```
The dashed lines represent the decision boundaries, in green for the logistic regression and in gray for the previous linear regression.
# Discriminant analysis
```{r}
m5 <- lda(democracyDummy ~ Aid_per_capita, data = aid)
m6 <- lda(democracyDummy ~ Aid_per_capita +
Initial_GDP_per_capita + Initial_population +
Ethnic_Fractionalization + initial_FHdemocracy, data = aid)
ldaPredicted <- ifelse(predict(m6)$class == 1, "Democracy", "Non democracy")
ggplot(designMatrix, aes(x = Aid_per_capita, y = democracyDummy)) +
geom_jitter(aes(shape = ldaPredicted, col = ldaPredicted), size = 2, height = .1) +
labs(x = "Aid per capita", y = "Democracy", title = "LDA: Democracy and aid") +
theme_minimal() +
scale_color_manual(values = c("darkgray", "blue"))
```
> **`r qnr()` Create a cross-table of *FHdemocracy* by *ldaPredicted*. Due to missing data, you will need to use the *designMatrix* instead of *aid* as the data set. How well do you think the model does in predicting the outcome?**
```{r}
ggplot(aid, aes(x = Aid_per_capita)) +
geom_histogram(bins = 6, color = "blue", fill = "blue") +
labs(x = "Aid per Capita", y = "Frequency",
title = "Histogram of aid")
```
> **`r qnr()` Note that LDA assumes independent variables are normally distributed (i.e. a bell curve shape). Do you think the *Aid_per_capita* variable is?**
# Comparing prediction quality
An ROC curve is relatively straightforward to produce using the pROC library.
```{r}
plot(roc(designMatrix$democracyDummy, predict(m2)), main = "Linear regression")
plot(roc(designMatrix$democracyDummy, predict(m4)), main = "Logistic regression")
plot(roc(designMatrix$democracyDummy, predict(m6)$posterior[,2]),
main = "Discriminant analysis")
```
We can calculate the Area Under Curve (AUC) score easily as well using similar code:
```{r}
auc(designMatrix$democracyDummy, predict(m2))
auc(designMatrix$democracyDummy, predict(m4))
auc(designMatrix$democracyDummy, predict(m6)$posterior[,2])
```
> **`r qnr()` Repeat the above three regressions using *democracy* instead as the dependent variable. Are your conclusions different?**
> **`r qnr()` Try some different variations of models (all of the above, not just discriminant analysis) - i.e. by changing the set of independent variables.**
# Extra
## Curve based on multiple regression
As mentioned, the above plots of predicted probabilities ignore the control variables, which is not appropriate. Below is an example based on model *m4*, the logistic regression, plotting the appropriate prediction line.
```{r}
apc <- seq(min(aid$Aid_per_capita, na.rm = TRUE),
max(aid$Aid_per_capita, na.rm = TRUE),
length.out = 50)
X <- matrix(rep(apply(designMatrix, 2, median), length(apc)),
nrow = length(apc), byrow = TRUE)
X[, 2] <- apc
colnames(X) <- names(designMatrix)
X <- as.data.frame(X)
X$phat <- predict(m4, newdata = X, type = "response")
ggplot(designMatrix, aes(x = Aid_per_capita, y = democracyDummy)) +
geom_jitter(aes(shape = logitPredicted, color = logitPredicted), height = .1) +
geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial")) +
geom_point(aes(x = Aid_per_capita, y = phat), X) +
labs(x = "Aid per Capita", y = "Democracy", title = "GLM: Democracy and aid") +
theme_minimal() +
scale_color_manual(values = c("darkgray", "blue"))
```
The above plot keeps all independent variables other than **Aid_per_capita** at the median value and then calculates predicted values across the range of **Aid_per_capita**. The blue line is the predicted probability using *m3*, which does not include control variables. The black dotted line is the predicted probability using *m4*, which includes control variables, for those cases where the control variables are all at their median value.