---
title: "Data Analytics for Social Science - Lab 4"
author: "Johan A. Elkink"
date: "13 February 2020"
output:
html_document:
number_sections: yes
toc: yes
toc_float: true
---
```{r echo = FALSE, cache = FALSE}
source("http://www.joselkink.net/files/dass_utils.R", echo = FALSE)
# allow knitr to run with errors
knitr::opts_chunk$set(error = TRUE)
```
We start by opening all the relevant libraries for this lab:
```{r}
library(rio) # contains import()
library(dplyr)
library(pander)
library(ggplot2)
library(stargazer)
library(leaps) # contains regsubsets()
library(MASS) # contains lm.ridge()
```
# Opening data
For Project 2 we start with an entirely different data set. This time, we will have country-level data and we will look at the relationship between development aid and democracy. The primary data set is replication data from Erasmus Kersting and Christopher Kilby (2014), "[Aid and democracy redux](http://www.sciencedirect.com/science/article/pii/S0014292114000245)", *European Economic Review*. While we use some of their data and main argument, their analysis is significantly more complicated statistically and we will simply do some related analysis, exploring a range of different methods using their data.
Their data can be directly downloaded from the web:
```{r}
aid <- import("http://www.joselkink.net/files/data/KerstingKilby2014-Tab-2-5.dta")
```
# Data management
While the data provided by Kersting and Kilby (2014) contains key information on democracy indicators, aid flows, economic variables, etcetera, we might want to add some variables related to democracy, political violence, and state fragility. We obtain these data from the [Centre for Systemic Peace](http://www.systemicpeace.org/inscrdata.html). We will then merge these data sets with the aid data based on country names.
## Merging
A problem is that the names are not always identical between the two data sets. For example, one uses the "Congo, Dem. Rep.", while the other uses "Congo Kinshasa". To solve this problem, I created a spreadsheet listing the matching names. We can then first merge the original data with this data set, and then merge the other data sets using the matched name. In the spreadsheet, the names used by Kersting and Kilby (2014) are saved as the **kk** variable and those used by the Polity IV data set of the Centre for Systemic Peace as the **p4** variable.
```{r}
matchNames <- import("http://www.joselkink.net/files/data/KerstingKilby-Polity-name-matching.xlsx")
# Remove Ivory Coast due to problems with spelling on Windows computers
matchNames <- matchNames[-35, ]
aid <- aid[-35, ]
aid <- merge(aid, matchNames, by.x = "name", by.y = "kk", all = TRUE)
```
This last line therefore merges the *aid* data set with the *matchNames* data set, using the **name** variable of the first data set to match against the **kk** variable of the second data set. Finally, we indicate that all observations should be included, regardless of whether they are missing in either data set - i.e. we keep non-matching cases.
The first data set we will add is the original Polity IV democracy indicators. We first download the data and use the filter() command to select only observations from year 2012 (all aid projects in the Kersting and Kilby (2014) data set end in 2011 at the latest).
```{r}
p4 <- import("http://www.systemicpeace.org/inscr/p4v2016.sav") %>% filter(year == 2012)
```
Before we merge, we rename all variable names in the *p4* data set by adding the prefix *P4_* to each of them. This way, after the merger, it is easy to recognize any variables originating from this data set. This also helps when we try to locate the variables in the [codebook](http://www.systemicpeace.org/inscr/p4manualv2016.pdf). The function to merge to sets of strings of text is the paste() command, or if you want to avoid spaces in between the parts, the paste0() command.
```{r}
names(p4) <- paste0("P4_", names(p4))
aid <- merge(aid, p4, by.x = "p4", by.y = "P4_country", all = TRUE)
```
So we merge the *aid* data set with the downloaded *p4* data set, by using the **p4** variable in the first (which was merged from the matching names data file above and contains the country names in the spelling used by the Polity IV project) and the **P4_country** variable in the second. Again, we do not want to lose any observations from either data set due to the merger.
We proceed in the same manner to add the Major Episodes of Political Violence data set (see also the [codebook](http://www.systemicpeace.org/inscr/MEPVcodebook2016.pdf)), again for 2012:
```{r}
mepv <- import("http://www.systemicpeace.org/inscr/MEPVv2016.sav") %>% filter(year == 2012)
names(mepv) <- paste0("MEPV_", names(mepv))
aid <- merge(aid, mepv, by.x = "p4", by.y = "MEPV_country", all = TRUE)
```
> **`r qnr()` A challenging exercise, but try to do the same using the file for the State Fragility Index, which can be found at http://www.systemicpeace.org/inscr/SFIv2016.sav and where you should use prefix FRAG_**
Note also the [codebook](http://www.systemicpeace.org/inscr/SFImatrix2016c.pdf).
## Computing variables
One variable that we will be using later is a binary indicator of democracy, using the merged Polity IV data. The democracy measure in that data set is the **polity2** variable, which ranges from -10 (pure autocracies) to 10 (pure democracies) and typically 6 or higher is considered a democracy.
```{r}
aid <- aid %>% mutate(democracy = factor(ifelse(P4_polity2 > 5, "Democracy", "Non-Democracy")))
```
Always when computing or recoding variables, it is a good idea to check whether the coding seems correct. For example, in this case a table will help:
```{r}
pander(table(aid$P4_polity2, aid$democracy))
```
We can also use a plot to check whether the recoding looks correct:
> **`r qnr()` Create a bar chart of the P4_Polity2 variable, colored by the new democracy variable. What do you conclude? What does the distribution of regimes according to the Polity IV score look like?**
While Kersting and Kilby (2014) use the Polity IV data set as well, their main indicator is based on Freedom House data, which separately scores civil liberties (CL) and political rights (PR).
For the following exercise you need to check the slides to understand how to combine different clauses (like "A and B"):
> **`r qnr()` Create a separate democracy indicator variable, naming it FHdemocracy, which takes as a democracy only those countries where Initial_CL_score is greater than 30 and Initial_PR_score is greater than 20.**
> **`r qnr()` Create a table of the Polity IV based and the Freedom House based dummy variables, i.e. democracy and FHdemocracy.**
But note that this compares 2012 Polity IV measures with much earlier (see **startyear**) measures on the Freedom House variables.
We can also use computations to transform a variable, for example by taking a logarithm. Again, we then evaluate whether the transform worked, in this case we could use a scatter plot to see whether it is indeed the shape of a log transform and histograms to understand how the distributions changed. Here we take the log of the **Aid_per_capita** variable.
```{r}
aid <- aid %>%
mutate(logAPC = log(Aid_per_capita))
ggplot(aid, aes(x = Aid_per_capita, y = logAPC)) + geom_point()
ggplot(aid, aes(Aid_per_capita)) + geom_histogram()
ggplot(aid, aes(logAPC)) + geom_histogram()
```
> **`r qnr()` Compute a variable that contains the logarithmic transform of the DAC_aid variable.**
> **`r qnr()` Produce histograms of the original DAC_aid variable and the variable after the logarithmic transformation. What do you see?**
Note that in this data set, **Initial_population** and **Initial_GDP_per_capita** are already log transformed.
## Recoding variables
The authors decided to create dummy variables (0 or 1 as only possible values) for former colonial powers. If we want to use this, it is probably going to be easier when we have a categorical, nominal variable that has all information. Note that this loos like a normal recode, but because colonial power is coded in a set of separate (dummy) variables, we need to use slightly different code.
```{r}
aid$colonialPower <- "Independent"
aid$colonialPower[aid$British == 1] <- "British"
aid$colonialPower[aid$French == 1] <- "French"
aid$colonialPower[aid$Spanish == 1] <- "Spanish"
aid$colonialPower[aid$Portugese == 1] <- "Portugese"
aid$colonialPower[aid$Belgian == 1] <- "Dutch"
aid$colonialPower[aid$Belgian == 1] <- "Belgian"
aid$colonialPower[aid$Italian == 1] <- "Italian"
aid$colonialPower[aid$German == 1] <- "German"
aid$colonialPower[aid$American == 1] <- "American"
aid$colonialPower <- as.factor(aid$colonialPower)
pander(table(aid$British, aid$colonialPower))
```
Many countries have had only very few colonies, so for visualisations etc. it is probably better to only include the main colonial powers. We therefore group the small ones together as "other" in a new variable called **mainColonialPower**.
```{r}
aid$mainColonialPower <- as.character(aid$colonialPower)
aid$mainColonialPower[aid$colonialPower %in% c("American", "Belgian", "German", "Italian", "Portugese")] <- "Other"
aid$mainColonialPower <- as.factor(aid$mainColonialPower)
pander(table(aid$colonialPower, aid$mainColonialPower))
```
We similarly create a region variable (this would be a good exercise, but takes too much lab time).
```{r}
aid$region <- ""
aid$region[aid$LAC == 1] <- "Latin America / Caribbean"
aid$region[aid$MENA == 1] <- "Middle East / North Africa"
aid$region[aid$SSA == 1] <- "Sub-Saharan Africa"
aid$region[aid$ECA == 1] <- "Europe / Central Asia"
aid$region[aid$SA == 1] <- "South Asia"
aid$region[aid$region == ""] <- "Asia / Pacific"
aid$region <- as.factor(aid$region)
# Fix a few countries not properly categorised:
pander(table(aid$region))
```
# Linear regression
## Simple regression: aid and democracy
Lets first run a regression where we see if more aid per capita correlates with higher levels of democracy in 2012, after the aid period. Note that this does not distinguish a scenario where more aid goes to democracies from a case where more aid leads to more democratization.
```{r results = "asis"}
stargazer(lm(P4_polity2 ~ Aid_per_capita, data = aid), type = "html")
```
We use the stargazer package to help produce more readable regression tables - you will now also start to understand the name of this package.
While the above provides a numerical estimate, we can also produce a graphical depiction using the "lm" method in geom_smooth():
```{r}
ggplot(aid, aes(x = Aid_per_capita, y = P4_polity2)) + geom_point(alpha = .3) +
geom_smooth(method = "lm")
```
Of course, we should make this figure more presentable, based on what we learned last week about ggplot functionality:
```{r}
ggplot(aid, aes(x = Aid_per_capita, y = P4_polity2)) + geom_point(alpha = .3) +
geom_smooth(method = "lm") +
labs(x = "Aid per capita", y = "Polity IV democracy score") +
theme_minimal()
```
> **`r qnr()` Repeat the above (table and plot) using the logAPC variable instead of Aid_per_capita.**
> **`r qnr()` Repeat the above (table and plot) using the DAC_aid variable instead of Aid_per_capita.**
## Multiple regression: adding region fixed effects
In the plot you will have noticed that there is an upward trend, but it is very slight, and the wide gray bars indicate that there is a high level of variation and uncertainty around this trend. This is partly the case because in some parts of the world, democracy levels are simply much lower than in other parts, entirely unrelated to development aid. If we look at the plot for different regions, we see slightly clearer trends:
```{r}
ggplot(aid, aes(x = Aid_per_capita, y = P4_polity2)) + geom_point(alpha = .3) +
geom_smooth(method = "lm") +
facet_wrap(~ region) +
labs(x = "Aid per capita", y = "Polity IV democracy score") +
theme_minimal()
```
In regression analysis we can do this by adding region as a factor to the regression - note that we now obtain stars on our key independent variable, aid:
```{r results = "asis"}
stargazer(lm(P4_polity2 ~ Aid_per_capita + region, data = aid), type = "html")
```
> **`r qnr()` Repeat the above (table and plot) using the logAPC variable instead of Aid_per_capita.**
## Multiple regression: adding control variables
While this shows an interesting relationship, we should really be adding a number of potential confounding factors. For example, poorer countries get more aid and are less likely to be democratic, so we should add GDP per capita as a control. More ethnically diverse countries will have more difficulty democratizing, as well as developing economically, so we might add that as a control variable.
```{r results = "asis"}
stargazer(lm(P4_polity2 ~ Aid_per_capita + Initial_GDP_per_capita + Ethnic_Fractionalization + region, data = aid), type = "html")
```
> **`r qnr()` Also add population size (Initial_population) as a control variable.**
> **`r qnr()` Repeat the above using logAPC instead of Aid_per_capita.**
> **`r qnr()` Throughout, what do you conclude about the relationship between aid and democracy?**
## Change in democracy levels
Note that Kersting and Kilby (2014) look at the change in democracy scores (**DPolity**) from the start to the end of the assessed time period, and then add the initial level of democracy as an additional variable **Initial_Polity**.
```{r results = "asis"}
stargazer(lm(DPolity ~ Aid_per_capita + Initial_Polity + Initial_GDP_per_capita + Ethnic_Fractionalization + region, data = aid), type = "html")
```
> **`r qnr()` Repeat using logAPC, adding initial population. What do you conclude?**
> **`r qnr()` Repeat but using DFH_score and Initial_FH_score instead of DPolity and Initial_Polity. What do you conclude?**
# Extra: Variable selection
## Stepwise regression
With stepwise regression, we can keep adding (step-forward) or removing (step-backward) variables to see how much each helps in *predicting* the dependent variable. Note that this is very different from causal analysis, where we are concerned with the theoretically confounding nature of variables, not the predictive quality!
We will use a slightly larger set of variables to start with. We start by estimating a model with a large number of variables, and save the model as "full". To make sure the stepwise regression is always based on the same observations, regardless of which variables are included, we first need to create a data set with no missing values in all the variables of interest. When you run a regression, this already happens implicitly, but only for the variables included in the regression. If you want to do it explicitly, for a set of variables, you can use the model.frame() command. We save this data set as **designMatrix**.
```{r}
designMatrix <- model.frame(DPolity ~ Aid_per_capita + logAPC + Initial_Polity + Initial_GDP_per_capita + Ethnic_Fractionalization + Initial_population + region + mainColonialPower + Openness + Christian_share, data = aid)
full <- lm(DPolity ~ Aid_per_capita + logAPC + Initial_Polity + Initial_GDP_per_capita + Ethnic_Fractionalization + Initial_population + region + mainColonialPower + Openness + Christian_share, data = designMatrix)
```
We then estimate a "null" model, which has no variables included, only an intercept.
```{r}
null <- lm(DPolity ~ 1, data = designMatrix)
```
We can then use stepwise regression, for example starting with the full model and removing variables one by one based on the AIC score:
```{r}
step(full, direction = "backward")
```
Or start with a null model and add variables one by one:
```{r}
step(null, scope = list(upper = full), direction = "forward")
```
This tells a slightly different story from the backward process. Note that here GDP per capita and share of Christians are also left out, just as with the backward process, but openness and colonial power are not included this time, while aid per capita is kept in both linearly as well as through the logarithmic transformation.
```{r}
step(null, scope = list(lower = null, upper = full), direction = "both")
```
The final method allows for both adding and removing variables from the model. Based on the null model, the initial democracy value is added first, which of course strongly affects how much the democracy level can change. Then the regional dummies are added, as regions vary significantly in levels of democracy. Aid is then added, but removed later as the logarithmically transformed variable - once other variables are taken into account - is a better predictor. We finally obtain a relatively simple model, with initial democracy level, region, initial population, ethnic fractionalization, and our key independent variable the log transformed aid per capita.
## Subset regressions
An alternative approach is a bit cruder, but easily feasible for small models, namely to run a whole range of possible subsets and just see which combination leads to be the best fit (e.g. AIC or $R^2$). In the package leaps there is function regsubsets which does just this, and which has a convenient plot function.
```{r}
plot(regsubsets(DPolity ~ logAPC + Initial_Polity + Initial_GDP_per_capita + Ethnic_Fractionalization + Initial_population + Openness + region, data = aid, nbest = 20, method = "exhaustive"), scale = "adjr2")
```
In the plot, we see the models ranked by the chosen fit metric - in this case adjusted $R^2$, but we could also have used another one. Note that a few variables have been removed, to keep the model a bit smaller, but even then this is an unreadable plot. It also tests whether the model works with or without the initial democracy level - but theoretically it makes no sense to leave that out - and it also includes and excludes individual regions, but regions should be either in or out. So we can force a number of variables to be in, resulting in a slightly more readable output:
```{r}
plot(regsubsets(DPolity ~ logAPC + Initial_Polity + Initial_GDP_per_capita + Ethnic_Fractionalization + Initial_population + Openness + region, data = aid, nbest = 20, method = "exhaustive", force.in = c("Initial_Polity", "regionLatin America / Caribbean", "regionMiddle East / North Africa", "regionSouth Asia", "regionSub-Saharan Africa")), scale = "adjr2")
```
We can thus conclude that a model with logged aid per capita, and including ethnic fractionalization and initial population, does best, while the openness and plain aid per capita variables can be left out.
## Ridge regression
A third method is to penalize having too many high coefficients, by using a regression model that puts less weight on those variables that contribute less.
```{r}
plot(m <- lm.ridge(DPolity ~ Aid_per_capita + logAPC + Initial_Polity + Initial_GDP_per_capita + Ethnic_Fractionalization + Initial_population + region + mainColonialPower + Openness + Christian_share, data = aid,
lambda = 10^(-2:2)))
```
In the plot we see how the coefficients change as $\lambda$ changes, for a selection of chosen lambdas. As $\lambda$ increases, the coefficients are further constrained and thus become smaller in size. A very rough rule of thumb is to select the $\lambda$ where the impact starts to decline, but this is very arbitrary.
Some statistics have been developed to assess the appropriate value of $\lambda$. Here the optimal value depends a lot on the method of selection used:
```{r}
select(m)
```
So let's repeat the ridge regression using the suggested value for $\lambda$ by the HKB estimator:
```{r}
lm.ridge(DPolity ~ Aid_per_capita + logAPC + Initial_Polity + Initial_GDP_per_capita + Ethnic_Fractionalization + Initial_population + region + mainColonialPower + Openness + Christian_share, data = aid,
lambda = 3.7)
```
Running a ridge regression with the associated $\lambda$ value gives us a new set of regression coefficients.
> **`r qnr()` Try the various variable selection procedures using the Freedom House instead of the Polity data.**
> **`r qnr()` Try to identify additional potential control variables in the data and use variable selection to see how useful the variable is (for prediction).**