We start by opening all the relevant libraries for this lab:

library(rio)       # contains import()
library(car)       # contains recode()
## Please cite as:
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(leaps)     # contains regsubsets()
library(MASS)      # contains lm.ridge()

1 Opening data

For Project 2 we start with an entirely different data set. This time, we will have country-level data and we will look at the relationship between development aid and democracy. The primary data set is replication data from Erasmus Kersting and Christopher Kily (2014), “Aid and democracy redux”, European Economic Review. While we use some of their data and main argument, their analysis is significantly more complicated statistically and we will simply do some related analysis, testing a range of different methods using their data.

Their data can be directly downloaded from the web:

aid <- import("http://www.joselkink.net/files/data/KerstingKilby2014-Tab-2-5.dta")

2 Data management

While the data provided by Kersting and Kilby contains key information on democracy indicators, aid flows, economic variables, etcetera, we might want to add some variables related to democracy, political violence, and state fragility. We obtain these data from the Centre for Systemic Peace. We will then merge these data sets with the aid data based on country names.

2.1 Merging

A problem is that the names are not always identical between the two data sets. For example, one uses the “Congo, Dem. Rep.”, while the other uses “Congo Kinshasa”. To solve this problem, I created a spreadsheet listing the matching names. We can then first merge the original data with this data set, and then merge the other data sets using the matched name. In the spreadsheet, the names used by Kersting and Kilby are saved as the kk variable and those used by the Polity IV data set of the Centre for Systemic Peace as the p4 variable.

matchNames <- import("http://www.joselkink.net/files/data/KerstingKilby-Polity-name-matching.xlsx")

# Remove Ivory Coast due to problems with spelling on Windows computers
matchNames <- matchNames[-35, ]
aid <- aid[-35, ]

aid <- merge(aid, matchNames, by.x = "name", by.y = "kk", all = TRUE)

This last line therefore merges the aid data set with the matchNames data set, using the name variable of the first data set to match against the kk variable of the second data set. Finally, we indicate that all observations should be included, regardless of whether they are missing in either data set - i.e. we keep non-matching cases.

The first data set we will add is the original Polity IV democracy indicators. We first download the data and use the subset() command to select only observations from year 2012 (all aid projects in the Kersting and Kilby data set end in 2011 at the latest).

p4 <- subset(import("http://www.systemicpeace.org/inscr/p4v2016.sav"), year == 2012)

Before we merge, we rename all variable names in the p4 data set by adding the prefix P4_ to each of them. This way, after the merger, it is easy to recognize any variables originating from this data set. This also helps when we try to locate the variables in the codebook. The function to merge to sets of strings of text is the paste() command, or if you want to avoid spaces in between the parts, the paste0() command.

names(p4) <- paste0("P4_", names(p4))
aid <- merge(aid, p4, by.x = "p4", by.y = "P4_country", all = TRUE)

So we merge the aid data set with the downloaded p4 data set, by using the p4 variable in the first (which was merged from the matching names data file above and contains the country names in the spelling used by the Polity IV project) and the P4_country variable in the second. Again, we do not want to lose any observations from either data set due to the merger.

We proceed in the same manner to add the Major Episodes of Political Violence data set (see also the codebook), again for 2012:

mepv <- subset(import("http://www.systemicpeace.org/inscr/MEPVv2016.sav"), year == 2012)
names(mepv) <- paste0("MEPV_", names(mepv))
aid <- merge(aid, mepv, by.x = "p4", by.y = "MEPV_country", all = TRUE)

A challenging exercise, but try to do the same using the file for the State Fragility Index, which can be found at http://www.systemicpeace.org/inscr/SFIv2016.sav and where you should use prefix FRAG_

Note also the codebook.

2.2 Computing variables

One variable that we will be using later is a binary indicator of democracy, using the merged Polity IV data. The democracy measure in that data set is the polity2 variable, which ranges from -10 (pure autocracies) to 10 (pure democracies) and typically 6 or higher is considered a democracy.

aid$democracy <- as.factor(ifelse(aid$P4_polity2 > 5, "Democracy", "Non-Democracy"))

Always when computing or recoding variables, it is a good idea to check whether the coding seems correct. For example, in this case a table will help:

pander(table(aid$P4_polity2, aid$democracy))
  Democracy Non-Democracy
-10 0 4
-9 0 4
-8 0 2
-7 0 9
-6 0 2
-5 0 1
-4 0 6
-3 0 6
-2 0 5
-1 0 3
0 0 5
1 0 3
2 0 3
3 0 6
4 0 3
5 0 9
6 13 0
7 9 0
8 20 0
9 18 0
10 34 0

While Kersting and Kilby use the Polity IV data set as well, their main indicator is based on Freedom House data, which separately scores civil liberties (CL) and political rights (PR).

For the following exercise you need to check the slides to understand how to combine different clauses (like “A and B”):

Create a separate democracy indicator variable, naming it FHdemocracy, which takes as a democracy only those countries where Initial_CL_score is greater than 30 and Initial_PR_score is greater than 20.

Create a table of the Polity IV based and the Freedom House based dummy variables, i.e. democracy and FHdemocracy.

But note that this compares 2012 Polity IV measures with much earlier (see startyear) measures on the Freedom House variables.

We can also use computations to transform a variable, for example by taking a logarithm. Again, we then evaluate whether the transform worked, in this case we could use a scatter plot to see whether it is indeed the shape of a log transform and histograms to understand how the distributions changed. Here we take the log of the Aid_per_capita variable.

aid$logAPC <- log(aid$Aid_per_capita)

ggplot(aid, aes(x = Aid_per_capita, y = logAPC)) + geom_point()
## Warning: Removed 74 rows containing missing values (geom_point).

ggplot(aid, aes(Aid_per_capita)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 74 rows containing non-finite values (stat_bin).

ggplot(aid, aes(logAPC)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 74 rows containing non-finite values (stat_bin).

Compute a variable that contains the logarithmic transform of the DAC_aid variable.

Produce histograms of the original DAC_aid variable and the variable after the logarithmic transformation. What do you see?

Note that in this data set, Initial_population and Initial_GDP_per_capita are already log transformed.

2.3 Recoding variables

The authors decided to create dummy variables (0 or 1 as only possible values) for former colonial powers. If we want to use this, it is probably going to be easier when we have a categorical, nominal variable that has all information.

aid$colonialPower <- "Independent"
aid$colonialPower[aid$British == 1] <- "British"
aid$colonialPower[aid$French == 1] <- "French"
aid$colonialPower[aid$Spanish == 1] <- "Spanish"
aid$colonialPower[aid$Portugese == 1] <- "Portugese"
aid$colonialPower[aid$Belgian == 1] <- "Dutch"
aid$colonialPower[aid$Belgian == 1] <- "Belgian"
aid$colonialPower[aid$Italian == 1] <- "Italian"
aid$colonialPower[aid$German == 1] <- "German"
aid$colonialPower[aid$American == 1] <- "American"
aid$colonialPower <- as.factor(aid$colonialPower)

pander(table(aid$British, aid$colonialPower))
Table continues below
  American Belgian British French German Independent Italian
0 3 1 0 19 4 13 1
1 0 0 54 1 0 0 1
  Portugese Spanish
0 7 17
1 0 0

Many countries have had only very few colonies, so for visualisations etc. it is probably better to only include the main colonial powers. We therefore group the small ones together as “other” in a new variable called mainColonialPower.

aid$mainColonialPower <- recode(aid$colonialPower,
  "c('American', 'Belgian', 'German', 'Italian', 'Portugese')='Other'")

pander(table(aid$colonialPower, aid$mainColonialPower))
  British French Independent Other Spanish
American 0 0 0 3 0
Belgian 0 0 0 1 0
British 54 0 0 0 0
French 0 20 0 0 0
German 0 0 0 4 0
Independent 0 0 87 0 0
Italian 0 0 0 2 0
Portugese 0 0 0 7 0
Spanish 0 0 0 0 17

We similarly create a region variable (this would be an easy exercise, but takes too much lab time).

aid$region <- ""
aid$region[aid$LAC == 1] <- "Latin America / Caribbean"
aid$region[aid$MENA == 1] <- "Middle East / North Africa"
aid$region[aid$SSA == 1] <- "Sub-Saharan Africa"
aid$region[aid$ECA == 1] <- "Europe / Central Asia"
aid$region[aid$SA == 1] <- "South Asia"
aid$region[aid$region == ""] <- NA
aid$region <- as.factor(aid$region)

Table continues below
Europe / Central Asia Latin America / Caribbean Middle East / North Africa
7 32 16
South Asia Sub-Saharan Africa
8 41

3 Linear regression

3.1 Simple regression: aid and democracy

Lets first run a regression where we see if more aid per capita correlates with higher levels of democracy in 2012, after the aid period. Note that this does not distinguish a scenario where more aid goes to democracies from a case where more aid leads to more democratization.

stargazer(lm(P4_polity2 ~ Aid_per_capita, data = aid), type = "html")
Dependent variable:
Aid_per_capita 0.011
Constant 2.955***
Observations 97
R2 0.008
Adjusted R2 -0.002
Residual Std. Error 5.993 (df = 95)
F Statistic 0.767 (df = 1; 95)
Note: p<0.1; p<0.05; p<0.01

We use the stargazer package to help produce more readable regression tables - you will now also start to understand the name of this package.

While the above provides a numerical estimate, we can also produce a graphical depiction using the “lm” method in geom_smooth():

ggplot(aid, aes(x = Aid_per_capita, y = P4_polity2)) + geom_point(alpha = .3) +
  geom_smooth(method = "lm")
## Warning: Removed 98 rows containing non-finite values (stat_smooth).
## Warning: Removed 98 rows containing missing values (geom_point).

Repeat the above (table and plot) using the logAPC variable instead of Aid_per_capita.

Repeat the above (table and plot) using the DAC_aid variable instead of Aid_per_capita.

3.2 Multiple regression: adding region fixed effects

In the plot you will have noticed that there is an upward trend, but it is very slight, and the wide gray bars indicate that there is a high level of variation and uncertainty around this trend. This is partly the case because in some parts of the world, democracy levels are simply much lower than in other parts, entirely unrelated to development aid. If we look at the plot for different regions, we see slightly clearer trends:

ggplot(aid, aes(x = Aid_per_capita, y = P4_polity2)) + geom_point(alpha = .3) +
  geom_smooth(method = "lm") + facet_wrap(~ region)
## Warning: Removed 98 rows containing non-finite values (stat_smooth).
## Warning: Removed 98 rows containing missing values (geom_point).