library(rio)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

1 Introduction

In this class we will lab at multiple regression, using replication data from Ross (2004). While Ross uses panel data - multiple countries over multiple years - we will select only one year to avoid complications with time-series data. This data set has already been prepared:

ross <- import("http://www.joselkink.net/wp-content/uploads/2013/01/ross_1997.dta")

Check out the codebook for a description of the relevant variables.

2 Multiple regression

Estimating a multiple regression model - once you already know how to estimate a simple regression - is a straightforward extension. You simply add the variables to the regression equation. For example, regressing corruption on democracy would be as follows:

lm(corruption ~ democracy, ross)
## 
## Call:
## lm(formula = corruption ~ democracy, data = ross)
## 
## Coefficients:
## (Intercept)    democracy  
##      2.0259       0.2004

If we wanted to add as a control variable, the level of economic performance, we might include the GDP per capita:

lm(corruption ~ democracy + gdppc, ross)
## 
## Call:
## lm(formula = corruption ~ democracy + gdppc, data = ross)
## 
## Coefficients:
## (Intercept)    democracy        gdppc  
##   1.990e+00    1.145e-01    7.747e-05

As a side-note, in the case of a variable that relates to money, like GDP per capita, or a size, like the population, we typically have a very skewed distribution. The relationship is then likely to be non-linear and you obtain better results with a linear regression using a log transformed variable. For example:

lm(corruption ~ democracy + log(gdppc), ross)
## 
## Call:
## lm(formula = corruption ~ democracy + log(gdppc), data = ross)
## 
## Coefficients:
## (Intercept)    democracy   log(gdppc)  
##     -1.3510       0.1189       0.4674

(As it happens, the data sets already contains a variable called loggdppc, but I wanted to include an example that can be used when this is not available already.)

Typically, rather than looking at this output directly we would save the output as an R object and then use a package that presents the results better:

regOutput <- lm(corruption ~ democracy + log(gdppc), ross)

stargazer(regOutput, type = "html", style = "ajps")
corruption
democracy 0.119***
(0.037)
log(gdppc) 0.467***
(0.102)
Constant -1.351*
(0.754)
N 100
R-squared 0.416
Adj. R-squared 0.404
Residual Std. Error 0.968 (df = 97)
F Statistic 34.610*** (df = 2; 97)
p < .01; p < .05; p < .1

We see that once we control for the log of GDP per capita, the estimated impact of higher levels of democracy on corruption is halved.

3 Exercises

  1. Create a new RMarkdown file for this lab and fill out the details in the header. Use it for the remainder of the questions.
  2. Produce a scatter plot with regression line for the relation between democracy and corruption.
  3. Repeat the regression analysis adding at least three more relevant control variables (not including region).
  4. Consider for each of the control variables (including GDP per capita) why you think this is a good variable to include.
  5. Fully interpret the regression results. What does it tell you about the relation between democracy and corruption?
  6. Perform a regression analysis with democracy as the dependent variable and the level of taxation (taxes) as the independent variable.
  7. Produce a scatter plot with regression line for these two variables.
  8. Repeat the regression analysis adding at least three relevant control variables.
  9. Consider for each of the control variables why you think this is a good variable to include.
  10. Fully interpret the regression results. What does it tell yo uabout the relation between taxation and corruption?
  11. Continue running relevant regressions and try to fully understand the output.