library(rio)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(pander)
library(ggplot2)

1 Introduction

In this class we will look at regression diagnostics. We will focus on detecting issues rather than addressing them. Furthermore, the code in this lab is significantly more difficult than in previous classes and should not be taken as core course material. If you’re struggling with this lab, don’t worry! This is primarily for your information, to have some idea how one might investigate the types of problems discussed in the lecture.

We will make use with a data set on democracy and development. For an overview of the variables and a brief description of the data set, see the description page.

demdev <- import("http://www.joselkink.net/wp-content/uploads/2013/01/demdev.dta")

This is a panel, or longitudinal, data set, which means that we observe the same units (countries) over multiple time periods (years). It is helpful to have a table of country by year to get an idea what data is available. First we need to perform some tricks to get a character vector of country names, instead of a numbered and labelled variable as we currently have in the data set:

countryNames <- names(attr(demdev$country, "labels"))
names(countryNames) <- unname(attr(demdev$country, "labels"))
demdev$countryName <- countryNames[demdev$country]

And then we can make our table:

pander(demdev %>% 
         group_by(countryName) %>% 
         summarise(from = min(year), to = max(year)) %>% 
         arrange(from, to, countryName)
)
countryName from to
Germany East 1951 1990
Germany West 1951 1990
Yemen North 1951 1990
Czechoslovakia 1951 1992
Afghanistan 1951 1998
Albania 1951 1998
Argentina 1951 1998
Australia 1951 1998
Austria 1951 1998
Belgium 1951 1998
Bhutan 1951 1998
Bolivia 1951 1998
Brazil 1951 1998
Bulgaria 1951 1998
Canada 1951 1998
Chile 1951 1998
China 1951 1998
Colombia 1951 1998
Costa Rica 1951 1998
Cuba 1951 1998
Denmark 1951 1998
Dominican Rep 1951 1998
Ecuador 1951 1998
Egypt 1951 1998
El Salvador 1951 1998
Ethiopia 1951 1998
Finland 1951 1998
France 1951 1998
Greece 1951 1998
Guatemala 1951 1998
Haiti 1951 1998
Honduras 1951 1998
Hungary 1951 1998
India 1951 1998
Indonesia 1951 1998
Iran 1951 1998
Iraq 1951 1998
Ireland 1951 1998
Israel 1951 1998
Italy 1951 1998
Japan 1951 1998
Jordan 1951 1998
Korea North 1951 1998
Korea South 1951 1998
Lebanon 1951 1998
Liberia 1951 1998
Libya 1951 1998
Mexico 1951 1998
Mongolia 1951 1998
Myanmar (Burma) 1951 1998
Nepal 1951 1998
Netherlands 1951 1998
New Zealand 1951 1998
Nicaragua 1951 1998
Norway 1951 1998
Oman 1951 1998
Panama 1951 1998
Paraguay 1951 1998
Peru 1951 1998
Philippines 1951 1998
Poland 1951 1998
Portugal 1951 1998
Romania 1951 1998
Saudi Arabia 1951 1998
South Africa 1951 1998
Spain 1951 1998
Sri Lanka 1951 1998
Sweden 1951 1998
Switzerland 1951 1998
Syria 1951 1998
Taiwan 1951 1998
Thailand 1951 1998
Turkey 1951 1998
United Kingdom 1951 1998
United States 1951 1998
Uruguay 1951 1998
Venezuela 1951 1998
Yugoslavia 1951 1998
Cambodia 1953 1998
Vietnam North 1954 1976
Laos 1954 1998
Sudan 1954 1998
Vietnam South 1955 1975
Morocco 1956 1998
Malaysia 1957 1998
Guinea 1958 1998
Jamaica 1959 1998
Singapore 1959 1998
Tunisia 1959 1998
Benin 1960 1998
Burkina Faso 1960 1998
Cameroon 1960 1998
Central African Republic 1960 1998
Chad 1960 1998
Congo Brazzaville 1960 1998
Congo Kinshasa 1960 1998
Cyprus 1960 1998
Gabon 1960 1998
Ghana 1960 1998
Ivory Coast 1960 1998
Madagascar 1960 1998
Mali 1960 1998
Mauritania 1960 1998
Niger 1960 1998
Nigeria 1960 1998
Senegal 1960 1998
Somalia 1960 1998
Togo 1960 1998
Rwanda 1961 1998
Sierra Leone 1961 1998
Tanzania 1961 1998
Algeria 1962 1998
Burundi 1962 1998
Trinidad 1962 1998
Uganda 1962 1998
Kenya 1963 1998
Kuwait 1963 1998
Malawi 1964 1998
Zambia 1964 1998
Gambia 1965 1998
Botswana 1966 1998
Guyana 1966 1998
Lesotho 1966 1998
Yemen South 1967 1990
Equatorial Guinea 1968 1998
Mauritius 1968 1998
Swaziland 1968 1998
Fiji 1970 1998
Zimbabwe 1970 1998
Bahrain 1971 1998
Qatar 1971 1998
UAE 1971 1998
Bangladesh 1972 1998
Pakistan 1972 1998
Guinea-Bissau 1974 1998
Angola 1975 1998
Comoros 1975 1998
Mozambique 1975 1998
Papua New Guinea 1975 1998
Djibouti 1977 1998
Germany 1990 1998
Namibia 1990 1998
Yemen 1990 1998
Armenia 1991 1998
Azerbaijan 1991 1998
Belarus 1991 1998
Croatia 1991 1998
Estonia 1991 1998
Georgia 1991 1998
Kazakhstan 1991 1998
Kyrgyzstan 1991 1998
Latvia 1991 1998
Lithuania 1991 1998
Macedonia 1991 1998
Moldova 1991 1998
Slovenia 1991 1998
Tajikistan 1991 1998
Turkmenistan 1991 1998
Ukraine 1991 1998
Uzbekistan 1991 1998
Bosnia 1992 1998
Russia 1992 1998
Czech Republic 1993 1998
Eritrea 1993 1998
Slovakia 1993 1998

In total we have 6071 observations.

2 Main regression model

We will focus on one regression as an example, to investigate the various problems we might encounter. We will save the regression as “mdl1” for the simple regression and “mdl2” for the multiple regression including some control variables, so we can access it throughout. We also have “mdl0” where we use a linear version of energy consumption instead of the, more reasonable, logged variable.

stargazer(
  mdl0 <- lm(polity2 ~ energy2, demdev),
  mdl1 <- lm(polity2 ~ log(energy2), demdev),
  mdl2 <- lm(polity2 ~ log(energy2) + propdem + log(laggdppc), demdev), 
  type = "html"
)
Dependent variable:
polity2
(1) (2) (3)
energy2 4.463***
(0.131)
log(energy2) 2.143*** 0.260**
(0.062) (0.107)
propdem 15.573***
(1.289)
log(laggdppc) 3.196***
(0.158)
Constant -3.779*** 1.603*** -31.367***
(0.128) (0.111) (1.431)
Observations 5,965 5,811 5,759
R2 0.164 0.170 0.248
Adjusted R2 0.164 0.170 0.247
Residual Std. Error 6.915 (df = 5963) 6.932 (df = 5809) 6.606 (df = 5755)
F Statistic 1,169.108*** (df = 1; 5963) 1,188.344*** (df = 1; 5809) 631.090*** (df = 3; 5755)
Note: p<0.1; p<0.05; p<0.01
ggplot(demdev, aes(x = log(energy2), y = polity2)) + geom_jitter(alpha = .3) + theme_minimal() + geom_smooth(method = "lm")
## Warning in log(energy2): NaNs produced

## Warning in log(energy2): NaNs produced

## Warning in log(energy2): NaNs produced
## Warning: Removed 260 rows containing non-finite values (stat_smooth).
## Warning: Removed 260 rows containing missing values (geom_point).