---
title: "Introduction to Statistics - Lab 10"
author: "Johan A. Elkink"
date: "18 November 2019"
output:
html_document:
number_sections: yes
toc: yes
theme: cerulean
---
```{r}
library(rio)
library(dplyr)
library(stargazer)
library(pander)
library(ggplot2)
library(hmeasure)
```
# Introduction
In this lab we will look at estimating and evaluating logistic regression models. While the underlying estimation methodology is very different, the implementation in R is quite similar to that of the linear model.
We will make use of the Asiabarometer data available on the [teaching data page](http://www.joselkink.net/data.php). There you will also find a [brief description](http://www.joselkink.net/wp-content/uploads/2013/01/asiabaro.txt) of all variables.
```{r}
asiabaro <- import("http://www.joselkink.net/wp-content/uploads/2013/01/asiabaro.dta")
```
When R opens this file, some categorical variables are not properly encoded as factors. This code fixes that:
```{r}
fixFactor <- function(x) {
labels <- names(attr(x, "labels"))
names(labels) <- unname(attr(x, "labels"))
factor(labels[x])
}
asiabaro$country <- fixFactor(asiabaro$country)
```
# Performing logistic regression
Here we evaluate which factors affect whether someone is at least somewhat interest in politics:
```{r results = "asis"}
asiabaro <- asiabaro %>% mutate(somewhatInterested = as.integer(polinterest >= 2))
stargazer(
m0 <- glm(somewhatInterested ~ age,
data = asiabaro,
family = binomial(link = "logit")),
m1 <- glm(somewhatInterested ~ age + female + urban + education,
data = asiabaro,
family = binomial(link = "logit")),
type = "html")
```
# Prediction
To obtain predicted probabilities, for example for a plot or for a table, we can use the predict() function. When we set the type parameter to "response", we get probabilities.
```{r}
yhat <- predict(m1, type = "response")
```
We could for example use this to plot predicted probabilities against one of the independent variables:
```{r}
ggplot(m1$model, aes(x = education, y = yhat)) +
geom_point(alpha = .3) +
theme_minimal() +
labs(x = "Years of education",
y = "Predicted probability of interest in politics") +
lims(y = c(0,1))
```
Note that we use m1$model as the data set, which is the data set used in the estimation of model m1, such that all missing data is removed. Otherwise the number of predicted probabilities would not match the number of observations on education.
Furthermore, we fix the scale of the y-axis to 0-1, to avoid suggesting large effects when the actual predicted probabilities vary little.
To predict for a specific kind of individual, we need to create a temporary data frame:
```{r}
df <- data.frame(
age = 40,
female = 1,
education = 12,
urban = 1
)
predict(m1, newdata = df, type = "response")
```
We thus find that for a 40 year old female with 12 years of education, living in an urban environment, the predicted probability of at least some interest in politics is 0.44.
We can also do this for a larger temporary data set, for example to produce a plot of predicted probabilities:
```{r}
df <- expand.grid(
age = 18:90,
education = 12,
female = 0:1,
urban = 0
)
df$predY <- predict(m1, newdata = df, type = "response")
```
Let's have a quick look at this data (showing only the first few observations):
```{r}
pander(head(df))
```
We can then use this to plot:
```{r}
df <- df %>%
mutate(gender = recode(female, `0` = "Male", `1` = "Female"))
ggplot(df, aes(x = age, y = predY, col = gender)) +
geom_line(lwd = 2) +
theme_minimal() +
lims(y = c(0,1)) +
labs(x = "Age",
y = "Predicted probability of interest in politics",
col = "Gender")
```
# Evaluating the model fit
## Confusion table
One way of investigating the output of a logistic regression is by looking at the confusion table, a table of the dependent variable by the predicted dependent variable. We will have to arbitrarily fix a threshold for predicting a one, typically we take a probability of 0.5.
```{r}
pander(table(predict(m1, type = "response") > .5, m1$model$somewhatInterested))
pander(prop.table(table(predict(m1, type = "response") > .5, m1$model$somewhatInterested)))
```
Remember that with the table() command, it is always rows first, columns second. So in the rows we have the predicted values and in the columns the original variable.
By inserting "response" in the predict command, we obtain predicted probabilities. The alternative would be the linear prediction, so this gives the same result:
```{r}
pander(table(predict(m1) > 0, m1$model$somewhatInterested))
```
## ROC curve
For the ROC curve we make use of the hmeasure package, which also provides us with the AUC statistic. Alternative packages that provide similar fit statistics exist.
```{r}
plotROC(HMeasure(m1$model$somewhatInterested, predict(m1)))
```
## Prediction metrics
We can obtain the AUC (area under curve) measure as well as many other metrics on the quality of the model prediction as follows:
```{r}
pander(HMeasure(m1$model$somewhatInterested, predict(m1))$metrics)
```
# Exercises
```
===================================================
Dependent variable:
-----------------------------
abstain
(1) (2) (3)
---------------------------------------------------
polinterest -0.378*** -0.379*** -0.416***
(0.115) (0.119) (0.128)
urban 0.576** 0.348
(0.240) (0.275)
female -0.134 -0.115
(0.206) (0.216)
countryJapan 0.230
(0.370)
countryKorea -1.527***
(0.458)
countryMainland China -0.825
(0.521)
countryMongolia -0.870**
(0.435)
countryPhilippines 0.214
(0.361)
countryTaiwan -1.202***
(0.421)
countryThailand -1.613***
(0.568)
Constant -1.024*** -1.379*** -0.618
(0.188) (0.296) (0.424)
---------------------------------------------------
Observations 701 701 701
Log Likelihood -315.455 -312.137 -288.934
Akaike Inf. Crit. 634.910 632.274 599.867
===================================================
Note: *p<0.1; **p<0.05; ***p<0.01
```
1. The above contains a series of four logistic regressions explaining turnout in elections (abstain). Replicate these models.
2. What do you conclude about the relation between political interest and turnout in elections? Tip: use the "divide by four" rule of thumb and consider the scale on which the independent variable is measured.
3. Based on these regressions, would you conclude that urbanisation is an important predictor?
4. Based on Model 2, compute the predicted probability for an urban female to vote. Repeat for an urban male.
5. Using Model 3, produce a confusion table.
6. Using Model 3, produce an ROC curve.