library(rio)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(pander)
library(ggplot2)
library(hmeasure)
In this lab we will look at estimating and evaluating logistic regression models. While the underlying estimation methodology is very different, the implementation in R is quite similar to that of the linear model.
We will make use of the Asiabarometer data available on the teaching data page. There you will also find a brief description of all variables.
asiabaro <- import("http://www.joselkink.net/wp-content/uploads/2013/01/asiabaro.dta")
When R opens this file, some categorical variables are not properly encoded as factors. This code fixes that:
fixFactor <- function(x) {
labels <- names(attr(x, "labels"))
names(labels) <- unname(attr(x, "labels"))
factor(labels[x])
}
asiabaro$country <- fixFactor(asiabaro$country)
Here we evaluate which factors affect whether someone is at least somewhat interest in politics:
asiabaro <- asiabaro %>% mutate(somewhatInterested = as.integer(polinterest >= 2))
stargazer(
m0 <- glm(somewhatInterested ~ age,
data = asiabaro,
family = binomial(link = "logit")),
m1 <- glm(somewhatInterested ~ age + female + urban + education,
data = asiabaro,
family = binomial(link = "logit")),
type = "html")
Dependent variable: | ||
somewhatInterested | ||
(1) | (2) | |
age | 0.003 | 0.018^{***} |
(0.005) | (0.006) | |
female | -0.549^{***} | |
(0.164) | ||
urban | -0.923^{***} | |
(0.203) | ||
education | 0.094^{***} | |
(0.023) | ||
Constant | -0.027 | -0.629 |
(0.235) | (0.406) | |
Observations | 686 | 665 |
Log Likelihood | -473.843 | -434.656 |
Akaike Inf. Crit. | 951.685 | 879.312 |
Note: | ^{}p<0.1; ^{}p<0.05; ^{}p<0.01 |
To obtain predicted probabilities, for example for a plot or for a table, we can use the predict() function. When we set the type parameter to “response”, we get probabilities.
yhat <- predict(m1, type = "response")
We could for example use this to plot predicted probabilities against one of the independent variables:
ggplot(m1$model, aes(x = education, y = yhat)) +
geom_point(alpha = .3) +
theme_minimal() +
labs(x = "Years of education",
y = "Predicted probability of interest in politics") +
lims(y = c(0,1))
Note that we use m1$model as the data set, which is the data set used in the estimation of model m1, such that all missing data is removed. Otherwise the number of predicted probabilities would not match the number of observations on education.
Furthermore, we fix the scale of the y-axis to 0-1, to avoid suggesting large effects when the actual predicted probabilities vary little.
To predict for a specific kind of individual, we need to create a temporary data frame:
df <- data.frame(
age = 40,
female = 1,
education = 12,
urban = 1
)
predict(m1, newdata = df, type = "response")
## 1
## 0.4375449
We thus find that for a 40 year old female with 12 years of education, living in an urban environment, the predicted probability of at least some interest in politics is 0.44.
We can also do this for a larger temporary data set, for example to produce a plot of predicted probabilities:
df <- expand.grid(
age = 18:90,
education = 12,
female = 0:1,
urban = 0
)
df$predY <- predict(m1, newdata = df, type = "response")
Let’s have a quick look at this data (showing only the first few observations):
pander(head(df))
age | education | female | urban | predY |
---|---|---|---|---|
18 | 12 | 0 | 0 | 0.6951 |
19 | 12 | 0 | 0 | 0.6989 |
20 | 12 | 0 | 0 | 0.7027 |
21 | 12 | 0 | 0 | 0.7065 |
22 | 12 | 0 | 0 | 0.7102 |
23 | 12 | 0 | 0 | 0.7139 |
We can then use this to plot:
df <- df %>%
mutate(gender = recode(female, `0` = "Male", `1` = "Female"))
ggplot(df, aes(x = age, y = predY, col = gender)) +
geom_line(lwd = 2) +
theme_minimal() +
lims(y = c(0,1)) +
labs(x = "Age",
y = "Predicted probability of interest in politics",
col = "Gender")