---
title: "Introduction to Statistics - Lab 9"
author: "Johan A. Elkink"
date: "11 November 2019"
output:
html_document:
number_sections: yes
toc: yes
theme: cerulean
---
```{r}
library(rio)
library(dplyr)
library(stargazer)
library(pander)
library(ggplot2)
```
# Introduction
In this class we will look at statistics testing, performing t-tests to compare means and in a regression context.
We will continue to make use of data from the [Irish National Election Study](https://www.ucd.ie/issda/data/irishnationalelectionstudy/). We only look at the most recent election in the data set, which is already a bit dated: 2007. This data file is a [ZIP archive](https://www.lifewire.com/zip-file-2622675), with inside it a Stata file, which can be opened in R using a temporary file as follows:
```{r}
tempFile <- tempfile(fileext = ".zip")
download.file("https://www.ucd.ie/issda/t4media/INESLong_Beta.zip", tempFile)
ines <- import(tempFile, haven = FALSE) %>%
filter(ines == 2007)
unlink(tempFile)
```
This uses the long file in Stata format on the INES archive website. If the above is too slow when using "Knit", you can also go to the INES website, download the file, unzip by clicking on the file, and then use a more typical method to open the file:
```{r eval = FALSE}
ines <- import("INESLong_Beta.dta", haven = FALSE) %>%
filter(ines == 2007)
```
Check out the [codebook](https://www.ucd.ie/issda/t4media/INES%20Codebook.pdf) for a description of the relevant variables.
# t-tests in R
There are three types of t-tests for the mean:
* Testing the mean against some reference value (one-sample t-test).
* Testing the means of two variables on the same units (paired-sample t-test).
* Testing the means of the same variable on two different groups (two-sample t-test).
This used to be core material of the course, but now the course focuses on t- and F-tests in regression only. It is still useful to have a basic understanding of tests for comparing means, on which the t-test in regression is based.
First, we will prepare some variables to have proper variable names, which makes interpreting the output easier:
```{r}
ines <- ines %>% mutate(union = recode(v0936, "Yes", "No"),
labour = v0190,
progDems = v0191,
turnout = recode(v0072, "Voted", "Did not vote", "Did not vote", "Did not vote"))
pander(table(ines$v0936, ines$union))
pander(table(ines$v0190, ines$labour))
pander(table(ines$v0191, ines$progDems))
pander(table(ines$v0072, ines$turnout))
```
## One-sample t-test
In a one-sample t-test, where you compare the mean of one variable against a fixed value. For example, to test whether the mean of a variable *labour* differs from 5:
```{r}
pander(t.test(ines$labour, mu = 5))
```
## Paired-sample t-test
Paired-sample t-test is where you compare the mean on two variables for the same individuals (e.g. a test score before and after a class). For example, to test whether the means of two variables, *labour* and *progDems*, for the same individuals differ:
```{r}
pander(t.test(ines$labour, ines$progDems, paired = TRUE))
```
## Two-sample t-test
The two-sample or independent-samples t-test is where you compare the mean on the same variable for two different groups. For example, does support for the Labour Party depend on whether a respondent is a member of a trade union?
```{r}
pander(t.test(ines$labour ~ ines$union))
```
# Chi-squared test
When producing a cross-table of two categorical variables, we can use the $\chi^2$-test to test whether the two variables are independent of each other or not. For example, to see if union members are more likely to participate in elections:
```{r}
pander(chisq.test(table(ines$union, ines$turnout)))
```
# Exercises
Lab 5 has example code on recoding and testing whether the recode worked. Use that here as well for questions where necessary. Often it might be helpful to generate properly named and labelled variables first, then run the regression or analysis.
Create a new RMarkdown file for this lab and fill out the details in the header. Use it for the remainder of the questions.
First we look at attitudes towards abortion among younger voters.
+ Construct a new variable *age* which is calculated on the basis of *v0906*. Remember that the year of the survey is 2007.
+ Produce a frequency table of attitude towards abortion (*v0266*) to check whether missing values and numerical values are correctly coded.
+ Produce a scatter plot of attitude towards abortion by *age*. Add a regression line.
+ Regress attitude towards abortion on age.
+ Recode age into a variable *young*, which is 1 if *age* is less than 30 and 0 otherwise.
+ Regress attitude towards abortion on the young dummy variable. (Note comments in Lab 6.)
+ Perform a t-test whether the mean on abortion is different for young versus old. Compare the result to the regression result.
For the following statements, formulate the null hypothesis and the alternative hypothesis, perform the appropriate t-test, and formulate the conclusion from the test:
+ Voters are more likely to vote Labour (*v0190*) than they are to vote Sinn Fein (*v0192*).
+ Union members are more likely to vote Labour (*v0190*) than are non-union members (*v0936*).
+ Voters in Ireland tend to the political right (i.e. *v0239* is greater than 5 on average).
+ Fianna Fail voters are more right-wing (*v0239*) than Fine Gael voters (*v0195*), on average.
+ Voters like Enda Kenny (*v0522*) more than they like Pat Rabitte (*v0524*).\footnote{R users might have to check whether it is properly coded as a numerical variable.}
+ Voters who agree that there should be very strict limits on immigration (from "slightly agree" to "strongly agree") (*v0247*) are more likely to vote for Sinn Fein (*v0192*) than the others.
To see whether individual with higher political efficacy are more likely to participate in elections:
+ Recode *v0291b* into a new variable *efficacy*, whereby you combine "strongly disagree" and "disagree" into "disagree", and "strongly agree" and "agree" into "agree" (while "neither agree nor disagree" just remains the same). Use the codebook and the str() command to know the meaning of the variables and categories.
+ Produce a cross-table of *turnout* by *efficacy*, with the appropriate percentages.
+ Perform a $\chi^2$-test to test for the independent between the two variables.
+ The null hypothesis of a $\chi^2$-test is that there is no dependence between the two variables. A high $\chi^2$ value means the two variables are dependent on each other. Try to interpret the $p$-value with this in mind. What does the $p$-value represent?
+ Based on the table and the test, what do you substantively conclude about political efficacy and turnout?
\end{enumerate}