library(rio)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
library(stargazer)
##
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(pander)
library(ggplot2)

# 1 Introduction

In this class we will look at statistics testing, performing t-tests to compare means and in a regression context.

We will continue to make use of data from the Irish National Election Study. We only look at the most recent election in the data set, which is already a bit dated: 2007. This data file is a ZIP archive, with inside it a Stata file, which can be opened in R using a temporary file as follows:

tempFile <- tempfile(fileext = ".zip")
ines <- import(tempFile, haven = FALSE) %>%
filter(ines == 2007)
unlink(tempFile)

This uses the long file in Stata format on the INES archive website. If the above is too slow when using “Knit”, you can also go to the INES website, download the file, unzip by clicking on the file, and then use a more typical method to open the file:

ines <- import("INESLong_Beta.dta", haven = FALSE) %>%
filter(ines == 2007)

Check out the codebook for a description of the relevant variables.

# 2 t-tests in R

There are three types of t-tests for the mean: * Testing the mean against some reference value (one-sample t-test). * Testing the means of two variables on the same units (paired-sample t-test). * Testing the means of the same variable on two different groups (two-sample t-test).

This used to be core material of the course, but now the course focuses on t- and F-tests in regression only. It is still useful to have a basic understanding of tests for comparing means, on which the t-test in regression is based.

First, we will prepare some variables to have proper variable names, which makes interpreting the output easier:

ines <- ines %>% mutate(union = recode(v0936, "Yes", "No"),
labour = v0190,
progDems = v0191,
turnout = recode(v0072, "Voted", "Did not vote", "Did not vote", "Did not vote"))

pander(table(ines$v0936, ines$union))
No Yes
0 431
718 0
pander(table(ines$v0190, ines$labour))
1 2 3 4 5 6 7 8 9 10
243 0 0 0 0 0 0 0 0 0
0 100 0 0 0 0 0 0 0 0
0 0 125 0 0 0 0 0 0 0
0 0 0 111 0 0 0 0 0 0
0 0 0 0 213 0 0 0 0 0
0 0 0 0 0 143 0 0 0 0
0 0 0 0 0 0 141 0 0 0
0 0 0 0 0 0 0 105 0 0
0 0 0 0 0 0 0 0 71 0
0 0 0 0 0 0 0 0 0 130
pander(table(ines$v0191, ines$progDems))
1 2 3 4 5 6 7 8 9 10
441 0 0 0 0 0 0 0 0 0
0 165 0 0 0 0 0 0 0 0
0 0 153 0 0 0 0 0 0 0
0 0 0 89 0 0 0 0 0 0
0 0 0 0 176 0 0 0 0 0
0 0 0 0 0 93 0 0 0 0
0 0 0 0 0 0 87 0 0 0
0 0 0 0 0 0 0 67 0 0
0 0 0 0 0 0 0 0 50 0
0 0 0 0 0 0 0 0 0 59
pander(table(ines$v0072, ines$turnout))
Did not vote Voted
0 1262
112 0
11 0
41 0

## 2.1 One-sample t-test

In a one-sample t-test, where you compare the mean of one variable against a fixed value. For example, to test whether the mean of a variable labour differs from 5:

pander(t.test(ines$labour, mu = 5)) One Sample t-test: ines$labour
Test statistic df P value Alternative hypothesis mean of x
0.3833 1381 0.7016 two.sided 5.03

## 2.2 Paired-sample t-test

Paired-sample t-test is where you compare the mean on two variables for the same individuals (e.g. a test score before and after a class). For example, to test whether the means of two variables, labour and progDems, for the same individuals differ:

pander(t.test(ines$labour, ines$progDems, paired = TRUE))
Paired t-test: ines$labour and ines$progDems (continued below)
Test statistic df P value Alternative hypothesis
13.14 1371 3.221e-37 * * * two.sided
mean of the differences
1.253

## 2.3 Two-sample t-test

The two-sample or independent-samples t-test is where you compare the mean on the same variable for two different groups. For example, does support for the Labour Party depend on whether a respondent is a member of a trade union?

pander(t.test(ines$labour ~ ines$union))
Welch Two Sample t-test: ines$labour by ines$union (continued below)
Test statistic df P value Alternative hypothesis
-3.21 865.5 0.001377 * * two.sided
mean in group No mean in group Yes
4.886 5.449

# 3 Chi-squared test

When producing a cross-table of two categorical variables, we can use the $$\chi^2$$-test to test whether the two variables are independent of each other or not. For example, to see if union members are more likely to participate in elections:

pander(chisq.test(table(ines$union, ines$turnout)))
Pearson’s Chi-squared test with Yates’ continuity correction: table(ines$union, ines$turnout)
Test statistic df P value
0.8159 1 0.3664

# 4 Exercises

Lab 5 has example code on recoding and testing whether the recode worked. Use that here as well for questions where necessary. Often it might be helpful to generate properly named and labelled variables first, then run the regression or analysis.

Create a new RMarkdown file for this lab and fill out the details in the header. Use it for the remainder of the questions.

First we look at attitudes towards abortion among younger voters.

• Construct a new variable age which is calculated on the basis of v0906. Remember that the year of the survey is 2007.
• Produce a frequency table of attitude towards abortion (v0266) to check whether missing values and numerical values are correctly coded.
• Produce a scatter plot of attitude towards abortion by age. Add a regression line.
• Regress attitude towards abortion on age.
• Recode age into a variable young, which is 1 if age is less than 30 and 0 otherwise.
• Regress attitude towards abortion on the young dummy variable. (Note comments in Lab 6.)
• Perform a t-test whether the mean on abortion is different for young versus old. Compare the result to the regression result.

For the following statements, formulate the null hypothesis and the alternative hypothesis, perform the appropriate t-test, and formulate the conclusion from the test:

• Voters are more likely to vote Labour (v0190) than they are to vote Sinn Fein (v0192).
• Union members are more likely to vote Labour (v0190) than are non-union members (v0936).
• Voters in Ireland tend to the political right (i.e. v0239 is greater than 5 on average).
• Fianna Fail voters are more right-wing (v0239) than Fine Gael voters (v0195), on average.
• Voters like Enda Kenny (v0522) more than they like Pat Rabitte (v0524).
• Voters who agree that there should be very strict limits on immigration (from “slightly agree” to “strongly agree”) (v0247) are more likely to vote for Sinn Fein (v0192) than the others.

To see whether individual with higher political efficacy are more likely to participate in elections:

• Recode v0291b into a new variable efficacy, whereby you combine “strongly disagree” and “disagree” into “disagree”, and “strongly agree” and “agree” into “agree” (while “neither agree nor disagree” just remains the same). Use the codebook and the str() command to know the meaning of the variables and categories.
• Produce a cross-table of turnout by efficacy, with the appropriate percentages.
• Perform a $$\chi^2$$-test to test for the independent between the two variables.
• The null hypothesis of a $$\chi^2$$-test is that there is no dependence between the two variables. A high $$\chi^2$$ value means the two variables are dependent on each other. Try to interpret the $$p$$-value with this in mind. What does the $$p$$-value represent?
• Based on the table and the test, what do you substantively conclude about political efficacy and turnout? \end{enumerate}