---
title: "Introduction to Statistics - Lab 2"
author: "Johan A. Elkink"
date: "16 September 2019"
output:
html_document:
number_sections: yes
toc: yes
---
# Introduction
We continue with the same data set we used in Lab 1, which we can download directly from the web server using the "rio" package:
```{r}
library(rio)
library(pander)
library(stargazer)
library(dplyr)
library(ggplot2)
brexit <- import("http://www.joselkink.net/files/data/brexit_subset.Rdata")
```
Note that the above commands also open a few other libraries we will use in this lab session.
We will focus on using tables to get a first insight into your data, some summary statistics to suggest patterns in the data, but first some further visualisations that focus on single variables. Throughout, the focus is going to be a bit more on what individual variables look like, and a little less on relationships between variables - which is the focus for most of this course.
# Graphs of distributions
We already discussed using graphs for single variables, namely the bar plot. For continuous instead of categorical variables, if we want to show a distribution, the main plots are the histogram, density plot, and the box plot. We will focus on attitude towards European integration as the primary example for the graphs.
## Histogram
The histogram is similar to a bar plot, but for continuous variables. For the histogram, we group each observation within a range of values together, and then look at the distribution of this recoded variable using a bar plot type plot. For example, if age is measured in years, we could make a bar plot after grouping people in categories 0 to 20, 20 to 40, 40 to 60, etc. Such a bar plot is called a histogram. It is crucial that the width of each group (in this case 20) is the same for all groups. In ggplot2, the histogram is relatively straightforward:
```{r}
ggplot(brexit, aes(x = proIntegration)) + geom_histogram()
```
We can fix certain parameters to get a better looking graph, for example by fixing the number of groups.
```{r}
ggplot(brexit, aes(x = proIntegration)) + geom_histogram(bins = 10)
```
Finally, we can embellish this by setting some additional parameters.
```{r}
ggplot(brexit, aes(x = proIntegration)) +
geom_histogram(bins = 10, color = "blue", fill = "blue") +
labs(x = "Pro-integration attitude", y = "Frequency",
title = "Histogram of pro-integration attitudes") +
theme_minimal()
```
## Density plot
A density plot is similar to a histogram, but the pattern is shown as a smooth line, instead of the bars. This hides some of the detail of the variable, but can be easier to read.
```{r}
ggplot(brexit, aes(x = proIntegration)) + geom_density()
```
Generally, these density plots look a bit better with a color and alpha (transparency) value.
```{r}
ggplot(brexit, aes(x = proIntegration)) + geom_density(color = "blue", fill = "blue", alpha = .3)
```
Compare this plot with what we saw in the histogram. How does it compare?
Experiment with the different parameters to get a feel for how it works.
Where the density plot is in particular more helpful than histograms - also due to the transparency feature - is when comparing different groups in one plot. So instead of a fixed color, we can use a color by group, e.g. for different voters in the referendum.
```{r}
ggplot(brexit, aes(x = proIntegration, color = vote, fill = vote)) + geom_density(alpha = .3)
```
Or with some more embellishments:
```{r}
ggplot(brexit, aes(x = proIntegration, color = vote, fill =)) +
geom_density(aes(fill = vote, color = vote), alpha = .3) +
labs(x = "Pro-integration attitude", y = "",
title = "Density of pro-integration attitude by vote choice") +
theme_minimal()
```
Spend some time looking at this plot and try to see if you understand what you are looking at. Based on the this graph, what role do you think turnout played in determining the overall outcome of the referendum?
## Box plots
Box plots are also a form to show the distribution of a variable, and can easily be extended to show the relationship between two variables, one categorical and one continuous - much like with the density plot, where vote choice is categorical and pro-integration attitude is continuous.
The histogram shows the median, the middle 50% of obversations, and the range. For example for attitude towards European integration:
```{r}
ggplot(brexit, aes(y = proIntegration)) +
geom_boxplot()
```
Here we see that 50% of observations are between the values 0 and 6; 50% are below 3 and 50% are above 3; 25% have a value between 6 and 10. To make sure I read the graph correctly, I actually used the following R command, but it is not important for this course:
```{r}
pander(quantile(brexit$proIntegration))
```
To show this for different groups, we use:
```{r}
ggplot(brexit, aes(y = proIntegration, x = vote)) +
geom_boxplot()
```
And with some embellishments:
```{r}
ggplot(brexit, aes(y = proIntegration, x = vote)) +
geom_boxplot(fill = "lightblue") +
theme_minimal() +
labs(x = "Vote choice", y = "Pro-integration attitude",
title = "Pro-integration attitude by voter group")
```
Compare these plots to the density plots above. Can you see this provides similar information? Which is easier to interpret?
# Basic tables
Tables are often used in publications, including non-quantitative work, to give a quick overview of a variable. For example, how many voters are there for each party in an election? How many countries in the world are democratic? How many months was each British Prime Minister in power? Etcetera. These can typically be visualised in bar plots (see Lab 1), but this information can also be presented through tables.
Each of the above examples concerns a single variable and the type of table we use in this context is the frequency table. This table can either list the number of observations in each category, or the proportion or percentage in each category.
## Frequency tables
A frequency table is a table of just one (categorical) variable, where we can use either counts or proportions, to show how many cases are in each category. Note that if a variable is continuous, you can always recode it to a categorical variable, which we'll discuss in a later lab.
The variable "vote" in the Brexit data set contains, for each of the `r ncol(brexit)` respondents, how they voted - they did either not vote, or voted for the UK to leave the EU, or for the UK to remain in the EU.
```{r}
table(brexit$vote)
```
Or if you want to get proportions, you can use:
```{r}
prop.table(table(brexit$vote))
```
Or for percentages we simply multiply the proportions by 100:
```{r}
prop.table(table(brexit$vote)) * 100
```
In RMarkdown, tables look a bit better when using the pander() command.
```{r}
pander(prop.table(table(brexit$vote)) * 100)
```
# Summary statistics
Summary, or descriptive, statistics are numerical or graphical descriptions of the data. This can be either related to a single variable or to the relationship between two variables. For the interpretation of each of these statistics, you will need to consult the textbook(s) referenced in the course syllabus and the lecture slides.
There is one technical detail to note here, that is unrelated to the statistical analysis itself. We use the "dplyr" library which provides many commands for data manipulation, including the "%>%" command, which is a "pipe". To pipe means that the output of the command before the pipe is to be used as the input for the command after it. So in the first example below, the "brexit" data set is used as input to the "summarise" command (so that variables such as "attention" and "age" can be found), and the output of the summarise() command is piped into "pander", which is the function that makes it more presentable in the Markdown output, just like with the tables. You can find more examples of this style of coding in the textbook chapter on [Data Transformation](https://r4ds.had.co.nz/transform.html).
## Central tendency
The most common measure for the central tendency is the mean (or average), which is the sum of all values on a variable, divided by the number of obervations. For example, to get the mean of the "attention", "age" and "proIntegration" variables, we can use:
```{r}
brexit %>%
summarise(meanAttention = mean(attention),
meanAge = mean(age),
meanProIntegration = mean(proIntegration)) %>%
pander
```
Note that we are working with a cleaned data set, which does not contain missing data. If there were missings, you would need to tell R explicitly to ignore those:
```{r}
brexit %>%
summarise(meanAttention = mean(attention, na.rm = TRUE),
meanAge = mean(age, na.rm = TRUE),
meanProIntegration = mean(proIntegration, na.rm = TRUE)) %>%
pander
```
We can use very similar code to get the median:
```{r}
brexit %>%
summarise(meanProIntegration = mean(proIntegration, na.rm = TRUE),
medianProIntegration = median(proIntegration, na.rm = TRUE)) %>%
pander
```
The mean and the median are slightly different, with the median being somewhat lower than the mean. What do you think this implies?
If you want to avoid the "dplyr" approach, a simpler version to get the mean of a variable is:
```{r}
mean(brexit$proIntegration, na.rm = TRUE)
```
Note that here we need to make the name of the data set explicit, otherwise R cannot find the "proIntegration" variable. Otherwise the command looks very similar.
One advantage of using the more complicated, "dplyr" approach, however, is that it will be easier to calculate statistics for different subgroups. For example, if we want to have the mean and median of pro-integration for different groups of voters, we can do this:
```{r}
brexit %>%
group_by(vote) %>%
summarise(meanProIntegration = mean(proIntegration, na.rm = TRUE),
medianProIntegration = median(proIntegration, na.rm = TRUE)) %>%
pander
```
Have a look at the density and box plots above and see if you can find some of these values in the plots as well, approximately.
## Dispersion
The mean and the median tell us something about the centre of the distribution. The second question is typically how much the observations vary around this central point. For that we can use a number of different statistics, such as the standard deviation and the variance:
```{r}
brexit %>%
summarise("standard deviation" = sd(proIntegration, na.rm = TRUE),
variance = var(proIntegration, na.rm = TRUE),
"interquartile range" = IQR(proIntegration, na.rm = TRUE)) %>%
pander
```
# Exercises
1. Create a new RMarkdown file for this lab and fill out the details in the header. Use it for the remainder of the questions.
1. Produce a histogram and a box plot for the age variable.
1. Calculate the mean, median, and standard deviation for the age variable. How do these relate to the plots you just produced?
1. Calculate the same summary statistics for the age variable, for different parties. What do you conclude?
1. Create a series of box plots to see whether age distributions differ by party.
1. Use a frequency table to look at the higherEducation variable.
1. Use a density plot to investigate the relation between having higher education and pro-integration attitudes.
1. For all plots, add embellishments such as labels, colors, and titles.
1. If you have time left, investigate the distribution of other variables in the data set, using tables, summary statistics, and visualisations.