# 1 RMarkdown

RMarkdown is a tool to write code and text in the same document, to be easily able to track which commands belong to what output. This is an excellent way to manage a research project. It is built into RStudio, but the first time you use it, RStudio will be installing a set of libraries (packages) to be able to use RMarkdown.

Browse down this file and look at the use of hash symbols for titles, and the use of R code chunks. Look above this window and check out “Knit HTML” which compiles this file to generate an HTML file and “Run” to see how you can run separate chunks of the code. You will need to run the individual chunks to test the code in the “Console” tab and to be able to access the data objects in the RStudio environment. We will work through this together.

We will make use of RMarkdown for all labs, so that I can pre-populate files which you can then expand upon. This will then gradually also build up a set of example code you can use in future. Some decide to use RMarkdown also to produce the homework submissions, but you can also easily just prepare a separate Word document for that (and save as PDF).

RMarkdown can be used also to produce nice looking PDF files or Word files. The latter is helpful if you want to copy output directly into your Word documents. PDF requires an installation of Latex, though, which is a bit more involved. The HTML output is the simplest to generate and will be used in class.

The code chunks (see below) can have options set as well, see Chunk options and package options.

If I want you to insert code yourself, I add a comment (i.e. a sentence starting with a hash symbol) that you can replace, like this:

### REPLACE THIS LINE ###

If I want you to try something for yourself, I will use this format:

Try creating a new file by using “File -> New File -> R Markdown…” and try to compile this to HTML using “Knit HTML”.

Before continuing to read further, try the above.

# 2 Packages in R

Throughout the course, we will be making use of a lot of libraries that are not standard part of R or RStudio, but that can easily be added. You can use the menu “Tools -> Install Packages” to manage installed libraries, or you can use the command line, for example to install the “haven” package:

install.packages("rio", repos = "https://ftp.heanet.ie/mirrors/cran.r-project.org/")

Note also how the command is given in R:

• first the name of the function;
• then all parameters between parentheses;
• some parameters can be given without naming them, because there is a default order; e.g. here the first argument is always the name of the library;
• some parameters can be given by specifying the parameter name, like here the “repos” parameter.

The “repos” parameter passes on the URL to the site where all libraries for R are available. This is often not necessary with the install.packages() function, as there is a default location, but if that does not work, you can try this explicit option.

After installing a package, you need to open it:

library(rio)

A set of relevant packages for data science can be found at Great R packages for data import, wrangling & visualization.

Try installing and opening the “stringr” package, which is useful for manipulating text in R.

### REPLACE THIS LINE ### to install and open (two separate steps!) the stringr package

Try installing and opening the “tidyverse” package, which is the main package use in the Hadley Wickham book. (Best at home, not in class.)

### REPLACE THIS LINE ### to install and open (two separate steps!) the tidyverse package

# 3 Opening data

Data can be in many different formats. R has its own formats, which are of course the fastest to read and write, but most data sets are prepared in other ways:

• data prepared for competitor statistical packages such as Stata and SPSS;
• data published in tables on the web, such as in Wikipedia;
• data published in raw text tabular format, especially for example large surveys;
• data published in Excel or other spreadsheet;
• data stored in relational or non-relational databases, such as SQL, Redis, etc. (no-SQL data bases);
• or as we will see in the last part of the course, just plain text files.

The “readr” package in R provides tools for reading flat text files; the “rvest” package for reading tables on the web; the “readxl” package for opening Excel files; and the “haven” package provides tools for opening Stata and SPSS files. The “foreign” package, which is standard part of R, also opens Stata and SPSS files, but the “haven” package is written by our textbook author, Hadley Wickham, which means it immediately opens in the formats we use.

The “rio” package also opens a wide range of file formats. This package has the great advantage that, while using a consistent syntax, it can open many different file formats.

For example opening a data file from the European Election Survey:

ees <- import("~/Dropbox/academic/data/EES/EES 2014/ZA5160_v4-0-0.dta")

Note the use “eval = FALSE” so that the code is printed in the HTML file, but not executed. I use this in this case because you do not have the same folders I have, and so the above code would not work on your computer.

Note that since we use a command language to do every step, including opening files, you will not get a nice dialogue window in which you can find the file you are looking for. Instead, you will need to include the fill file name and path to the file in the command itself. One tool you might want to use is the file.choose() command, which opens that dialogue screen (it might be hidden behind your RStudio window!), and which after you select the file outputs the full name and both, like this. This you can then copy/paste into your code.

We can download the survey data for the Brexit referendum directly from my web server using the “rio” package:

brexit <- import("http://www.joselkink.net/files/data/brexit_subset.Rdata")

To speed up the process, we will be working with a 1,000 cases subset of the originally 64,689 large survey, saved in the R format on my website. At home you can go to the British Election Study website, download the original data, and alter the code above to open the original data set, if you so wish.

The data will have the name “brexit” in R (note the arrow in the code to assign the loaded file to a particular object name). Check out the “Environment” tab on the top-right to see the list of objects. You will see that the “brexit” object is a data object with 21 observations and 1000 variables. To get a list of variable names:

names(brexit)
##  [1] "region"            "age"               "income"
##  [4] "gender"            "higherEducation"   "attention"
##  [7] "party"             "proIntegration"    "impactSelf"
## [10] "impactUK"          "certaintyLeave"    "certaintyRemain"
## [13] "vote"              "turnout"           "likeCameron"
## [16] "likeCorbyn"        "likeFarage"        "immigrationEcon"
## [19] "immigrationCult"   "econRetroPersonal" "econRetroGeneral"

You will need to download the questionnaire for the British Election Study website to be able to understand what these variables represent.

Note that we use some variables from Wave 8, some from Wave 9 and some from Wave 10 for our analysis.

The full mapping of variables in our cleaned data set is as follows:

variable BES variable
age Age
income profile_gross_household
higherEducation anyUniW9
party generalElectionVoteW9
proIntegration EUIntegrationSelfW8
impactUK leaveImpactBritainW8
impactSelf leaveImpactSelfW8
certaintyLeave certaintyUKLeaveW8
certaintyRemain certaintyUKRemainW8
vote euRefVoteW9
turnout euRefTurnoutRetroW9
attention polAttention
region gor
likeCameron likeCameronW8
likeCorbyn likeCorbynW8
likeFarage likeFarageW8
econRetroPersonal econPersonalRetroW8
econRetroGeneral econGenRetroW8
immigrationEcon immigEconW8
immigrationCult immigCulturalW8

Note that you can use the dollar sign to access individual variables in a data set.

table(brexit$party) ## ## Conservative Labour Liberal Democrat Other ## 310 266 105 167 ## UKIP Would not vote ## 134 18 Or if you want to get percentages, you can use: prop.table(table(brexit$party)) * 100
##
##     Conservative           Labour Liberal Democrat            Other
##             31.0             26.6             10.5             16.7
##             UKIP   Would not vote
##             13.4              1.8

Try using the table() command to use at some other variables in the data set.

### REPLACE THIS LINE ### by adding a few table() commands to look at variables.

A quick way to get full summary statistics is this:

library(stargazer)
##
## Please cite as:
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
stargazer(brexit, type = "html")
 Statistic N Mean St. Dev. Min Max age 1,000 52.279 14.984 15 83 income 1,000 6.970 3.476 1 15 attention 1,000 7.715 1.810 0 10 proIntegration 1,000 3.534 3.248 0 10 impactSelf 1,000 2.950 1.133 1 5 impactUK 1,000 3.691 1.049 1 5 certaintyLeave 1,000 2.714 0.717 1 4 certaintyRemain 1,000 2.963 0.694 1 4 turnout 1,000 0.969 0.173 0 1 likeCameron 1,000 3.491 3.101 0 10 likeCorbyn 1,000 3.987 3.344 0 10 likeFarage 1,000 3.404 3.459 0 10 immigrationEcon 1,000 4.055 1.932 1 7 immigrationCult 1,000 3.800 2.065 1 7 econRetroPersonal 1,000 2.869 0.897 1 5 econRetroGeneral 1,000 2.684 0.886 1 5

Note that if you want to look at it in the console window, instead of the compiled Markdown document, you might prefer:

stargazer(brexit, type = "text")

# 4 Importing Excel files

One of the easiest tools to enter your own data is of course Microsoft Excel. It is possible to open Excel files directly into R, whereby it works easiest when the first row of your sheet contains the variable names and all rows below it, without blank rows in between, contains the data for each observation. For example:

name       grade    age
Jacob       80       18
Erin        72       19
Brenda      34       17
Sam         68       18
Michelle    84       18


An example of an Excel file already populated exactly in this style is that containing the Brexit referendum results by the Electoral Commission. A CSV (comma separated file) is a typical file format to exchange data between spreadsheets and other statistical packages, and their election data can be downloaded here.

Download the data from the electoral commission website and open it in Excel to have a look.

The file can be opened in R using this code:

results <- read.csv("https://www.electoralcommission.org.uk/__data/assets/file/0014/212135/EU-referendum-result-data.csv")

Note that the “import” command could also have been used, but unfortunately creates some technical problems we will return to later in the course.

You can view the data by using the View() command, but use this in the console, not inside a Markdown script (hence we use “eval=FALSE” again).

View(results)
View(brexit)

Using tables and stargazer, explore this data.

### REPLACE THIS LINE ### by adding a few table() commands to look at variables.

Using a bit more advanced R code, we can also aggregate data by group. For example, to know the average level of support for Brexit by region in the UK, we could do the following:

aggregate(Pct_Leave ~ Region, results, mean)
##                      Region Pct_Leave
## 1                      East  56.96213
## 2             East Midlands  59.57450
## 3                    London  39.09152
## 4                North East  59.47917
## 5                North West  55.91513
## 6          Northern Ireland  44.22000
## 7                  Scotland  39.13687
## 8                South East  52.17000
## 9                South West  52.37921
## 10                    Wales  53.34773
## 11            West Midlands  60.31467
## 12 Yorkshire and The Humber  58.65000

Try to calculate the average level of support for remain by region.

### REPLACE THIS LINE *** by adding a new aggregate() command.

# 5 Web scraping

This section is extra, significantly more advanced, and might not work in the lab. Only continue if you’re interested and the above all works fine.

Downloading data from a web site can be a bit more cumbersome, since you often need to delve a bit more into the structure of the underlying HTML code of the website to identify the correct table. Here’s an example of downloading elections results for the 2014 local elections in Ireland, by county, directly from Wikipedia, based on rvest: easy web scraping with R.

library(rvest)

tbl <- elections_page %>% html_nodes("table") %>% .[[7]] %>% html_table()

In English, that last line says:

• take the page saved as “elections_page”,
• find all HTML “table” tags, i.e. all HTML tables on the page,
• take the 7th table (this was found by trial-and-error),
• transform the HTML table into an R data frame,
• save this as “tbl”.

Note the use of %>% to string together different functions - this is a rather modern innovation in R so many users will not be familiar with this, but it is a very powerful tool.

There are some problems with this example, whereby some numerical columns are interpreted as text, typically due to the occassional text symbol used inside the table, but it gives you a good first impression how to scrape tables from web pages.