RMarkdown

RMarkdown is a tool to write code and text in the same document, to be easily able to track which commands belong to what output. This is an excellent way to manage a research project. It is built into RStudio, but the first time you use it, RStudio will be installing a set of libraries (packages) to be able to use RMarkdown.

Browse down this file and look at the use of hashtags for titles, and the use of R code chunks. Look above this window and check out “Knit HTML” which compiles this file to generate an HTML file and “Run” to see how you can run separate chunks of the code. You will need to run the individual chunks to test the code in the “Console” tab and to be able to access the data objects in the RStudio environment. We will work through this together.

We will make use of RMarkdown for all labs, so that I can pre-populate files which you can then expand upon. This will then gradually also build up a set of example code you can use in future.

RMarkdown can be used also to produce nice looking PDF files or Word files. The latter is helpful if you want to copy output directly into your Word documents. PDF requires an installation of Latex, though, which is a bit more involved. The HTML output is the simplest to generate and will be used in class.

The code chunks (see below) can have options set as well, see Chunk options and package options.

Try creating a new file by using “File -> New File -> R Markdown…” and try to compile this to HTML using “Knit HTML”.

Packages in R

Throughout the course, we will be making use of a lot of libraries that are not standard part of R or RStudio, but that can easily be added. You can use the menu “Tools -> Install Packages” to manage installed libraries, or you can use the command line, for example to install the “haven” package:

install.packages("haven", repos = "https://ftp.heanet.ie/mirrors/cran.r-project.org/")
## 
## The downloaded binary packages are in
##  /var/folders/y0/xp2qpq993gq5t4d77dy1_vkw0000gn/T//RtmpH0fJmQ/downloaded_packages

Note also how the command is given in R:

The “repos” parameter passes on the URL to the site where all libraries for R are available. This is often not necessary with the install.packages() function, as there is a default location, but if that does not work, you can try this explicit option.

After installing a package, you need to open it:

library(haven)

A set of relevant packages for data science can be found at Great R packages for data import, wrangling & visualization.

Try installing and opening the “stringr” package, which is useful for manipulating text in R.
At home, not in class: Try installing and opening the “tidyverse” package, which is the main package use in the Hadley Wickham book.

Opening data

Data can be in many different formats. R has its own formats, which are of course the fastest to read and write, but most data sets are prepared in other ways:

The “readr” package in R provides tools for reading flat text files; the “rvest” package for reading tables on the web; the “readxl” package for opening Excel files; and the “haven” package provides tools for opening Stata and SPSS files. The “foreign” package, which is standard part of R, also opens Stata and SPSS files, but the “haven” package is written by our textbook author, Hadley Wickham, which means it immediately opens in the formats we use. The “rio” package also opens a wide range of file formats.

Files can be read locally from your harddisk, or remotely directly from the web. To open a Stata file you downloaded from the GESIS website, for example, you can use:

# Using the haven package:
# ees <- read_dta("~/Dropbox/academic/data/EES/EES 2014/ZA5160_v4-0-0.dta")

An alternative method is to use the “rio” package. This package has the great advantage that, while using a consistent syntax, it can open many different file formats.

# install.packages("rio")
# library(rio)
# ees <- import("~/Dropbox/academic/data/EES/EES 2014/ZA5160_v4-0-0.dta")

Note the use of hashtags to “comment out” R code, so that it will not run. I use this because you do not have the same folders I have, and so the above code would not work on your computer. Similarly, I use it in the next code chunk to show how I selected the cases, but this only works after you would open the full, original data set.

Note that since we use a command language to do every step, including opening files, you will not get a nice dialogue window in which you can find the file you are looking for. Instead, you will need to include the fill file name and path to the file in the command itself. One tool you might want to use is the file.choose() command, which opens that dialogue screen (it might be hidden behind your RStudio window!), and which after you select the file outputs the full name and both, like this. This you can then copy/paste into your code.

Go to my teaching data page and download the AsiaBarometer sample data. Use the “rio” and the “haven” packages to open the data.

To speed up the process, we will be working with a 1,000 cases subset of the originally 30,000 large survey, saved in the R format on my website. At home you can log in to the GESIS website, download the original data, and alter the code above using read_dta() to open the original data set, if you so wish.

# ees <- ees[sample(1:dim(ees)[1], 1000, replace = FALSE)]
# save(ees, file = "ees2014_subset.Rdata")

load(url("http://www.joselkink.net/files/data/ees2014_subset.Rdata"))

The data will have the name “ees” in R. Check out the “Environment” tab on the top-right to see the list of objects. You will see that the “ees” object is a data object with 1,000 observations and 422 variables. To get a list of variable names:

names(ees)
##   [1] "za_nr"            "version"          "doi"             
##   [4] "vd"               "b"                "countrycode"     
##   [7] "respid"           "regioncc"         "p7_region_nuts1" 
##  [10] "p7r_region_nuts2" "p13_intlang"      "p6_sizeloc"      
##  [13] "p1"               "p2"               "p3"              
##  [16] "p3_r"             "p4"               "p5"              
##  [19] "q1_1"             "q1_2"             "q1_3"            
##  [22] "q1_4"             "q1_5"             "q1_6"            
##  [25] "q1_7"             "q1_8"             "q1_9"            
##  [28] "q1_10"            "q1_11"            "q1_12"           
##  [31] "q1_13"            "q1_14"            "q1_15"           
##  [34] "q1_16"            "q1_17"            "q1_18"           
##  [37] "q1_19"            "q1_20"            "q1_21"           
##  [40] "q1_22"            "q1_23"            "q1_24"           
##  [43] "q1_25"            "q1_26"            "q1_27"           
##  [46] "q1_28"            "q1_29"            "q1"              
##  [49] "qp1"              "qp2"              "qp2_ees"         
##  [52] "qp2_emcs"         "qp3a"             "qp4a_1"          
##  [55] "qp4a_2"           "qp4a_3"           "qp4a_4"          
##  [58] "qp4a_5"           "qp4a_6"           "qp4a_7"          
##  [61] "qp4a_8"           "qp4a_9"           "qp4a_10"         
##  [64] "qp4a_11"          "qp4a_12"          "qp4a_13"         
##  [67] "qp4a_14"          "qp4a_15"          "qp4a_16"         
##  [70] "qp4a_17"          "qp5a"             "qp5b_1"          
##  [73] "qp5b_2"           "qp5b_3"           "qp5b_4"          
##  [76] "qp5b_5"           "qp5b_6"           "qp5b_7"          
##  [79] "qp5b_8"           "qp5b_9"           "qp5b_10"         
##  [82] "qp5b_11"          "qp5b_12"          "qp5b_13"         
##  [85] "qp5b_14"          "qp5b_15"          "qp5b_16"         
##  [88] "qp5b_17"          "qp5b_18"          "qp5t_1"          
##  [91] "qp5t_2"           "qp5t_3"           "qp5t_4"          
##  [94] "qp5t_5"           "qp5t_6"           "qp5t_7"          
##  [97] "qp5t_8"           "qp5t_9"           "qp5t_10"         
## [100] "qp5t_11"          "qp5t_12"          "qp5t_13"         
## [103] "qp5t_14"          "qp5t_15"          "qp5t_16"         
## [106] "qp5t_17"          "qp5t_18"          "qp3b"            
## [109] "qp4b_1"           "qp4b_2"           "qp4b_3"          
## [112] "qp4b_4"           "qp4b_5"           "qp4b_6"          
## [115] "qp4b_7"           "qp4b_8"           "qp4b_9"          
## [118] "qp4b_10"          "qp4b_11"          "qp4b_12"         
## [121] "qp4b_13"          "qp4b_14"          "qp4b_15"         
## [124] "qp4b_16"          "qp4b_17"          "qp6_1"           
## [127] "qp6_2"            "qp6_3"            "qp6_4"           
## [130] "qp6_5"            "qp6_6"            "qp6_7"           
## [133] "qp6_8"            "qp6_9"            "qp7"             
## [136] "qp8"              "qp9_1"            "qp9_2"           
## [139] "qp9_3"            "qp10_1"           "qp10_2"          
## [142] "qp10_3"           "qp11_1"           "qp11_2"          
## [145] "qp11_3"           "qp11_4"           "qp11_5"          
## [148] "qp12"             "qpp1a"            "qpp1aO"          
## [151] "qpp1aO_EES"       "qpp1aO_EMCS"      "qpp2"            
## [154] "qpp3"             "qpp1b"            "qpp1bO"          
## [157] "qpp1bO_EES"       "qpp1bO_EMCS"      "qpp4"            
## [160] "qp1pp4"           "qpp5"             "qpp5_ees"        
## [163] "qpp6"             "qpp6_ees"         "qpp7_1"          
## [166] "qpp7_2"           "qpp7_3"           "qpp7_4"          
## [169] "qpp8_1"           "qpp8_2"           "qpp8_3"          
## [172] "qpp8_4"           "qpp8_5"           "qpp8_6"          
## [175] "qpp8_7"           "qpp8_8"           "qpp9_1"          
## [178] "qpp9_2"           "qpp9_3"           "qpp10"           
## [181] "qpp11_1"          "qpp11_2"          "qpp12"           
## [184] "qpp13"            "qpp14_1"          "qpp14_2"         
## [187] "qpp14_3"          "qpp14_4"          "qpp14_5"         
## [190] "qpp14_6"          "qpp14_7"          "qpp14_8"         
## [193] "qpp14_mean_1"     "qpp14_mean_2"     "qpp14_mean_3"    
## [196] "qpp14_mean_4"     "qpp14_mean_5"     "qpp14_mean_6"    
## [199] "qpp14_mean_7"     "qpp14_mean_8"     "qpp15"           
## [202] "qpp16"            "qpp17_1"          "qpp17_2"         
## [205] "qpp17_3"          "qpp17_4"          "qpp17_5"         
## [208] "qpp17_6"          "qpp17_7"          "qpp17_8"         
## [211] "qpp18"            "qpp19_1"          "qpp19_2"         
## [214] "qpp19_3"          "qpp19_4"          "qpp19_5"         
## [217] "qpp19_6"          "qpp19_7"          "qpp19_8"         
## [220] "qpp19_mean_1"     "qpp19_mean_2"     "qpp19_mean_3"    
## [223] "qpp19_mean_4"     "qpp19_mean_5"     "qpp19_mean_6"    
## [226] "qpp19_mean_7"     "qpp19_mean_8"     "qpp20_1"         
## [229] "qpp20_2"          "qpp21"            "qpp21_ees"       
## [232] "qpp22"            "qpp23_1"          "qpp23_2"         
## [235] "qpp23_3"          "qpp23_4"          "qpp24_1"         
## [238] "qpp24_2"          "qpp24_3"          "d7"              
## [241] "d7b"              "d7c"              "d8"              
## [244] "d10"              "vd11"             "d11r1"           
## [247] "d11r2"            "d15a"             "d15ar"           
## [250] "c14"              "d15b"             "d15br"           
## [253] "d25"              "d40a"             "d40b"            
## [256] "d40c"             "d40abc"           "d46_1"           
## [259] "d46_2"            "d46_3"            "d46_4"           
## [262] "d46_5"            "d46_6"            "d46_7"           
## [265] "d46_8"            "d46_9"            "d46_10"          
## [268] "d46_11"           "d46_12"           "d46_13"          
## [271] "d60"              "d61"              "d61r"            
## [274] "d62_1"            "d62_2"            "d62_3"           
## [277] "d63"              "d71_1"            "d71_2"           
## [280] "d71_3"            "d72_1"            "d72_2"           
## [283] "d73_1"            "d73_2"            "d74"             
## [286] "d75"              "d76"              "p6be"            
## [289] "p7be"             "p7rbe"            "p13be"           
## [292] "p6at"             "p7at"             "p7rat"           
## [295] "p6bg"             "p7bg"             "p6cy"            
## [298] "p7cy"             "p6cz"             "p7cz"            
## [301] "p6dk"             "p7dk"             "p6ee"            
## [304] "p7ee"             "p13ee"            "p6de"            
## [307] "p7de"             "p6el"             "p7el"            
## [310] "p7rel"            "p6es"             "p7es"            
## [313] "p7res"            "p13es"            "p6fi"            
## [316] "p7fi"             "p13fi"            "p6fr"            
## [319] "p7fr"             "p7rfr"            "p6uk"            
## [322] "p7uk"             "p6hu"             "p7hu"            
## [325] "p6ie"             "p7ie"             "p6it"            
## [328] "p7it"             "p7rit"            "p6lt"            
## [331] "p7lt"             "p6lu"             "p7lu"            
## [334] "p13lu"            "p6lv"             "p7lv"            
## [337] "p13lv"            "p6mt"             "p13mt"           
## [340] "p6nl"             "p7nl"             "p7rnl"           
## [343] "p6pl"             "p7pl"             "p7rpl"           
## [346] "p6pt"             "p7pt"             "p6ro"            
## [349] "p7ro"             "p6se"             "p7se"            
## [352] "p6si"             "p7si"             "p6sk"            
## [355] "p7sk"             "p6hr"             "p7hr"            
## [358] "wex"              "wexpol"           "w1"              
## [361] "w3"               "w4"               "w5"              
## [364] "w6"               "w7"               "w8"              
## [367] "w9"               "w10"              "w11"             
## [370] "w13"              "w14"              "w15"             
## [373] "w16"              "w18"              "w19"             
## [376] "w22"              "w23"              "w24"             
## [379] "w29"              "w30"              "w81"             
## [382] "w82"              "w83"              "w84"             
## [385] "w89"              "w90"              "w93"             
## [388] "w94"              "w98"              "w99"             
## [391] "w1pol"            "w3pol"            "w4pol"           
## [394] "w5pol"            "w6pol"            "w7pol"           
## [397] "w8pol"            "w9pol"            "w10pol"          
## [400] "w11pol"           "w13pol"           "w14pol"          
## [403] "w15pol"           "w16pol"           "w18pol"          
## [406] "w19pol"           "w22pol"           "w23pol"          
## [409] "w24pol"           "w29pol"           "w30pol"          
## [412] "w81pol"           "w82pol"           "w83pol"          
## [415] "w84pol"           "w89pol"           "w90pol"          
## [418] "w93pol"           "w94pol"           "w98pol"          
## [421] "w99pol"           "filter__"
You will need to download the questionnaire for the GESIS website to be able to understand what these variables represent, as the names are rather cryptic.

To find out where those respondents are from, you can look at the Eurostat NUTS codes in the data set.

table(ees$p7_region_nuts1)
## 
## AT1 AT2 AT3 BE1 BE2 BE3 BG3 BG4 CY0 CZ0 DE1 DE2 DE3 DE4 DE5 DE7 DE8 DE9 
##  14   5  12   3  23  16  21  12  20  35   4   8   1   5   1   3   3   5 
## DEA DEB DEC DED DEE DEF DEG DK0 EE0 EL1 EL2 EL3 EL4 ES2 ES3 ES4 ES5 ES6 
##   8   2   1   7   3   2   4  30  40  19  12   9   2   7   4   5  15  11 
## ES7 FI1 FR1 FR2 FR3 FR4 FR5 FR6 FR7 FR8 HR0 HU1 HU2 HU3 IE0 ITC ITF ITG 
##   4  39   7   3   3   3   6   3   3   6  39   9  12  14  45  10  10   4 
## ITH ITI LT0 LU0 LV0 MT0 NL1 NL2 NL3 NL4 PL1 PL2 PL3 PL4 PL5 PL6 PT1 RO1 
##   3   4  31  16  27  16   4  10  20   8  14   8   8   8   3   8  33   7 
## RO2 RO3 RO4 SE1 SE2 SE3 SI0 SK0 UKC UKD UKE UKF UKG UKH UKI UKJ UKK UKL 
##   7  11  12  16  22   3  34  36   1   3   3   4   1   5   2   4   2   2 
## UKM UKN 
##   4   8
Try using the table() command to use at some other variables in the data set.

To select only the Irish cases:

ees_ie <- subset(ees, countrycode == 1372)

This will create a new object called “ees_ie” which contains only the 45 Irish cases in this data.

Importing Excel files

One of the easiest tools to enter your own data is of course Microsoft Excel. It is possible to open Excel files directly into R, whereby it works easiest when the first row of your sheet contains the variable names and all rows below it, without blank rows in between, contains the data for each observation. For example:

name       grade    age
Jacob       80       18
Erin        72       19
Brenda      34       17
Sam         68       18
Michelle    84       18
Create a new Excel file, entering the above data, and save this as “lab1.xlsx”, but also as “lab1.csv” (comma separated file format).

The above file can now be opened with the below code, but you’ll need to remove the hashtags first.

# lab1_csv <- read_csv("lab1.csv")
# lab1_xl  <- read_excel("lab1.xlsx", sheet = 1)
Use the above code to open the two files. In the “Environment” tab on the top-right, click on the new data object name to see the data.

You can access individual variables in a data set by using the dollar sign, as when I made a table of the “p7_region_nuts1” variable in the EES data set.

Use the mean() function in R to calculate the average grade of these five students.