This page will show you how you van import different types of datasets in R and how you can process its data.
To import a data set, you first need to activate the correct package (these are foreign
(for SPSS files), readr
or readxl
(for excel files), depending on the type of data you want to import). Next to that, you need to know the location of the data set on your computer. Suppose the location of your data file called ‘dataset2’ is: “C:/My Documents”. The data set can be loaded into the global environment with the following command:
# set working directory to the correct folder and put all files to be imported
# in that folder
install.packages("tidyverse", "foreign", "readr", "readxl") # you do this only once
library("tidyverse", "foreign", "readr", "readxl") #you only need to call packages you will actually use, but you need to do that every time after (re)starting R
setwd("C:/My Documents") # setting the working directory
dataset2 <- read.spss("dataset2.sav", to.data.frame = TRUE) # load an spss file with the package 'foreign' as data frame.
# OR, if the file is a csv file ...
# load a csv data file, separated by comma's, with the package 'readr'
dataset2 <- read.csv("dataset2.csv", sep = ",")
Sometimes imported datasets are ‘labeled’. For example, the variable ‘gender’ is stored as a series of 1’s and 2’s and a label is added for the variable (‘gender as derived from the sampling frame’, this is called the variable label) and for the values (1 means ‘woman’ and 2 means ‘man’, the so called ‘value labels’). The labels are attributes of a data frame.
To inspect the labels in a dataframe:
# for the variable label
data$variable %>%
attr("label")
# for the value labels
data$variable %>%
attr("labels")
Creating your own data set can be done in many ways. We mention two:
#creating a data frame by variable
age <- c(44, 30, 20, 67) %>% as.integer()
gender <- c("male", "female", "male", "female") %>% as.factor()
length <- c(1.67, 1.70, 1.80, 1.81) %>% as.numeric()
membership <- c("T", "F", "T", "T") %>% as.logical()
# merge the variables
dataset3 <- data.frame(age, gender, length, membership)
# creating a data frame with a random variable based on the normal
# distribution, assumed mean of the normal distribution is 50, sd is 20, with
# 100 units.
n <- 100
dataset4 <- rnorm(n, 50, 20) %>%
as.data.frame()
To illustrate the commands used below, we will use the data set gss_cat (from the tidyverse package forcats). This is a sample of data from the general social survey in the US. It contains the variables: year (year of survey, 2000–2014); age (Maximum age truncated to 89); marital (marital status); race (race); rincome (reported income); partyid (party affiliation); relig (religion); denom (denomination); tvhours (hours per day watching tv).
Suppose we have a data frame and you want to inspect it.
# View the data from the data set dataset4 in a spreadsheet.
gss_cat %>%
View()
# Get a quick overview of the types of data in your matrix and their names
gss_cat %>%
str()
# Only get the column names of a data set
gss_cat %>%
colnames()
# View the data of a particular column, say column 2 (the second variable in
# the data frame)
gss_cat %>%
select(2)
# View the data of a particular set of variables
gss_cat %>%
select(year, marital, race)
# View the data set of a particular set of variables, in which one is left out
gss_cat %>%
select(1:5, -3)
# View part of the data set in a spreadsheet
gss_cat %>%
select(1:5, -3) %>%
View()
Selected variables can also be stored as a separate object in the global environment.
# Make a new data frame with only the items (variables) that you want:
gss_cat_social <- gss_cat %>%
select(1:5, 7, 8)
# Make a new data frame with only observations from 2000:
gss_cat_2000 <- gss_cat %>%
filter(year == 2000)
# Make a new data frame with observations from 2002 and later:
gss_cat_2002plus <- gss_cat %>%
filter(year >= 2002)
R uses only one way of declaring a specific observation as missing: NA (Not Available). If your data set does include missings, but the missings are coded with a number (for example: 99), you need to replace these values before analyzing the data.
# To change other values (here: the word 'Don't know)' to NA, use na_if()
gss_cat <- gss_cat %>%
mutate(relig = na_if(relig, "Don't know"))
# in the data set gss_cat, replace the existing variable relig (which contains contains cases with the word 'Don't know') with a new variable relig, in which this group is declared missing (and gets an NA).
There are several ways to find out how many missings are included in the data. Also, there are multiple ways to deal with these missings (pairwise deletion versus listwise deletion).
# To see which variables in the data file called 'mydata' have missings use:
summary(gss_cat)
# Or use 'is.na' to detect the number of missing values in a specific column
# (the 'religion' variable).
gss_cat$relig %>%
is.na() %>%
sum()
# Pairwise deletion of cases (exclude cases that have a missing value on a
# variable, but keep them when working with other variables): 'na.rm'
mean(gss_cat$tvhours, na.rm = TRUE)
# Listwise deletion of cases (drop cases that have a missing value on any of
# the variables used): 'na.omit'
gss_cat_no_missings <- na.omit(gss_cat)
# Make a new data frame only containing units without a missing value on the
# variable 'relig'. Please note that you now drop many cases ONLY because this
# variable is missing.
gss_cat_no_missings_on_relig <- gss_cat %>%
filter(!is.na(relig))
# Please note that in R we can use the '!' sign, to say 'not'.
# Change old name 'relig' to the new variable name 'religion' (new = old)
gss_cat <- gss_cat %>%
rename(religion = relig)
Adding a variable to the data set can be done with mutate().
# Adding a standardized variable with mutate()
gss_cat <- gss_cat %>%
mutate(ztvhours = (tvhours - mean(tvhours))/sd(tvhours))
# Creating an index (a combination of item scores) and adding that index to the
# data frame can also be done with mutate()
dataset5 <- dataset5 %>%
mutate(index = item3 + item4 + item4)
# OR
# (assuming that item3, item4 and item5 are in columns 3, 4 and 5) use:
dataset5 <- dataset5 %>%
mutate(index = rowSums(.[3:5]))
# If you want to sum values of a lot of columns and ignore the missings, use:
dataset5 <- dataset5 %>%
mutate(sum = rowSums(.[c(1:4, 7:20, 22)], na.rm = TRUE))
# The 'c' in this command stands for 'combine' and is part of base R.
If you have a character variable with the ‘words’ ‘man’ and ‘women’ you probably want to treat this variable as a factor, not merely as a column of words. And if the factor in your data frame has three ‘values’/‘attributes’ (low, medium and high), you have to make sure this variable is stored as an ‘ordered factor’.
# because gss_cat only has numerical variables and factors, we first change a
# factor to a character variable
gss_cat <- gss_cat %>%
mutate(marital_char = as.character(marital))
# changing text back to factor
gss_cat <- gss_cat %>%
mutate(marital_factor = as.factor(marital_char))
# changing factor to ordered factor
gss_cat <- gss_cat %>%
mutate(marital_ord = factor(marital_factor, order = TRUE, levels = c("Never married",
"Married", "Divorced", "Separated", "Widowed", "No answer")))
# In this example, we have put the brackets in a way that avoids omitting a
# bracket (which happens a LOT!)
Suppose you have a data file with a large number of countries (stored as a word) over a large number of years (every ‘country x year’ is one observation). You want to add the continents of these countries to the data set. You have another data set with just the countries (stored as the same words as in the other data set) and the continents.
# gapm1945to2020 is the original dataset gapmcountries is the set with
# countries and continents Both datasets are loaded in the global environment
# country in both data sets is called 'geo' and both have the same values (the
# same words)
gapm1945joined <- gapm1945to2020 %>%
left_join(gapmcountries, by = "geo")
Sometimes, you want to add data to an existing data file. For this, you can use the function add_row()
. With .before
and .after
, you can specify where new cases should be added.
# to use the add_row() function, the tidyverse packages should be installed and loaded
# all variable names should be included in the command
# data will be added after the last case
dataset1 <- dataset1 %>%
add_row(., Variable1 = 202, Variable2 = 3, Variable3 = 1)
# Note the very confusing ".,", which is sometimes used when combining base R commands with tidyverse commands. This case it means something like "we are really using dataset1".
# You can also specify where to add the new case (for example: before case 51)
dataset1 <- dataset1 %>%
add_row(., Variable1 = 202, Variable2 = 3, Variable3 = 1, .before = 51)
Sometimes you want to change the values of a variable. Usually it is best to simply make a new variable using mutate (see above). Let us say you have items x10 and x11 of type integer (meaning, only the numbers -1, 0, 1, 2 …. etc.) that are scored from 1 to 3 and you’d like to change number 3 into integers 1, and number 1 into integer 3 (reverse coding, 2 will stay 2). We then make new variables of type integer x10_R and x11_R in the following way. Note that if you use the L, your new data type will be integer (which saves memory). You leave them out if you want the data type to be numeric.
# keep in mind R uses in most cases (but NOT with rename, which is very
# confusing) the 'OLD is now NEW' order of values.
df_psychology <- df_psychology %>%
mutate(x10_R = recode(x10, `1` = 3L, `2` = 2L, `3` = 1L), x11_R = recode(x11,
`1` = 3L, `2` = 2L, `3` = 1L))
# it is often simpler to temporarily ignore the fact a variable is an integer,
# and to simply add the 'as.integer()' command later.
# It is also possible to recode character/factor variables.
data <- data %>%
mutate(var1b = recode(var1a, word = "newword"), var2b = recode(var1b, word2 = "anothernewword"))
The code above does not seem to work for all data files. When you for example imported an SPSS data file with labels, another method should be used.
# for variable x3: 1 -> 0, 2 -> 1
# use: value - 1
data_new <- data_new %>%
mutate(x3_R = x3 -1)
# for variable x1 and x2: 1 -> 3, 2 -> 2, 3 -> 1
# use: value * (-1) + 4
data_new <- data_new %>%
mutate(
x1_R = x1 * (-1) + 4,
x2_R = x2 * (-1) + 4
)
# again, make sure the data are now stored as integers.
When you would like to recode an entire range of values of a variable into the same different value in a different variable, case_when()
can be used.
# recode values of the variable "old_var":
# lower than 20 into 1, 20 - 39 into 2, 40 or higher into 3
dataset1 <- dataset1 %>%
mutate(new_var = case_when(
old_var < 20 ~ 1,
old_var >= 20 & old_var < 40 ~ 2,
old_var >= 40 ~ 3
)
)
Sometimes data are stored in a relatively wide format (a lot of variables). For example, all the countries are rows and there is a variable ‘unemployment_2000’ and a variable ‘unemployment_2001’ etc… In order to change wide format data into long format data, in which there are three variables only: country_name, year, level of unemployment, use pivot_longer()
.
# To create 1 new variable containing the names of 4 variables (Sepal.Length,
# Sepal.Width, Petal.Length, Petal.Width) and 1 variable with the scores:
iris %>%
pivot_longer(cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
names_to = "variable", values_to = "score", values_drop_na = TRUE)
# If you want to restructure variables beginning with the same name (for
# example variables for each week, starting with 'wk'), you can use
# starts_with()
billboardlong %>%
pivot_longer(cols = starts_with("wk"), names_to = "week", values_to = "rank",
values_drop_na = TRUE)
And sometimes you want to go from long format to wide format. Use pivot_wider()
.
#using the same example
billboardwide %>%
pivot_wider(names_from = week,
values_from = rank)
No examples available yet
A short overview of the functions used in this page, which package they are in, and what they are used for, with a link to their official documentation
add_row()
as.character()
as.factor()
as.integer()
as.logical()
as.numeric()
attr()
case_when()
colnames()
data.frame()
filter()
getwd()
is.na()
NA
read.csv()
read_spss()
rnorm()
sd()
select()
setwd()
str()
sum()
summary()
left_join()
mean()
mutate()
na_if()
NA
if it is equal to a specific valuepivot_longer()
pivot_wider()
rename()
rowSums()
View()
When trying to open a dataset I get the message Error : no such file or directory
When loading a file, R will only search for that file in one directory: The working directory. You can find out what the working directory is by using the command getwd()
. If that is not the folder where your file is in, you need to change it. You can do that using setwd()
. When typing the path of this directory, you can’t use the default backslash (\) that is normally used in paths. Therefore, you need to change it to either a forward slash (/) or a double backslash (\\). So if your file is in the folder “C:\Users\username\Downloads”, you type setwd("C:\\Users\\username\\Downloads")
. If you do not know the full path of the folder, you can type rstudioapi::selectDirectory()
in your console. This will allow you to pick a folder by hand, and will give you the full path you need to use in your command with the slashes already changed.
If the file still cannot be found, check for any typo’s in the folder path or filename. When typing the filename, make sure to include the file extension. This is for example .csv
or .sav
. File extensions may be hidden in your file explorer, making them difficult to know. To see file extensions on Windows, open file explorer, go to the View tab, and under Show/Hide check the box File name extensions. On Mac, select a file, then choose File > Get Info. Click the arrow next to Name & Extension and deselect Hide extension
My computer can’t open .sav files. I don’t have the right program installed.
You don’t need to open .sav files directly with your computer. You only have to import the sav files into R. You can do that using the foreign
library and the read.spss()
command.