This page will show you how you van import different types of datasets in R and how you can process its data.

Explanation

Importing data: from .csv, .sav and other formats to R data frames

To import a data set, you first need to activate the correct package (these are foreign (for SPSS files), readr or readxl (for excel files), depending on the type of data you want to import). Next to that, you need to know the location of the data set on your computer. Suppose the location of your data file called ‘dataset2’ is: “C:/My Documents”. The data set can be loaded into the global environment with the following command:

# set working directory to the correct folder and put all files to be imported
# in that folder
install.packages("tidyverse", "foreign", "readr", "readxl")  # you do this only once
library("tidyverse", "foreign", "readr", "readxl")  #you only need to call packages you will actually use, but you need to do that every time after (re)starting R

setwd("C:/My Documents")  # setting the working directory

dataset2 <- read.spss("dataset2.sav", to.data.frame = TRUE)  # load an spss file with the package 'foreign' as data frame.

# OR, if the file is a csv file ...

# load a csv data file, separated by comma's, with the package 'readr'
dataset2 <- read.csv("dataset2.csv", sep = ",")

Working with labelled data (mainly SPSS and SAV files)

Sometimes imported datasets are ‘labeled’. For example, the variable ‘gender’ is stored as a series of 1’s and 2’s and a label is added for the variable (‘gender as derived from the sampling frame’, this is called the variable label) and for the values (1 means ‘woman’ and 2 means ‘man’, the so called ‘value labels’). The labels are attributes of a data frame.

To inspect the labels in a dataframe:

# for the variable label
data$variable %>%
    attr("label")

# for the value labels
data$variable %>%
    attr("labels")

Creating data frames from scratch

Creating your own data set can be done in many ways. We mention two:

typing in data:

#creating a data frame by variable

age <- c(44, 30, 20, 67) %>% as.integer()
gender <- c("male", "female", "male", "female") %>% as.factor()
length <- c(1.67, 1.70, 1.80, 1.81) %>% as.numeric()
membership <- c("T", "F", "T", "T") %>% as.logical()

# merge the variables
dataset3 <- data.frame(age, gender, length, membership)

generating random data: If you do not have actual data, but want to show something in R, you can randomly select observations from a distribution.

# creating a data frame with a random variable based on the normal
# distribution, assumed mean of the normal distribution is 50, sd is 20, with
# 100 units.

n <- 100
dataset4 <- rnorm(n, 50, 20) %>%
    as.data.frame()

View data frames and parts of data frames: view, colnames and select

To illustrate the commands used below, we will use the data set gss_cat (from the tidyverse package forcats). This is a sample of data from the general social survey in the US. It contains the variables: year (year of survey, 2000–2014); age (Maximum age truncated to 89); marital (marital status); race (race); rincome (reported income); partyid (party affiliation); relig (religion); denom (denomination); tvhours (hours per day watching tv).

Suppose we have a data frame and you want to inspect it.

# View the data from the data set dataset4 in a spreadsheet.
gss_cat %>%
    View()

# Get a quick overview of the types of data in your matrix and their names
gss_cat %>%
    str()

# Only get the column names of a data set
gss_cat %>%
    colnames()

# View the data of a particular column, say column 2 (the second variable in
# the data frame)
gss_cat %>%
    select(2)

# View the data of a particular set of variables
gss_cat %>%
    select(year, marital, race)

# View the data set of a particular set of variables, in which one is left out
gss_cat %>%
    select(1:5, -3)

# View part of the data set in a spreadsheet
gss_cat %>%
    select(1:5, -3) %>%
    View()

Creating subsets of a data frame: select, filter and the use the assignment operator <-

Selected variables can also be stored as a separate object in the global environment.

# Make a new data frame with only the items (variables) that you want:
gss_cat_social <- gss_cat %>%
    select(1:5, 7, 8)

# Make a new data frame with only observations from 2000:
gss_cat_2000 <- gss_cat %>%
    filter(year == 2000)

# Make a new data frame with observations from 2002 and later:
gss_cat_2002plus <- gss_cat %>%
    filter(year >= 2002)

Dealing with missing data: na_if, is.na, na.rm, na.omit

R uses only one way of declaring a specific observation as missing: NA (Not Available). If your data set does include missings, but the missings are coded with a number (for example: 99), you need to replace these values before analyzing the data.

# To change other values (here: the word 'Don't know)' to NA, use na_if()

gss_cat <- gss_cat %>%
  mutate(relig = na_if(relig, "Don't know"))

# in the data set gss_cat, replace the existing variable relig (which contains contains cases with the word 'Don't know') with a new variable relig, in which this group is declared missing (and gets an NA).

There are several ways to find out how many missings are included in the data. Also, there are multiple ways to deal with these missings (pairwise deletion versus listwise deletion).

# To see which variables in the data file called 'mydata' have missings use:
summary(gss_cat)

# Or use 'is.na' to detect the number of missing values in a specific column
# (the 'religion' variable).
gss_cat$relig %>%
    is.na() %>%
    sum()

# Pairwise deletion of cases (exclude cases that have a missing value on a
# variable, but keep them when working with other variables): 'na.rm'
mean(gss_cat$tvhours, na.rm = TRUE)

# Listwise deletion of cases (drop cases that have a missing value on any of
# the variables used): 'na.omit'
gss_cat_no_missings <- na.omit(gss_cat)

# Make a new data frame only containing units without a missing value on the
# variable 'relig'. Please note that you now drop many cases ONLY because this
# variable is missing.
gss_cat_no_missings_on_relig <- gss_cat %>%
    filter(!is.na(relig))

# Please note that in R we can use the '!' sign, to say 'not'.

Renaming a variable in a data frame: rename

# Change old name 'relig' to the new variable name 'religion' (new = old)

gss_cat <- gss_cat %>% 
  rename(religion = relig)

Adding a new variable to an existing data frame: mutate

Adding a variable to the data set can be done with mutate().

# Adding a standardized variable with mutate()
gss_cat <- gss_cat %>%
    mutate(ztvhours = (tvhours - mean(tvhours))/sd(tvhours))

# Creating an index (a combination of item scores) and adding that index to the
# data frame can also be done with mutate()
dataset5 <- dataset5 %>%
    mutate(index = item3 + item4 + item4)

# OR

# (assuming that item3, item4 and item5 are in columns 3, 4 and 5) use:
dataset5 <- dataset5 %>%
    mutate(index = rowSums(.[3:5]))

# If you want to sum values of a lot of columns and ignore the missings, use:
dataset5 <- dataset5 %>%
    mutate(sum = rowSums(.[c(1:4, 7:20, 22)], na.rm = TRUE))
# The 'c' in this command stands for 'combine' and is part of base R.

Making sure variables have the correct type: from character to factor and from factor to ordered factor

If you have a character variable with the ‘words’ ‘man’ and ‘women’ you probably want to treat this variable as a factor, not merely as a column of words. And if the factor in your data frame has three ‘values’/‘attributes’ (low, medium and high), you have to make sure this variable is stored as an ‘ordered factor’.

# because gss_cat only has numerical variables and factors, we first change a
# factor to a character variable
gss_cat <- gss_cat %>%
    mutate(marital_char = as.character(marital))

# changing text back to factor
gss_cat <- gss_cat %>%
    mutate(marital_factor = as.factor(marital_char))

# changing factor to ordered factor
gss_cat <- gss_cat %>%
    mutate(marital_ord = factor(marital_factor, order = TRUE, levels = c("Never married",
        "Married", "Divorced", "Separated", "Widowed", "No answer")))
# In this example, we have put the brackets in a way that avoids omitting a
# bracket (which happens a LOT!)

Adding a new variable from another datafile to an existing data frame: leftjoin

Suppose you have a data file with a large number of countries (stored as a word) over a large number of years (every ‘country x year’ is one observation). You want to add the continents of these countries to the data set. You have another data set with just the countries (stored as the same words as in the other data set) and the continents.

# gapm1945to2020 is the original dataset gapmcountries is the set with
# countries and continents Both datasets are loaded in the global environment
# country in both data sets is called 'geo' and both have the same values (the
# same words)

gapm1945joined <- gapm1945to2020 %>%
    left_join(gapmcountries, by = "geo")

Adding new data (observations) to an existing data file: add_row

Sometimes, you want to add data to an existing data file. For this, you can use the function add_row(). With .before and .after, you can specify where new cases should be added.

# to use the add_row() function, the tidyverse packages should be installed and loaded
# all variable names should be included in the command

# data will be added after the last case
dataset1 <- dataset1 %>% 
   add_row(., Variable1 = 202, Variable2 = 3, Variable3 = 1)

# Note the  very confusing ".,", which is sometimes used when combining base R commands with tidyverse commands. This case it means something like "we are really using dataset1".

# You can also specify where to add the new case (for example: before case 51)
dataset1 <- dataset1 %>% 
   add_row(., Variable1 = 202, Variable2 = 3, Variable3 = 1, .before = 51)

Recoding and changing variables in an existing data frame: recode and case_when

Recoding single values of a variable into different values in a new variable

Sometimes you want to change the values of a variable. Usually it is best to simply make a new variable using mutate (see above). Let us say you have items x10 and x11 of type integer (meaning, only the numbers -1, 0, 1, 2 …. etc.) that are scored from 1 to 3 and you’d like to change number 3 into integers 1, and number 1 into integer 3 (reverse coding, 2 will stay 2). We then make new variables of type integer x10_R and x11_R in the following way. Note that if you use the L, your new data type will be integer (which saves memory). You leave them out if you want the data type to be numeric.

# keep in mind R uses in most cases (but NOT with rename, which is very
# confusing) the 'OLD is now NEW' order of values.

df_psychology <- df_psychology %>%
    mutate(x10_R = recode(x10, `1` = 3L, `2` = 2L, `3` = 1L), x11_R = recode(x11,
        `1` = 3L, `2` = 2L, `3` = 1L))

# it is often simpler to temporarily ignore the fact a variable is an integer,
# and to simply add the 'as.integer()' command later.

# It is also possible to recode character/factor variables.

data <- data %>%
    mutate(var1b = recode(var1a, word = "newword"), var2b = recode(var1b, word2 = "anothernewword"))

The code above does not seem to work for all data files. When you for example imported an SPSS data file with labels, another method should be used.

# for variable x3: 1 -> 0, 2 -> 1
# use: value - 1

data_new <- data_new %>% 
  mutate(x3_R = x3 -1)

# for variable x1 and x2: 1 -> 3, 2 -> 2, 3 -> 1
# use: value * (-1) + 4
data_new <- data_new %>% 
  mutate(
    x1_R = x1 * (-1) + 4,
    x2_R = x2 * (-1) + 4
  )

# again, make sure the data are now stored as integers.

Recoding a range of values of a variable into different values in a new variable: case_when

When you would like to recode an entire range of values of a variable into the same different value in a different variable, case_when() can be used.

# recode values of the variable "old_var":
# lower than 20 into 1, 20 - 39 into 2, 40 or higher into 3

dataset1 <- dataset1 %>%
      mutate(new_var = case_when(
        old_var < 20 ~ 1,
          old_var >= 20 & old_var < 40 ~ 2,
        old_var >= 40 ~ 3
  )
)

Restructuring data files: creating long and broad format data frames using pivot_longer and pivot_wider

Sometimes data are stored in a relatively wide format (a lot of variables). For example, all the countries are rows and there is a variable ‘unemployment_2000’ and a variable ‘unemployment_2001’ etc… In order to change wide format data into long format data, in which there are three variables only: country_name, year, level of unemployment, use pivot_longer().

# To create 1 new variable containing the names of 4 variables (Sepal.Length,
# Sepal.Width, Petal.Length, Petal.Width) and 1 variable with the scores:

iris %>%
    pivot_longer(cols = c(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
        names_to = "variable", values_to = "score", values_drop_na = TRUE)

# If you want to restructure variables beginning with the same name (for
# example variables for each week, starting with 'wk'), you can use
# starts_with()

billboardlong %>%
    pivot_longer(cols = starts_with("wk"), names_to = "week", values_to = "rank",
        values_drop_na = TRUE)

And sometimes you want to go from long format to wide format. Use pivot_wider().

#using the same example
billboardwide %>%
  pivot_wider(names_from = week, 
              values_from = rank)

Examples

No examples available yet

Functions

A short overview of the functions used in this page, which package they are in, and what they are used for, with a link to their official documentation

add_row()
package: tibble (tidyverse)
Add a row to an existing dataset
as.character()
package: base
Converts the data type of a variable or object into a character (text)
as.factor()
package: base
Converts the data type of a variable or object into a factor (categories)
as.integer()
package: base
Converts the data type of a variable or object into a integer (whole numbers)
as.logical()
package: base
Converts the data type of a variable or object into a locical variable (TRUE or FALSE)
as.numeric()
package: base
Converts the data type of a variable or object into a number (real number)
attr()
package: base
Get or set specific attributes of an object.
case_when()
package: dplyr (tidyverse)
Return one or an other variable based on a condition
colnames()
package: base
Only get the column names of a data set
data.frame()
package: base
Creates a dataframe from multiple lists
filter()
package: dplyr (tidyverse)
Get specific rows from the dataset based on certain conditions
getwd()
package: base
Tells you what the current working directory is
is.na()
package: base
Check if a value is equal to NA
read.csv()
package: utils (base)
Lets you import a .sav file made for SPSS into R as a dataset
read_spss()
package: haven
rnorm()
package: stats (base)
Creates random numbers with a normal distribution Lets you import a .sav file made for SPSS into R as a dataset
sd()
package: stats
Find the standard deviation of a list of numbers.
select()
package: dplyr (tidyverse)
Get a specific column or columns from a dataset
setwd()
package: base
Sets the working directory (the folder in which R will look for your files)
str()
package: utils (base)
Get a quick overview of the types of data in your matrix and their names
sum()
package: base
Find the sum of a list of numbers.
summary()
package: base
Give a summary of all variables in a dataset, giving information like minimum, maximum, mean, and frequencies
left_join()
package: dplyr (tidyverse)
Merge two datasets together using a variable they have in common
mean()
package: utils (base)
Find the mean of a list of numbers.
mutate()
package: dplyr (tidyverse)
Change the values of an entire column in a dataset
na_if()
package: dplyr (tidyverse)
Change a value to NA if it is equal to a specific value
pivot_longer()
package: tidyr (tidyverse)
Change a dataset from short format to long format
pivot_wider()
package: tidyr (tidyverse)
Change a dataset from long format to short format
rename()
package: dplyr (tidyverse)
Changes the name of a variable in a dataset
rowSums()
package: base
Get the sum of the values of the rows in a dataset
View()
package: utils (base)
Shows a dataframe as a table in RStudio

FAQ

When trying to open a dataset I get the message Error : no such file or directory
When loading a file, R will only search for that file in one directory: The working directory. You can find out what the working directory is by using the command getwd(). If that is not the folder where your file is in, you need to change it. You can do that using setwd(). When typing the path of this directory, you can’t use the default backslash (\) that is normally used in paths. Therefore, you need to change it to either a forward slash (/) or a double backslash (\\). So if your file is in the folder “C:\Users\username\Downloads”, you type setwd("C:\\Users\\username\\Downloads"). If you do not know the full path of the folder, you can type rstudioapi::selectDirectory() in your console. This will allow you to pick a folder by hand, and will give you the full path you need to use in your command with the slashes already changed.

If the file still cannot be found, check for any typo’s in the folder path or filename. When typing the filename, make sure to include the file extension. This is for example .csv or .sav. File extensions may be hidden in your file explorer, making them difficult to know. To see file extensions on Windows, open file explorer, go to the View tab, and under Show/Hide check the box File name extensions. On Mac, select a file, then choose File > Get Info. Click the arrow next to Name & Extension and deselect Hide extension

My computer can’t open .sav files. I don’t have the right program installed.
You don’t need to open .sav files directly with your computer. You only have to import the sav files into R. You can do that using the foreign library and the read.spss() command.

Resources

Data transformation cheatsheet
An overview of all the functions you can use to find and manipulate data in your dataframe.

Handling data files