Most data scientists first learn to import data from flat files, such as comma or tab delimited files. Afterall, we’re familiar with seeing our data organized with observations across rows and variables in columns.
The best starting place for most flat files is the readr package by R Studio.
There are several
read_* functions for reading in tabular files with different column specifications.
read_csv2for comma separated files
read_tsvfor tab separated files
read_delimfor other delimiters
read_fwffor fixed width files
read_tablefor files with columns separated by whitespace
read_rdsfor reading in RDS files
read_logfor parsing log files
read_lines_rawfor reading lines from a file
Readr is more internally consistent than base R — meaning you’re less likely to be surprised by its behavior. Plus the
read_* functions are much faster.
But perhaps the best features are that the
read_* functions don’t convert strings to factors and parses most date/time formats automatically (e.g. ISO8601 format). No more
read.csv(..., stringsAsFactors = TRUE).
In most cases you can use one of the
read_* functions as a drop-in replacement for the base R
read.* functions (e.g.
read_csv instead of
read.csv) and the
read_* function will return a tibble.
Plus readr will spit out the column specifications used to parse your file.
myfile <- read_csv("path/to/file.csv") ## Parsed with column specification: ## cols( ## uid = col_integer(), ## date = col_date("%y/%m/%d") ## measurement = col_double(), ## : ## description = col_character() ## )
This magic is courtesy of the very handy
type_convert function. It is quite useful to use this function when reading in data from other types of sources as well. For best practices, take the column specifications output and pass it to the
col_types argument of any of the
read_* functions. This ensures that your data import script is consistent and reproducible.
myfile <- read_csv(file = "path/to/file.csv", col_types = cols( uid = col_integer(), date = col_date("%y/%m/%d") measurement = col_double(), : description = col_character() ))
Sometimes you’ll need to read and write data that’s in a format used by some other statistical package (e.g. SAS, SPSS, or Stata). Strictly speaking, these aren’t flat files but . This is where haven comes in.
Haven is a wrapper around the ReadStat C library, a commandline tool for reading and writing from SAS, SPSS, and Stata files. Like readr, haven is a part of the tidyverse ecosystem.
The main functions are
read_dta. The outputs are tibbles and, like readr, parses date/time and does not convert strings to factors.
For the most part, haven works pretty much the same as the readr functions (it is part of the tidyverse afterall).
bolts <- read_sas(data_file = "bolts.sas7bdat") # # A tibble: 41 × 8 # run speed1 total speed2 number2 sens time t20bolt # <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # 1 NA NA NA NA NA NA NA NA # 2 25 2 10 1.5 0 6 5.70 11.40 # 3 24 2 10 1.5 0 10 17.56 35.12 # 4 30 2 10 1.5 2 6 11.28 22.56 # 5 2 2 10 1.5 2 10 8.39 16.78 # 6 40 2 10 2.5 0 6 16.67 33.34 # 7 37 2 10 2.5 0 10 12.04 24.08 # 8 16 2 10 2.5 2 6 9.22 18.44 # 9 22 2 10 2.5 2 10 3.94 7.88 # 10 33 2 30 1.5 0 6 27.02 18.01 # # ... with 31 more rows
But perhaps the most overlooked feature of haven is “tagged” missing values.
Tagged missing values are special
NA values. They behave just like regular missing values but store an additional byte – a tag. This is used to handle multiple missing value types in SAS (
._) and Stata (
.Z). There is a similar function,
labelled_spss, for handling missing values for SPSS. Essentially, haven extends the
NA type so you can explicitly code for different missing types.
Readr makes it easy to work with flat files in R. The functions are fast, consistent, and return tibbles. They don’t convert strings to factors and automatically parses common datetime formats.
If you’re working with SAS, SPSS, or Stata files, haven has you covered. Haven makes transferring data between R and SAS, SPSS, and Stata simple. Plus haven brings support for tagged missing values.