Getting Data into R Part 1 - Flat(ish) Files

Feb 21, 2017 · 723 words · 4 minutes read data processing • R

Most data scientists first learn to import data from flat files, such as comma or tab delimited files. Afterall, we’re familiar with seeing our data organized with observations across rows and variables in columns.

readr

The best starting place for most flat files is the readr package by R Studio.

There are several read_* functions for reading in tabular files with different column specifications.

read_csv and read_csv2 for comma separated files
read_tsv for tab separated files
read_delim for other delimiters
read_fwf for fixed width files
read_table for files with columns separated by whitespace
read_rds for reading in RDS files
read_log for parsing log files
read_lines and read_lines_raw for reading lines from a file

Readr is more internally consistent than base R — meaning you’re less likely to be surprised by its behavior. Plus the read_* functions are much faster.

But perhaps the best features are that the read_* functions don’t convert strings to factors and parses most date/time formats automatically (e.g. ISO8601 format). No more read.csv(..., stringsAsFactors = TRUE).

In most cases you can use one of the read_* functions as a drop-in replacement for the base R read.* functions (e.g. read_csv instead of read.csv) and the read_* function will return a tibble.

Plus readr will spit out the column specifications used to parse your file.

myfile <- read_csv("path/to/file.csv")
## Parsed with column specification:
## cols(
##   uid = col_integer(),
##   date = col_date("%y/%m/%d")
##   measurement = col_double(),
##   :
##   description = col_character()
## )

This magic is courtesy of the very handy type_convert function. It is quite useful to use this function when reading in data from other types of sources as well. For best practices, take the column specifications output and pass it to the col_types argument of any of the read_* functions. This ensures that your data import script is consistent and reproducible.

myfile <- read_csv(file = "path/to/file.csv",
                   col_types = cols(
                      uid = col_integer(),
                      date = col_date("%y/%m/%d")
                      measurement = col_double(),
                      :
                      description = col_character()
                    ))

Sometimes you’ll need to read and write data that’s in a format used by some other statistical package (e.g. SAS, SPSS, or Stata). Strictly speaking, these aren’t flat files but . This is where haven comes in.

haven

Haven is a wrapper around the ReadStat C library, a commandline tool for reading and writing from SAS, SPSS, and Stata files. Like readr, haven is a part of the tidyverse ecosystem.

The main functions are read_sas, read_sav, and read_dta. The outputs are tibbles and, like readr, parses date/time and does not convert strings to factors.

For the most part, haven works pretty much the same as the readr functions (it is part of the tidyverse afterall).

bolts <- read_sas(data_file = "bolts.sas7bdat")
# # A tibble: 41 × 8
#      run speed1 total speed2 number2  sens  time t20bolt
#    <dbl>  <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl>   <dbl>
# 1     NA     NA    NA     NA      NA    NA    NA      NA
# 2     25      2    10    1.5       0     6  5.70   11.40
# 3     24      2    10    1.5       0    10 17.56   35.12
# 4     30      2    10    1.5       2     6 11.28   22.56
# 5      2      2    10    1.5       2    10  8.39   16.78
# 6     40      2    10    2.5       0     6 16.67   33.34
# 7     37      2    10    2.5       0    10 12.04   24.08
# 8     16      2    10    2.5       2     6  9.22   18.44
# 9     22      2    10    2.5       2    10  3.94    7.88
# 10    33      2    30    1.5       0     6 27.02   18.01
# # ... with 31 more rows

But perhaps the most overlooked feature of haven is “tagged” missing values.

Tagged missing values are special NA values. They behave just like regular missing values but store an additional byte – a tag. This is used to handle multiple missing value types in SAS (.A-.Z and ._) and Stata (.A-.Z). There is a similar function, labelled_spss, for handling missing values for SPSS. Essentially, haven extends the NA type so you can explicitly code for different missing types.

Readr makes it easy to work with flat files in R. The functions are fast, consistent, and return tibbles. They don’t convert strings to factors and automatically parses common datetime formats.

If you’re working with SAS, SPSS, or Stata files, haven has you covered. Haven makes transferring data between R and SAS, SPSS, and Stata simple. Plus haven brings support for tagged missing values.