Getting Data into R Part 1 - Flat(ish) Files
Feb 21, 2017 · 723 words · 4 minutes read
Most data scientists first learn to import data from flat files, such as comma or tab delimited files. Afterall, we’re familiar with seeing our data organized with observations across rows and variables in columns.
readr
The best starting place for most flat files is the readr package by R Studio.
There are several read_*
functions for reading in tabular files with different column specifications.
read_csv
andread_csv2
for comma separated filesread_tsv
for tab separated filesread_delim
for other delimitersread_fwf
for fixed width filesread_table
for files with columns separated by whitespaceread_rds
for reading in RDS filesread_log
for parsing log filesread_lines
andread_lines_raw
for reading lines from a file
Readr is more internally consistent than base R — meaning you’re less likely to be surprised by its behavior. Plus the read_*
functions are much faster.
But perhaps the best features are that the read_*
functions don’t convert strings to factors and parses most date/time formats automatically (e.g. ISO8601 format). No more read.csv(..., stringsAsFactors = TRUE)
.
In most cases you can use one of the read_*
functions as a drop-in replacement for the base R read.*
functions (e.g. read_csv
instead of read.csv
) and the read_*
function will return a tibble.
Plus readr will spit out the column specifications used to parse your file.
myfile <- read_csv("path/to/file.csv")
## Parsed with column specification:
## cols(
## uid = col_integer(),
## date = col_date("%y/%m/%d")
## measurement = col_double(),
## :
## description = col_character()
## )
This magic is courtesy of the very handy type_convert
function. It is quite useful to use this function when reading in data from other types of sources as well. For best practices, take the column specifications output and pass it to the col_types
argument of any of the read_*
functions. This ensures that your data import script is consistent and reproducible.
myfile <- read_csv(file = "path/to/file.csv",
col_types = cols(
uid = col_integer(),
date = col_date("%y/%m/%d")
measurement = col_double(),
:
description = col_character()
))
Sometimes you’ll need to read and write data that’s in a format used by some other statistical package (e.g. SAS, SPSS, or Stata). Strictly speaking, these aren’t flat files but . This is where haven comes in.
haven
Haven is a wrapper around the ReadStat C library, a commandline tool for reading and writing from SAS, SPSS, and Stata files. Like readr, haven is a part of the tidyverse ecosystem.
The main functions are read_sas
, read_sav
, and read_dta
. The outputs are tibbles and, like readr, parses date/time and does not convert strings to factors.
For the most part, haven works pretty much the same as the readr functions (it is part of the tidyverse afterall).
bolts <- read_sas(data_file = "bolts.sas7bdat")
# # A tibble: 41 × 8
# run speed1 total speed2 number2 sens time t20bolt
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 NA NA NA NA NA NA NA NA
# 2 25 2 10 1.5 0 6 5.70 11.40
# 3 24 2 10 1.5 0 10 17.56 35.12
# 4 30 2 10 1.5 2 6 11.28 22.56
# 5 2 2 10 1.5 2 10 8.39 16.78
# 6 40 2 10 2.5 0 6 16.67 33.34
# 7 37 2 10 2.5 0 10 12.04 24.08
# 8 16 2 10 2.5 2 6 9.22 18.44
# 9 22 2 10 2.5 2 10 3.94 7.88
# 10 33 2 30 1.5 0 6 27.02 18.01
# # ... with 31 more rows
But perhaps the most overlooked feature of haven is “tagged” missing values.
Tagged missing values are special NA
values. They behave just like regular missing values but store an additional byte – a tag. This is used to handle multiple missing value types in SAS (.A
-.Z
and ._
) and Stata (.A
-.Z
). There is a similar function, labelled_spss
, for handling missing values for SPSS. Essentially, haven extends the NA
type so you can explicitly code for different missing types.
Readr makes it easy to work with flat files in R. The functions are fast, consistent, and return tibbles. They don’t convert strings to factors and automatically parses common datetime formats.
If you’re working with SAS, SPSS, or Stata files, haven has you covered. Haven makes transferring data between R and SAS, SPSS, and Stata simple. Plus haven brings support for tagged missing values.