Vroom Benchmarks

vroom is a new approach to reading delimited and fixed width data into R.

It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.

Therefore you can obtain very rapid input by first performing a fast indexing step and then using the Altrep framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.

How it works

The initial reading of the file simply records the locations of each individual record, the actual values are not read into R. Altrep vectors are created for each column in the data which hold a pointer to the index and the memory mapped file. When these vectors are indexed the value is read from the memory mapping.

This means initial reading is extremely fast, in the real world dataset below it is ~ 1/4 the time of the multi-threaded data.table::fread(). Sampling operations are likewise extremely fast, as only the data actually included in the sample is read. This means things like the tibble print method, calling head(), tail() x[sample(), ] etc. have very low overhead. Filtering also can be fast, only the columns included in the filter selection have to be fully read and only the data in the filtered rows needs to be read from the remaining columns. Grouped aggregations likewise only need to read the grouping variables and the variables aggregated.

Once a particular vector is fully materialized the speed for all subsequent operations should be identical to a normal R vector.

This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.

Reading delimited files

The following benchmarks all measure reading delimited files of various sizes and data types. Because vroom delays reading the benchmarks also do some manipulation of the data afterwards to try and provide a more realistic performance comparison.

Because the read.delim results are so much slower than the others they are excluded from the plots, but are retained in the tables.

Taxi Trip Dataset

This real world dataset is from Freedom of Information Law (FOIL) Taxi Trip Data from the NYC Taxi and Limousine Commission 2013, originally posted at https://chriswhong.com/open-data/foil_nyc_taxi/. It is also hosted on archive.org.

The first table trip_fare_1.csv is 1.55G in size.

#> Observations: 14,776,615
#> Variables: 11
#> $ medallion       <chr> "89D227B655E5C82AECF13C3F540D4CF4", "0BD7C8F5B...
#> $ hack_license    <chr> "BA96DE419E711691B9445D6A6307C170", "9FD8F69F0...
#> $ vendor_id       <chr> "CMT", "CMT", "CMT", "CMT", "CMT", "CMT", "CMT...
#> $ pickup_datetime <chr> "2013-01-01 15:11:48", "2013-01-06 00:18:35", ...
#> $ payment_type    <chr> "CSH", "CSH", "CSH", "CSH", "CSH", "CSH", "CSH...
#> $ fare_amount     <dbl> 6.5, 6.0, 5.5, 5.0, 9.5, 9.5, 6.0, 34.0, 5.5, ...
#> $ surcharge       <dbl> 0.0, 0.5, 1.0, 0.5, 0.5, 0.0, 0.0, 0.0, 1.0, 0...
#> $ mta_tax         <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0...
#> $ tip_amount      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ tolls_amount    <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.8, 0.0, 0...
#> $ total_amount    <dbl> 7.0, 7.0, 7.0, 6.0, 10.5, 10.0, 6.5, 39.3, 7.0...

Taxi Benchmarks

code: bench/taxi

All benchmarks were run on a Amazon EC2 m5.4xlarge instance with 16 vCPUs and an EBS volume type.

The benchmarks labeled vroom_base uses vroom with base functions for manipulation. vroom_dplyr uses vroom to read the file and dplyr functions to manipulate. data.table uses fread() to read the file and data.table functions to manipulate and readr uses readr to read the file and dplyr to manipulate. By default vroom only uses Altrep for character vectors, these are labeled vroom(altrep: normal). The benchmarks labeled vroom(altrep: full) instead use Altrep vectors for all supported types and vroom(altrep: none) disable Altrep entirely.

The following operations are performed.

  • The data is read
  • print() - N.B. read.delim uses print(head(x, 10)) because printing the whole dataset takes > 10 minutes
  • head()
  • tail()
  • Sampling 100 random rows
  • Filtering for “UNK” payment, this is 6434 rows (0.0435% of total).
  • Aggregation of mean fare amount per payment type.
reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 7.04GB 1m 14.4s 6ms 1ms 1ms 1ms 1.1s 888ms 1m 16.4s
readr dplyr 6.88GB 32.6s 81ms 1ms 1ms 9ms 232ms 511ms 33.4s
vroom dplyr FALSE 6.53GB 16.3s 81ms 1ms 1ms 9ms 931ms 1.2s 18.6s
data.table data.table 6.31GB 13.1s 13ms 1ms 1ms 1ms 109ms 211ms 13.4s
vroom base TRUE 8.11GB 1.1s 77ms 1ms 1ms 1ms 1.1s 8.5s 10.8s
vroom dplyr TRUE 8.18GB 1.1s 87ms 1ms 1ms 8ms 1.4s 4.1s 6.7s

(N.B. Rcpp used in the dplyr implementation fully materializes all the Altrep numeric vectors when using filter() or sample_n(), which is why the first of these cases have additional overhead when using full Altrep.).

All numeric data

All numeric data is really a worst case scenario for vroom. The index takes about as much memory as the parsed data. Also because parsing doubles can be done quickly in parallel and text representations of doubles are only ~25 characters at most there isn’t a great deal of savings for delayed parsing.

For these reasons (and because the data.table implementation is very fast) vroom is a bit slower than fread for pure numeric data.

However because vroom is multi-threaded it is a bit quicker than readr and read.delim for this type of data.

Long

code: bench/all_numeric-long

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 4.78GB 1m 58.7s 1.5s 1ms 1ms 1ms 4.7s 40ms 2m 4.8s
readr dplyr 2.86GB 14s 97ms 1ms 1ms 11ms 16ms 99ms 14.3s
vroom base FALSE 2.73GB 1.2s 88ms 1ms 1ms 3ms 6ms 93ms 1.4s
vroom dplyr FALSE 2.74GB 1s 95ms 1ms 1ms 10ms 16ms 47ms 1.2s
vroom base TRUE 3.27GB 345ms 99ms 1ms 1ms 3ms 26ms 236ms 709ms
vroom dplyr TRUE 3.25GB 280ms 99ms 1ms 1ms 50ms 40ms 223ms 691ms
data.table data.table 2.64GB 263ms 14ms 1ms 1ms 3ms 6ms 25ms 309ms

Wide

code: bench/all_numeric-wide

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 14.42GB 9m 27.8s 142ms 7ms 7ms 9ms 77ms 5ms 9m 28.1s
readr dplyr 5.45GB 59.1s 130ms 3ms 3ms 40ms 23ms 53ms 59.3s
vroom dplyr FALSE 5.33GB 5.1s 137ms 2ms 2ms 27ms 81ms 59ms 5.4s
vroom base FALSE 5.32GB 5.2s 137ms 2ms 2ms 4ms 5ms 5ms 5.3s
data.table data.table 5.46GB 1.3s 110ms 1ms 1ms 3ms 4ms 4ms 1.4s
vroom base TRUE 7.24GB 1.1s 143ms 5ms 5ms 6ms 11ms 41ms 1.3s
vroom dplyr TRUE 7.24GB 933ms 147ms 4ms 5ms 28ms 36ms 96ms 1.2s

All character data

code: bench/all_character-long

All character data is a best case scenario for vroom when using Altrep, as it takes full advantage of the lazy reading.

Long

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 4.44GB 1m 43.1s 8ms 1ms 1ms 2ms 28ms 414ms 1m 43.6s
readr dplyr 4.34GB 1m 2.6s 92ms 1ms 1ms 12ms 18ms 346ms 1m 3.1s
vroom dplyr FALSE 4.29GB 51.7s 92ms 1ms 1ms 11ms 18ms 270ms 52.1s
data.table data.table 4.76GB 38.3s 16ms 1ms 1ms 4ms 16ms 266ms 38.6s
vroom base TRUE 3.2GB 339ms 89ms 1ms 1ms 3ms 141ms 2s 2.6s
vroom dplyr TRUE 3.17GB 249ms 98ms 1ms 1ms 10ms 155ms 1.2s 1.7s

Wide

code: bench/all_character-wide

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 13.06GB 9m 13.8s 173ms 7ms 7ms 24ms 211ms 66ms 9m 14.3s
readr dplyr 12.22GB 6m 0.2s 135ms 3ms 3ms 34ms 45ms 79ms 6m 0.5s
vroom dplyr FALSE 12.13GB 4m 1.1s 135ms 2ms 3ms 33ms 45ms 63ms 4m 1.4s
data.table data.table 12.61GB 2m 48.8s 159ms 1ms 1ms 29ms 159ms 26ms 2m 49.2s
vroom base TRUE 6.55GB 1.1s 132ms 5ms 5ms 6ms 49ms 256ms 1.5s
vroom dplyr TRUE 6.55GB 939ms 136ms 4ms 4ms 28ms 75ms 158ms 1.3s

Reading multiple delimited files

code: bench/taxi_multiple

The benchmark reads all 12 files in the taxi trip fare data, totaling 173,179,759 rows and 11 columns for a total file size of 18.4G.

reading package manipulating package altrep memory read print head tail sample filter aggregate total
readr dplyr 57.8GB 7m 54.2s 80ms 1ms 1ms 9ms 4.1s 12.6s 8m 11s
data.table data.table 59.6GB 4m 14.6s 7ms 1ms 1ms 1ms 1.1s 12s 4m 27.6s
vroom dplyr FALSE 57.4GB 3m 31s 1.8s 1ms 1ms 12ms 11.2s 7s 3m 51s
vroom base TRUE 76.9GB 13.2s 2.6s 1ms 1ms 1ms 17.7s 1m 56.8s 2m 30.3s
vroom dplyr TRUE 77.3GB 14.5s 2.6s 1ms 1ms 8ms 19.6s 54.5s 1m 31.2s

Reading fixed width files

United States Census 5-Percent Public Use Microdata Sample files

This fixed width dataset contains individual records of the characteristics of a 5 percent sample of people and housing units from the year 2000 and is freely available at https://web.archive.org/web/20150908055439/https://www2.census.gov/census_2000/datasets/PUMS/FivePercent/California/all_California.zip. The data is split into files by state, and the state of California was used in this benchmark.

The data totals 2,342,339 rows and 37 columns with a total file size of 677M.

Census data benchmarks

code: bench/fwf

reading package manipulating package altrep memory read print head tail sample filter aggregate total
read.delim base 6.16GB 16m 0.7s 16ms 1ms 2ms 3ms 334ms 336ms 16m 1.4s
readr dplyr 5.66GB 29.6s 104ms 1ms 1ms 17ms 96ms 96ms 29.9s
vroom dplyr FALSE 5.38GB 14.2s 103ms 1ms 1ms 17ms 459ms 93ms 14.9s
vroom base TRUE 4.06GB 148ms 103ms 1ms 1ms 6ms 249ms 1.7s 2.2s
vroom dplyr TRUE 4.07GB 151ms 105ms 1ms 1ms 55ms 280ms 1.1s 1.7s

Writing delimited files

code: bench/taxi_writing

The benchmarks write out the taxi trip dataset in a few different ways.