Skip to content

A testing exercise comparing CSV and Parquet formats for data storage

Notifications You must be signed in to change notification settings

JamesHaughey/parquet_format

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSV vs Parquet formats for data storage

TL:DR

Using the Parquet format built into the Pandas package on one 2gb file of mixed data types is faster than reading / writing to a CSV and produces smaller files by significant margins. The only downside is that the files cannot be read natively by Excel.

  • Writing files is over twenty times faster
  • Reading files is nearly four times faster
  • File sizes are thirteen times smaller

What is Parquet?

Parquet is an open source standard for storing columnar data, developed by Twitter & Cloudera it is now maintained by the Apache software foundation. It is suported in both Python & R programming languages and by many SQL engines and distributed processing frameworks like Hadoop / Spark. It employs a variety of strategies to store data efficiently but works most efficiently on numeric data. See Parquet Methodology Presentation for more details.

Testing Methodology

I have used a single large file stored in memory in Python - this file contains a mix of numeric, string and datetime formatted data which should be fairly representative of an average dataset for analysis.

I am saving this file as both a CSV and parquet file, with and without compression and then repeating this test five times.

We are measuring the time to write the file, the size it takes up on the drive and then how long it takes to read that file back into memory.

This is being done on Windows 10 on a SSD drive.

See Testing Notebook for more details

Results

The parquet files are faster and smaller than both CSV methodolgies on all methods measured. The only measure on which the CSV file was able keep up was size of the file when compressed, however this came at the cost of additional I/O time over and above the no compression CSV times.

write_time read_time file_size
test
csv_compression 443.7s 44.0s 193mb
csv_no_compression 345.9s 34.0s 2,435mb
parquet_compression 16.0s 9.0s 182mb
parquet_no_compression 15.8s 9.2s 182mb

Drawbacks

The numbers above are heavily in the parquet formats favour but there are some downside considerations. The parquet format is not natively supported by Excel or most text editors so if sharing data with non-programming teams there would need to be an additional translation. Given that the benefits of this file format will be most felt on very large files which are very difficult to use tools like Excel anyway it may be best to pick formats on a case by case basis. The ease of use within Python Pandas supports such a strategy.

Python Implementation

Parquet is very easy to adopt within Python through the Pandas package as it supports as a native read_parquet and to_parquet functions. The pyarrow package does need to be installed in addition to Pandas as this is not a main requirement of Pandas itself but an optional add-in.

Code examples:

# Read a parquet file
df = pd.read_parquet("filename.parquet")

# Write a parquet file
df.to_parquet("filename.parquet")

Pandas Documentaton

R Implementation

The arrow library is available in R to read and write parquet files. See Ursa Labs - Columnar Performance for more details.

About

A testing exercise comparing CSV and Parquet formats for data storage

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published