CSV vs Parquet formats for data storage

TL:DR

Using the Parquet format built into the Pandas package on one 2gb file of mixed data types is faster than reading / writing to a CSV and produces smaller files by significant margins. The only downside is that the files cannot be read natively by Excel.

Writing files is over twenty times faster
Reading files is nearly four times faster
File sizes are thirteen times smaller

What is Parquet?

Parquet is an open source standard for storing columnar data, developed by Twitter & Cloudera it is now maintained by the Apache software foundation. It is suported in both Python & R programming languages and by many SQL engines and distributed processing frameworks like Hadoop / Spark. It employs a variety of strategies to store data efficiently but works most efficiently on numeric data. See Parquet Methodology Presentation for more details.

Testing Methodology

I have used a single large file stored in memory in Python - this file contains a mix of numeric, string and datetime formatted data which should be fairly representative of an average dataset for analysis.

I am saving this file as both a CSV and parquet file, with and without compression and then repeating this test five times.

We are measuring the time to write the file, the size it takes up on the drive and then how long it takes to read that file back into memory.

This is being done on Windows 10 on a SSD drive.

See Testing Notebook for more details

Results

The parquet files are faster and smaller than both CSV methodolgies on all methods measured. The only measure on which the CSV file was able keep up was size of the file when compressed, however this came at the cost of additional I/O time over and above the no compression CSV times.

	write_time	read_time	file_size
test
csv_compression	443.7s	44.0s	193mb
csv_no_compression	345.9s	34.0s	2,435mb
parquet_compression	16.0s	9.0s	182mb
parquet_no_compression	15.8s	9.2s	182mb

Drawbacks

The numbers above are heavily in the parquet formats favour but there are some downside considerations. The parquet format is not natively supported by Excel or most text editors so if sharing data with non-programming teams there would need to be an additional translation. Given that the benefits of this file format will be most felt on very large files which are very difficult to use tools like Excel anyway it may be best to pick formats on a case by case basis. The ease of use within Python Pandas supports such a strategy.

Python Implementation

Parquet is very easy to adopt within Python through the Pandas package as it supports as a native read_parquet and to_parquet functions. The pyarrow package does need to be installed in addition to Pandas as this is not a main requirement of Pandas itself but an optional add-in.

Code examples:

# Read a parquet file
df = pd.read_parquet("filename.parquet")

# Write a parquet file
df.to_parquet("filename.parquet")

Pandas Documentaton

R Implementation

The arrow library is available in R to read and write parquet files. See Ursa Labs - Columnar Performance for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
parquet_results.ipynb		parquet_results.ipynb
results.csv		results.csv
results.parquet		results.parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSV vs Parquet formats for data storage

TL:DR

What is Parquet?

Testing Methodology

Results

Drawbacks

Python Implementation

R Implementation

About

Releases

Packages

Languages

JamesHaughey/parquet_format

Folders and files

Latest commit

History

Repository files navigation

CSV vs Parquet formats for data storage

TL:DR

What is Parquet?

Testing Methodology

Results

Drawbacks

Python Implementation

R Implementation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages