Using the Parquet format built into the Pandas package on one 2gb file of mixed data types is faster than reading / writing to a CSV and produces smaller files by significant margins. The only downside is that the files cannot be read natively by Excel.
- Writing files is over twenty times faster
- Reading files is nearly four times faster
- File sizes are thirteen times smaller
Parquet is an open source standard for storing columnar data, developed by Twitter & Cloudera it is now maintained by the Apache software foundation. It is suported in both Python & R programming languages and by many SQL engines and distributed processing frameworks like Hadoop / Spark. It employs a variety of strategies to store data efficiently but works most efficiently on numeric data. See Parquet Methodology Presentation for more details.
I have used a single large file stored in memory in Python - this file contains a mix of numeric, string and datetime formatted data which should be fairly representative of an average dataset for analysis.
I am saving this file as both a CSV and parquet file, with and without compression and then repeating this test five times.
We are measuring the time to write the file, the size it takes up on the drive and then how long it takes to read that file back into memory.
This is being done on Windows 10 on a SSD drive.
See Testing Notebook for more details
The parquet files are faster and smaller than both CSV methodolgies on all methods measured. The only measure on which the CSV file was able keep up was size of the file when compressed, however this came at the cost of additional I/O time over and above the no compression CSV times.
write_time | read_time | file_size | |
---|---|---|---|
test | |||
csv_compression | 443.7s | 44.0s | 193mb |
csv_no_compression | 345.9s | 34.0s | 2,435mb |
parquet_compression | 16.0s | 9.0s | 182mb |
parquet_no_compression | 15.8s | 9.2s | 182mb |
The numbers above are heavily in the parquet formats favour but there are some downside considerations. The parquet format is not natively supported by Excel or most text editors so if sharing data with non-programming teams there would need to be an additional translation. Given that the benefits of this file format will be most felt on very large files which are very difficult to use tools like Excel anyway it may be best to pick formats on a case by case basis. The ease of use within Python Pandas supports such a strategy.
Parquet is very easy to adopt within Python through the Pandas package as it supports as a native read_parquet and to_parquet functions. The pyarrow
package does need to be installed in addition to Pandas as this is not a main requirement of Pandas itself but an optional add-in.
Code examples:
# Read a parquet file
df = pd.read_parquet("filename.parquet")
# Write a parquet file
df.to_parquet("filename.parquet")
The arrow
library is available in R to read and write parquet files. See Ursa Labs - Columnar Performance for more details.