merra2_subsetter

This repo contains two tangentially related experiments.

I needed an excuse to play around with asynchronous programming, so I built a tool to download subsets of MERRA-2 reanalysis data. I added rate and concurrency limiting to avoid hogging server resources.
- I found httpx to be an easier framework to use than aiohttp. Also, I never got aiohttp to work due to an arcane problem with NASA's OAuth (they use an outdated timestamp format in their session cookie that the built-in urllib CookieJar class no longer supports). But httpx worked right out of the box.
I also needed an excuse to try the Clickhouse columnar database, so I used the MERRA-2 data from above to test various data compression schemas in Clickhouse.
- I found that for floating point data, the most impactful choice was simply how much false precision to keep. Discarding false precision can make sense (watch out for derivatives!), but the risks of being wrong or not anticipating a future use case generally outweigh the rewards. Disk space is cheap; re-collecting data may not even be possible.
- The choice of compression algorithms in Clickhouse only mattered for floats if data was oversampled or if there were unused bits (trailing zeros) that could be removed. In those cases, the Gorilla XOR codec was most effective.
- A better option for lossy compression of multi-dimensional data would probably be zfp ("jpeg for spatial data").

The dashed line is at the uncompressed size

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
scratch_work		scratch_work
.gitattributes		.gitattributes
.gitignore		.gitignore
Compression.ipynb		Compression.ipynb
LICENSE		LICENSE
README.md		README.md
async_downloader.py		async_downloader.py
async_downloader_test_script.py		async_downloader_test_script.py
clickhouse.sh		clickhouse.sh
clickhouse.sql		clickhouse.sql
clickhouse_compression_test.tsv		clickhouse_compression_test.tsv
compression_by_column.png		compression_by_column.png
data_size_calcs.xlsx		data_size_calcs.xlsx
merra_etl.py		merra_etl.py
merra_urls.py		merra_urls.py
merra_urls_test.py		merra_urls_test.py

Provide feedback