Database design and sample data - proof-of-concept #1

jdduh · 2023-12-12T01:17:42Z

SWE Data Apps Google site page;
https://sites.google.com/site/nwccspatialservices/ebagis/swe-data-apps

Sample data and scripts for loading and querying the database are available at:
ftp://basins.geog.pdx.edu/BAGIS/SNODAS_Tools/proof-of-concept_data/

After updating the file path information in the load_snodas.sql file, run the load_snodas.sql in pgAdmin's PSQL tool to create the tables and load the data into these tables.

Copy and paste the SQL statements in the snodas_queries.sql file to the pgAdmin's database Query Tool to retrieve the SNODAS stats from the database.

Data coverage: (green: existing AOI polygons, pink: SNODAS cells included in the database

jdduh · 2024-01-08T21:40:21Z

The problem with postgres

I have been working on the new database schema and I have run into a bit of a snag. Basically, it's a problem of scaling. I feel like I should have seen this coming, but I love the idea of using SQL to query rasters (it has been an idea of mine I've wanted to explore for a few years now) and my bias probably made me overly optimistic.

In any case, here's the basic problem: storing a cell value as a queryable entity in postgres requires we assign it some kind of identifier. I had been planning to use the pixel coordinate encoded as a quadkey along with the date of the SNODAS dataset to uniquely identify each cell value. I started into the math on that though, and I pretty quickly realized that we'll be duplicating the date value across all cell value rows from the same date, and the quadkey across all cell value rows for the same cell, which is not really viable. What follows are some back-of-the-envelope estimates to put this in perspective.

First off, let's calculate the total storage we need just for the SONDAS SWE values. For the sake of looking ahead, I'm rounding the number of SNODAS raster dates up to 10,000. Assuming we store the full grid (6935 x 3351), we end up needing to store 6935 * 3351 * 10,000 = 232,391,850,000 cells. Just these values alone, assuming a 16-bit int datatype, would be over 460 GB uncompressed.

Now, we can get some clear savings here, such as limiting the cells we are storing. We can drop any with NODATA values--let's assume, perhaps optimistically, ocean and other water cells are about 10% of our cells (I'm sure I could easily get a better number for this, I just haven't), so we can throw that into our equation. We can also easily clip the cells down to just the Western US, as we discussed. Very roughly, I used the S edge of New Mexico and the E edge of Colorado to get the lower and right-hand bounds, respectively, and found we could cut the grid down to the cells 2727 x 2587 from the origin.

Repeating the above calculation with the smaller grid and scaling to 90% data cells gets us to 63,492,741,000, or about 127 GB. Just for the cell values.

The real problem is this: adding a date to each cell value takes the tuple size from 2 bytes for a lone 16-bit int to 6 bytes, as the postgres date type is 4 bytes. This addition alone inflates the data to close to 400 GB. The quadkeys simply compound the problem. If we use a string-encoded quadkey this is a massive hit, because we need to store 14 bytes to address a single cell in the original grid. We could use a 4-byte binary type instead, but even with that improvement we're still nearing 650 GB. And this is excluding the size requirements of any indexes or any postgres metadata.

Now, I mentioned this is uncompressed. Postgres does support compression, but only when dealing with column values greater than 2 KB (this is how the current database is only 200 GB despite storing 8 times the data, because the rasters within the rows are stored on disk in separate compressed blobs). We really don't have any way to effectively compress the SWE data as long as we are trying to store each value as an independent entity.

Other ideas in the postgres box

I think you were suggesting using a new column for each date because it deduplicates the date and quadkey identifiers, is that right? Realizing this might be a reasonable motivation, I considered it, but it quickly breaks down: postgres is not performant with a large number of columns (it has to retrieve the whole record even if wanting only a subset--returned columns are filtered in memory), has a hard limit of 1600 columns per tuple, and has a practical limit potentially less than that (https://www.postgresql.org/docs/current/limits.html.

In our case with 16-bit ints it does look like we could fit 1600 columns into a single page, but using a column per date also makes adding a new cell value an update operation rather than an insert. Updates are bad for performance/storage because the old row is copied to the new row, modified, and the old row is marked as dead but is retained until the table is VACUUMed (https://www.postgresql.org/docs/16/sql-vacuum.html). Thus, the entire table would have to be duplicated every day (not permanently if VACUUM can run, but still, the performance implications of copying so much data is not negated by however short the lifetime of the duplicates ends up being).

A key takeaway: avoiding updates is critical for performance.

I considered an idea of using an array type to store each cell's "time series", which would be a large enough amount of data to be TOAST-able (https://www.postgresql.org/docs/16/storage-toast.html) and therefore compressible, but this suffers from the same update concerns. It also would necessitate using array-specific functions in queries, effectively negating the advantages of SQL.

What about other databases?

I also started thinking about non-postgres databases. That is, recognizing that a cell is effectively a time series, we could consider something like clickhouse to store the data (https://clickhouse.com/). I think this is an interesting idea that might be worth exploring, but I have not used clickhouse before so in the interest of solving this problem in the time I have available I have kinda ruled it out.

I also think this SWE data is not exactly the type of data most time series databases are performant at storing/querying. That is, we have extremely high cardinality (many cells) with a (relatively) low number of observations: going back to the 10,000 number, that's only 20 KB of data per cell time series, but at minimum 7,054,749 cell time series. I'm not super familiar with the way clickhouse specifically stores data, but each cell would be a column in the table and each date would be a row (as it is column-oriented, using a column per cell means we can grab all values for that cell, and dates as rows mean each new date is an insert). With what I understand about column-oriented databases generally, that means we could only compress each column, or 20 KB, somewhat limiting the total effectiveness of the compression. And it also likely means we have to again rewrite the entire table every time we process a new raster. That said, if we only want to read/write one cell at a time then this is a good solution because we'd only have to read and decompress the equivalent of 20 KB.

But we don't really want to read just one cell at a time, e.g., AOIs group a collection of cells together.

Where to go from here

This points me towards a solution that similarly groups more cells together with the hopes that we get better compression and less disparate reads from disk. I'm also motivated to find a solution that minimizes the size of the postgres database given the pain associated with managing the current database.

Regarding the latter of these goals, one idea I've had for the current database is to use out-db rasters, where the raster data is stored as regular files on disk and the db contains only the file path. Thinking about this further with the current use-case in mind, I started wondering about using something like the Cloud-Optimized GeoTIFF (COG) format to store the SNODAS data. In case you are unfamiliar, a key advantage of the COG format over a regular GeoTIFF is the support for random access (https://www.cogeo.org/). This feature is enabled by 1) ensuring all metadata is located at predictable offsets in the header of the file, especially the Image File Directory (IFD) that describes the byte offsets of the image "pages", and 2) using tiles for image pages.

The short version is that storing the SWE data in COG format would provide the ability to read an individual (compressed) tile independently of the others. All we need is the tile offset from the IFD to get the tile byte range; we don't have to read the whole file (or a stripe) to access a small area.

Reframing the problem

Now, where does postgres come into play when I'm talking about COGs? Well, it really doesn't, or at least doesn't have to. Taking a step back and thinking at a higher level about the problem at hand: we really just want a way to aggregate SWE values per date grouped by an elevation range. SQL provides a great interface for expressing such a query, it's just that postgres doesn't really give us the right storage to execute such a query efficiently, at least not out of the box. I think, if we had more time, a native C or Rust extension for postgres to "SQL-ize" raster data is entirely possible. It just doesn't exist, to my knowledge.

I did present this idea more generally to a group of geospatial software engineers back in June to ask why not, and they all said it sounds like a compelling idea that should work. The suggestion was to start with a PoC based around a custom SQL execution engine in Python, compiling SQL statements into raster operations--really, just making a SQL-to-raster-op translation. But we don't have to have such a tool to perform such a translation ourselves, given our operations are relatively static.

Perhaps at this point I've come full circle on this idea, all the way back to its beginning, and the next bit is obvious and what we were trying not to do. But here it goes.

The query we essentially want to run, in pseudo-SQL, is something like

select
swe.date,
dem.elevation_band,
average(swe.value)
from
aoi
join aoi_cells using (station_triplet)
join dem using (cell_coord)
join swe using (cell_coord)
where
aoi.name =
and swe.date in <date_range>
group by
swe.date,
-- hand wave here around the elevation band grouping
dem.elevation_band

A new idea

Looking at this, I think the raster operations are rather straightforward, but to leverage the benefits of the COG format we might have to do some special things. So here's how I see this working:
We need an index into the AOIs, which is as simple as loading all the names/IDs from the files provided by https://github.com/PSU-CSAR/BAGIS-pourpoints.
We need an index from AOI to the SNODAS cell grid: this is simply a rasterized version of the AOI geometry. We can clip it down to the SWE COG tile bounds (more on this later).
We need to be able to join the AOI to the DEM layer. Because the AOI polygon is just a boolean raster, we can collapse the two together, and simply use resampled DEM values for all cells within the AOI polygon in the rasterized version.
We need to be able to join the AOI to SWE data. This gets a little tricker because we don't want to have to open the entire SWE raster when we only need a subset. If we have a map of AOI to the intersected SWE COG tiles we can load only those tiles into an array in memory, then use the AOI elevation array to aggregate (average) the SWE values based on specified elevation bands (doing this in Python with numpy is rather trivial and should be rather performant; we might also look at a tool like dask to build xarrays, which might give us both parallelization and sparse reads right out of the box, but it's a relatively new tool to me, I'm not sure how it works on windows, and I get nervous about pursuing something like that when I know I have a path forward making a simpler, purpose-build solution).
We need to have an index into the SWE data by date, which is trivially handled by using a predictable date-based naming convention and relying on the filesystem to give us the COG we need for a given date. We can walk that directory to build a set of all available SNODAS dates at server startup, and simply restart the server process each day when we add a new file.
Exploring the viability

To test the efficiency of COG tile reads per AOI, and to explore how to index intersected tiles for each AOI, I generated two grids based on the SNODAS grid: one for 256x256 tiles and another for 512x512 tiles. I then intersected all the AOIs with polygons with those tile grids and counted the number of tiles intersected by each, and the sum total of tile intersections.

Of the 653 pourpoints with geometries, we have 702 intersections with 512x512 tiles and 794 intersections with 256x256 tiles. This shows that AOIs overwhelmingly fall inside the bounds of a single tile. In both cases the maximum number of tiles intersected by an AOI is four, likely indicating at least one AOI intersects four tiles due to a common corner between the 256x256 and 512x512 tile grids.

The evidence here points to the advantage of using the 256x256 tile size. 512x512 tiles would not provide a meaningful reduction in tile access. In the common case larger tiles would only serve to increase the cost of tile reads. More accurately, the read cost multiplier for 512x512 vs 256x256 tiles is given by the equation 4 * (reads512 / reads256). In our case this is 4 * (702 / 794) = 3.5365239295, suggesting a 512x512 tile grid would result in over 3.5 times as much data having to be read compared to using 256x256 tiles (barring compression).

In fact, the evidence here points to possible benefits from further reducing tile size, say using a 128x128 tile grid. However, smaller tiles will reduce compression efficiency and increase COG index lookups; for these reasons I would suggest 256x256 tiles appear as the optimal balance of read efficiency to disk usage (without real access patterns to back up a smaller tile size).

Worst case scenario for 256x256 tile reads is to have to read all available dates for a given tile. Looking towards the future, let's use that 10,000 number again. A single tile is, with zero compression, 256 x 256 x 2 bytes = 131,072 bytes. Total data to read from disk then, excluding COG index lookups, is roughly 1.31GB. In the case of the four tile AOIs, this becomes a worst-case read of about 5.24GB. This is a lot to read, but in reality the data on disk should compress heavily and will not be nearly so much on disk, and we are not likely to have many queries spanning the entire archive history (right?).

Do we need a database at all?

When storing everything in COG format it ends up that we don't need postgres at all. It's kinda handy for some of the calculations we need to set everything up, like finding the list of tile quadkeys intersected by each AOI, but that can be done with GDAL/OGR in python too. Once we have that list for each AOI, however we've derived it, we can simply embed it into each AOI's geojson file in the BAGIS-pourpoints repo, or in a dedicated SNODAS json metadata file.

If we needed even slightly more advanced AOI search keeping a SQL database in scope would make sense. But even then something like sqlite or duckdb would be a better choice than postgres for a simple relational data model, especially one that doesn't have to support writes from the API (because there's no server application to maintain or complex upgrade process). But until such requirements are in scope I think it makes sense to keep the system as simple as possible via static text files indexed with the filesystem.

Closing thoughts

So this is where I'm at. Again, I apologize for such a long email, especially because I expect you are busy with the start of the term. I just wanted to make sure you had this information time to digest it before our meeting next week.

I'm going to proceed with the plan I've outlined unless I hear back from you to do otherwise. I have a few conflicts this week, but I am optimistically hoping to pull together a rough PoC before our meeting next week.

jdduh · 2024-01-09T02:03:31Z

Here are some stats for DB design considerations.

SNODAS cell count in the selected area: 4 million (4,023,290)
Daily SWE data table with cell ID and SWE value: ASCII ~ 44 MB, Zipped ASCII ~ 8 MB
Average AOI size (not including Alaska AOIs): 7,217,072,554 square meters or 8,500 cells
There are around 560 active forecast stations (i.e., AOIs) in the lower 48 states

jdduh added the documentation Improvements or additions to documentation label Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database design and sample data - proof-of-concept #1

Database design and sample data - proof-of-concept #1

jdduh commented Dec 12, 2023

jdduh commented Jan 8, 2024

jdduh commented Jan 9, 2024

Database design and sample data - proof-of-concept #1

Database design and sample data - proof-of-concept #1

Comments

jdduh commented Dec 12, 2023

jdduh commented Jan 8, 2024

jdduh commented Jan 9, 2024