Open Kerchunk refs as Virtual Dataset #119

norlandrhagen · 2024-05-16T23:29:53Z

Closes Open on-disk kerchunk references as a virtual dataset #118
Tests added
User visible changes (including notable bug fixes) are documented in changelog.md
New functions/methods are listed in api.rst

Start of PR to address #118.

Lots of open questions!

How should we read .parquetfiles into KerchunkStoreRefs to pass into dataset_from_kerchunk_refs
RT'ing json seems to loose _ARRAY_DIMENSIONS

Would love some feedback @jsignell!

norlandrhagen · 2024-05-17T18:23:44Z

Update: @TomNicholas and I dug a bit deeper on the json roundtrip failing. On a visual inspection, the underlying structure and data seem identical, but the xarray_testing.assert_equal disagrees.

vds.lat.data.manifest.dict() == rt_vds.lat.data.manifest.dict() asserts to True.

...

norlandrhagen · 2024-05-17T18:30:11Z

Also, some weird behavior where virtualize.to_kerchunk seems to be adding ARRAY_DIMENSIONS?

jsignell · 2024-05-17T18:08:24Z

virtualizarr/xarray.py

+
+            vds = dataset_from_kerchunk_refs(refs_dict)
+            return vds
+        elif kerchunk_storage_ftype == ".parquet":


Parquet files are not required to have this suffix for instance ".parq" is also very common. Not sure if there is a better way to tell the type of file though.

jsignell · 2024-05-17T18:57:12Z

virtualizarr/xarray.py

+
+            # Question: How should we read the parquet files
+            # into a dict to pass into dataset_from_kerchunk_refs?
+            # pandas, pyarrow table, duckdb?


I feel like pandas would be a fine way to get things working and then you can always switch it out.

jsignell · 2024-05-17T18:59:21Z

virtualizarr/utils.py

-        fpath = fsspec.filesystem(protocol, **storage_options).open(filepath)
+        fpath = fsspec.filesystem(protocol, **storage_options)
+        if universal_filepath.is_file():
+            fpath = fpath.open(filepath)


I'm having trouble figuring out what motivated these changes.

jsignell · 2024-05-17T19:02:40Z

virtualizarr/tests/test_kerchunk.py

+def test_kerchunk_to_virtual_dataset(netcdf4_file, tmpdir, format):
+    vds = open_virtual_dataset(netcdf4_file, indexes={})
+
+    # QUESTION: should these live in a fixture? ex. kerchunk_ref_fpath_json, kerchunk_ref_fpath_parquet


eh you kind of want the original vds as well as the kerchunk refs so I think it is fine as is.

TomNicholas · 2024-07-13T17:27:10Z

I just tried to merge main into this because @kthyng is interesting in picking it up.

Also, some weird behavior where virtualize.to_kerchunk seems to be adding ARRAY_DIMENSIONS?

I'm pretty sure I fixed this in #153

EDIT: Looks like I broke something in the defaults for the fsspec reader though oops

norlandrhagen · 2024-10-10T04:04:29Z

#251

norlandrhagen added 2 commits May 16, 2024 17:25

first stab at issue #118

39d7735

cleanup pdb

ffa8562

TomNicholas added the references generation Reading byte ranges from archival files label May 17, 2024

jsignell reviewed May 17, 2024

View reviewed changes

TomNicholas mentioned this pull request Jun 8, 2024

Aspirational use case: [C]Worthy mCDR OAE Atlas dataset #132

Open

19 tasks

Merge branch 'main' into kerchunk_to_virtual

f290592

TomNicholas temporarily deployed to test-release July 13, 2024 17:22 — with GitHub Actions Inactive

kthyng mentioned this pull request Jul 13, 2024

Open kerchunk ref as virtual dataset, only json (from PR 119) #186

Closed

6 tasks

TomNicholas mentioned this pull request Jul 21, 2024

Extend refspec support to [path] entries (without offset/length) #187

Merged

7 tasks

keewis mentioned this pull request Oct 8, 2024

Allow open_virtual_dataset to read existing Kerchunk references #251

Merged

10 tasks

norlandrhagen closed this Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open Kerchunk refs as Virtual Dataset #119

Open Kerchunk refs as Virtual Dataset #119

norlandrhagen commented May 16, 2024 •

edited by TomNicholas

Loading

norlandrhagen commented May 17, 2024 •

edited

Loading

norlandrhagen commented May 17, 2024

jsignell May 17, 2024

jsignell May 17, 2024

jsignell May 17, 2024

jsignell May 17, 2024

TomNicholas commented Jul 13, 2024

norlandrhagen commented Oct 10, 2024

Open Kerchunk refs as Virtual Dataset #119

Open Kerchunk refs as Virtual Dataset #119

Conversation

norlandrhagen commented May 16, 2024 • edited by TomNicholas Loading

norlandrhagen commented May 17, 2024 • edited Loading

norlandrhagen commented May 17, 2024

jsignell May 17, 2024

Choose a reason for hiding this comment

jsignell May 17, 2024

Choose a reason for hiding this comment

jsignell May 17, 2024

Choose a reason for hiding this comment

jsignell May 17, 2024

Choose a reason for hiding this comment

TomNicholas commented Jul 13, 2024

norlandrhagen commented Oct 10, 2024

norlandrhagen commented May 16, 2024 •

edited by TomNicholas

Loading

norlandrhagen commented May 17, 2024 •

edited

Loading