A new API and a new internal structure #231

bnlawrence · 2024-12-18T07:52:20Z

PyActiveStorage is currently a "research activity" and we need to transition it to a library with a clean API and clear internal functionality - in partnership, for the moment, with the Reductionist Library.

There are three things we need to do:

develop a clean new API
clean up the code and documentation, prune branches and issues
release and publicise

This issue is mainly about the first of these objectives. We'll spin off issues for the other two when that is done.

bnlawrence · 2024-12-18T08:35:10Z

David and I had a lengthy conversation and came up with the following potential new API. It depends on some new components which we will pretend exist in the following snippets (to be clear, none of the imports from PyActive exist, these are proposals).

First we start with the vanilla file opening.

from h5netcdf.legacyapi import Dataset
filename ='fred.nc'
f = Dataset('fred.nc')
v = f['temp']

v would then be a normal h5netcdf Variable instance. As such, we have access to the underlying h5py Dataset instance (or, if we use a pyfive based backend, the pyfive equivalent, which is currently a DatasetDataObject).

We want to provide access to the storage chunks of v for use in Dask computational chunks and/or with active storage. In both cases we want this to be efficient and thread safe. With pre ExcaliData existing packages there was no way to do this, so we invested in upgrading pyfive, and we now have the capability to that (albeit in a bunch of branches which need tidying up). However, we have discovered that to be efficient, we don't want every dask computational task having to read the b-tree - this of course is not a real discovery, and was likely the motivation behind the invention of kerchunk. We now have methods for extracting the b-tree and caching it for re-use by active storage. The API question is then how to make the use of that b-tree nice and natural and how to invoke the "active" part of active storage in a pythonic way. It will be seen that in solving this, we can and should solve the "the efficient and natural use of thread-safe netcdf reading".

We first show our thinking for how we might handle the active API". It starts with getting a modified version of v, one that can be used for active operations.

from PyActive import active
av = active(v)

av is an "ActiveDataset" instance (or would be if we had built it, but this is a thought experiment). Now we can use it in our examples.

Example 1: Normal access to slices of v (i.e. using av in a non-active way).

result1 = v[56:60]
result2 = av[56:60]
assert result1 == result2

av and v behave the same for normal operations.

Example 2: Doing an active operation on the entire array

import numpy as np
result1 = np.mean(v)
result2 = av.mean[:]
assert result1 == result2

In this case result2 is calculated with chunk means calculated in the storage server, so result2 has far less network traffic involved than result1, where the chunk means are calculated in this python code itself.

(All the methods supported by the active storage need to be supported as methods on the ActiveDataset instance, which has implications for how we add new reductions into the client as well as the server, but we expect this to be a rare thing.)

Example 3: Doing an active operation on a subset of the array

result1 = np.mean(v[56:60])
result2 = av.mean[56:60]
assert result1 == result2

If the slice [56:60] intersected in two storage chunks, then in this case the active mean would be calculated on both chunks storage side, and the two results returned and meaned. Note that the active version would work using missing data masks by default.

Example 4: Using Dask to do normal operations

import dask.array as da
y = da.from_array(av)

av quacks like v, so Dask could work on that in all normal ways.

Example 5: Using dask with active

It was tempting to think that the extension from that would be to do things like this:

y = da.from_array(av.mean)

but we rapidly realised all sorts of odd things would happen as av.mean would not behave the way v or av would within a dask graph so dask would get in all sorts of trouble. Instead, we think a pattern like this would work better:

import PyActive as pa
y = pa.mean.from_array(av)

In this case, again, all the reduction methods need to be supported in the active library, but they all have a common pattern which involves some manipulation of the dask instances (David to add details).

bnlawrence · 2024-12-18T08:38:40Z

(It is important for everyone to realise that in the event the storage is not active, this client will handle the operations client side anyway, so the code will work with and without active storage)

davidhassell · 2024-12-19T11:24:44Z

Hi Bryan - Great write up, thanks. It still seems to make sense.

One API thing we touched in briefly was renaming the class (from Active), since we'd be using it in non-active mode as a proper use case (normal reads in cf-python), rather than just for testing (is that fair?). I.e. the object is a "general" data getter that also has an active mode of functionality. No conclusions were reached.

I'll think a bit more on the "Using dask with active" case, as you say.

bnlawrence · 2025-01-17T15:04:06Z

@davidhassell I am assuming the dask issue will also bite us here as well as in cf-python?

davidhassell · 2025-01-17T17:02:50Z

Hi @bnlawrence - the good news is that I don't think there'll be any problem here. The dask issue is about only what goes in cf-python before it gets to the stage instantiating any Active instances. I.e. the problem is independent of the nature of the Active class.

valeriupredoi · 2025-01-23T15:06:54Z

this is cool! Finally had a chance to go through it! The main questions I'd ask here are:

how much exposure we'd want for the Active class? ie the way I see it, it should be a fairly low level class that gets used by a bunch of higher level classes/tools (cf-python, iris etc) or even lower level (but still high level) like Dask
depending on the answer from above, I'd completely drop the "non-Active" case, and let that default to and be dealt with by whatever tool is using ActiveStorage; or not, if we want ActiveStorage to be higher-level

Let me start with a set of vanilla refactoring ops for now, then we'll decide how best to juggle what we have 🍺

bnlawrence · 2025-01-23T15:09:15Z

I think we would want it to be usable by civilians as well as higher level libraries, so in that sense I do think we need Active ...

valeriupredoi · 2025-01-23T15:34:23Z

I think we would want it to be usable by civilians as well as higher level libraries, so in that sense I do think we need Active ...

sounds about what I was stinking too

bnlawrence added the wacasoft label Dec 18, 2024

bnlawrence assigned bnlawrence, davidhassell and valeriupredoi Dec 18, 2024

valeriupredoi mentioned this issue Feb 27, 2025

Improving API: part 1: functionality for input pyfive.high_level.Dataset #241

Merged

5 tasks

valeriupredoi mentioned this issue Mar 12, 2025

Enable https reduction (off NGINX server only) and auto-detect storage_type #245

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A new API and a new internal structure #231

A new API and a new internal structure #231

bnlawrence commented Dec 18, 2024

bnlawrence commented Dec 18, 2024 •

edited

Loading

bnlawrence commented Dec 18, 2024

davidhassell commented Dec 19, 2024

bnlawrence commented Jan 17, 2025

davidhassell commented Jan 17, 2025

valeriupredoi commented Jan 23, 2025

bnlawrence commented Jan 23, 2025

valeriupredoi commented Jan 23, 2025

A new API and a new internal structure #231

A new API and a new internal structure #231

Comments

bnlawrence commented Dec 18, 2024

bnlawrence commented Dec 18, 2024 • edited Loading

bnlawrence commented Dec 18, 2024

davidhassell commented Dec 19, 2024

bnlawrence commented Jan 17, 2025

davidhassell commented Jan 17, 2025

valeriupredoi commented Jan 23, 2025

bnlawrence commented Jan 23, 2025

valeriupredoi commented Jan 23, 2025

bnlawrence commented Dec 18, 2024 •

edited

Loading