Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new API and a new internal structure #231

Open
3 tasks
bnlawrence opened this issue Dec 18, 2024 · 8 comments
Open
3 tasks

A new API and a new internal structure #231

bnlawrence opened this issue Dec 18, 2024 · 8 comments
Assignees
Labels

Comments

@bnlawrence
Copy link
Collaborator

PyActiveStorage is currently a "research activity" and we need to transition it to a library with a clean API and clear internal functionality - in partnership, for the moment, with the Reductionist Library.

There are three things we need to do:

  • develop a clean new API
  • clean up the code and documentation, prune branches and issues
  • release and publicise

This issue is mainly about the first of these objectives. We'll spin off issues for the other two when that is done.

@bnlawrence
Copy link
Collaborator Author

bnlawrence commented Dec 18, 2024

David and I had a lengthy conversation and came up with the following potential new API. It depends on some new components which we will pretend exist in the following snippets (to be clear, none of the imports from PyActive exist, these are proposals).

First we start with the vanilla file opening.

from h5netcdf.legacyapi import Dataset
filename ='fred.nc'
f = Dataset('fred.nc')
v = f['temp']

v would then be a normal h5netcdf Variable instance. As such, we have access to the underlying h5py Dataset instance (or, if we use a pyfive based backend, the pyfive equivalent, which is currently a DatasetDataObject).

We want to provide access to the storage chunks of v for use in Dask computational chunks and/or with active storage. In both cases we want this to be efficient and thread safe. With pre ExcaliData existing packages there was no way to do this, so we invested in upgrading pyfive, and we now have the capability to that (albeit in a bunch of branches which need tidying up). However, we have discovered that to be efficient, we don't want every dask computational task having to read the b-tree - this of course is not a real discovery, and was likely the motivation behind the invention of kerchunk. We now have methods for extracting the b-tree and caching it for re-use by active storage. The API question is then how to make the use of that b-tree nice and natural and how to invoke the "active" part of active storage in a pythonic way. It will be seen that in solving this, we can and should solve the "the efficient and natural use of thread-safe netcdf reading".

We first show our thinking for how we might handle the active API". It starts with getting a modified version of v, one that can be used for active operations.

from PyActive import active
av = active(v)

av is an "ActiveDataset" instance (or would be if we had built it, but this is a thought experiment). Now we can use it in our examples.

Example 1: Normal access to slices of v (i.e. using av in a non-active way).

result1 = v[56:60]
result2 = av[56:60]
assert result1 == result2

av and v behave the same for normal operations.

Example 2: Doing an active operation on the entire array

import numpy as np
result1 = np.mean(v)
result2 = av.mean[:]
assert result1 == result2

In this case result2 is calculated with chunk means calculated in the storage server, so result2 has far less network traffic involved than result1, where the chunk means are calculated in this python code itself.

(All the methods supported by the active storage need to be supported as methods on the ActiveDataset instance, which has implications for how we add new reductions into the client as well as the server, but we expect this to be a rare thing.)

Example 3: Doing an active operation on a subset of the array

result1 = np.mean(v[56:60])
result2 = av.mean[56:60]
assert result1 == result2

If the slice [56:60] intersected in two storage chunks, then in this case the active mean would be calculated on both chunks storage side, and the two results returned and meaned. Note that the active version would work using missing data masks by default.

Example 4: Using Dask to do normal operations

import dask.array as da
y = da.from_array(av)

av quacks like v, so Dask could work on that in all normal ways.

Example 5: Using dask with active

It was tempting to think that the extension from that would be to do things like this:

y = da.from_array(av.mean)

but we rapidly realised all sorts of odd things would happen as av.mean would not behave the way v or av would within a dask graph so dask would get in all sorts of trouble. Instead, we think a pattern like this would work better:

import PyActive as pa
y = pa.mean.from_array(av)

In this case, again, all the reduction methods need to be supported in the active library, but they all have a common pattern which involves some manipulation of the dask instances (David to add details).

@bnlawrence
Copy link
Collaborator Author

(It is important for everyone to realise that in the event the storage is not active, this client will handle the operations client side anyway, so the code will work with and without active storage)

@davidhassell
Copy link
Collaborator

Hi Bryan - Great write up, thanks. It still seems to make sense.

One API thing we touched in briefly was renaming the class (from Active), since we'd be using it in non-active mode as a proper use case (normal reads in cf-python), rather than just for testing (is that fair?). I.e. the object is a "general" data getter that also has an active mode of functionality. No conclusions were reached.

I'll think a bit more on the "Using dask with active" case, as you say.

@bnlawrence
Copy link
Collaborator Author

@davidhassell I am assuming the dask issue will also bite us here as well as in cf-python?

@davidhassell
Copy link
Collaborator

Hi @bnlawrence - the good news is that I don't think there'll be any problem here. The dask issue is about only what goes in cf-python before it gets to the stage instantiating any Active instances. I.e. the problem is independent of the nature of the Active class.

@valeriupredoi
Copy link
Collaborator

this is cool! Finally had a chance to go through it! The main questions I'd ask here are:

  • how much exposure we'd want for the Active class? ie the way I see it, it should be a fairly low level class that gets used by a bunch of higher level classes/tools (cf-python, iris etc) or even lower level (but still high level) like Dask
  • depending on the answer from above, I'd completely drop the "non-Active" case, and let that default to and be dealt with by whatever tool is using ActiveStorage; or not, if we want ActiveStorage to be higher-level

Let me start with a set of vanilla refactoring ops for now, then we'll decide how best to juggle what we have 🍺

@bnlawrence
Copy link
Collaborator Author

I think we would want it to be usable by civilians as well as higher level libraries, so in that sense I do think we need Active ...

@valeriupredoi
Copy link
Collaborator

I think we would want it to be usable by civilians as well as higher level libraries, so in that sense I do think we need Active ...

sounds about what I was stinking too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants