Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CN/IBL TODO: discuss raw vs. processed separation #74

Closed
CodyCBakerPhD opened this issue Sep 13, 2024 · 5 comments · Fixed by #92
Closed

CN/IBL TODO: discuss raw vs. processed separation #74

CodyCBakerPhD opened this issue Sep 13, 2024 · 5 comments · Fixed by #92
Assignees

Comments

@CodyCBakerPhD
Copy link
Member

And should 'extra' (potentially correctable in future revisions) metadata be associated with raw data, or removed so that we can officially publish a persistent version of the 30+ TB that will not need to be changed or duplicated going forward?

@CodyCBakerPhD
Copy link
Member Author

@grg2rsr This is one of the more important things to consider here early on

The short summation:

Assets on DANDI, once officially 'published' (to mint a DOI) become persistent and frozen. For NWB Dandisets, an 'asset' is a single NWB file. The current approach is to bundle ALL metadata and processed data alongside the bulk of the raw data (the electrical series from the multiple NeuroPixels probes). This means that any time the metadata must be updated in a file containing raw data (perhaps corresponding to a recent 'revision' of the processed data), the entire file must be reuploaded and republished. This needlessly multiplies the amount of storage space taken on the S3 bucket

This is the reason why we've been waiting for clearance from your team to publish the current Dandiset, which though highly used is still in 'draft' because of the known issues with it.

What I propose is to simply write the bulk raw data (which will never change) once, to separate stand-alone files, that have minimal associated metadata

Then, any time a new data revision for the processed / histology / atlas / etc. is reconverted and reuploaded, you can simply republish those new files, which is much less data waste (and even kind of useful as a way to observe changes over versioned releases)

Please let me know what approach you prefer here ASAP so I can make adjustments in the next week or so

cc: @oliche @mayofaulkner @GaelleChapuis

@grg2rsr
Copy link
Collaborator

grg2rsr commented Sep 24, 2024

What I propose is to simply write the bulk raw data (which will never change) once, to separate stand-alone files, that have minimal associated metadata

The discussion on our side has been in agreement with this. If streaming is a performant and viable option, those might even be lumped together in a single file, so that there is a {eid}-processed-only.nwb and a {eid}-raw-only.nwb.

The raw-only will then contain all the fields that might change in future revisions and is comparatively lightweight, so that hosting multiple revisions might be an option.

The only problem with this was, if I understood from your side correctly, that the location of the probe insertion must be present in the nwb file that contains the ElectrodeGroup (which is the -raw-only.nwb), or is this now fixed / covered by #73?

@CodyCBakerPhD
Copy link
Member Author

The only problem with this was, if I understood from your side correctly, that the location of the probe insertion must be present

We can leave it "" if the only purpose of a raw-only file is to store the electrical series

@grg2rsr
Copy link
Collaborator

grg2rsr commented Sep 27, 2024

seems to me like the best way forward

@oliche
Copy link

oliche commented Sep 27, 2024

One of the reason for splitting the files here is that by nature, "raw" acquisition files are often bulkier and also much less likely to change than others. So here we would split the files by "datasets that don´t change" versus "datasets that may change". The motivation is to save space and allow revisions of pre-processed inputs without incurring the full cost of a re-upload. This is the what the discussion above seems to converge to, and I agree.

Yet there are user-centered reasons to make splits desirable: a user doesn't want to have to get the full raw data package if she wants only say the LFP band. In theory this can be addressed by a streaming strategy. Here we'd like to try out the user experience on one of our newly uploaded sessions, simulating different scenarios to decide if further splits are desirable (video / LFP / AP come to mind) depending on how it goes !

@CodyCBakerPhD here your expertise would be helpful to find the most appropriate way to access the data given a scenario !

@CodyCBakerPhD CodyCBakerPhD linked a pull request Sep 29, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants