-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CN/IBL TODO: discuss raw vs. processed separation #74
Comments
@grg2rsr This is one of the more important things to consider here early on The short summation: Assets on DANDI, once officially 'published' (to mint a DOI) become persistent and frozen. For NWB Dandisets, an 'asset' is a single NWB file. The current approach is to bundle ALL metadata and processed data alongside the bulk of the raw data (the electrical series from the multiple NeuroPixels probes). This means that any time the metadata must be updated in a file containing raw data (perhaps corresponding to a recent 'revision' of the processed data), the entire file must be reuploaded and republished. This needlessly multiplies the amount of storage space taken on the S3 bucket This is the reason why we've been waiting for clearance from your team to publish the current Dandiset, which though highly used is still in 'draft' because of the known issues with it. What I propose is to simply write the bulk raw data (which will never change) once, to separate stand-alone files, that have minimal associated metadata Then, any time a new data revision for the processed / histology / atlas / etc. is reconverted and reuploaded, you can simply republish those new files, which is much less data waste (and even kind of useful as a way to observe changes over versioned releases) Please let me know what approach you prefer here ASAP so I can make adjustments in the next week or so |
The discussion on our side has been in agreement with this. If streaming is a performant and viable option, those might even be lumped together in a single file, so that there is a The The only problem with this was, if I understood from your side correctly, that the location of the probe insertion must be present in the nwb file that contains the |
We can leave it |
seems to me like the best way forward |
One of the reason for splitting the files here is that by nature, "raw" acquisition files are often bulkier and also much less likely to change than others. So here we would split the files by "datasets that don´t change" versus "datasets that may change". The motivation is to save space and allow revisions of pre-processed inputs without incurring the full cost of a re-upload. This is the what the discussion above seems to converge to, and I agree. Yet there are user-centered reasons to make splits desirable: a user doesn't want to have to get the full raw data package if she wants only say the LFP band. In theory this can be addressed by a streaming strategy. Here we'd like to try out the user experience on one of our newly uploaded sessions, simulating different scenarios to decide if further splits are desirable (video / LFP / AP come to mind) depending on how it goes ! @CodyCBakerPhD here your expertise would be helpful to find the most appropriate way to access the data given a scenario ! |
And should 'extra' (potentially correctable in future revisions) metadata be associated with raw data, or removed so that we can officially publish a persistent version of the 30+ TB that will not need to be changed or duplicated going forward?
The text was updated successfully, but these errors were encountered: