Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata correction #2177

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Conversation

candleindark
Copy link
Member

This PR provides a solution to correct the corruptions in metadata documented in dandi/dandi-schema#276.

The solution is implemented in two parts.

  1. A general command, correct_metadata, to correct dandiset metadata.
  2. A specific helper function which the general command calls to do the actual correction.

Extensive tests are provided for the helper function and its supporting function. However, because I am not familiar with this repo and Django in general, I am not able to provide tests for the command which interacts with the database. Advice and additional tests are very much appreciated.

The command can be run on a targeted dandiset at a particular version and run on all versions of all dandisets. Running on all dandiset versions will only correct the corrupted dandisets. If running only on targeted dandiset version is preferable, please let me know, and I will provide the list of corrupted dandiset versions. (Additionally, would changing the interface of the command to a file consisting of corrupted dandiset versions be better?)

@yarikoptic yarikoptic requested a review from asmacdo February 10, 2025 16:52
Copy link
Member

@asmacdo asmacdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of questions to start off with-- I'll have another look tomorrow.

ver.save() # TODO: should I use `save()` in this case? It has some custom logic.

for ver, _ in changes:
write_manifest_files.delay(ver.id) # TODO: Please check if this is needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if any of the file writes fail or time out?

write_dandiset_yaml(version)
write_assets_yaml(version)
write_dandiset_jsonld(version)
write_assets_jsonld(version)
write_collection_jsonld(version)

Maybe we should at least keep the AsyncResults and monitor that they go as planned?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually don't know if I need to call write_manifest_files.delay(ver.id) at all. I was trying to mimic

, but https://github.com/dandi/dandi-archive/blob/master/dandiapi/api/management/commands/migrate_version_metadata.py didn't call write_manifest_files() at all.

Maybe we should at least keep the AsyncResults and monitor that they go as planned?

I think the answer is related to how important that statement gets executed successfully. In the example above, its execution status is not checked. If a success execution is important, I guess I can't interact with the DB the current way.

@dandi/dandiarchive

Note: This function corrects the corruptions described in
https://github.com/dandi/dandi-schema/issues/276
"""
unwanted_fields = ['contactPoint', 'includeInCitation', 'roleName']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the original intent of the change that led to this problem was to remove all fields but the allowed ones-- would it be beneficial to also have a list of allowed_fields to confirm that they are the only ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In concept, you are right. I considered do that as well. However, I want to be careful for now. I want to remove only the ones that I know are causing problems. I don't want to remove anything that we haven't analyzed previously.

@yarikoptic yarikoptic added internal Changes only affect the internal API maintenance Action to maintain the system (neither a bugfix nor an enhancement) metadata Issues of dandiset/asset metadata handling labels Feb 12, 2025
Copy link
Member

@waxlamp waxlamp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @candleindark! One thing to change listed below. And I've asked @jjnesbitt to provide a more comprehensive review on this R as well.

@waxlamp waxlamp requested a review from jjnesbitt February 27, 2025 19:11
This make the default behavior to require
user to specify a particular dandiset version
to apply the correct to. Only when the `--all`
flag is provided, should the command apply the
correction to all dandiset versions
@candleindark
Copy link
Member Author

@jjnesbitt I have put in changes requested by @waxlamp regarding the command line interface. With the latest push, two unrelated tests failed. Can they be related to the latest changes in dandi-cli?

- Don't allow correct function configuration
- Remove verbose error handling logic
- Write manifest files synchronously
- Only support correction of the `Affiliation` schema key
- Remove `find_objs` tests against generalized schema key
- Use transactions to couple logic of metadata save and manifest file generation
@jjnesbitt
Copy link
Member

I've pushed a commit which aims to simplify the logic in this management command. Namely:

  • Remove the configuration of the correcting function.
    • We have no real use case for this yet, and no way to change this function is provided.
  • Removal of verbose error handling logic
    • Along the same lines as the above point, since this is a single function, this is largely unnecessary. We have tests, and if the correcting function fails for some reason, the error will simply bubble up.
  • Only support correction of the Affiliation schema key
    • Similarly again, we only have the use case of Affiliation at the moment, so all of the configuration surrounding other schema keys is unnecessary at the moment.
  • Remove find_objs tests against generalized schema key
  • Write manifest files synchronously
    • Since we are not in a request context, it's better to write the manifest files to S3 synchronously, so that we can be sure it succeeds.
  • Use transactions to couple logic of metadata save and manifest file generation

IMO this is ready, maybe @waxlamp should take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal Changes only affect the internal API maintenance Action to maintain the system (neither a bugfix nor an enhancement) metadata Issues of dandiset/asset metadata handling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants