Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect and report missing data in GWO ingest #626

Open
znatty22 opened this issue Apr 22, 2021 · 4 comments
Open

Detect and report missing data in GWO ingest #626

znatty22 opened this issue Apr 22, 2021 · 4 comments
Assignees
Labels
feature New functionality

Comments

@znatty22
Copy link
Member

znatty22 commented Apr 22, 2021

The study creator's GenomicDataLoader currently does not detect any discrepancies between the GWO manifest and S3 or between the GWO manifest and the Dataservice. This is an important part of the analysts' current manual process of loading the harmonized genomic file info into the Dataservice.

Each of the 3 load functions in the GenomicDataLoader should be modified to detect discrepancies and report them either through log statements and/or event firing.

Specifics:

In load_harmonized_genomic_files method:

  • Detect and report if there is a discrepancy between the files listed in the GWO manifest and the S3 scrape

In load_specimen_harmonized_gf_links method:

  • Detect and report if there is a discrepancy between the specimens listed in the GWO manifest and the specimens in Dataservice

In load_seq_exp_harmonized_genomic_files method:

  • Detect and report if any harmonized files were not able to be linked to sequencing experiments (e.g. because the corresponding unharmonized genomic file didn't exist)
@znatty22 znatty22 added the feature New functionality label Apr 22, 2021
@gsantia gsantia self-assigned this Apr 23, 2021
@gsantia
Copy link
Contributor

gsantia commented Apr 28, 2021

Should any of these three changes lead to a stop in the ingestion process? Or do we just want to report these things?

@znatty22
Copy link
Member Author

I think just report these things but maybe we should ask @allisonheath

@gsantia
Copy link
Contributor

gsantia commented Apr 28, 2021

I've been thinking through the 3rd checks here and it seems to me some parts of it should be done elsewhere. For example, checking that a harmonized genomic file's corresponding genomic file doesn't exist is something we can do immediately just using the GWO manifest itself. Query the dataservice for genomic-files which match the source file column entries and if any are missing then we have a problem.

EDIT: On second thought it probably is better to do it in the load_seq_exp_harmonized_genomic_files method because then we don't need to make extraneous queries to the dataservice

@znatty22
Copy link
Member Author

znatty22 commented Apr 29, 2021

@gsantia Yea the issue I wrote up might not be exactly how it turns out to be implemented. You will prob have a better idea since you're doing the implementation. The important thing is we're able to record and report any missing data which we feel is important for the user to know about

@gsantia gsantia linked a pull request Apr 29, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants