Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUMULUS-3368]: [User Contribution] Introducing allFilesPresent metadata field for Cumulus Collections. Fixes Issue #3193 #3492

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

botanical
Copy link
Member

@botanical botanical commented Oct 5, 2023

Summary: Summary of changes

Addresses CUMULUS-3368: Introducing allFilesPresent metadata field for Cumulus Collections.

Changes

Change allows Cumulus Collections to enforce that all files specified in Collections list to be present before ingesting Granules.

PR Checklist

  • Update CHANGELOG
  • Unit tests
  • Ad-hoc testing - Deploy changes and test manually
  • Integration tests

@botanical botanical changed the title CUMULUS-3368: [User Contribution] Introducing allFilesPresent metadata field for Cumulus Collections. Fixes Issue #3193 [CUMULUS-3368]: [User Contribution] Introducing allFilesPresent metadata field for Cumulus Collections. Fixes Issue #3193 Oct 5, 2023
@botanical botanical marked this pull request as ready for review October 5, 2023 23:07
@@ -48,6 +48,11 @@ Users/clients that do not make use of these endpoints will not be impacted.
- Added optional `maxDownloadTime` field to `provider` schema
- Added `max_download_time` column to PostgreSQL `providers` table
- Updated `@cumulus/ingest/lock` to check expired locks based on `provider.maxDownloadTime`
- **CUMULUS-3368**
- Added `allFilesPresent` field for `collection.meta` that defines whether or not all files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since allFilesPresent is used in our task, suggest to add it to https://github.com/nasa/cumulus/blob/master/packages/api/lib/schemas.js#L211

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please add the following to collection and granule mapping:
https://github.com/nasa/cumulus/blob/master/packages/es-client/config/mappings.json
"meta": {
"type": "object",
"dynamic": false
},
There is a ticket CUMULUS-3417 about mapping issue, and it can happen for collection record as well

@@ -48,6 +48,10 @@ A list of buckets with types that will be used to assign bucket targets based on

A Cumulus [collection](https://github.com/nasa/cumulus/blob/master/packages/api/lib/schemas.js) object. Used to define granule file groupings and granule metadata for discovered files. The collection object utilizes the collection type key to generate types in the output object on discovery.

##### Collection Meta

If the collection is configured with `collection.meta.allFilesPresent` set to `true`, the task will remove granules's missing files. Otherwise, the default behavior ignores missing files in granules.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the collection is configured with `collection.meta.allFilesPresent` set to `true`, the task will remove granules's missing files. Otherwise, the default behavior ignores missing files in granules.
If the collection is configured with `collection.meta.allFilesPresent` set to `true`, the task will remove granules with all the files. Otherwise, by default, task continue to process granules without all the files.

remove granules's missing files is confusing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording above is still confusing. Do you mean something like the following?

If the collection is configured with collection.meta.allFilesPresent set to true, the task will include only granules where all files specified in the collection.files list are present, thus excluding granules where at least one file is absent. Otherwise, by default (i.e., collection.meta.allFilesPresent defaults to false), the task includes every granule, whether or not all of its files are present.

* @returns {Boolean} - returns true if all file in collection config are present for granuleId
*/
const checkGranuleHasAllFiles = (config, filesByGranuleId, granuleId) => {
const granuleHasAllFiles = filesByGranuleId[granuleId].length >= config.collection.files.length;
Copy link
Contributor

@jennyhliu jennyhliu Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If collection.files[].regex matchs multiple granule files, how do we make sure each regex has at least one granule file? The assertion is weak.

Copy link
Contributor

@jennyhliu jennyhliu Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A granule can have 10 files but only 5 collection.files configuration.

@@ -314,6 +370,16 @@ const discoverGranules = async ({ config }) => {
concurrency: get(config, 'concurrency', 3),
});

let allFilesPresent = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to move line 373-376 to handleGranulesWithoutAllFilesPresent function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jkovarik Jkovarik mentioned this pull request Jul 8, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants