All versioned datasets from a given pipeline execution should have the same timestamp #4234

ChristopherRabotin · 2024-10-16T22:15:44Z

ChristopherRabotin
Oct 16, 2024

Hi there,

Obligatory thank you: kedro is great. I'm excited to share what we've been using it for in a few months.

In our special use case, we're building 20+ data products in our pipelines, some of which take several minutes to build. We then tell several teams where they can find the latest data. For these operations, we log the version of the data products we generate. Since we generate a lot of them, we currently have to log one timestamp per data product. A number of our data products are time critical and need to be communicate without error to another team in a few minutes.

Therefore, we're changed the catalog so that each pipeline starts with a node that sets the timestamp for the whole pipeline execution and every note outputs a PartitionedDataset: this mimics the versioned dataset, but we can guarantee that all the datasets have the same version.

We've been told by non-tech oriented teams that the timestamp isn't super easy for them to parse (it seems quite trivial to me). Eventually, we will probably implement a "tagging" system where the versioning node will ask the user for a tag for this run, and append the timestamp in ISO format.

We've noticed that switching from a versioned pandas CSV dataset to a partitioned dataset is about 5 times faster to save the each file. We're not sure what's going on there, but it's an unexpected benefit.

noklam · 2024-10-17T00:36:12Z

noklam
Oct 17, 2024
Collaborator

All versioned datasets from a given pipeline execution should have the same timestamp

This should be the case already if I recalled correctly

Eventually, we will probably implement a "tagging" system where the versioning node will ask the user for a tag for this run, and append the timestamp in ISO format.

Note that versioning exist since beginning of Kedro (6 years+), this can now be achieved with a custom resolver with configuration.

Set a custom resolver to create a timestamp
use the templated value in catalog for the filepath i.e. filepath: xxx/yyy/zzz/${timestamp:}/data.parquet

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All versioned datasets from a given pipeline execution should have the same timestamp #4234

{{title}}

Replies: 1 comment

{{title}}

Select a reply

All versioned datasets from a given pipeline execution should have the same timestamp #4234

ChristopherRabotin Oct 16, 2024

Replies: 1 comment

noklam Oct 17, 2024 Collaborator

ChristopherRabotin
Oct 16, 2024

noklam
Oct 17, 2024
Collaborator