All versioned datasets from a given pipeline execution should have the same timestamp #4234
ChristopherRabotin
started this conversation in
Idea
Replies: 1 comment
-
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi there,
Obligatory thank you: kedro is great. I'm excited to share what we've been using it for in a few months.
In our special use case, we're building 20+ data products in our pipelines, some of which take several minutes to build. We then tell several teams where they can find the latest data. For these operations, we log the version of the data products we generate. Since we generate a lot of them, we currently have to log one timestamp per data product. A number of our data products are time critical and need to be communicate without error to another team in a few minutes.
Therefore, we're changed the catalog so that each pipeline starts with a node that sets the timestamp for the whole pipeline execution and every note outputs a
PartitionedDataset
: this mimics the versioned dataset, but we can guarantee that all the datasets have the same version.We've been told by non-tech oriented teams that the timestamp isn't super easy for them to parse (it seems quite trivial to me). Eventually, we will probably implement a "tagging" system where the versioning node will ask the user for a tag for this run, and append the timestamp in ISO format.
We've noticed that switching from a versioned pandas CSV dataset to a partitioned dataset is about 5 times faster to save the each file. We're not sure what's going on there, but it's an unexpected benefit.
Beta Was this translation helpful? Give feedback.
All reactions