Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Codec unit tests #2035

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

rabernat
Copy link
Contributor

While working on #2031 I became familiar with the new V3 Codec API and its peculiarities. And I saw that we don't yet have actual unit tests for the codecs. We have some tests in tests/v3/test_codecs/, but I'd call these more end-to-end tests, since they are creating Arrays.

I think it's important for us to unit-test al of the important internal interfaces separately from end-to-end tests. This is particularly important for codecs, so we can guard against data corruption issues.

This PR is a step in that direction.

TODO:

  • Add decode_partial and encode_partial tests.
  • Parametrize more variation of input data
  • Look for opportunities to make these tests simpler / faster (right now there is a combinatorial explosion of possibilities)

@rabernat
Copy link
Contributor Author

rabernat commented Jul 14, 2024

One area of feedback on the Codec API: it makes very little sense to me that the Codec API is async. Almost by definition Codecs are blocking, CPU-intensive code. They are not doing I/O. Why should their core methods be async?

It should be the Pipeline's job to dispatch blocking Codecs calls to threads. Not the Codec itself.

@d-v-b
Copy link
Contributor

d-v-b commented Jul 20, 2024

One area of feedback on the Codec API: it makes very little sense to me that the Codec API is async. Almost by definition Codecs are blocking, CPU-intensive code. They are not doing I/O. Why should their core methods be async?

It should be the Pipeline's job to dispatch blocking Codecs calls to threads. Not the Codec itself.

The reason why all codecs need to be async is because sharding is a codec, and the encode / decode operation of the sharding codec requires doing IO.

I would love to see some formal separation between "codecs that read from storage" (i.e., just sharding) and "codecs that transform bytes in memory" (all the other codecs), but I'm not sure what this would look like.

@jhamman jhamman added the V3 Affects the v3 branch label Aug 9, 2024
@jhamman jhamman added this to the After 3.0.0 milestone Oct 1, 2024
Copy link
Contributor

@d-v-b d-v-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love tests! thanks for this @rabernat

@d-v-b
Copy link
Contributor

d-v-b commented Oct 10, 2024

oops, I approved without noting that this is a draft. sorry for the noise. consider it a draft approval.

@jhamman jhamman changed the base branch from v3 to main October 14, 2024 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Affects the v3 branch
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

3 participants