Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143311

Merged
merged 1 commit into from
Mar 24, 2025

Conversation

dt
Copy link
Member

@dt dt commented Mar 21, 2025

Previously running LDR into a table that included user-defined types could crash the nodes due to an error citing missing type hydration.

This occurs during decoding of an incoming KV into a logical row, and is due to the fact the decode used in the LDR writer processor is configured with table descriptors received from the source cluster during LDR initialization, but the LDR writer processors setup was not hydrating said descriptors before using them.

This hydration step is modeled as an in-place mutation of said metadata that does not change its type signature, so even though downstream code requires its inputs be hydrated (and crashes if they're not) it is up to the specific execution path to ensure this step occurs as this is not checked during compilation.

Furthermore, when this metadata is serialized and transmitted to another node as part of a distSQL flow spec, the table descriptor seen by the other node does not contain hydrated types, while if the same distsql processor is passed the same spec on a local node, it sees the types in that descriptor already hydrated.

Such cases are not unheard of and just mean we rely on tests rather than the compiler to ensure proper usage. And indeed, when UDT functionality was added, the addition included tests to exercise it. However this is where a subtly complication emerges, which meant that these tests, which ran LDR on a table including UDTs from one node to one node, did not catch this bug.

Apparently when a TableDescriptor that has been hydrated is placed in a DistSQL spec, the TableDescriptor read back out of that spec by a DistSQL processor to which that spec is supplied may also be hydrated or may not be be, depending on whether the processor is on the same node as the coordinator where the spec was created or not. In the tests of UDT behavior, since these tests were configured to use a single source and destination node, this was always the case, so despite missing an explicit hydration in the processor, the table descriptor just so happened to already be hydrated when it was used in that processor. However in a multi-node cluster, the processors on remote nodes would encounter a non-hydrated descriptor when reading the same spec.

Our bias towards using single-node test configurations for exercising features that do not appear to have anything to do with how work is distributed across many nodes in a larger cluster is generally sensible: it speeds up tests and thus allows us to add more test cases without making the test suite unwieldy. However in this case there was a very non-obvious interaction between the routines for decoding of some bytes and executing those routines on different nodes.

This diff changes the UDT tests to now explicitly run on multiple nodes --and scatters many ranges to ensure those nodes are used. This caused these tests to reliably fail until the missing hydration steps were added to the processor to fix the underlying bug.

Release note (bug fix): Fix a crash due to 'use of enum metadata before hydration' when using LDR with user-defined types.
Epic: none.

@dt dt requested review from rafiss and jeffswenson March 21, 2025 22:27
@dt dt requested a review from a team as a code owner March 21, 2025 22:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@dt dt added backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.1.x Flags PRs that need to be backported to 25.1 labels Mar 21, 2025
Copy link
Collaborator

@stevendanna stevendanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work here!

Copy link
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

becuase multinode tests do not dance well without CI system, i think you need skip the tests under race now.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice find!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson)

Previously running LDR into a table that included user-defined types
could crash the nodes due to an error citing missing type hydration.

This occurs during decoding of an incoming KV into a logical row, and is
due to the fact the decode used in the LDR writer processor is
configured with table descriptors received from the source cluster
during LDR initialization, but the LDR writer processors setup was _not_
hydrating said descriptors before using them.

This hydration step is modeled as an in-place mutation of said metadata
that does not change its type signature, so even though downstream code
requires its inputs be hydrated (and crashes if they're not); it is up
to the specific execution path to ensure this step occurs as this is not
checked during compilation.

Furthermore, when this metadata is serialized and transmitted to another
node as part of a distSQL flow spec, the table descriptor seen by the
other node does _not_ contain hydrated types, while if the same distsql
processor is passed the same spec on a local node, it sees the types in
that descriptor already hydrated.

Such cases are not unheard of and just mean we rely on tests rather than
the compiler to ensure proper usage. And indeed, when UDT functionality
was added, the addition included tests to exercise it. However this is
where a subtly complication emerges, which meant that these tests, which
ran LDR on a table including UDTs from one node to one node, did not
catch this bug.

Apparently when a TableDescriptor that has been hydrated is placed in a
DistSQL spec, the TableDescriptor read back out of that spec by a
DistSQL processor to which that spec is supplied may also be hydrated or
may not be be, depending on whether the processor is on the same node as
the coordinator where the spec was created or not. In the tests of UDT
behavior, since these tests were configured to use a single source and
destination node, this was always the case, so despite missing an
explicit hydration in the processor, the table descriptor just so
happened to already be hydrated when it was used in that processor.
However in a multi-node cluster, the processors on remote nodes would
encounter a non-hydrated descriptor when reading the same spec.

Our bias towards using single-node test configurations for exercising
features that do not appear to have anything to do with how work is
distributed across many nodes in a larger cluster is generally sensible:
it speeds up tests and thus allows us to add more test cases without
making the test suite unwieldy. However in this case there was a very
non-obvious interaction between the routines for decoding of some bytes
and executing those routines on different nodes.

This diff changes the UDT tests to now explicitly run on multiple nodes
--and scatters many ranges to ensure those nodes are used. This caused
these tests to reliably fail until  the missing hydration steps were
added to the processor to fix the underlying bug.

Release note (bug fix): Fix a crash due to 'use of enum metadata before
hydration' when using LDR with user-defined types. Epic: none.
Copy link
Collaborator

@jeffswenson jeffswenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dt
Copy link
Member Author

dt commented Mar 24, 2025

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 24, 2025

@craig craig bot merged commit 87b5f83 into cockroachdb:master Mar 24, 2025
24 checks passed
Copy link

blathers-crl bot commented Mar 24, 2025

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 8c48ebb to blathers/backport-release-24.3-143311: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.3.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.1.x Flags PRs that need to be backported to 25.1 backport-failed v25.2.0-prerelease
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants