crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143311

dt · 2025-03-21T22:27:10Z

Previously running LDR into a table that included user-defined types could crash the nodes due to an error citing missing type hydration.

This occurs during decoding of an incoming KV into a logical row, and is due to the fact the decode used in the LDR writer processor is configured with table descriptors received from the source cluster during LDR initialization, but the LDR writer processors setup was not hydrating said descriptors before using them.

This hydration step is modeled as an in-place mutation of said metadata that does not change its type signature, so even though downstream code requires its inputs be hydrated (and crashes if they're not) it is up to the specific execution path to ensure this step occurs as this is not checked during compilation.

Furthermore, when this metadata is serialized and transmitted to another node as part of a distSQL flow spec, the table descriptor seen by the other node does not contain hydrated types, while if the same distsql processor is passed the same spec on a local node, it sees the types in that descriptor already hydrated.

Such cases are not unheard of and just mean we rely on tests rather than the compiler to ensure proper usage. And indeed, when UDT functionality was added, the addition included tests to exercise it. However this is where a subtly complication emerges, which meant that these tests, which ran LDR on a table including UDTs from one node to one node, did not catch this bug.

Apparently when a TableDescriptor that has been hydrated is placed in a DistSQL spec, the TableDescriptor read back out of that spec by a DistSQL processor to which that spec is supplied may also be hydrated or may not be be, depending on whether the processor is on the same node as the coordinator where the spec was created or not. In the tests of UDT behavior, since these tests were configured to use a single source and destination node, this was always the case, so despite missing an explicit hydration in the processor, the table descriptor just so happened to already be hydrated when it was used in that processor. However in a multi-node cluster, the processors on remote nodes would encounter a non-hydrated descriptor when reading the same spec.

Our bias towards using single-node test configurations for exercising features that do not appear to have anything to do with how work is distributed across many nodes in a larger cluster is generally sensible: it speeds up tests and thus allows us to add more test cases without making the test suite unwieldy. However in this case there was a very non-obvious interaction between the routines for decoding of some bytes and executing those routines on different nodes.

This diff changes the UDT tests to now explicitly run on multiple nodes --and scatters many ranges to ensure those nodes are used. This caused these tests to reliably fail until the missing hydration steps were added to the processor to fix the underlying bug.

Release note (bug fix): Fix a crash due to 'use of enum metadata before hydration' when using LDR with user-defined types.
Epic: none.

cockroach-teamcity · 2025-03-21T22:27:20Z

This change is

stevendanna

Nice work here!

msbutler

becuase multinode tests do not dance well without CI system, i think you need skip the tests under race now.

rafiss

nice find!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson)

Previously running LDR into a table that included user-defined types could crash the nodes due to an error citing missing type hydration. This occurs during decoding of an incoming KV into a logical row, and is due to the fact the decode used in the LDR writer processor is configured with table descriptors received from the source cluster during LDR initialization, but the LDR writer processors setup was _not_ hydrating said descriptors before using them. This hydration step is modeled as an in-place mutation of said metadata that does not change its type signature, so even though downstream code requires its inputs be hydrated (and crashes if they're not); it is up to the specific execution path to ensure this step occurs as this is not checked during compilation. Furthermore, when this metadata is serialized and transmitted to another node as part of a distSQL flow spec, the table descriptor seen by the other node does _not_ contain hydrated types, while if the same distsql processor is passed the same spec on a local node, it sees the types in that descriptor already hydrated. Such cases are not unheard of and just mean we rely on tests rather than the compiler to ensure proper usage. And indeed, when UDT functionality was added, the addition included tests to exercise it. However this is where a subtly complication emerges, which meant that these tests, which ran LDR on a table including UDTs from one node to one node, did not catch this bug. Apparently when a TableDescriptor that has been hydrated is placed in a DistSQL spec, the TableDescriptor read back out of that spec by a DistSQL processor to which that spec is supplied may also be hydrated or may not be be, depending on whether the processor is on the same node as the coordinator where the spec was created or not. In the tests of UDT behavior, since these tests were configured to use a single source and destination node, this was always the case, so despite missing an explicit hydration in the processor, the table descriptor just so happened to already be hydrated when it was used in that processor. However in a multi-node cluster, the processors on remote nodes would encounter a non-hydrated descriptor when reading the same spec. Our bias towards using single-node test configurations for exercising features that do not appear to have anything to do with how work is distributed across many nodes in a larger cluster is generally sensible: it speeds up tests and thus allows us to add more test cases without making the test suite unwieldy. However in this case there was a very non-obvious interaction between the routines for decoding of some bytes and executing those routines on different nodes. This diff changes the UDT tests to now explicitly run on multiple nodes --and scatters many ranges to ensure those nodes are used. This caused these tests to reliably fail until the missing hydration steps were added to the processor to fix the underlying bug. Release note (bug fix): Fix a crash due to 'use of enum metadata before hydration' when using LDR with user-defined types. Epic: none.

jeffswenson

LGTM

dt · 2025-03-24T17:18:04Z

TFTR!

bors r+

craig · 2025-03-24T18:22:13Z

Build succeeded:

blathers-crl · 2025-03-24T18:22:28Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 8c48ebb to blathers/backport-release-24.3-143311: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 24.3.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

dt requested review from rafiss and jeffswenson March 21, 2025 22:27

dt requested a review from a team as a code owner March 21, 2025 22:27

dt force-pushed the udt-fix branch from 5d0cce2 to f27348d Compare March 21, 2025 22:52

dt added backport-24.3.x Flags PRs that need to be backported to 24.3 backport-25.1.x Flags PRs that need to be backported to 25.1 labels Mar 21, 2025

stevendanna approved these changes Mar 24, 2025

View reviewed changes

msbutler approved these changes Mar 24, 2025

View reviewed changes

rafiss approved these changes Mar 24, 2025

View reviewed changes

dt force-pushed the udt-fix branch from f27348d to 8c48ebb Compare March 24, 2025 15:03

jeffswenson approved these changes Mar 24, 2025

View reviewed changes

craig bot merged commit 87b5f83 into cockroachdb:master Mar 24, 2025
24 checks passed

celeste-cockroachdb bot added the target-release-25.2.0 label Mar 24, 2025

blathers-crl bot mentioned this pull request Mar 24, 2025

release-25.1: crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143373

Merged

blathers-crl bot added the backport-failed label Mar 24, 2025

dt mentioned this pull request Mar 24, 2025

release-24.3: crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143376

Merged

This was referenced Mar 24, 2025

release-25.1.3-rc: release-25.1: crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143388

Merged

release-24.3.9-rc: release-24.3: crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143389

Merged

dt deleted the udt-fix branch March 25, 2025 17:33

celeste-cockroachdb bot added v25.2.0-prerelease and removed target-release-25.2.0 labels Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143311

crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143311

dt commented Mar 21, 2025 •

edited

Loading

cockroach-teamcity commented Mar 21, 2025

stevendanna left a comment

msbutler left a comment

rafiss left a comment

jeffswenson left a comment

dt commented Mar 24, 2025

craig bot commented Mar 24, 2025

blathers-crl bot commented Mar 24, 2025

crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143311

crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143311

Conversation

dt commented Mar 21, 2025 • edited Loading

cockroach-teamcity commented Mar 21, 2025

stevendanna left a comment

Choose a reason for hiding this comment

msbutler left a comment

Choose a reason for hiding this comment

rafiss left a comment

Choose a reason for hiding this comment

jeffswenson left a comment

Choose a reason for hiding this comment

dt commented Mar 24, 2025

craig bot commented Mar 24, 2025

blathers-crl bot commented Mar 24, 2025

dt commented Mar 21, 2025 •

edited

Loading