-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
crosscluster/logical: fix crash when using UDTs in LDR on multiple nodes #143311
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
becuase multinode tests do not dance well without CI system, i think you need skip the tests under race now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice find!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @jeffswenson)
Previously running LDR into a table that included user-defined types could crash the nodes due to an error citing missing type hydration. This occurs during decoding of an incoming KV into a logical row, and is due to the fact the decode used in the LDR writer processor is configured with table descriptors received from the source cluster during LDR initialization, but the LDR writer processors setup was _not_ hydrating said descriptors before using them. This hydration step is modeled as an in-place mutation of said metadata that does not change its type signature, so even though downstream code requires its inputs be hydrated (and crashes if they're not); it is up to the specific execution path to ensure this step occurs as this is not checked during compilation. Furthermore, when this metadata is serialized and transmitted to another node as part of a distSQL flow spec, the table descriptor seen by the other node does _not_ contain hydrated types, while if the same distsql processor is passed the same spec on a local node, it sees the types in that descriptor already hydrated. Such cases are not unheard of and just mean we rely on tests rather than the compiler to ensure proper usage. And indeed, when UDT functionality was added, the addition included tests to exercise it. However this is where a subtly complication emerges, which meant that these tests, which ran LDR on a table including UDTs from one node to one node, did not catch this bug. Apparently when a TableDescriptor that has been hydrated is placed in a DistSQL spec, the TableDescriptor read back out of that spec by a DistSQL processor to which that spec is supplied may also be hydrated or may not be be, depending on whether the processor is on the same node as the coordinator where the spec was created or not. In the tests of UDT behavior, since these tests were configured to use a single source and destination node, this was always the case, so despite missing an explicit hydration in the processor, the table descriptor just so happened to already be hydrated when it was used in that processor. However in a multi-node cluster, the processors on remote nodes would encounter a non-hydrated descriptor when reading the same spec. Our bias towards using single-node test configurations for exercising features that do not appear to have anything to do with how work is distributed across many nodes in a larger cluster is generally sensible: it speeds up tests and thus allows us to add more test cases without making the test suite unwieldy. However in this case there was a very non-obvious interaction between the routines for decoding of some bytes and executing those routines on different nodes. This diff changes the UDT tests to now explicitly run on multiple nodes --and scatters many ranges to ensure those nodes are used. This caused these tests to reliably fail until the missing hydration steps were added to the processor to fix the underlying bug. Release note (bug fix): Fix a crash due to 'use of enum metadata before hydration' when using LDR with user-defined types. Epic: none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TFTR! bors r+ |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 8c48ebb to blathers/backport-release-24.3-143311: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 24.3.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Previously running LDR into a table that included user-defined types could crash the nodes due to an error citing missing type hydration.
This occurs during decoding of an incoming KV into a logical row, and is due to the fact the decode used in the LDR writer processor is configured with table descriptors received from the source cluster during LDR initialization, but the LDR writer processors setup was not hydrating said descriptors before using them.
This hydration step is modeled as an in-place mutation of said metadata that does not change its type signature, so even though downstream code requires its inputs be hydrated (and crashes if they're not) it is up to the specific execution path to ensure this step occurs as this is not checked during compilation.
Furthermore, when this metadata is serialized and transmitted to another node as part of a distSQL flow spec, the table descriptor seen by the other node does not contain hydrated types, while if the same distsql processor is passed the same spec on a local node, it sees the types in that descriptor already hydrated.
Such cases are not unheard of and just mean we rely on tests rather than the compiler to ensure proper usage. And indeed, when UDT functionality was added, the addition included tests to exercise it. However this is where a subtly complication emerges, which meant that these tests, which ran LDR on a table including UDTs from one node to one node, did not catch this bug.
Apparently when a TableDescriptor that has been hydrated is placed in a DistSQL spec, the TableDescriptor read back out of that spec by a DistSQL processor to which that spec is supplied may also be hydrated or may not be be, depending on whether the processor is on the same node as the coordinator where the spec was created or not. In the tests of UDT behavior, since these tests were configured to use a single source and destination node, this was always the case, so despite missing an explicit hydration in the processor, the table descriptor just so happened to already be hydrated when it was used in that processor. However in a multi-node cluster, the processors on remote nodes would encounter a non-hydrated descriptor when reading the same spec.
Our bias towards using single-node test configurations for exercising features that do not appear to have anything to do with how work is distributed across many nodes in a larger cluster is generally sensible: it speeds up tests and thus allows us to add more test cases without making the test suite unwieldy. However in this case there was a very non-obvious interaction between the routines for decoding of some bytes and executing those routines on different nodes.
This diff changes the UDT tests to now explicitly run on multiple nodes --and scatters many ranges to ensure those nodes are used. This caused these tests to reliably fail until the missing hydration steps were added to the processor to fix the underlying bug.
Release note (bug fix): Fix a crash due to 'use of enum metadata before hydration' when using LDR with user-defined types.
Epic: none.