-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test Isolation using linux namespaces #277
base: main
Are you sure you want to change the base?
Test Isolation using linux namespaces #277
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're part of the way there . I confirmed ROS 2 processes can start in the namespace, and that they're unable to talk to ROS processes in other namespace, or outside of a namespace. However, I'm unable to get two processes in the same namespace to talk
The processes seem isolated
# these can't communicate with each other
bazel run //test_isolation:unshare /opt/ros/humble/lib/examples_rclcpp_minimal_publisher/publisher_member_function
bazel run //test_isolation:unshare /opt/ros/humble/lib/examples_rclcpp_minimal_subscriber/subscriber_member_function
# nor can the communicate with these
/opt/ros/humble/lib/examples_rclcpp_minimal_publisher/publisher_member_function
/opt/ros/humble/lib/examples_rclcpp_minimal_subscriber/subscriber_member_function
But communication in the same process isn't working. I'm looking into why.
# This does not work
bazel run //test_isolation:unshare -- /bin/bash -c 'source /opt/ros/humble/setup.bash && ros2 run examples_rclcpp_minimal_publisher publisher_member_function & ros2 run examples_rclcpp_minimal_subscriber subscriber_member_function'
Reviewable status: 0 of 2 files reviewed, all discussions resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 1 unresolved discussion
a discussion (no related file):
Cyclonedds gave me a bit more info to investigate. It's failing to find a suitable interface for UDP traffic.
$ bazel run //test_isolation:unshare -- /bin/bash -c 'source /opt/ros/humble/setup.bash && export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp && ros2 run examples_rclcpp_minimal_publisher publisher_member_function & ros2 run examples_rclcpp_minimal_subscriber subscriber_member_function'
INFO: Analyzed target //test_isolation:unshare (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //test_isolation:unshare up-to-date:
bazel-bin/test_isolation/unshare
INFO: Elapsed time: 0.149s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
INFO: Running command line: bazel-bin/test_isolation/unshare /bin/bash -c 'source /opt/ros/humble/setup.bash && export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp && ros2 run examples_rclcpp_minimal_publisher publisher_member_function & ros2 run exam
INFO: Build completed successfully, 1 total action
1684359903.973761 [0] publisher_: failed to find interfaces for "udp"
[ERROR] [1684359903.973803521] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
>>> [rcutils|error_handling.c:108] rcutils_set_error_state()
This error state is being overwritten:
'error not set, at ./src/rcl/node.c:263'
with this new error message:
'rcl node's rmw handle is invalid, at ./src/rcl/node.c:415'
rcutils_reset_error() should be called after error handling to avoid this.
<<<
[ERROR] [1684359903.973875581] [rcl]: Failed to fini publisher for node: 1
terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:415
[ros2run]: Aborted
^C[INFO] [1684359912.595173344] [rclcpp]: signal_handler(signum=2)
which comes from here: https://github.com/eclipse-cyclonedds/cyclonedds/blob/a10ced3c81cc009e7176912190f710331a4d6caf/src/core/ddsi/src/ddsi_ownip.c#L295
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, all discussions resolved
a discussion (no related file):
Previously, sloretz (Shane Loretz) wrote…
Cyclonedds gave me a bit more info to investigate. It's failing to find a suitable interface for UDP traffic.
$ bazel run //test_isolation:unshare -- /bin/bash -c 'source /opt/ros/humble/setup.bash && export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp && ros2 run examples_rclcpp_minimal_publisher publisher_member_function & ros2 run examples_rclcpp_minimal_subscriber subscriber_member_function' INFO: Analyzed target //test_isolation:unshare (0 packages loaded, 0 targets configured). INFO: Found 1 target... Target //test_isolation:unshare up-to-date: bazel-bin/test_isolation/unshare INFO: Elapsed time: 0.149s, Critical Path: 0.00s INFO: 1 process: 1 internal. INFO: Build completed successfully, 1 total action INFO: Running command line: bazel-bin/test_isolation/unshare /bin/bash -c 'source /opt/ros/humble/setup.bash && export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp && ros2 run examples_rclcpp_minimal_publisher publisher_member_function & ros2 run exam INFO: Build completed successfully, 1 total action 1684359903.973761 [0] publisher_: failed to find interfaces for "udp" [ERROR] [1684359903.973803521] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error >>> [rcutils|error_handling.c:108] rcutils_set_error_state() This error state is being overwritten: 'error not set, at ./src/rcl/node.c:263' with this new error message: 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:415' rcutils_reset_error() should be called after error handling to avoid this. <<< [ERROR] [1684359903.973875581] [rcl]: Failed to fini publisher for node: 1 terminate called after throwing an instance of 'rclcpp::exceptions::RCLError' what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:415 [ros2run]: Aborted ^C[INFO] [1684359912.595173344] [rclcpp]: signal_handler(signum=2)which comes from here: https://github.com/eclipse-cyclonedds/cyclonedds/blob/a10ced3c81cc009e7176912190f710331a4d6caf/src/core/ddsi/src/ddsi_ownip.c#L295
Fixed by d735994 which makes sure the interface is up ( I noticed that was one of the things cyclonedds was checking for).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion
a discussion (no related file):
Thinking about how the rules should be implemented, my current plan is to make the dload shim macros aware of the option to isolate.
The dload_cc_reexec
, dload_cc_ldwrap
, and dload_py_shim
would all get an isolate
boolean option that defaulted to false, and when was true would add code to the shim to call network_isolation::create_linux_namespaces()
. The ros_cc_test
and ros_py_test
would get an isolate
option defaulting to true
. When true
it would add the dependency on @bazel_ros2_rules//network-isolation:network_isolation_cc
to the shim executable and pass on the value to the dload macros.
The //network-isolation:isolate
executable wouldn't be used directly by the rules in this repo, but could be invoked by sh_test
type rules to isolate those types of tests.
I don't see any linked issue number here. Where I can find the user story or requirements document that governs this work? I can imagine several different possible victory conditions, and not all of them would require adding more dload shim complexity. |
This comes from discussion with @IanTheEngineer and @calderpg-tri, and is motivated by test failures from collisions with the current rmw isolation mechanism. They can give you more info. |
This work is a consequence of anzu #10432. The primary victory condition is that network isolation for ROS 2 is solved - no hacks, no DDS-version-specific capabilities (i.e. nothing that only works in Cyclone). A secondary goal is that the mechanism used for network isolation work for both ROS 2 and LCM if possible, so that we do not need multiple isolation mechanisms. Our experience is that the isolation mechanisms provided by the DDS spec are just always going to be brittle when used in our tests and CI, so the answer is to use a more general method. Network namespaces achieves that, with the benefit of isolating LCM as well. |
That makes sense as far as the linux API side (which systems calls to make, etc.). What remains in doubt is how to wrap our tests using an isolated namespace. The simple and obvious answer is to create a test-wrapper program that sets up the namespace and then exec's the actual test program. Any downstream project's skylark rules could then adjust its test macros to insert the wrapper around the actual program. It's not clear to me why anything related to shimming infrastructure is in scope here. A standalone wrapper binary should be sufficient on its own, without all of the extra complexity of binary shimming. |
Also deferring the shimming code to live in Anzu only would be less expensive to develop and test, since the set of supported platforms and build rule flavors is much narrower there. Placing the namespacing code in open-source makes a lot of sense. Other people will reuse it and can contribute back. Since the Bazel hijinks to apply namespacing within the test infra are going to be a lot more project-specific, I don't see the value in developing those inside Drake ROS. We should develop that in Anzu. If we find in due course that the code is more general than Anzu, we can open source it then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The simple and obvious answer is to create a test-wrapper program that sets up the namespace and then exec's the actual test program. Any downstream project's skylark rules could then adjust its test macros to insert the wrapper around the actual program. It's not clear to me why anything related to shimming infrastructure is in scope here. A standalone wrapper binary should be sufficient on its own, without all of the extra complexity of binary shimming.
This PR does take the approach of setting up the namespace and then exec()
ing the actual test program. It just uses the existing shim to do it. I think that's less complex than adding another shim layer.
Also deferring the shimming code to live in Anzu only would be less expensive to develop and test, since the set of supported platforms and build rule flavors is much narrower there.
Placing the namespacing code in open-source makes a lot of sense. Other people will reuse it and can contribute back.
Since the Bazel hijinks to apply namespacing within the test infra are going to be a lot more project-specific, I don't see the value in developing those inside Drake ROS. We should develop that in Anzu. If we find in due course that the code is more general than Anzu, we can open source it then.
This repo already needs isolation for its tests to run in parallel, so there's value in developing it here. Right now they use rmw_isolation
, but I think we could consolidate to one isolation approach after landing this PR.
Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @sloretz)
a discussion (no related file):
I'm unable to figure out why testing ros2_example_installed
bazel test //... @ros2//...
is failing in CI. Clean builds pass locally on my machine.
https://github.com/RobotLocomotion/drake-ros/actions/runs/5138423998/jobs/9248790435
==================== Test output for //ros2_example_apps:custom_message_list_test:
Traceback (most recent call last):
File "/github/home/.cache/bazel/_bazel_root/9ffa39c466eb506336c2960019ab2bc0/sandbox/processwrapper-sandbox/155/execroot/ros2_example_bazel_installed/bazel-out/k8-opt/bin/ros2_example_apps/custom_message_list_test.runfiles/ros2_example_bazel_installed/ros2_example_apps/_custom_message_list_test_shim.py", line 4, in <module>
import network_isolation.network_isolation_py as isolation
ImportError: libexternal_Sbazel_Uros2_Urules_Snetwork_Uisolation_Slibnetwork_Uisolation_Ucc.so: cannot open shared object file: No such file or directory
Could there be something in the bazel cache in CI that's causing an issue?
Signed-off-by: Shane Loretz <[email protected]>
Signed-off-by: Shane Loretz <[email protected]>
Signed-off-by: Shane Loretz <[email protected]>
Signed-off-by: Shane Loretz <[email protected]>
Signed-off-by: Shane Loretz <[email protected]>
Signed-off-by: Shane Loretz <[email protected]>
Signed-off-by: Shane Loretz <[email protected]>
7a15ce8
to
2a22e42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 15 files reviewed, 2 unresolved discussions
a discussion (no related file):
Separate thread for CI failure in bazelized_drake_ros.yml
- This seems to be failing to download libstdc++6-10-dbg
- which is one of Drake's dependencies. Maybe we need an apt-get update
before trying to run Drake's install_prereqs.sh
script?
Trying this in 8f015db
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 15 files reviewed, 2 unresolved discussions
a discussion (no related file):
Previously, sloretz (Shane Loretz) wrote…
Separate thread for CI failure in
bazelized_drake_ros.yml
- This seems to be failing to downloadlibstdc++6-10-dbg
- which is one of Drake's dependencies. Maybe we need anapt-get update
before trying to run Drake'sinstall_prereqs.sh
script?Trying this in 8f015db
Did not fix it - same issue different package. It's odd though that the URLs have "azure" in them. Maybe this is an issue with Github actions workers?
Err:32 http://azure.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libstdc++6-10-dbg amd64 10.4.0-4ubuntu1~22.04
Error reading from server. Remote end closed connection [IP: 52.147.219.192 80]
Get:47 http://azure.archive.ubuntu.com/ubuntu jammy/main amd64 valgrind-dbg amd64 1:3.18.1-1ubuntu2 [63.4 MB]
Err:47 http://azure.archive.ubuntu.com/ubuntu jammy/main amd64 valgrind-dbg amd64 1:3.18.1-1ubuntu2
Error reading from server. Remote end closed connection [IP: 52.147.219.192 80]
Fetched 121 MB in 13s (9,114 kB/s)
E: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/pool/universe/g/gcc-10/libstdc%2b%2b6-10-dbg_10.4.0-4ubuntu1%7e22.04_amd64.deb Error reading from server. Remote end closed connection [IP: 52.147.219.192 80]
E: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/pool/main/v/valgrind/valgrind-dbg_3.18.1-1ubuntu2_amd64.deb Error reading from server. Remote end closed connection [IP: 52.147.219.192 80]
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
The Drake source distribution prerequisite setup script has experienced an error on line 20 while running the command source "${BASH_SOURCE%/*}/source_distribution/install_prereqs.sh" "${source_distribution_args[@]}"
yes: standard output: Broken pipe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 15 files reviewed, 2 unresolved discussions
a discussion (no related file):
Previously, sloretz (Shane Loretz) wrote…
Did not fix it - same issue different package. It's odd though that the URLs have "azure" in them. Maybe this is an issue with Github actions workers?
Err:32 http://azure.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 libstdc++6-10-dbg amd64 10.4.0-4ubuntu1~22.04 Error reading from server. Remote end closed connection [IP: 52.147.219.192 80] Get:47 http://azure.archive.ubuntu.com/ubuntu jammy/main amd64 valgrind-dbg amd64 1:3.18.1-1ubuntu2 [63.4 MB] Err:47 http://azure.archive.ubuntu.com/ubuntu jammy/main amd64 valgrind-dbg amd64 1:3.18.1-1ubuntu2 Error reading from server. Remote end closed connection [IP: 52.147.219.192 80] Fetched 121 MB in 13s (9,114 kB/s) E: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/pool/universe/g/gcc-10/libstdc%2b%2b6-10-dbg_10.4.0-4ubuntu1%7e22.04_amd64.deb Error reading from server. Remote end closed connection [IP: 52.147.219.192 80] E: Failed to fetch http://azure.archive.ubuntu.com/ubuntu/pool/main/v/valgrind/valgrind-dbg_3.18.1-1ubuntu2_amd64.deb Error reading from server. Remote end closed connection [IP: 52.147.219.192 80] E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing? The Drake source distribution prerequisite setup script has experienced an error on line 20 while running the command source "${BASH_SOURCE%/*}/source_distribution/install_prereqs.sh" "${source_distribution_args[@]}" yes: standard output: Broken pipe
Possibly actions/runner-images#675 has some workarounds that could work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posted blocking comment w.r.t. what Jeremy mentioned about when to invoke isolation.
Reviewable status: 0 of 17 files reviewed, 4 unresolved discussions (waiting on @sloretz)
bazel_ros2_rules/ros2/tools/dload_cc.bzl
line 24 at r10 (raw file):
_REEXEC_TEMPLATE = """\ #include "ros2/tools/dload_shim.h" CC_ISOLATE_IMPORT
For both the C++ and Python version, it's unclear to me if this will force network execution in only a test configuration.
i.e., it's unclear to me if our existing ROS binaries will break when not run as a test (e.g. a cc_binary
or py_binary
, or trying to debug a test).
While I think it's great to default to network isolation, I think it may be painful to try and debug in-situ.
Somewhat in line with Jeremy's suggestion, we should consider leaving the calling of network_isolation.create_linux_namespaces()
to the actual test processes, explicitly, and do not try to implicitly enforce test isolation in the wrapper, where some developers (i.e. myself) may want to run a test outside of isolation to inspect things like visualization, topic introspection, etc.
@calderpg-tri @IanTheEngineer Please weigh in if I'm off the mark / y'all discussed something else.
bazel_ros2_rules/network_isolation/network_isolation_py.cc
line 1 at r10 (raw file):
#define PY_SSIZE_T_CLEAN
nit Why define this using CPyhon API and not use pybind11
?
My recommendation is to use pybind11
for consistency and maintenance.
Passing comment since I still get email notifications. Elsewhere we intentionally tried to use Drake's version of Pybind11. I don't remember why, but since |
Signed-off-by: Aditya Pande <[email protected]>
Signed-off-by: Aditya Pande <[email protected]>
Signed-off-by: Aditya Pande <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 20 files reviewed, 4 unresolved discussions (waiting on @calderpg-tri, @EricCousineau-TRI, @IanTheEngineer, and @sloretz)
bazel_ros2_rules/ros2/tools/dload_cc.bzl
line 24 at r10 (raw file):
Previously, EricCousineau-TRI (Eric Cousineau) wrote…
For both the C++ and Python version, it's unclear to me if this will force network execution in only a test configuration.
i.e., it's unclear to me if our existing ROS binaries will break when not run as a test (e.g. acc_binary
orpy_binary
, or trying to debug a test).While I think it's great to default to network isolation, I think it may be painful to try and debug in-situ.
Somewhat in line with Jeremy's suggestion, we should consider leaving the calling ofnetwork_isolation.create_linux_namespaces()
to the actual test processes, explicitly, and do not try to implicitly enforce test isolation in the wrapper, where some developers (i.e. myself) may want to run a test outside of isolation to inspect things like visualization, topic introspection, etc.@calderpg-tri @IanTheEngineer Please weigh in if I'm off the mark / y'all discussed something else.
Removed isolation by default (default flag is False for network_isolation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere we intentionally tried to use Drake's version of Pybind11 [...]
Ah, gotcha. I think it was to try and keep dependencies small and focused. However, for the purpose of this package, we may want to just depend on Drake to pull in pybind11
, or somehow synchronize with Drake's version.
Reviewable status: 0 of 20 files reviewed, 4 unresolved discussions (waiting on @adityapande-1995, @calderpg-tri, @IanTheEngineer, and @sloretz)
bazel_ros2_rules/ros2/tools/dload_cc.bzl
line 24 at r10 (raw file):
Previously, adityapande-1995 (Aditya Pande) wrote…
Removed isolation by default (default flag is False for network_isolation).
I'm not sure if adding a flag is the right approach.
Instead, it should just be a function and a downstream project should explicitly invoke it when so desired.
That being said, I will defer to Ian and Calder how they ultimately want to integrate it into Anzu.
Ian / Calder, can I ask y'all to drive this in the direction y'all want?
…th_linux_namespaces
Signed-off-by: Aditya Pande <[email protected]>
Signed-off-by: Aditya Pande <[email protected]>
Signed-off-by: Aditya Pande <[email protected]>
Signed-off-by: Aditya Pande <[email protected]>
For the record, see this gist for the solution I ended up committing into TRI's Anzu repository. I added that function into a helper library (bound in Python), and then make sure that all of our project-specific test launchers called that function during their |
Signed-off-by: Alejandro Hernandez Cordero <[email protected]>
Hi @jwnimmer-tri and @adityapande-1995 I added the suggestions here sloretz#1 |
…x_namespaces Added suggestions
…th_linux_namespaces
Resolved merged conflicts, open for review / approval ! |
Signed-off-by: Alejandro Hernández Cordero <[email protected]>
Signed-off-by: Alejandro Hernández Cordero <[email protected]>
@adityapande-1995, @jwnimmer-tri and @EricCousineau-TRI I included some suggestions here to use pybind11 sloretz#2 |
Suggestions to network isolation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I'm one the authors above, this looks good to me. I'd suggest adding a new test for the LCM integration as well, but that can be done in a separate PR/ticket.
FYI This was previously discussed and rejected: I suppose the code that calls CPython API directly needed a chesteron's fence comment explaining why it can't use pybind11.
Eric is unavailable today. Tomorrow would be the earliest he could look, although I suspect he's pretty busy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I included some suggestions here to use pybind11 ...
FYI This was previously discussed and rejected:
If it is possible to have assured synchronization between Drake pybind11 and bazel_ros2_rules
, I am all for it.
I would much rather review pybind11 code instead of raw CPython code just because of dependency carving.
Reviewable status: 0 of 21 files reviewed, 5 unresolved discussions (waiting on @adityapande-1995, @calderpg-tri, @IanTheEngineer, and @sloretz)
a discussion (no related file):
I don't like the notion of automatically injecting isolation at the Bazel wrapper level. We should keep Bazel wrapping as simple / dumb as possible.
Instead, isolation should be very explicit. That is what we do in Anzu, it works, and it is clear / easy to observe.
Also, ideally it is exposed in both a Bazel way and a C++ fashion, so that way code can easily depend on both paths easily, and without additional gymnastics.
Does this make sense?
This is a work in progress towards using linux namespaces to do test isolation with ROS 2 in bazel
So far I've figured out:
unshare()
orclone()
orclone3()
?unshare()
seems easiet to work withclone()
because in 2015 there was a race condition inunshare()
- assuming that's no longer relevant to Drake-ROSioctl
to set the flagslinux-sandboxing
strategy, despite documentation warning thatlinux-sandboxing
could not be used indocker
because network namespaces could not be nestedStill need to finish the proof of concept - fork/exec a talker/listener and see if they communicate.
Then comes the starlark rules. I haven't thought about what they should look like yet.
@calderpg-tri @IanTheEngineer FYI
This change is