Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assess parallelism opportunities in Nav2 #2042

Closed
SteveMacenski opened this issue Oct 14, 2020 · 11 comments
Closed

Assess parallelism opportunities in Nav2 #2042

SteveMacenski opened this issue Oct 14, 2020 · 11 comments

Comments

@SteveMacenski
Copy link
Member

SteveMacenski commented Oct 14, 2020

From GPU, OpenMP, TBB, etc

  • DWB has these N critics over M trajectories structure that could be parallelized at 2 levels
  • AMCL / nav2 localization framework is built on a particle filter, which are independently updated
  • Costmap layers can be updated separately and then combined (and sensor processing layers can process multiple measurement readings at once with raycasting/marking)
  • Anything in planning? Not sure because search based. Maybe a sampling planner could better utilize
  • Bt navigator recoveries to check for planning/control validity during recovery execution to preempt
  • Nav2 dynamic tracking / detection / layer processing
  • Any new algorithms
  • transformations of TF for things like pointclouds to a new frame. (e.g. laser projections too)
  • Voxel grid: To support the above, we have a voxel grid library which could probably have some internal optimizations with parallel computing to march
  • SmacPlanner* (soon): We are collision checking in increments of a motion primitive for the state lattice planner to make sure valid expansions. E.g. we have long motion primitives that cannot be summarized by collision checking the start and end, need to do a few in between and collision checking isn't cheap.
  • RPP/Recoveries: We forward project in dt increments into the future given a velocity command in the RPP controller and backup / spin recoveries to collision check. This could be parallelized to collision check them all at once.
@simutisernestas
Copy link
Contributor

I've taken a look into costmap layers updates out of curiosity. I'm not quite sure if that's a reasonable assessment, so here is a short description of what I did exactly. As I understand the heaviest stuff in layers is done here ObstacleLayer::updateBounds and my idea were to exploit OpenMPs "parallel for" directive.

So I've modified the first loop in the LayeredCostmap::updateMap function to support parallel updates. Also, I've stacked more layers onto costmap to make the difference more clear. Changes are available here: simutisernestas@f0630c6.

I've observed that map average update time (tb3 simulation, Intel® Core™ i7-3770 CPU @ 3.40GHz×8):

  • without "parallel for" ranges from 19ms - 24ms
  • with "parallel for" ranges from 12ms - 18ms

If considering the 5Hz update rate 6ms difference adds up to a 30ms gain per second. I suppose that processing the real-world data could possibly increase layer update time and the effect seen here would be much more visible (for example longer lidar range or bigger pointcloud).

Would be cool to hear a second opinion. :)

@SteveMacenski
Copy link
Member Author

SteveMacenski commented Oct 26, 2020

I'm not sure that update bounds is the best place for this because all of the layers updating will be updating those max/min i/j pointers. You'd need to make sure you handled those resources carefully so that they don't get corrupted. I'd think the updateCosts would be a good target too (but similarly with the master_grid needing to be careful). I think OMP has some options like SHARED or something similar to deal with these cases. I'd try both updates.

I was also thinking within the obstacle / voxel layers parallelizing the marking / clearing operations since those are independent measurements and many of them (QVA sensor is thousands of iterations)

30ms isn't anything to snuff at. Alot of significant performance gains can be had by nickle and diming the system. 30ms here... 30 ms there... all the sudden you're 2 or 3x faster. 6ms on a 24ms process is still a 25% improvement, that's alot for such a little amount of work!

@abylikhsanov
Copy link

Regarding our previous chat on this issue:
#2190

You have mentioned that you would start from the "outer loop" first, can you please elaborate more on what you meant?

@SteveMacenski
Copy link
Member Author

You mentioned and outer and inner loop to try to parallelize with DWB, just start with the outer one only and then benchmark + add a PR. I think you'll find one level will do most of the heavy lifting you require (e.g. DWB has these N critics over M trajectories structure that could be parallelized at 2 levels). I forget which is the outer most for loop in DWB, but I think its the trajectory generator (e.g. M) to generate the M trajectories of vx * vy samples, parallelize that first

@SteveMacenski
Copy link
Member Author

@simutisernestas is there a reason we couldn't merge that openMP solution into costmap 2d?

@simutisernestas
Copy link
Contributor

if you're up for it, I would be happy to make a PR

@SteveMacenski
Copy link
Member Author

Sure, its a starting point! Only does updateBounds (not costs) but you've shown some compelling speed ups on just that itself!

@SteveMacenski
Copy link
Member Author

@abylikhsanov any progress to share?

@Parv-Maheshwari
Copy link

* Anything in planning? Not sure because search based. Maybe a sampling planner could better utilize

hi @SteveMacenski . I have worked on a sampling based local planner in frenet frame for ROS1 on which I have used OpenMP which showed a five fold increase in the frequency while using just 8 threads.

So I wanted to know would it possible to include our local planner as a controller plugin for NAV2. We would obviously add or change funtionalities according to the requirement for NAV2.

I would also love to read your thoughts about this and what all should/can we do.

P.S. Me and my team are open to porting our planner to ROS2

@SteveMacenski
Copy link
Member Author

Hi @Parv-Maheshwari, thanks for reaching out! I think that might be a good discussion to have in this ticket #1710 instead. Can you continue the discussion there explaining specifically what the technique is you've implemented that you'd be interested in contributing (and potentially a link if already open sourced)?

@SteveMacenski
Copy link
Member Author

Closing for now -- I've recently done some experiments on a Nvidia Jetson and was surprised how little CPU Nav2 was using with the full system running while processing 2 depth sensors. It looks like Nav2 is good enough as-is for embedded use that we don't need to speed up a whole lot more to be perfectly suitable. DWB is the big area that can use the most help that is the thing causing problems and we have another ticket open to handle that #2045

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants