Looking at the existing ros2 support for the za6 (as gotten from this thread and the repo referenced), I see that it is humble based. Are there plans to port it to jazzy? It says in the repo readme that it has everything needed inside there, so I will probably try to port it myself if not, but I don’t want to start down that road if it’s coming next week or something.
For anybody watching, Claude and I have made a few forks of the jammy/humble repos with the goal of running the za6 on noble/jazzy.
Currently, the demo.launch.py and the bringup.launch launch files are launching with no errors in simulation (not in a docker container), and I can plan and execute moves in rviz. (!!!)
But the termination of the bringup.launch is not graceful, so I want to work on that before I do stuff with an expensive robot.
I have made more progress in tracking down why the launch does not shut down properly.
I think the problem with the exit is that the launch process itself becomes a dummy hal component during launching and then gets prematurely terminated by the hal system on exit.
The launch process becomes a hal component in hal_ros_control, in the hal_hw_interface/launch python code.
- hal_hardware.launch.py creates a HalRTNode and HalUserNodes.
- In order for these nodes to check if hal components are ready, they try to access the hal system by checking if it is a registered component.
- When performing this check, (self.hal_name not in hal.components) the launch implicitly registers itself as a dummy component.
Then when you press ctrl+c in the terminal,
- the launch file sends sigint to all processes it has spawned to signal to them to shut down.
- This stops the hal system, and I think the hal system finishes normally.
- However, when the hal system finishes shutting down realtime, it sends sigterm to all hal components.
- Since the launch file is one of the components, it also gets a sigterm.
- It then interprets the sigterm as a signal from the user to stop immediately and passes this sigterm on to all subprocesses.
It’s essentially a race condition. In my case, other nodes have not successfully finished shutting down from the initial sigint and are abruptly killed by the sigterm. This leads to the dds shared memory not being released and to zombie processes.
I imagine that this bug is also present in the humble branch.
I am willing to work on this since I have made custom forks already, but I need to know a direction in which to go. I don’t really know hal very well and I am still new to ros2. Given my inexperience and now that I know what is happening and that it does not seem too dangerous, I will probably just clean up the shared memory and kill the zombie nodes for the time being. But here are the potential fixes I thought of:
- change the hal ros nodes/components to instead signal outside hal that they are now ready, perhaps through ros mechanisms or through signals. Then the launch file never becomes a hal component and therefore never gets sigtermed. I also wonder if the built in ros concept of lifecycle nodes and the launch lifecycle manager has enough overlap to replace the launch aspects of hal_ros_control? So if hal_mgr and hw_device_mgr were lifecycle nodes, would that be sufficient? But either of these options is a major redesign.
- do the is_ready check in a subprocess instead of the launch. IIRC, this is more what the ROS 1 version did. Then that subprocess becomes a hal component and we don’t care if it gets a sigterm because it does not effect the other ros nodes. But would that mean that the new subprocess spawns the hal nodes? Would those hal nodes still get the sigint from the launch if they are subprocesses of a subprocess of the launch?
- edit the launch file somehow to make sure that all normal ROS nodes are done shutting down before starting to stop the realtime node from being killed. If launch files can start up in a defined order, then maybe they can stop in a defined order as well? But I am not that familiar with launch files and anyway you’d have to do this for every other launch file you made that uses hal stuff.
- add the ability to de-register/disconnect as a component in machinekit-hal. Then have the launch de-register itself after it is done launching, since it doesn’t look like it needs the hal connection after that. This modifies the upstream machinekit-hal, but I am already using zultron’s fork anyway. But I am not sure if this makes sense in hal.
- change how machinekit sigterms its components. Give components an opt out somehow? And then the launch file should opt out. Seems hacky.
@John_Morris, since you seem to be the expert in the ROS2 za6 stuff, it’d be great to get your opinion when you get a minute to look at this!
My za6 is now moving on jazzy/noble!!
All the code is pushed to my fork of tormach_za_ros2_drivers.
I didn’t do the demo stuff, but it’s in its own commit if you want to put that back.
Also, I had to build machinekit-hal and linuxcnc-ethercat from source, but all of that is in a separate commit for whenever the upstream of those for noble get released. Hopefully that’ll be an easier change since it’s all grouped together.