[ros-users] How can I be robust to a crashed roscore?

Tue Jun 28 16:00:50 UTC 2011

On Fri, Jun 24, 2011 at 2:06 PM, Vijay Pradeep <vpradeep07 at gmail.com> wrote:
> Hi All,
>
> This didn't feel crisp enough for answers.ros.org, but if people feel that
> it is the right forum for email, I can transfer my question to that site.
>
> I'm thinking of using ROS for my non-ROS robot's basestation, and I'm trying
> to figure out if I can satisfy my robustness requirements with a ROS based
> system.
>
> The basestation will most likely consist of at least 3 computers: A primary
> communications & control (C&C) computer, a backup C&C computer and at least
> one non-critical visualization computer.  The idea is for the visualization
> machines to receive ROS messages from the primary and backup C&C computers,
> and if the primary C&C computer crashes, the backup C&C computer can take
> over all communications and control.  Once the primary C&C machine
> reboots/recovers from the crash, it can then retake control of the robot.
>
> Now comes the hard question:  Where should I run my roscore, and what
> happens if it crashes?  Assuming that the roscore is running on the primary
> C&C machine and this machine crashes, I believe everything else should still
> run just fine (assuming we're not using the parameter server or negotiating
> service connections at runtime).  And, is there any way that I can restart
> my roscore and C&C nodes on the primary machine after the crash?

Right now, no, but it *should* be possible to write an external Python
script that backs up a master and can replay its state.  Basically, in
the Master API the getSystemState() call should give you most of the
info, and a "rosparam dump /" represents the full Parameter Server.
You'll lose param subscriber info, but that's not a biggie.

More long-term, I've been looking into something like Redis as a
backing key/value store that handles replication.

As for restarting the C&C nodes, I think if you just target a
roslaunch at the C&C machine, then you can just restart that file.
Restarting the roscore is trickier as you're trying to do
auto-failover/return.  The nodes are all going to be pointing at the
main C&C machine and aren't going to want to automatically start
talking to the other machine.

An alternative, experimental architecture is to play around with the
multimaster syncing stuff we've been doing.  Each machine gets its own
master; you can then do either full or partial syncs of each master to
each other.  Nodes on each machine only point at their own master.
The multimaster stuff is in the rosproxy package.  Basically, it sets
up a node that subscribes to a set of topics on another master.
Whenever it hears about changes in reg info, it replays it into
another master.

> Maybe this involves patching the ROS Master to store the state of it's
> connections to disk.  If so, any suggestions as to where to start looking in
> the ROS Master code would be appreciated.

The master code is pretty simple; there aren't many files to it.  The
masterdata file represents all of the data.  In the end, you just need
to record the bare registrations and replay/load them back.

 - Ken

> Thanks,
> Vijay Pradeep
>
> _______________________________________________
> ros-users mailing list
> ros-users at code.ros.org
> https://code.ros.org/mailman/listinfo/ros-users
>
>