[ros-users] [Discourse.ros.org] [General] Deterministic replay and debugging

Wed Feb 15 01:25:21 UTC 2017

This thread is very interesting to me, as it is a recurrent problem developers of distributed systems always struggle with.

In a software system with multiple processes and async communication, determinism is already out the window, since all the combinations of messages, when received in different order, explode very quickly ( unless you pay attention to an huge amount of tiny details with the actual goal of becoming insane ). That is also the reason why you will never remove all bugs by testing a distributed concurrent system multiple times. 

A more theoritically sound approach is required, but while computer scientists work on that part what can be done with our robots ? Here is a shortlist : 

- linearizability : all messages received by a process (node) should be ordered in a queue before doing anything else. This requires extra effort as it is not supported  ( or at least api is not exposed ? ) in ROS. This also allows replaying messages received by a node (from the receiver view point, not like rosbag, which does it from a sender viewpoint). But that also means that the node relies only on messages received ( not clock, or data in a file somewhere, etc. ) otherwise they are other things to record and replay. Luckily anything ( I think ) can be modeled as a list of messages received at a certain time.

- one thread per node. One needs to make sure the multithreading in a node introduced by ros ( services, topics ) do not have any side effect ( nothing modified outside the callback ). This could maybe be enforced with an extension ? Or drop the mutithread approach and just queue messages to be processed by the only node main loop.

- global synchronous scheduler needs to know which message is received by which node (=thread) and the time, and record this. This ensure a possible deterministice replay later, providing that all inputs in the system is a message. Ideally it should also be the one telling a node to loop another time, so that node state can be saved before and after. This is where you can have a "debug"/"release" distinction, where release would spin as fast as possible ( possibly dropping fairness and determinacy, but these might be addressable by putting a few sleeps here and there, and avoiding accumulators... which is still humanly possible to find and fix )

I think these are the main points...
Actor Model : one thread, one mailbox. And to compose them you need a scheduler that ensure fairness of computation, record all and can replay all messages. A message is the only way to interract between threads.

Note 1 : one can have a scheduler inside a node between threads, and outside between nodes. But this hierarchy can makes things more complex than needed.

Note 2 : The OS scheduler is non deterministic ( in short because it is influenced by real world state that you cannot possibly control, and because it was designed for processes that compete for machine resources ) so you need to build and use your own in your application when you want cooperative process scheduling.

I hope this is helpful. My impression is that the tools listed are just doing a part of this list, and ROS as a framework lacks the necessary constraints to enforce determinism. Other framework might  do that but then it s not ROS anymore :slight_smile: 

I would love to see libraries developed that could insure determinacy in a ROS system.

-- 
[Visit Topic](https://discourse.ros.org/t/deterministic-replay-and-debugging/1316/11) or reply to this email to respond.