[ros-users] Race condition in actionlib

Vijay Pradeep vpradeep at willowgarage.com
Fri Apr 1 18:03:37 UTC 2011


Hi Ryan,

Here's one more thing to try.  Turn on rosconsole debugging (See rosconole
configuration <http://www.ros.org/wiki/rosconsole#Configuration>)

You're going to want the following two lines in your rosonsole.config:
log4j.logger.ros.actionlib=DEBUG
log4j.logger.ros.chores=DEBUG


You'll get an output that looks like this:
[DEBUG] [1301680806.580203868]: Setting the current goal as aborted
[DEBUG] [1301680806.580254603]: Setting status to aborted on goal, id:
/test_client_server-1118-1301680806.579359151, stamp: 1301680806.58
[DEBUG] [1301680806.580301868]: Publishing result for goal with id:
/test_client_server-1118-1301680806.579359151 and stamp: 1301680806.58
[DEBUG] [1301680806.580439068]: Getting status over the wire.
[DEBUG] [1301680806.580509625]: Trying to transition to ACTIVE
[DEBUG] [1301680806.580553872]: Transitioning CommState from
WAITING_FOR_GOAL_ACK to ACTIVE
[DEBUG] [1301680806.580603042]: Transitioning SimpleState from [PENDING] to
[ACTIVE]
Active.
[DEBUG] [1301680806.580934461]: Trying to transition to WAITING_FOR_RESULT
[DEBUG] [1301680806.580991611]: Transitioning CommState from ACTIVE to
WAITING_FOR_RESULT
[DEBUG] [1301680806.581051376]: Trying to transition to DONE
[DEBUG] [1301680806.581108948]: Transitioning CommState from
WAITING_FOR_RESULT to DONE
[DEBUG] [1301680806.581164618]: Transitioning SimpleState from [ACTIVE] to
[DONE]
Done.
[ INFO] [1301680806.581243326]: Hogging time or something
Current State: ABORTED
[DEBUG] [1301680806.581302953]: IN DELETER
[DEBUG] [1301680806.581339010]: About to erase CommStateMachine
[DEBUG] [1301680806.581377433]: Done erasing CommStateMachine
[DEBUG] [1301680806.581414297]: about to start initGoal()
[DEBUG] [1301680806.581489440]: Done with initGoal()
[DEBUG] [1301680806.581544720]: The action server has received a new goal
request
[DEBUG] [1301680806.581638097]: A new goal has been recieved by the single
goal action server
[DEBUG] [1301680806.581705716]: Accepting a new goal
[DEBUG] [1301680806.581758609]: Accepting goal, id:
/test_client_server-1119-1301680806.581440270, stamp: 1301680806.58
Server called. Aborting goal.

Can send me the debug output for the last few cycles when your app freezes?

Vijay

On Fri, Apr 1, 2011 at 10:05 AM, Vijay Pradeep <vpradeep at willowgarage.com>wrote:

> Hi Ryan,
>
> So I actually ran your original chores app overnight (~10 hours), and it
> still didn't freeze.
>
> Also, right now, I tried running your new chores app about 30 times (for
> about 5 seconds on each run), and it still hadn't frozen.  I couldn't get
> the app to build using your CMakeLists and Makefile, so I rewrote them based
> on the standard "roscreate-pkg" versions of these files. I doubt this would
> affect the functionality.
>
> The fact that it freezes close to startup is interesting.  My guess would
> be that it is having some sort of startup interaction with the AsyncSpinner.
>
> Instead of:
> int main(int argc, char** argv) {
>   ros::init(argc, argv, "test_client_server");
>
>   ros::AsyncSpinner spinner(4);
> *  spinner.start();*
>
>   ServerDude sd;
>
>   ClientGuy client;
>   client.run();
>
>   return 0;
> }
>
> Can you try:
> int main(int argc, char** argv) {
>   ros::init(argc, argv, "test_client_server");
>
>   ros::AsyncSpinner spinner(4);
>
>   ServerDude sd;
>
>   ClientGuy client;
> *  spinner.start();*
>   client.run();
>
>   return 0;
> }
>
> I would really like to get to the bottom of this, but it's hard to do much
> on my end if I can't actually get it to halt.
>
> Vijay
>
>
> On Fri, Apr 1, 2011 at 7:51 AM, Ryan Miller <rmiller4589 at gmail.com> wrote:
>
>> Hey Vijay, thanks for helping (and sorry for the double post; I
>> accidentally forgot to reply all, and I figure some people on the list
>> may be interested in the email). Yes, I did mean waitForResult, and I
>> am also using common-1.4.3 (diamondback too, if that's relevant).
>> Since you're having trouble duplicating the issue, I went back to
>> create an example closer to what I'm actually doing, and I managed to
>> get it to crash just as often. I've attached the project.
>>
>> If it doesn't crash after about 5 seconds, restart it. It seems if it
>> continues for more than 5 seconds, it is more likely to continue
>> running for much longer, but it usually halts within a second.
>>
>> Thanks a ton.
>>
>> -Ryan
>>
>> On Thu, Mar 31, 2011 at 11:00 PM, Vijay Pradeep
>> <vpradeep at willowgarage.com> wrote:
>> > Hi Ryan,
>> >
>> > I'm sorry that you're having trouble getting actionlib to work.  What
>> > version of the common stack do you have?
>> >
>> > Note that it is always possible for some of the ROS messages to be
>> dropped.
>> > If the result message doesn't make it to the action client, then
>> > waitForResult is going to block forever.  I'd suggest adding a timeout
>> on
>> > waitForServer, and then preempting the goal if you end up waiting too
>> long.
>> >
>> >> I made each callback print when
>> >> it was called, and I found that quite often, the program would block
>> >> forever when I called waitForServer.
>> > I'm guessing you meant "the program would block forever when I called
>> > waitForResult"
>> >
>> > I'm using common-1.4.3, and after running your example for ~10 minutes,
>> it
>> > doesn't seem to freeze for me.  I can rerun this test with the exact
>> version
>> > of common that you're using.  Is there anything you can do to make this
>> > minimal example freeze more often? Are there some strategic sleeps you
>> could
>> > add to make it mimic your original app more closely?
>> >
>> > If you can get the chores app to freeze, you could try attaching gdb to
>> the
>> > process and seeing where all the threads are (using the "info threads"
>> > command inside gdb).  A bunch of them will be stuck on
>> > pthread_cond_timedwait calls, but I'd be curious if there's a thread
>> stuck
>> > on a lock inside of actionlib.  That would be indicative of a race
>> condition
>> > in actionlib.
>> >
>> > Vijay
>> >
>> > On Thu, Mar 31, 2011 at 6:55 PM, Ryan Miller <rmiller4589 at gmail.com>
>> wrote:
>> >>
>> >> Because of a timing issue I can't quite pin down, actionlib
>> >> occasionally appears to exhibit a race condition. I found the problem
>> >> after adding callbacks to my client. I made each callback print when
>> >> it was called, and I found that quite often, the program would block
>> >> forever when I called waitForServer. The problem was that the client's
>> >> active callback was called but the server's execute method was never
>> >> called.
>> >>
>> >> I have reduced the problem into a simple ROS package with a single
>> >> executable that I have attached. After running the node for a while, I
>> >> finally noticed the same problem. It's last output before blocking
>> >> was:
>> >>
>> >> --- snip ---
>> >> Current State: ABORTED
>> >> Aborting.
>> >> Active.
>> >> Done.
>> >> Current State: ABORTED
>> >> Aborting.
>> >> Active.
>> >> --- snip ---
>> >>
>> >> In my actual code, the condition happens extremely frequently, but I
>> >> found I could mitigate the problem by sleeping for one millisecond
>> >> before returning from the server's execute method. (I should warn you
>> >> that in the attached example, the problem almost never occurs).
>> >>
>> >> Is this likely a bug, or might I doing something wrong? Any
>> >> suggestions would be appreciated. Thanks for the help.
>> >>
>> >> -Ryan
>> >>
>> >> _______________________________________________
>> >> ros-users mailing list
>> >> ros-users at code.ros.org
>> >> https://code.ros.org/mailman/listinfo/ros-users
>> >>
>> >
>> >
>> >
>> > --
>> > Vijay Pradeep
>> > Systems Engineer
>> > Willow Garage, Inc.
>> > vpradeep at willowgarage.com
>> >
>> >
>>
>
>
>
> --
> Vijay Pradeep
> Systems Engineer
> Willow Garage, Inc.
>  <tfoote at willowgarage.com>vpradeep at willowgarage.com
>
>


-- 
Vijay Pradeep
Systems Engineer
Willow Garage, Inc.
 <tfoote at willowgarage.com>vpradeep at willowgarage.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ros.org/pipermail/ros-users/attachments/20110401/b946b38d/attachment-0002.html>


More information about the ros-users mailing list