[ros-users] Race condition in actionlib

Vijay Pradeep vpradeep at willowgarage.com
Fri Apr 1 17:05:30 UTC 2011


Hi Ryan,

So I actually ran your original chores app overnight (~10 hours), and it
still didn't freeze.

Also, right now, I tried running your new chores app about 30 times (for
about 5 seconds on each run), and it still hadn't frozen.  I couldn't get
the app to build using your CMakeLists and Makefile, so I rewrote them based
on the standard "roscreate-pkg" versions of these files. I doubt this would
affect the functionality.

The fact that it freezes close to startup is interesting.  My guess would be
that it is having some sort of startup interaction with the AsyncSpinner.

Instead of:
int main(int argc, char** argv) {
  ros::init(argc, argv, "test_client_server");

  ros::AsyncSpinner spinner(4);
*  spinner.start();*

  ServerDude sd;

  ClientGuy client;
  client.run();

  return 0;
}

Can you try:
int main(int argc, char** argv) {
  ros::init(argc, argv, "test_client_server");

  ros::AsyncSpinner spinner(4);

  ServerDude sd;

  ClientGuy client;
*  spinner.start();*
  client.run();

  return 0;
}

I would really like to get to the bottom of this, but it's hard to do much
on my end if I can't actually get it to halt.

Vijay

On Fri, Apr 1, 2011 at 7:51 AM, Ryan Miller <rmiller4589 at gmail.com> wrote:

> Hey Vijay, thanks for helping (and sorry for the double post; I
> accidentally forgot to reply all, and I figure some people on the list
> may be interested in the email). Yes, I did mean waitForResult, and I
> am also using common-1.4.3 (diamondback too, if that's relevant).
> Since you're having trouble duplicating the issue, I went back to
> create an example closer to what I'm actually doing, and I managed to
> get it to crash just as often. I've attached the project.
>
> If it doesn't crash after about 5 seconds, restart it. It seems if it
> continues for more than 5 seconds, it is more likely to continue
> running for much longer, but it usually halts within a second.
>
> Thanks a ton.
>
> -Ryan
>
> On Thu, Mar 31, 2011 at 11:00 PM, Vijay Pradeep
> <vpradeep at willowgarage.com> wrote:
> > Hi Ryan,
> >
> > I'm sorry that you're having trouble getting actionlib to work.  What
> > version of the common stack do you have?
> >
> > Note that it is always possible for some of the ROS messages to be
> dropped.
> > If the result message doesn't make it to the action client, then
> > waitForResult is going to block forever.  I'd suggest adding a timeout on
> > waitForServer, and then preempting the goal if you end up waiting too
> long.
> >
> >> I made each callback print when
> >> it was called, and I found that quite often, the program would block
> >> forever when I called waitForServer.
> > I'm guessing you meant "the program would block forever when I called
> > waitForResult"
> >
> > I'm using common-1.4.3, and after running your example for ~10 minutes,
> it
> > doesn't seem to freeze for me.  I can rerun this test with the exact
> version
> > of common that you're using.  Is there anything you can do to make this
> > minimal example freeze more often? Are there some strategic sleeps you
> could
> > add to make it mimic your original app more closely?
> >
> > If you can get the chores app to freeze, you could try attaching gdb to
> the
> > process and seeing where all the threads are (using the "info threads"
> > command inside gdb).  A bunch of them will be stuck on
> > pthread_cond_timedwait calls, but I'd be curious if there's a thread
> stuck
> > on a lock inside of actionlib.  That would be indicative of a race
> condition
> > in actionlib.
> >
> > Vijay
> >
> > On Thu, Mar 31, 2011 at 6:55 PM, Ryan Miller <rmiller4589 at gmail.com>
> wrote:
> >>
> >> Because of a timing issue I can't quite pin down, actionlib
> >> occasionally appears to exhibit a race condition. I found the problem
> >> after adding callbacks to my client. I made each callback print when
> >> it was called, and I found that quite often, the program would block
> >> forever when I called waitForServer. The problem was that the client's
> >> active callback was called but the server's execute method was never
> >> called.
> >>
> >> I have reduced the problem into a simple ROS package with a single
> >> executable that I have attached. After running the node for a while, I
> >> finally noticed the same problem. It's last output before blocking
> >> was:
> >>
> >> --- snip ---
> >> Current State: ABORTED
> >> Aborting.
> >> Active.
> >> Done.
> >> Current State: ABORTED
> >> Aborting.
> >> Active.
> >> --- snip ---
> >>
> >> In my actual code, the condition happens extremely frequently, but I
> >> found I could mitigate the problem by sleeping for one millisecond
> >> before returning from the server's execute method. (I should warn you
> >> that in the attached example, the problem almost never occurs).
> >>
> >> Is this likely a bug, or might I doing something wrong? Any
> >> suggestions would be appreciated. Thanks for the help.
> >>
> >> -Ryan
> >>
> >> _______________________________________________
> >> ros-users mailing list
> >> ros-users at code.ros.org
> >> https://code.ros.org/mailman/listinfo/ros-users
> >>
> >
> >
> >
> > --
> > Vijay Pradeep
> > Systems Engineer
> > Willow Garage, Inc.
> > vpradeep at willowgarage.com
> >
> >
>



-- 
Vijay Pradeep
Systems Engineer
Willow Garage, Inc.
 <tfoote at willowgarage.com>vpradeep at willowgarage.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ros.org/pipermail/ros-users/attachments/20110401/9b935f95/attachment-0002.html>


More information about the ros-users mailing list