On Fri, Apr 1, 2011 at 10:05 AM, Vijay Pradeep <vpradeep@willowgarage.com> wrote:

Hi Ryan,

So I actually ran your original chores app overnight (~10 hours), and it still didn't freeze.

Also, right now, I tried running your new chores app about 30 times (for about 5 seconds on each run), and it still hadn't frozen. I couldn't get the app to build using your CMakeLists and Makefile, so I rewrote them based on the standard "roscreate-pkg" versions of these files. I doubt this would affect the functionality.

The fact that it freezes close to startup is interesting. My guess would be that it is having some sort of startup interaction with the AsyncSpinner.

Instead of:
int main(int argc, char** argv) {
ros::init(argc, argv, "test_client_server");

ros::AsyncSpinner spinner(4);
spinner.start();

ServerDude sd;

ClientGuy client;
client.run();

return 0;
}

Can you try:
int main(int argc, char** argv) {
ros::init(argc, argv, "test_client_server");

ros::AsyncSpinner spinner(4);

ServerDude sd;

ClientGuy client;
spinner.start();
client.run();

return 0;
}

I would really like to get to the bottom of this, but it's hard to do much on my end if I can't actually get it to halt.

Vijay

On Fri, Apr 1, 2011 at 7:51 AM, Ryan Miller <rmiller4589@gmail.com> wrote:
Hey Vijay, thanks for helping (and sorry for the double post; I
accidentally forgot to reply all, and I figure some people on the list
may be interested in the email). Yes, I did mean waitForResult, and I
am also using common-1.4.3 (diamondback too, if that's relevant).

Since you're having trouble duplicating the issue, I went back to
create an example closer to what I'm actually doing, and I managed to
get it to crash just as often. I've attached the project.

If it doesn't crash after about 5 seconds, restart it. It seems if it
continues for more than 5 seconds, it is more likely to continue
running for much longer, but it usually halts within a second.

Thanks a ton.

-Ryan

On Thu, Mar 31, 2011 at 11:00 PM, Vijay Pradeep
<vpradeep@willowgarage.com> wrote:

> Hi Ryan,
>
> I'm sorry that you're having trouble getting actionlib to work. What
> version of the common stack do you have?
>
> Note that it is always possible for some of the ROS messages to be dropped.
> If the result message doesn't make it to the action client, then
> waitForResult is going to block forever. I'd suggest adding a timeout on
> waitForServer, and then preempting the goal if you end up waiting too long.
>
>> I made each callback print when
>> it was called, and I found that quite often, the program would block
>> forever when I called waitForServer.
> I'm guessing you meant "the program would block forever when I called
> waitForResult"
>
> I'm using common-1.4.3, and after running your example for ~10 minutes, it
> doesn't seem to freeze for me. I can rerun this test with the exact version
> of common that you're using. Is there anything you can do to make this
> minimal example freeze more often? Are there some strategic sleeps you could
> add to make it mimic your original app more closely?
>
> If you can get the chores app to freeze, you could try attaching gdb to the
> process and seeing where all the threads are (using the "info threads"
> command inside gdb). A bunch of them will be stuck on
> pthread_cond_timedwait calls, but I'd be curious if there's a thread stuck
> on a lock inside of actionlib. That would be indicative of a race condition
> in actionlib.
>
> Vijay
>
> On Thu, Mar 31, 2011 at 6:55 PM, Ryan Miller <rmiller4589@gmail.com> wrote:
>>
>> Because of a timing issue I can't quite pin down, actionlib
>> occasionally appears to exhibit a race condition. I found the problem
>> after adding callbacks to my client. I made each callback print when
>> it was called, and I found that quite often, the program would block
>> forever when I called waitForServer. The problem was that the client's
>> active callback was called but the server's execute method was never
>> called.
>>
>> I have reduced the problem into a simple ROS package with a single
>> executable that I have attached. After running the node for a while, I
>> finally noticed the same problem. It's last output before blocking
>> was:
>>
>> --- snip ---
>> Current State: ABORTED
>> Aborting.
>> Active.
>> Done.
>> Current State: ABORTED
>> Aborting.
>> Active.
>> --- snip ---
>>
>> In my actual code, the condition happens extremely frequently, but I
>> found I could mitigate the problem by sleeping for one millisecond
>> before returning from the server's execute method. (I should warn you
>> that in the attached example, the problem almost never occurs).
>>
>> Is this likely a bug, or might I doing something wrong? Any
>> suggestions would be appreciated. Thanks for the help.
>>
>> -Ryan
>>
>> _______________________________________________
>> ros-users mailing list
>> ros-users@code.ros.org
>> https://code.ros.org/mailman/listinfo/ros-users
>>
>
>
>
> --
> Vijay Pradeep
> Systems Engineer
> Willow Garage, Inc.
> vpradeep@willowgarage.com
>
>

--
Vijay Pradeep
Systems Engineer
Willow Garage, Inc.
vpradeep@willowgarage.com