[ros-users] Race condition in actionlib

Ryan Miller rmiller4589 at gmail.com
Fri Apr 1 20:07:22 UTC 2011


Hi Vijay,

When I move the called to spinner.start() to where you suggested, the client
gets stuck on its call to waitForServer as follows:

--- snip ---
$ rosrun chores chores
[DEBUG] [1301684841.767308639]: Spinning up a thread for the
SimpleActionClient
[DEBUG] [1301684841.794376567]: goalConnectCallback: Adding
[/test_client_server] to goalSubscribers
[DEBUG] [1301684841.794470922]: Goal Subscribers (1 total)
  - /test_client_server
[DEBUG] [1301684841.800129524]: cancelConnectCallback: Adding
[/test_client_server] to cancelSubscribers
[DEBUG] [1301684841.800199855]: cancel Subscribers (1 total)
  - /test_client_server
[DEBUG] [1301684841.809336199]: isServerConnected: Didn't receive status
yet, so not connected yet
[DEBUG] [1301684841.809441030]: isServerConnected: Didn't receive status
yet, so not connected yet
[DEBUG] [1301684842.309665431]: isServerConnected: Didn't receive status
yet, so not connected yet
[DEBUG] [1301684842.810034890]: isServerConnected: Didn't receive status
yet, so not connected yet
...
[DEBUG] [1301684842.810034890]: isServerConnected: Didn't receive status
yet, so not connected yet
--- snip ---

Also, it appears I have a heisenbug on my hands because when I turn on
rosconsole debugging, the problem seemingly disappears, but when I keep it
off, it crashes on almost every run. By the way, the single call to ROS_INFO
I placed in the code is critical to making the program halt more often.
Remove that, and you'll have more trouble getting the program to halt.

Is there any other information that might help you replicate this problem?
For now, I'll just throw some facts here about the computer I'm using. I am
running Ubuntu 10.10 64-bit. I have replicated the problem on a relatively
slow laptop and two pretty fast desktops. The specific package information
for the version of ros-diamondback-common I am using is:

--- snip ---
$ dpkg -s ros-diamondback-common
Package: ros-diamondback-common
Status: install ok installed
Priority: optional
Section: unknown
Installed-Size: 34224
Maintainer: Tully Foote tfoote at willowgarage.com
Architecture: amd64
Version: 1.4.3-s1299027665~maverick
Depends: ros-diamondback-common-msgs (= 1.4.0-s1299027564~maverick),
ros-diamondback-ros (= 1.4.6-s1298587732~maverick), ros-diamondback-ros-comm
(= 1.4.5-s1299027222~maverick), libc6, build-essential, cmake, python-yaml,
subversion, uuid-dev, libcppunit-dev
Description: common code for personal robots
This stack contains tools built on top of ROS core which are commonly used
throughout the ROS ecosystem.
Wg-Rosdistro: diamondback
--- snip ---

Here is some information from gdb from when the program halts. I've printed
a list of the running threads as well as the backtraces of a few of the ones
in different functions.

--- snip ---
Current State: ABORTED
Server called. Aborting goal.
Active.
^C
Program received signal SIGINT, Interrupt.
pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
212 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No such
file or directory.
in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
(gdb) info thread
 12 Thread 0x7fffecea2700 (LWP 9649)  pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
 11 Thread 0x7fffed6a3700 (LWP 9646)  pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
 10 Thread 0x7fffedea4700 (LWP 9640)  pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
 9 Thread 0x7fffee6a5700 (LWP 9639)  pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
 8 Thread 0x7fffeeea6700 (LWP 9638)  pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
 7 Thread 0x7fffef6a7700 (LWP 9637)  pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
 6 Thread 0x7fffefea8700 (LWP 9636)  pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
 5 Thread 0x7ffff06a9700 (LWP 9635)  pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
 4 Thread 0x7ffff0eaa700 (LWP 9630)  pthread_cond_wait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
 3 Thread 0x7ffff16ab700 (LWP 9629)  0x00007ffff534e2c3 in select () at
../sysdeps/unix/syscall-template.S:82
 2 Thread 0x7ffff1eac700 (LWP 9628)  0x00007ffff5349203 in __poll
(fds=<value optimized out>, nfds=<value optimized out>, timeout=100) at
../sysdeps/unix/sysv/linux/poll.c:87
* 1 Thread 0x7ffff7fcd760 (LWP 9609)  pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
(gdb) bt
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
#1  0x000000000044c698 in timed_wait<boost::unique_lock<boost::mutex> >
(this=0x7fffffffe110, m=..., wait_duration=<value optimized out>) at
/usr/include/boost/thread/pthread/condition_variable.hpp:110
#2
 boost::condition_variable_any::timed_wait<boost::unique_lock<boost::mutex>,
boost::posix_time::milliseconds> (this=0x7fffffffe110, m=...,
wait_duration=<value optimized out>)
   at /usr/include/boost/thread/pthread/condition_variable.hpp:133
#3  0x000000000044d355 in
actionlib::SimpleActionClient<chores::DoDishesAction_<std::allocator<void> >
>::waitForResult (this=0x7fffffffe040, timeout=...)
   at
/opt/ros/diamondback/stacks/common/actionlib/include/actionlib/client/simple_action_client.h:567
#4  0x00000000004583f6 in ClientGuy::run (this=0x7fffffffe040) at
/home/pesckal/prometheus/ros/chores/src/main.cpp:50
#5  0x000000000042d8d5 in main (argc=1, argv=<value optimized out>) at
/home/pesckal/prometheus/ros/chores/src/main.cpp:87
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ffff1eac700 (LWP 9628))]#0
0x00007ffff5349203 in __poll (fds=<value optimized out>, nfds=<value
optimized out>, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:87
87    ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
    in ../sysdeps/unix/sysv/linux/poll.c
(gdb) bt
#0  0x00007ffff5349203 in __poll (fds=<value optimized out>, nfds=<value
optimized out>, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:87
#1  0x00007ffff78c169a in ros::PollSet::update (this=0x69c650,
poll_timeout=100)
    at
/tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/poll_set.cpp:202
#2  0x00007ffff7906778 in ros::PollManager::threadFunc (this=0x69c650)
    at
/tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/poll_manager.cpp:95
#3  0x00007ffff6df5230 in thread_proxy () from
/usr/lib/libboost_thread.so.1.42.0
#4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
pthread_create.c:304
#5  0x00007ffff535592d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()
(gdb) thread 3
[Switching to thread 3 (Thread 0x7ffff16ab700 (LWP 9629))]#0
0x00007ffff534e2c3 in select () at ../sysdeps/unix/syscall-template.S:82
82    ../sysdeps/unix/syscall-template.S: No such file or directory.
    in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0  0x00007ffff534e2c3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007ffff723313a in XmlRpc::XmlRpcDispatch::work (this=0x69ac18,
timeout=<value optimized out>)
    at
/tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/utilities/xmlrpcpp/src/XmlRpcDispatch.cpp:110
#2  0x00007ffff7903de9 in ros::XMLRPCManager::serverThreadFunc
(this=0x69abc0)
    at
/tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/xmlrpc_manager.cpp:256
#3  0x00007ffff6df5230 in thread_proxy () from
/usr/lib/libboost_thread.so.1.42.0
#4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
pthread_create.c:304
#5  0x00007ffff535592d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()
(gdb) thread 4
[Switching to thread 4 (Thread 0x7ffff0eaa700 (LWP 9630))]#0
pthread_cond_wait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
162    ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such
file or directory.
    in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007ffff79603b3 in wait (this=<value optimized out>) at
/usr/include/boost/thread/pthread/condition_variable.hpp:20
#2  ros::ROSOutAppender::logThread (this=<value optimized out>)
    at
/tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/rosout_appender.cpp:135
#3  0x00007ffff6df5230 in thread_proxy () from
/usr/lib/libboost_thread.so.1.42.0
#4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
pthread_create.c:304
#5  0x00007ffff535592d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#6  0x0000000000000000 in ?? ()
(gdb) thread 5
[Switching to thread 5 (Thread 0x7ffff06a9700 (LWP 9635))]#0
pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
212    ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No
such file or directory.
    in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
(gdb) bt
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
#1  0x00007ffff796885d in bool
boost::condition_variable::timed_wait<boost::date_time::subsecond_duration<boost::posix_time::time_duration,
1000000l> >(boost::unique_lock<boost::mutex>&,
boost::date_time::subsecond_duration<boost::posix_time::time_duration,
1000000l> const&) () from
/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/lib/libros.so
#2  0x00007ffff7964e99 in ros::CallbackQueue::callAvailable (this=0x69c410,
timeout=DWARF-2 expression error: DW_OP_reg operations must be used either
alone or in conjuction with DW_OP_piece or DW_OP_bit_piece.
)
    at
/tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/callback_queue.cpp:307
#3  0x00007ffff78f22e1 in ros::internalCallbackQueueThreadFunc ()
    at
/tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/init.cpp:265
#4  0x00007ffff6df5230 in thread_proxy () from
/usr/lib/libboost_thread.so.1.42.0
#5  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
pthread_create.c:304
#6  0x00007ffff535592d in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#7  0x0000000000000000 in ?? ()
(gdb)
--- snip ---

Also, I'm sorry to hear you had to create a new package to compile the
program. As a sanity check, I also created a new package and copied over the
cpp and action files, but this did not change the result I was having, as
expected. (I've attached the package, for reference.)

I really appreciate the help, thank you.

-Ryan

On Fri, Apr 1, 2011 at 2:03 PM, Vijay Pradeep <vpradeep at willowgarage.com>
wrote:
> Hi Ryan,
>
> Here's one more thing to try.  Turn on rosconsole debugging (See rosconole
> configuration)
>
> You're going to want the following two lines in your rosonsole.config:
> log4j.logger.ros.actionlib=DEBUG
> log4j.logger.ros.chores=DEBUG
>
>
> You'll get an output that looks like this:
> [DEBUG] [1301680806.580203868]: Setting the current goal as aborted
> [DEBUG] [1301680806.580254603]: Setting status to aborted on goal, id:
> /test_client_server-1118-1301680806.579359151, stamp: 1301680806.58
> [DEBUG] [1301680806.580301868]: Publishing result for goal with id:
> /test_client_server-1118-1301680806.579359151 and stamp: 1301680806.58
> [DEBUG] [1301680806.580439068]: Getting status over the wire.
> [DEBUG] [1301680806.580509625]: Trying to transition to ACTIVE
> [DEBUG] [1301680806.580553872]: Transitioning CommState from
> WAITING_FOR_GOAL_ACK to ACTIVE
> [DEBUG] [1301680806.580603042]: Transitioning SimpleState from [PENDING]
to
> [ACTIVE]
> Active.
> [DEBUG] [1301680806.580934461]: Trying to transition to WAITING_FOR_RESULT
> [DEBUG] [1301680806.580991611]: Transitioning CommState from ACTIVE to
> WAITING_FOR_RESULT
> [DEBUG] [1301680806.581051376]: Trying to transition to DONE
> [DEBUG] [1301680806.581108948]: Transitioning CommState from
> WAITING_FOR_RESULT to DONE
> [DEBUG] [1301680806.581164618]: Transitioning SimpleState from [ACTIVE] to
> [DONE]
> Done.
> [ INFO] [1301680806.581243326]: Hogging time or something
> Current State: ABORTED
> [DEBUG] [1301680806.581302953]: IN DELETER
> [DEBUG] [1301680806.581339010]: About to erase CommStateMachine
> [DEBUG] [1301680806.581377433]: Done erasing CommStateMachine
> [DEBUG] [1301680806.581414297]: about to start initGoal()
> [DEBUG] [1301680806.581489440]: Done with initGoal()
> [DEBUG] [1301680806.581544720]: The action server has received a new goal
> request
> [DEBUG] [1301680806.581638097]: A new goal has been recieved by the single
> goal action server
> [DEBUG] [1301680806.581705716]: Accepting a new goal
> [DEBUG] [1301680806.581758609]: Accepting goal, id:
> /test_client_server-1119-1301680806.581440270, stamp: 1301680806.58
> Server called. Aborting goal.
>
> Can send me the debug output for the last few cycles when your app
freezes?
>
> Vijay
>
> On Fri, Apr 1, 2011 at 10:05 AM, Vijay Pradeep <vpradeep at willowgarage.com>
> wrote:
>>
>> Hi Ryan,
>>
>> So I actually ran your original chores app overnight (~10 hours), and it
>> still didn't freeze.
>>
>> Also, right now, I tried running your new chores app about 30 times (for
>> about 5 seconds on each run), and it still hadn't frozen.  I couldn't get
>> the app to build using your CMakeLists and Makefile, so I rewrote them
based
>> on the standard "roscreate-pkg" versions of these files. I doubt this
would
>> affect the functionality.
>>
>> The fact that it freezes close to startup is interesting.  My guess would
>> be that it is having some sort of startup interaction with the
AsyncSpinner.
>>
>> Instead of:
>> int main(int argc, char** argv) {
>>   ros::init(argc, argv, "test_client_server");
>>
>>   ros::AsyncSpinner spinner(4);
>>   spinner.start();
>>
>>   ServerDude sd;
>>
>>   ClientGuy client;
>>   client.run();
>>
>>   return 0;
>> }
>>
>> Can you try:
>> int main(int argc, char** argv) {
>>   ros::init(argc, argv, "test_client_server");
>>
>>   ros::AsyncSpinner spinner(4);
>>
>>   ServerDude sd;
>>
>>   ClientGuy client;
>>   spinner.start();
>>   client.run();
>>
>>   return 0;
>> }
>>
>> I would really like to get to the bottom of this, but it's hard to do
much
>> on my end if I can't actually get it to halt.
>>
>> Vijay
>>
>> On Fri, Apr 1, 2011 at 7:51 AM, Ryan Miller <rmiller4589 at gmail.com>
wrote:
>>>
>>> Hey Vijay, thanks for helping (and sorry for the double post; I
>>> accidentally forgot to reply all, and I figure some people on the list
>>> may be interested in the email). Yes, I did mean waitForResult, and I
>>> am also using common-1.4.3 (diamondback too, if that's relevant).
>>> Since you're having trouble duplicating the issue, I went back to
>>> create an example closer to what I'm actually doing, and I managed to
>>> get it to crash just as often. I've attached the project.
>>>
>>> If it doesn't crash after about 5 seconds, restart it. It seems if it
>>> continues for more than 5 seconds, it is more likely to continue
>>> running for much longer, but it usually halts within a second.
>>>
>>> Thanks a ton.
>>>
>>> -Ryan
>>>
>>> On Thu, Mar 31, 2011 at 11:00 PM, Vijay Pradeep
>>> <vpradeep at willowgarage.com> wrote:
>>> > Hi Ryan,
>>> >
>>> > I'm sorry that you're having trouble getting actionlib to work.  What
>>> > version of the common stack do you have?
>>> >
>>> > Note that it is always possible for some of the ROS messages to be
>>> > dropped.
>>> > If the result message doesn't make it to the action client, then
>>> > waitForResult is going to block forever.  I'd suggest adding a timeout
>>> > on
>>> > waitForServer, and then preempting the goal if you end up waiting too
>>> > long.
>>> >
>>> >> I made each callback print when
>>> >> it was called, and I found that quite often, the program would block
>>> >> forever when I called waitForServer.
>>> > I'm guessing you meant "the program would block forever when I called
>>> > waitForResult"
>>> >
>>> > I'm using common-1.4.3, and after running your example for ~10
minutes,
>>> > it
>>> > doesn't seem to freeze for me.  I can rerun this test with the exact
>>> > version
>>> > of common that you're using.  Is there anything you can do to make
this
>>> > minimal example freeze more often? Are there some strategic sleeps you
>>> > could
>>> > add to make it mimic your original app more closely?
>>> >
>>> > If you can get the chores app to freeze, you could try attaching gdb
to
>>> > the
>>> > process and seeing where all the threads are (using the "info threads"
>>> > command inside gdb).  A bunch of them will be stuck on
>>> > pthread_cond_timedwait calls, but I'd be curious if there's a thread
>>> > stuck
>>> > on a lock inside of actionlib.  That would be indicative of a race
>>> > condition
>>> > in actionlib.
>>> >
>>> > Vijay
>>> >
>>> > On Thu, Mar 31, 2011 at 6:55 PM, Ryan Miller <rmiller4589 at gmail.com>
>>> > wrote:
>>> >>
>>> >> Because of a timing issue I can't quite pin down, actionlib
>>> >> occasionally appears to exhibit a race condition. I found the problem
>>> >> after adding callbacks to my client. I made each callback print when
>>> >> it was called, and I found that quite often, the program would block
>>> >> forever when I called waitForServer. The problem was that the
client's
>>> >> active callback was called but the server's execute method was never
>>> >> called.
>>> >>
>>> >> I have reduced the problem into a simple ROS package with a single
>>> >> executable that I have attached. After running the node for a while,
I
>>> >> finally noticed the same problem. It's last output before blocking
>>> >> was:
>>> >>
>>> >> --- snip ---
>>> >> Current State: ABORTED
>>> >> Aborting.
>>> >> Active.
>>> >> Done.
>>> >> Current State: ABORTED
>>> >> Aborting.
>>> >> Active.
>>> >> --- snip ---
>>> >>
>>> >> In my actual code, the condition happens extremely frequently, but I
>>> >> found I could mitigate the problem by sleeping for one millisecond
>>> >> before returning from the server's execute method. (I should warn you
>>> >> that in the attached example, the problem almost never occurs).
>>> >>
>>> >> Is this likely a bug, or might I doing something wrong? Any
>>> >> suggestions would be appreciated. Thanks for the help.
>>> >>
>>> >> -Ryan
>>> >>
>>> >> _______________________________________________
>>> >> ros-users mailing list
>>> >> ros-users at code.ros.org
>>> >> https://code.ros.org/mailman/listinfo/ros-users
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Vijay Pradeep
>>> > Systems Engineer
>>> > Willow Garage, Inc.
>>> > vpradeep at willowgarage.com
>>> >
>>> >
>>
>>
>>
>> --
>> Vijay Pradeep
>> Systems Engineer
>> Willow Garage, Inc.
>> vpradeep at willowgarage.com
>>
>
>
>
> --
> Vijay Pradeep
> Systems Engineer
> Willow Garage, Inc.
> vpradeep at willowgarage.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ros.org/pipermail/ros-users/attachments/20110401/f9608d62/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: chores.tar.gz
Type: application/x-gzip
Size: 2354 bytes
Desc: not available
URL: <http://lists.ros.org/pipermail/ros-users/attachments/20110401/f9608d62/attachment-0004.bin>


More information about the ros-users mailing list