[ros-users] Race condition in actionlib

Sat Apr 2 04:07:51 UTC 2011

Thanks for the design tip. It's working much better now.

My path planner is also a service for navigation goals, so I guess I should
also send those out periodically too. It would be a serious problem if my
navigation goal sender didn't know the robot actually finished navigating to
a goal.

> Although TCP/IP is reliable, there are cases where ROS cannot send
messages fast enough over the wire, so ROS decides to drop some messages.

Wow! I think this should be written on a blinking red marquee somewhere. Is
there any document on the ROS wiki where this, and maybe other interesting
details about ROS, are written? I guess I'll be treating all ROS messages as
unreliable from now on...

Thank you again for all the help! It's been great :D

-Ryan

P.S. I still think it's crazy you guys built an unreliable data transfer
protocol on top of TCP/IP...

On Fri, Apr 1, 2011 at 6:30 PM, Vijay Pradeep <vpradeep at willowgarage.com>wrote:

> > Is it acceptable for messages to be dropped? I thought actionlib used
> reliable data transfer.
> It is acceptable for messages to be dropped.  actionlib uses ROS messages
> which use TCP/IP transport.  Although TCP/IP is reliable, there are cases
> where ROS cannot send messages fast enough over the wire, so ROS decides to
> drop some messages.  This means on a given topic, messages get dropped only
> when there are even newer messages that need to be sent.
>
> In your case, it is totally reasonable for the goal to seem like it is not
> getting canceled.  As far as the server is concerned, it already completed
> the goal, and has already sent the result. Thus, there is nothing to cancel.
> This causes the client to indefinitely wait for the goal to terminate.
>
> I don't think you're current approach is best for your application.  You're
> sending a new goal to the server only after the previous goal completes,
> which means that the robot is probably going to briefly stop moving between
> goals.  To avoid this issue, you could continuously send goals at some rate,
> and never wait for them to finish.  In the server's execute loop, you can
> see if a new goal is available and then accept it and start using the new
> goal.  If you ever need to stop due to a potential collision, you can always
> send a cancel request for the latest goal.
>
> Does this seem to fit your need?  There's also a chance that I'm
> misunderstanding your application.
>
> Vijay
>
>
> On Fri, Apr 1, 2011 at 3:04 PM, Ryan Miller <rmiller4589 at gmail.com> wrote:
>
>> Sorry, forgot to attach the file.
>>
>> -Ryan
>>
>>
>> On Fri, Apr 1, 2011 at 6:01 PM, Ryan Miller <rmiller4589 at gmail.com>wrote:
>>
>>> Hi Vijay,
>>>
>>> A few questions then. Is it acceptable for messages to be dropped? I
>>> thought actionlib used reliable data transfer. In my actual program, I am
>>> depending on every message being sent, and I would rather not use delays and
>>> risk my robot still locking up.
>>>
>>> To give you some perspective, I am using the client to plan a path to a
>>> destination, and I am using the server to execute the motor controls to
>>> drive on a path it is given. Using a timeout would be fine (it's what I'm
>>> doing in my production code, actually). But with this new result, it seems I
>>> should now cancel the goal any time waitForResult times out because I don't
>>> know if the server is just taking a long time or if its message was dropped
>>> and it's actually finished. What I was planning on doing was letting the
>>> motion planner follow the path until it is in imminent danger of colliding
>>> with something, in which case I would cancel the goal. Unfortunately, I have
>>> a new problem now where I cannot even cancel the goal.
>>>
>>> After timing out, the client always thinks it's goal is active, and I am
>>> unable to cancel the goal. The output I am getting from the attached example
>>> is:
>>>
>>> --- snip ---
>>> *[ INFO] [1301694170.491982098]: Sending goal
>>> [ INFO] [1301694170.492052098]: Waiting for result...
>>> [ INFO] [1301694170.492138988]: Server is aborting goal.
>>> [ INFO] [1301694170.492193892]: After waiting for the result, the state
>>> is ABORTED
>>> [ INFO] [1301694170.492232340]: Sending goal
>>> [ INFO] [1301694170.492279138]: Waiting for result...
>>> [ INFO] [1301694170.492341676]: Server is aborting goal.
>>> [ INFO] [1301694170.492388975]: After waiting for the result, the state
>>> is ABORTED*
>>> [ INFO] [1301694170.492443452]: Sending goal
>>> [ INFO] [1301694170.492543467]: Server is aborting goal.
>>> [ INFO] [1301694170.492620163]: Waiting for result...
>>> [ INFO] [1301694171.492789064]: After waiting for the result, the state
>>> is ACTIVE
>>> [ INFO] [1301694171.492909135]: Timeout, cancelling the goal
>>> *[ INFO] [1301694171.492953475]: After cancelling, the state is ACTIVE*
>>> *[ INFO] [1301694171.492977808]: The server is not done.
>>> [ INFO] [1301694172.493114619]: The server is not done.
>>> [ INFO] [1301694173.493301543]: The server is not done.
>>> [ INFO] [1301694174.493476758]: The server is not done.*
>>> ^C
>>> --- snip ---
>>> (Normal behaviour is italicised. Unexpected behaviour is bold)
>>>
>>> Might this be another problem, or am I misusing SimpleActionClient?
>>>
>>> Thanks again,
>>>
>>> -Ryan
>>>
>>>
>>> On Fri, Apr 1, 2011 at 4:54 PM, Vijay Pradeep <vpradeep at willowgarage.com
>>> > wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> Ok, I think we're actually getting somewhere.....
>>>>
>>>>
>>>> > When I move the called to spinner.start() to where you suggested, the
>>>> client gets stuck on its call to waitForServer as follows:
>>>> Totally my bad.  What I meant for you to do was the following:
>>>> 1) Construct client & server
>>>> 2) Spin up threads
>>>> 3) waitForServer
>>>> 4) Start sending goals
>>>>
>>>> To do this, you'll have to move spinner.start() to where I suggested,
>>>> and also move client_.waitForServer() to the top of run().  Currently,
>>>> the app is trying to waitForServer without any spin threads running, which
>>>> is causing it to get stuck.  Sorry about that.  I'd also suggest trying to
>>>> run with 1 spinner instead of 4, and see if that helps.
>>>>
>>>> Thanks for the gdb thread info.  It looks like all the threads we're
>>>> worried about are happily waiting for inputs in pthread conditions.  This
>>>> makes it seem even more like a message was being dropped.
>>>>
>>>> Also, I finally got it to freeze on my side, *with* debug on.  See
>>>> output:
>>>>
>>>> Done.
>>>> [ INFO] [1301690082.884716841]: Hogging time or something
>>>> Current State: ABORTED
>>>> [DEBUG] [1301690082.884784275]: IN DELETER
>>>> [DEBUG] [1301690082.884838215]: About to erase CommStateMachine
>>>> [DEBUG] [1301690082.884891598]: Done erasing CommStateMachine
>>>> [DEBUG] [1301690082.884948750]: about to start initGoal()
>>>> [DEBUG] [1301690082.885037423]: The action server has received a new
>>>> goal request
>>>> [DEBUG] [1301690082.885152395]: A new goal has been recieved by the
>>>> single goal action server
>>>> [DEBUG] [1301690082.885263065]: Accepting a new goal
>>>> [DEBUG] [1301690082.885330292]: Accepting goal, id:
>>>> /test_client_server-281-1301690082.884983820, stamp: 1301690082.88
>>>> Server called. Aborting goal.
>>>> [DEBUG] [1301690082.885555049]: Setting the current goal as aborted
>>>> [DEBUG] [1301690082.885618287]: Setting status to aborted on goal, id:
>>>> /test_client_server-281-1301690082.884983820, stamp: 1301690082.88
>>>> [DEBUG] [1301690082.885675967]: Publishing result for goal with id:
>>>> /test_client_server-281-1301690082.884983820 and stamp: 1301690082.88
>>>> [DEBUG] [1301690082.885743824]: Getting status over the wire.
>>>> [DEBUG] [1301690082.885956181]: Done with initGoal()
>>>> [DEBUG] [1301690082.887905041]: Getting status over the wire.
>>>> [DEBUG] [1301690082.888000669]: Trying to transition to ACTIVE
>>>> [DEBUG] [1301690082.888064094]: Transitioning CommState from
>>>> WAITING_FOR_GOAL_ACK to ACTIVE
>>>> [DEBUG] [1301690082.888130878]: Transitioning SimpleState from [PENDING]
>>>> to [ACTIVE]
>>>> Active.
>>>> [DEBUG] [1301690082.888203846]: Trying to transition to
>>>> WAITING_FOR_RESULT
>>>> *[DEBUG] [1301690082.888262411]: Transitioning CommState from ACTIVE to
>>>> WAITING_FOR_RESULT*
>>>> [DEBUG] [1301690083.088191441]: Getting status over the wire.
>>>> [DEBUG] [1301690083.288523198]: Getting status over the wire.
>>>> [DEBUG] [1301690083.488836856]: Getting status over the wire.
>>>> [DEBUG] [1301690083.688115542]: Getting status over the wire.
>>>> [DEBUG] [1301690083.888460218]: Getting status over the wire.
>>>> [DEBUG] [1301690084.088794991]: Getting status over the wire.
>>>> [DEBUG] [1301690084.288137860]: Getting status over the wire.
>>>> [DEBUG] [1301690084.488456451]: Getting status over the wire.
>>>> [DEBUG] [1301690084.688742829]: Getting status over the wire.
>>>> [DEBUG] [1301690084.888085527]: Getting status over the wire.
>>>> [DEBUG] [1301690085.088418107]: Getting status over the wire.
>>>>
>>>> On the client side, the goal transitioned to WAITING_FOR_RESULT, which
>>>> means that the server successfully set the goal to a terminal state (in this
>>>> case, ABORTED).  However, for some reason, the client never gets the result
>>>> message.  With so many messages flying across so quickly, I guess one of the
>>>> result messages happens to get dropped.  This also explains why adding a
>>>> sleep in the server callback stopped the freezing.  The sleep significantly
>>>> slowed down the messaging rate, preventing the result message from being
>>>> dropped.
>>>>
>>>> It is definitely best practice to always add a timeout to
>>>> waitForResult().  I've updated the actionlib quickstart guide<http://www.ros.org/wiki/actionlib#C.2B-.2B-_SimpleActionClient>accordingly.
>>>>
>>>> And, please let me know if any of the spinner tweaks I suggested make a
>>>> difference.
>>>>
>>>> Vijay
>>>>
>>>>
>>>> On Fri, Apr 1, 2011 at 1:07 PM, Ryan Miller <rmiller4589 at gmail.com>wrote:
>>>>
>>>>> Hi Vijay,
>>>>>
>>>>> When I move the called to spinner.start() to where you suggested, the
>>>>> client gets stuck on its call to waitForServer as follows:
>>>>>
>>>>> --- snip ---
>>>>> $ rosrun chores chores
>>>>> [DEBUG] [1301684841.767308639]: Spinning up a thread for the
>>>>> SimpleActionClient
>>>>> [DEBUG] [1301684841.794376567]: goalConnectCallback: Adding
>>>>> [/test_client_server] to goalSubscribers
>>>>> [DEBUG] [1301684841.794470922]: Goal Subscribers (1 total)
>>>>>   - /test_client_server
>>>>> [DEBUG] [1301684841.800129524]: cancelConnectCallback: Adding
>>>>> [/test_client_server] to cancelSubscribers
>>>>> [DEBUG] [1301684841.800199855]: cancel Subscribers (1 total)
>>>>>   - /test_client_server
>>>>> [DEBUG] [1301684841.809336199]: isServerConnected: Didn't receive
>>>>> status yet, so not connected yet
>>>>> [DEBUG] [1301684841.809441030]: isServerConnected: Didn't receive
>>>>> status yet, so not connected yet
>>>>> [DEBUG] [1301684842.309665431]: isServerConnected: Didn't receive
>>>>> status yet, so not connected yet
>>>>> [DEBUG] [1301684842.810034890]: isServerConnected: Didn't receive
>>>>> status yet, so not connected yet
>>>>> ...
>>>>> [DEBUG] [1301684842.810034890]: isServerConnected: Didn't receive
>>>>> status yet, so not connected yet
>>>>> --- snip ---
>>>>>
>>>>> Also, it appears I have a heisenbug on my hands because when I turn on
>>>>> rosconsole debugging, the problem seemingly disappears, but when I keep it
>>>>> off, it crashes on almost every run. By the way, the single call to ROS_INFO
>>>>> I placed in the code is critical to making the program halt more often.
>>>>> Remove that, and you'll have more trouble getting the program to halt.
>>>>>
>>>>> Is there any other information that might help you replicate this
>>>>> problem? For now, I'll just throw some facts here about the computer I'm
>>>>> using. I am running Ubuntu 10.10 64-bit. I have replicated the problem on a
>>>>> relatively slow laptop and two pretty fast desktops. The specific package
>>>>> information for the version of ros-diamondback-common I am using is:
>>>>>
>>>>> --- snip ---
>>>>> $ dpkg -s ros-diamondback-common
>>>>> Package: ros-diamondback-common
>>>>> Status: install ok installed
>>>>> Priority: optional
>>>>> Section: unknown
>>>>> Installed-Size: 34224
>>>>> Maintainer: Tully Foote tfoote at willowgarage.com
>>>>> Architecture: amd64
>>>>> Version: 1.4.3-s1299027665~maverick
>>>>> Depends: ros-diamondback-common-msgs (= 1.4.0-s1299027564~maverick),
>>>>> ros-diamondback-ros (= 1.4.6-s1298587732~maverick), ros-diamondback-ros-comm
>>>>> (= 1.4.5-s1299027222~maverick), libc6, build-essential, cmake, python-yaml,
>>>>> subversion, uuid-dev, libcppunit-dev
>>>>> Description: common code for personal robots
>>>>> This stack contains tools built on top of ROS core which are commonly
>>>>> used throughout the ROS ecosystem.
>>>>> Wg-Rosdistro: diamondback
>>>>> --- snip ---
>>>>>
>>>>> Here is some information from gdb from when the program halts. I've
>>>>> printed a list of the running threads as well as the backtraces of a few of
>>>>> the ones in different functions.
>>>>>
>>>>> --- snip ---
>>>>> Current State: ABORTED
>>>>> Server called. Aborting goal.
>>>>> Active.
>>>>> ^C
>>>>> Program received signal SIGINT, Interrupt.
>>>>> pthread_cond_timedwait@@GLIBC_2.3.2 () at
>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>> 212 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No
>>>>> such file or directory.
>>>>> in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
>>>>> (gdb) info thread
>>>>>  12 Thread 0x7fffecea2700 (LWP 9649)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>  11 Thread 0x7fffed6a3700 (LWP 9646)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>  10 Thread 0x7fffedea4700 (LWP 9640)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>  9 Thread 0x7fffee6a5700 (LWP 9639)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>  8 Thread 0x7fffeeea6700 (LWP 9638)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>  7 Thread 0x7fffef6a7700 (LWP 9637)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>  6 Thread 0x7fffefea8700 (LWP 9636)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>  5 Thread 0x7ffff06a9700 (LWP 9635)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>  4 Thread 0x7ffff0eaa700 (LWP 9630)  pthread_cond_wait@@GLIBC_2.3.2 ()
>>>>> at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
>>>>>  3 Thread 0x7ffff16ab700 (LWP 9629)  0x00007ffff534e2c3 in select () at
>>>>> ../sysdeps/unix/syscall-template.S:82
>>>>>  2 Thread 0x7ffff1eac700 (LWP 9628)  0x00007ffff5349203 in __poll
>>>>> (fds=<value optimized out>, nfds=<value optimized out>, timeout=100) at
>>>>> ../sysdeps/unix/sysv/linux/poll.c:87
>>>>> * 1 Thread 0x7ffff7fcd760 (LWP 9609)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>> (gdb) bt
>>>>> #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>> #1  0x000000000044c698 in timed_wait<boost::unique_lock<boost::mutex> >
>>>>> (this=0x7fffffffe110, m=..., wait_duration=<value optimized out>) at
>>>>> /usr/include/boost/thread/pthread/condition_variable.hpp:110
>>>>> #2
>>>>>  boost::condition_variable_any::timed_wait<boost::unique_lock<boost::mutex>,
>>>>> boost::posix_time::milliseconds> (this=0x7fffffffe110, m=...,
>>>>> wait_duration=<value optimized out>)
>>>>>    at /usr/include/boost/thread/pthread/condition_variable.hpp:133
>>>>> #3  0x000000000044d355 in
>>>>> actionlib::SimpleActionClient<chores::DoDishesAction_<std::allocator<void> >
>>>>> >::waitForResult (this=0x7fffffffe040, timeout=...)
>>>>>    at
>>>>> /opt/ros/diamondback/stacks/common/actionlib/include/actionlib/client/simple_action_client.h:567
>>>>> #4  0x00000000004583f6 in ClientGuy::run (this=0x7fffffffe040) at
>>>>> /home/pesckal/prometheus/ros/chores/src/main.cpp:50
>>>>> #5  0x000000000042d8d5 in main (argc=1, argv=<value optimized out>) at
>>>>> /home/pesckal/prometheus/ros/chores/src/main.cpp:87
>>>>> (gdb) thread 2
>>>>> [Switching to thread 2 (Thread 0x7ffff1eac700 (LWP 9628))]#0
>>>>> 0x00007ffff5349203 in __poll (fds=<value optimized out>, nfds=<value
>>>>> optimized out>, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:87
>>>>> 87    ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
>>>>>     in ../sysdeps/unix/sysv/linux/poll.c
>>>>> (gdb) bt
>>>>> #0  0x00007ffff5349203 in __poll (fds=<value optimized out>,
>>>>> nfds=<value optimized out>, timeout=100) at
>>>>> ../sysdeps/unix/sysv/linux/poll.c:87
>>>>> #1  0x00007ffff78c169a in ros::PollSet::update (this=0x69c650,
>>>>> poll_timeout=100)
>>>>>     at
>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/poll_set.cpp:202
>>>>> #2  0x00007ffff7906778 in ros::PollManager::threadFunc (this=0x69c650)
>>>>>     at
>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/poll_manager.cpp:95
>>>>> #3  0x00007ffff6df5230 in thread_proxy () from
>>>>> /usr/lib/libboost_thread.so.1.42.0
>>>>> #4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
>>>>> pthread_create.c:304
>>>>> #5  0x00007ffff535592d in clone () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
>>>>> #6  0x0000000000000000 in ?? ()
>>>>> (gdb) thread 3
>>>>> [Switching to thread 3 (Thread 0x7ffff16ab700 (LWP 9629))]#0
>>>>> 0x00007ffff534e2c3 in select () at ../sysdeps/unix/syscall-template.S:82
>>>>> 82    ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>     in ../sysdeps/unix/syscall-template.S
>>>>> (gdb) bt
>>>>> #0  0x00007ffff534e2c3 in select () at
>>>>> ../sysdeps/unix/syscall-template.S:82
>>>>> #1  0x00007ffff723313a in XmlRpc::XmlRpcDispatch::work (this=0x69ac18,
>>>>> timeout=<value optimized out>)
>>>>>     at
>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/utilities/xmlrpcpp/src/XmlRpcDispatch.cpp:110
>>>>> #2  0x00007ffff7903de9 in ros::XMLRPCManager::serverThreadFunc
>>>>> (this=0x69abc0)
>>>>>     at
>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/xmlrpc_manager.cpp:256
>>>>> #3  0x00007ffff6df5230 in thread_proxy () from
>>>>> /usr/lib/libboost_thread.so.1.42.0
>>>>> #4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
>>>>> pthread_create.c:304
>>>>> #5  0x00007ffff535592d in clone () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
>>>>> #6  0x0000000000000000 in ?? ()
>>>>> (gdb) thread 4
>>>>> [Switching to thread 4 (Thread 0x7ffff0eaa700 (LWP 9630))]#0
>>>>> pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
>>>>> 162    ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No
>>>>> such file or directory.
>>>>>     in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
>>>>> (gdb) bt
>>>>> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
>>>>> #1  0x00007ffff79603b3 in wait (this=<value optimized out>) at
>>>>> /usr/include/boost/thread/pthread/condition_variable.hpp:20
>>>>> #2  ros::ROSOutAppender::logThread (this=<value optimized out>)
>>>>>     at
>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/rosout_appender.cpp:135
>>>>> #3  0x00007ffff6df5230 in thread_proxy () from
>>>>> /usr/lib/libboost_thread.so.1.42.0
>>>>> #4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
>>>>> pthread_create.c:304
>>>>> #5  0x00007ffff535592d in clone () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
>>>>> #6  0x0000000000000000 in ?? ()
>>>>> (gdb) thread 5
>>>>> [Switching to thread 5 (Thread 0x7ffff06a9700 (LWP 9635))]#0
>>>>> pthread_cond_timedwait@@GLIBC_2.3.2 () at
>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>> 212    ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:
>>>>> No such file or directory.
>>>>>     in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
>>>>> (gdb) bt
>>>>> #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>> #1  0x00007ffff796885d in bool
>>>>> boost::condition_variable::timed_wait<boost::date_time::subsecond_duration<boost::posix_time::time_duration,
>>>>> 1000000l> >(boost::unique_lock<boost::mutex>&,
>>>>> boost::date_time::subsecond_duration<boost::posix_time::time_duration,
>>>>> 1000000l> const&) () from
>>>>> /opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/lib/libros.so
>>>>> #2  0x00007ffff7964e99 in ros::CallbackQueue::callAvailable
>>>>> (this=0x69c410, timeout=DWARF-2 expression error: DW_OP_reg operations must
>>>>> be used either alone or in conjuction with DW_OP_piece or DW_OP_bit_piece.
>>>>> )
>>>>>     at
>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/callback_queue.cpp:307
>>>>> #3  0x00007ffff78f22e1 in ros::internalCallbackQueueThreadFunc ()
>>>>>     at
>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/init.cpp:265
>>>>> #4  0x00007ffff6df5230 in thread_proxy () from
>>>>> /usr/lib/libboost_thread.so.1.42.0
>>>>> #5  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
>>>>> pthread_create.c:304
>>>>> #6  0x00007ffff535592d in clone () at
>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
>>>>> #7  0x0000000000000000 in ?? ()
>>>>> (gdb)
>>>>> --- snip ---
>>>>>
>>>>> Also, I'm sorry to hear you had to create a new package to compile the
>>>>> program. As a sanity check, I also created a new package and copied over the
>>>>> cpp and action files, but this did not change the result I was having, as
>>>>> expected. (I've attached the package, for reference.)
>>>>>
>>>>> I really appreciate the help, thank you.
>>>>>
>>>>> -Ryan
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2011 at 2:03 PM, Vijay Pradeep <
>>>>> vpradeep at willowgarage.com> wrote:
>>>>> > Hi Ryan,
>>>>> >
>>>>> > Here's one more thing to try.  Turn on rosconsole debugging (See
>>>>> rosconole
>>>>> > configuration)
>>>>> >
>>>>> > You're going to want the following two lines in your
>>>>> rosonsole.config:
>>>>> > log4j.logger.ros.actionlib=DEBUG
>>>>> > log4j.logger.ros.chores=DEBUG
>>>>> >
>>>>> >
>>>>> > You'll get an output that looks like this:
>>>>> > [DEBUG] [1301680806.580203868]: Setting the current goal as aborted
>>>>> > [DEBUG] [1301680806.580254603]: Setting status to aborted on goal,
>>>>> id:
>>>>> > /test_client_server-1118-1301680806.579359151, stamp: 1301680806.58
>>>>> > [DEBUG] [1301680806.580301868]: Publishing result for goal with id:
>>>>> > /test_client_server-1118-1301680806.579359151 and stamp:
>>>>> 1301680806.58
>>>>> > [DEBUG] [1301680806.580439068]: Getting status over the wire.
>>>>> > [DEBUG] [1301680806.580509625]: Trying to transition to ACTIVE
>>>>> > [DEBUG] [1301680806.580553872]: Transitioning CommState from
>>>>> > WAITING_FOR_GOAL_ACK to ACTIVE
>>>>> > [DEBUG] [1301680806.580603042]: Transitioning SimpleState from
>>>>> [PENDING] to
>>>>> > [ACTIVE]
>>>>> > Active.
>>>>> > [DEBUG] [1301680806.580934461]: Trying to transition to
>>>>> WAITING_FOR_RESULT
>>>>> > [DEBUG] [1301680806.580991611]: Transitioning CommState from ACTIVE
>>>>> to
>>>>> > WAITING_FOR_RESULT
>>>>> > [DEBUG] [1301680806.581051376]: Trying to transition to DONE
>>>>> > [DEBUG] [1301680806.581108948]: Transitioning CommState from
>>>>> > WAITING_FOR_RESULT to DONE
>>>>> > [DEBUG] [1301680806.581164618]: Transitioning SimpleState from
>>>>> [ACTIVE] to
>>>>> > [DONE]
>>>>> > Done.
>>>>> > [ INFO] [1301680806.581243326]: Hogging time or something
>>>>> > Current State: ABORTED
>>>>> > [DEBUG] [1301680806.581302953]: IN DELETER
>>>>> > [DEBUG] [1301680806.581339010]: About to erase CommStateMachine
>>>>> > [DEBUG] [1301680806.581377433]: Done erasing CommStateMachine
>>>>> > [DEBUG] [1301680806.581414297]: about to start initGoal()
>>>>> > [DEBUG] [1301680806.581489440]: Done with initGoal()
>>>>> > [DEBUG] [1301680806.581544720]: The action server has received a new
>>>>> goal
>>>>> > request
>>>>> > [DEBUG] [1301680806.581638097]: A new goal has been recieved by the
>>>>> single
>>>>> > goal action server
>>>>> > [DEBUG] [1301680806.581705716]: Accepting a new goal
>>>>> > [DEBUG] [1301680806.581758609]: Accepting goal, id:
>>>>> > /test_client_server-1119-1301680806.581440270, stamp: 1301680806.58
>>>>> > Server called. Aborting goal.
>>>>> >
>>>>> > Can send me the debug output for the last few cycles when your app
>>>>> freezes?
>>>>> >
>>>>> > Vijay
>>>>> >
>>>>> > On Fri, Apr 1, 2011 at 10:05 AM, Vijay Pradeep <
>>>>> vpradeep at willowgarage.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> Hi Ryan,
>>>>> >>
>>>>> >> So I actually ran your original chores app overnight (~10 hours),
>>>>> and it
>>>>> >> still didn't freeze.
>>>>> >>
>>>>> >> Also, right now, I tried running your new chores app about 30 times
>>>>> (for
>>>>> >> about 5 seconds on each run), and it still hadn't frozen.  I
>>>>> couldn't get
>>>>> >> the app to build using your CMakeLists and Makefile, so I rewrote
>>>>> them based
>>>>> >> on the standard "roscreate-pkg" versions of these files. I doubt
>>>>> this would
>>>>> >> affect the functionality.
>>>>> >>
>>>>> >> The fact that it freezes close to startup is interesting.  My guess
>>>>> would
>>>>> >> be that it is having some sort of startup interaction with the
>>>>> AsyncSpinner.
>>>>> >>
>>>>> >> Instead of:
>>>>> >> int main(int argc, char** argv) {
>>>>> >>   ros::init(argc, argv, "test_client_server");
>>>>> >>
>>>>> >>   ros::AsyncSpinner spinner(4);
>>>>> >>   spinner.start();
>>>>> >>
>>>>> >>   ServerDude sd;
>>>>> >>
>>>>> >>   ClientGuy client;
>>>>> >>   client.run();
>>>>> >>
>>>>> >>   return 0;
>>>>> >> }
>>>>> >>
>>>>> >> Can you try:
>>>>> >> int main(int argc, char** argv) {
>>>>> >>   ros::init(argc, argv, "test_client_server");
>>>>> >>
>>>>> >>   ros::AsyncSpinner spinner(4);
>>>>> >>
>>>>> >>   ServerDude sd;
>>>>> >>
>>>>> >>   ClientGuy client;
>>>>> >>   spinner.start();
>>>>> >>   client.run();
>>>>> >>
>>>>> >>   return 0;
>>>>> >> }
>>>>> >>
>>>>> >> I would really like to get to the bottom of this, but it's hard to
>>>>> do much
>>>>> >> on my end if I can't actually get it to halt.
>>>>> >>
>>>>> >> Vijay
>>>>> >>
>>>>> >> On Fri, Apr 1, 2011 at 7:51 AM, Ryan Miller <rmiller4589 at gmail.com>
>>>>> wrote:
>>>>> >>>
>>>>> >>> Hey Vijay, thanks for helping (and sorry for the double post; I
>>>>> >>> accidentally forgot to reply all, and I figure some people on the
>>>>> list
>>>>> >>> may be interested in the email). Yes, I did mean waitForResult, and
>>>>> I
>>>>> >>> am also using common-1.4.3 (diamondback too, if that's relevant).
>>>>> >>> Since you're having trouble duplicating the issue, I went back to
>>>>> >>> create an example closer to what I'm actually doing, and I managed
>>>>> to
>>>>> >>> get it to crash just as often. I've attached the project.
>>>>> >>>
>>>>> >>> If it doesn't crash after about 5 seconds, restart it. It seems if
>>>>> it
>>>>> >>> continues for more than 5 seconds, it is more likely to continue
>>>>> >>> running for much longer, but it usually halts within a second.
>>>>> >>>
>>>>> >>> Thanks a ton.
>>>>> >>>
>>>>> >>> -Ryan
>>>>> >>>
>>>>> >>> On Thu, Mar 31, 2011 at 11:00 PM, Vijay Pradeep
>>>>> >>> <vpradeep at willowgarage.com> wrote:
>>>>> >>> > Hi Ryan,
>>>>> >>> >
>>>>> >>> > I'm sorry that you're having trouble getting actionlib to work.
>>>>> What
>>>>> >>> > version of the common stack do you have?
>>>>> >>> >
>>>>> >>> > Note that it is always possible for some of the ROS messages to
>>>>> be
>>>>> >>> > dropped.
>>>>> >>> > If the result message doesn't make it to the action client, then
>>>>> >>> > waitForResult is going to block forever.  I'd suggest adding a
>>>>> timeout
>>>>> >>> > on
>>>>> >>> > waitForServer, and then preempting the goal if you end up waiting
>>>>> too
>>>>> >>> > long.
>>>>> >>> >
>>>>> >>> >> I made each callback print when
>>>>> >>> >> it was called, and I found that quite often, the program would
>>>>> block
>>>>> >>> >> forever when I called waitForServer.
>>>>> >>> > I'm guessing you meant "the program would block forever when I
>>>>> called
>>>>> >>> > waitForResult"
>>>>> >>> >
>>>>> >>> > I'm using common-1.4.3, and after running your example for ~10
>>>>> minutes,
>>>>> >>> > it
>>>>> >>> > doesn't seem to freeze for me.  I can rerun this test with the
>>>>> exact
>>>>> >>> > version
>>>>> >>> > of common that you're using.  Is there anything you can do to
>>>>> make this
>>>>> >>> > minimal example freeze more often? Are there some strategic
>>>>> sleeps you
>>>>> >>> > could
>>>>> >>> > add to make it mimic your original app more closely?
>>>>> >>> >
>>>>> >>> > If you can get the chores app to freeze, you could try attaching
>>>>> gdb to
>>>>> >>> > the
>>>>> >>> > process and seeing where all the threads are (using the "info
>>>>> threads"
>>>>> >>> > command inside gdb).  A bunch of them will be stuck on
>>>>> >>> > pthread_cond_timedwait calls, but I'd be curious if there's a
>>>>> thread
>>>>> >>> > stuck
>>>>> >>> > on a lock inside of actionlib.  That would be indicative of a
>>>>> race
>>>>> >>> > condition
>>>>> >>> > in actionlib.
>>>>> >>> >
>>>>> >>> > Vijay
>>>>> >>> >
>>>>> >>> > On Thu, Mar 31, 2011 at 6:55 PM, Ryan Miller <
>>>>> rmiller4589 at gmail.com>
>>>>> >>> > wrote:
>>>>> >>> >>
>>>>> >>> >> Because of a timing issue I can't quite pin down, actionlib
>>>>> >>> >> occasionally appears to exhibit a race condition. I found the
>>>>> problem
>>>>> >>> >> after adding callbacks to my client. I made each callback print
>>>>> when
>>>>> >>> >> it was called, and I found that quite often, the program would
>>>>> block
>>>>> >>> >> forever when I called waitForServer. The problem was that the
>>>>> client's
>>>>> >>> >> active callback was called but the server's execute method was
>>>>> never
>>>>> >>> >> called.
>>>>> >>> >>
>>>>> >>> >> I have reduced the problem into a simple ROS package with a
>>>>> single
>>>>> >>> >> executable that I have attached. After running the node for a
>>>>> while, I
>>>>> >>> >> finally noticed the same problem. It's last output before
>>>>> blocking
>>>>> >>> >> was:
>>>>> >>> >>
>>>>> >>> >> --- snip ---
>>>>> >>> >> Current State: ABORTED
>>>>> >>> >> Aborting.
>>>>> >>> >> Active.
>>>>> >>> >> Done.
>>>>> >>> >> Current State: ABORTED
>>>>> >>> >> Aborting.
>>>>> >>> >> Active.
>>>>> >>> >> --- snip ---
>>>>> >>> >>
>>>>> >>> >> In my actual code, the condition happens extremely frequently,
>>>>> but I
>>>>> >>> >> found I could mitigate the problem by sleeping for one
>>>>> millisecond
>>>>> >>> >> before returning from the server's execute method. (I should
>>>>> warn you
>>>>> >>> >> that in the attached example, the problem almost never occurs).
>>>>> >>> >>
>>>>> >>> >> Is this likely a bug, or might I doing something wrong? Any
>>>>> >>> >> suggestions would be appreciated. Thanks for the help.
>>>>> >>> >>
>>>>> >>> >> -Ryan
>>>>> >>> >>
>>>>> >>> >> _______________________________________________
>>>>> >>> >> ros-users mailing list
>>>>> >>> >> ros-users at code.ros.org
>>>>> >>> >> https://code.ros.org/mailman/listinfo/ros-users
>>>>> >>> >>
>>>>> >>> >
>>>>> >>> >
>>>>> >>> >
>>>>> >>> > --
>>>>> >>> > Vijay Pradeep
>>>>> >>> > Systems Engineer
>>>>> >>> > Willow Garage, Inc.
>>>>> >>> > vpradeep at willowgarage.com
>>>>> >>> >
>>>>> >>> >
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Vijay Pradeep
>>>>> >> Systems Engineer
>>>>> >> Willow Garage, Inc.
>>>>> >> vpradeep at willowgarage.com
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Vijay Pradeep
>>>>> > Systems Engineer
>>>>> > Willow Garage, Inc.
>>>>> > vpradeep at willowgarage.com
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Vijay Pradeep
>>>> Systems Engineer
>>>> Willow Garage, Inc.
>>>>  <tfoote at willowgarage.com>vpradeep at willowgarage.com
>>>>
>>>>
>>>
>>
>
>
> --
> Vijay Pradeep
> Systems Engineer
> Willow Garage, Inc.
>  <tfoote at willowgarage.com>vpradeep at willowgarage.com
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ros.org/pipermail/ros-users/attachments/20110402/c5d04c04/attachment-0002.html>