[ros-users] Race condition in actionlib

Ken Conley kwc at willowgarage.com
Mon Apr 4 02:21:15 UTC 2011


On Fri, Apr 1, 2011 at 9:07 PM, Ryan Miller <rmiller4589 at gmail.com> wrote:

> Thanks for the design tip. It's working much better now.
>
> My path planner is also a service for navigation goals, so I guess I should
> also send those out periodically too. It would be a serious problem if my
> navigation goal sender didn't know the robot actually finished navigating to
> a goal.
>
>
> > Although TCP/IP is reliable, there are cases where ROS cannot send
> messages fast enough over the wire, so ROS decides to drop some messages.
>
> Wow! I think this should be written on a blinking red marquee somewhere. Is
> there any document on the ROS wiki where this, and maybe other interesting
> details about ROS, are written? I guess I'll be treating all ROS messages as
> unreliable from now on...
>
>
(NOTE: this pertains to roscpp)

Please read about queues and queue_size here:
http://www.ros.org/wiki/roscpp/Overview/Publishers%20and%20Subscribers

The queue size is something that client software configures.  The underlying
transport is reliable -- every message published by a subscriber over TCP is
delivered to the subscriber, assuming you don't saturate the network
connection.

In between the transport and your client is a queue.  Clients (i.e your
code) set a queue policy so that they can throw away messages -- if you want
to. As the documentation warns, its dangerous to configure clients to not
throw away messages.  If your algorithm is too slow and messages start
piling up in your queue faster than you can process them, bad things happen
(you run out of memory, or you backup your network connection).  It's
generally wise for clients to set a smaller queue as mostly often you want
to deal with the most recent data.

Looking at the actionlib code, it looks like it sets the queue size to 1.
This means that action clients are generally configured to act on the most
recent message as quickly as possible.

 - Ken




> Thank you again for all the help! It's been great :D
>
> -Ryan
>
> P.S. I still think it's crazy you guys built an unreliable data transfer
> protocol on top of TCP/IP...
>
>
> On Fri, Apr 1, 2011 at 6:30 PM, Vijay Pradeep <vpradeep at willowgarage.com>wrote:
>
>> > Is it acceptable for messages to be dropped? I thought actionlib used
>> reliable data transfer.
>> It is acceptable for messages to be dropped.  actionlib uses ROS messages
>> which use TCP/IP transport.  Although TCP/IP is reliable, there are cases
>> where ROS cannot send messages fast enough over the wire, so ROS decides to
>> drop some messages.  This means on a given topic, messages get dropped only
>> when there are even newer messages that need to be sent.
>>
>> In your case, it is totally reasonable for the goal to seem like it is not
>> getting canceled.  As far as the server is concerned, it already completed
>> the goal, and has already sent the result. Thus, there is nothing to cancel.
>> This causes the client to indefinitely wait for the goal to terminate.
>>
>> I don't think you're current approach is best for your application.
>> You're sending a new goal to the server only after the previous goal
>> completes, which means that the robot is probably going to briefly stop
>> moving between goals.  To avoid this issue, you could continuously send
>> goals at some rate, and never wait for them to finish.  In the server's
>> execute loop, you can see if a new goal is available and then accept it and
>> start using the new goal.  If you ever need to stop due to a potential
>> collision, you can always send a cancel request for the latest goal.
>>
>> Does this seem to fit your need?  There's also a chance that I'm
>> misunderstanding your application.
>>
>> Vijay
>>
>>
>> On Fri, Apr 1, 2011 at 3:04 PM, Ryan Miller <rmiller4589 at gmail.com>wrote:
>>
>>> Sorry, forgot to attach the file.
>>>
>>> -Ryan
>>>
>>>
>>> On Fri, Apr 1, 2011 at 6:01 PM, Ryan Miller <rmiller4589 at gmail.com>wrote:
>>>
>>>> Hi Vijay,
>>>>
>>>> A few questions then. Is it acceptable for messages to be dropped? I
>>>> thought actionlib used reliable data transfer. In my actual program, I am
>>>> depending on every message being sent, and I would rather not use delays and
>>>> risk my robot still locking up.
>>>>
>>>> To give you some perspective, I am using the client to plan a path to a
>>>> destination, and I am using the server to execute the motor controls to
>>>> drive on a path it is given. Using a timeout would be fine (it's what I'm
>>>> doing in my production code, actually). But with this new result, it seems I
>>>> should now cancel the goal any time waitForResult times out because I don't
>>>> know if the server is just taking a long time or if its message was dropped
>>>> and it's actually finished. What I was planning on doing was letting the
>>>> motion planner follow the path until it is in imminent danger of colliding
>>>> with something, in which case I would cancel the goal. Unfortunately, I have
>>>> a new problem now where I cannot even cancel the goal.
>>>>
>>>> After timing out, the client always thinks it's goal is active, and I am
>>>> unable to cancel the goal. The output I am getting from the attached example
>>>> is:
>>>>
>>>> --- snip ---
>>>> *[ INFO] [1301694170.491982098]: Sending goal
>>>> [ INFO] [1301694170.492052098]: Waiting for result...
>>>> [ INFO] [1301694170.492138988]: Server is aborting goal.
>>>> [ INFO] [1301694170.492193892]: After waiting for the result, the state
>>>> is ABORTED
>>>> [ INFO] [1301694170.492232340]: Sending goal
>>>> [ INFO] [1301694170.492279138]: Waiting for result...
>>>> [ INFO] [1301694170.492341676]: Server is aborting goal.
>>>> [ INFO] [1301694170.492388975]: After waiting for the result, the state
>>>> is ABORTED*
>>>> [ INFO] [1301694170.492443452]: Sending goal
>>>> [ INFO] [1301694170.492543467]: Server is aborting goal.
>>>> [ INFO] [1301694170.492620163]: Waiting for result...
>>>> [ INFO] [1301694171.492789064]: After waiting for the result, the state
>>>> is ACTIVE
>>>> [ INFO] [1301694171.492909135]: Timeout, cancelling the goal
>>>> *[ INFO] [1301694171.492953475]: After cancelling, the state is ACTIVE*
>>>> *[ INFO] [1301694171.492977808]: The server is not done.
>>>> [ INFO] [1301694172.493114619]: The server is not done.
>>>> [ INFO] [1301694173.493301543]: The server is not done.
>>>> [ INFO] [1301694174.493476758]: The server is not done.*
>>>> ^C
>>>> --- snip ---
>>>> (Normal behaviour is italicised. Unexpected behaviour is bold)
>>>>
>>>> Might this be another problem, or am I misusing SimpleActionClient?
>>>>
>>>> Thanks again,
>>>>
>>>> -Ryan
>>>>
>>>>
>>>> On Fri, Apr 1, 2011 at 4:54 PM, Vijay Pradeep <
>>>> vpradeep at willowgarage.com> wrote:
>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> Ok, I think we're actually getting somewhere.....
>>>>>
>>>>>
>>>>> > When I move the called to spinner.start() to where you suggested, the
>>>>> client gets stuck on its call to waitForServer as follows:
>>>>> Totally my bad.  What I meant for you to do was the following:
>>>>> 1) Construct client & server
>>>>> 2) Spin up threads
>>>>> 3) waitForServer
>>>>> 4) Start sending goals
>>>>>
>>>>> To do this, you'll have to move spinner.start() to where I suggested,
>>>>> and also move client_.waitForServer() to the top of run().  Currently,
>>>>> the app is trying to waitForServer without any spin threads running, which
>>>>> is causing it to get stuck.  Sorry about that.  I'd also suggest trying to
>>>>> run with 1 spinner instead of 4, and see if that helps.
>>>>>
>>>>> Thanks for the gdb thread info.  It looks like all the threads we're
>>>>> worried about are happily waiting for inputs in pthread conditions.  This
>>>>> makes it seem even more like a message was being dropped.
>>>>>
>>>>> Also, I finally got it to freeze on my side, *with* debug on.  See
>>>>> output:
>>>>>
>>>>> Done.
>>>>> [ INFO] [1301690082.884716841]: Hogging time or something
>>>>> Current State: ABORTED
>>>>> [DEBUG] [1301690082.884784275]: IN DELETER
>>>>> [DEBUG] [1301690082.884838215]: About to erase CommStateMachine
>>>>> [DEBUG] [1301690082.884891598]: Done erasing CommStateMachine
>>>>> [DEBUG] [1301690082.884948750]: about to start initGoal()
>>>>> [DEBUG] [1301690082.885037423]: The action server has received a new
>>>>> goal request
>>>>> [DEBUG] [1301690082.885152395]: A new goal has been recieved by the
>>>>> single goal action server
>>>>> [DEBUG] [1301690082.885263065]: Accepting a new goal
>>>>> [DEBUG] [1301690082.885330292]: Accepting goal, id:
>>>>> /test_client_server-281-1301690082.884983820, stamp: 1301690082.88
>>>>> Server called. Aborting goal.
>>>>> [DEBUG] [1301690082.885555049]: Setting the current goal as aborted
>>>>> [DEBUG] [1301690082.885618287]: Setting status to aborted on goal, id:
>>>>> /test_client_server-281-1301690082.884983820, stamp: 1301690082.88
>>>>> [DEBUG] [1301690082.885675967]: Publishing result for goal with id:
>>>>> /test_client_server-281-1301690082.884983820 and stamp: 1301690082.88
>>>>> [DEBUG] [1301690082.885743824]: Getting status over the wire.
>>>>> [DEBUG] [1301690082.885956181]: Done with initGoal()
>>>>> [DEBUG] [1301690082.887905041]: Getting status over the wire.
>>>>> [DEBUG] [1301690082.888000669]: Trying to transition to ACTIVE
>>>>> [DEBUG] [1301690082.888064094]: Transitioning CommState from
>>>>> WAITING_FOR_GOAL_ACK to ACTIVE
>>>>> [DEBUG] [1301690082.888130878]: Transitioning SimpleState from
>>>>> [PENDING] to [ACTIVE]
>>>>> Active.
>>>>> [DEBUG] [1301690082.888203846]: Trying to transition to
>>>>> WAITING_FOR_RESULT
>>>>> *[DEBUG] [1301690082.888262411]: Transitioning CommState from ACTIVE
>>>>> to WAITING_FOR_RESULT*
>>>>> [DEBUG] [1301690083.088191441]: Getting status over the wire.
>>>>> [DEBUG] [1301690083.288523198]: Getting status over the wire.
>>>>> [DEBUG] [1301690083.488836856]: Getting status over the wire.
>>>>> [DEBUG] [1301690083.688115542]: Getting status over the wire.
>>>>> [DEBUG] [1301690083.888460218]: Getting status over the wire.
>>>>> [DEBUG] [1301690084.088794991]: Getting status over the wire.
>>>>> [DEBUG] [1301690084.288137860]: Getting status over the wire.
>>>>> [DEBUG] [1301690084.488456451]: Getting status over the wire.
>>>>> [DEBUG] [1301690084.688742829]: Getting status over the wire.
>>>>> [DEBUG] [1301690084.888085527]: Getting status over the wire.
>>>>> [DEBUG] [1301690085.088418107]: Getting status over the wire.
>>>>>
>>>>> On the client side, the goal transitioned to WAITING_FOR_RESULT, which
>>>>> means that the server successfully set the goal to a terminal state (in this
>>>>> case, ABORTED).  However, for some reason, the client never gets the result
>>>>> message.  With so many messages flying across so quickly, I guess one of the
>>>>> result messages happens to get dropped.  This also explains why adding a
>>>>> sleep in the server callback stopped the freezing.  The sleep significantly
>>>>> slowed down the messaging rate, preventing the result message from being
>>>>> dropped.
>>>>>
>>>>> It is definitely best practice to always add a timeout to
>>>>> waitForResult().  I've updated the actionlib quickstart guide<http://www.ros.org/wiki/actionlib#C.2B-.2B-_SimpleActionClient>accordingly.
>>>>>
>>>>> And, please let me know if any of the spinner tweaks I suggested make a
>>>>> difference.
>>>>>
>>>>> Vijay
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2011 at 1:07 PM, Ryan Miller <rmiller4589 at gmail.com>wrote:
>>>>>
>>>>>> Hi Vijay,
>>>>>>
>>>>>> When I move the called to spinner.start() to where you suggested, the
>>>>>> client gets stuck on its call to waitForServer as follows:
>>>>>>
>>>>>> --- snip ---
>>>>>> $ rosrun chores chores
>>>>>> [DEBUG] [1301684841.767308639]: Spinning up a thread for the
>>>>>> SimpleActionClient
>>>>>> [DEBUG] [1301684841.794376567]: goalConnectCallback: Adding
>>>>>> [/test_client_server] to goalSubscribers
>>>>>> [DEBUG] [1301684841.794470922]: Goal Subscribers (1 total)
>>>>>>   - /test_client_server
>>>>>> [DEBUG] [1301684841.800129524]: cancelConnectCallback: Adding
>>>>>> [/test_client_server] to cancelSubscribers
>>>>>> [DEBUG] [1301684841.800199855]: cancel Subscribers (1 total)
>>>>>>   - /test_client_server
>>>>>> [DEBUG] [1301684841.809336199]: isServerConnected: Didn't receive
>>>>>> status yet, so not connected yet
>>>>>> [DEBUG] [1301684841.809441030]: isServerConnected: Didn't receive
>>>>>> status yet, so not connected yet
>>>>>> [DEBUG] [1301684842.309665431]: isServerConnected: Didn't receive
>>>>>> status yet, so not connected yet
>>>>>> [DEBUG] [1301684842.810034890]: isServerConnected: Didn't receive
>>>>>> status yet, so not connected yet
>>>>>> ...
>>>>>> [DEBUG] [1301684842.810034890]: isServerConnected: Didn't receive
>>>>>> status yet, so not connected yet
>>>>>> --- snip ---
>>>>>>
>>>>>> Also, it appears I have a heisenbug on my hands because when I turn on
>>>>>> rosconsole debugging, the problem seemingly disappears, but when I keep it
>>>>>> off, it crashes on almost every run. By the way, the single call to ROS_INFO
>>>>>> I placed in the code is critical to making the program halt more often.
>>>>>> Remove that, and you'll have more trouble getting the program to halt.
>>>>>>
>>>>>> Is there any other information that might help you replicate this
>>>>>> problem? For now, I'll just throw some facts here about the computer I'm
>>>>>> using. I am running Ubuntu 10.10 64-bit. I have replicated the problem on a
>>>>>> relatively slow laptop and two pretty fast desktops. The specific package
>>>>>> information for the version of ros-diamondback-common I am using is:
>>>>>>
>>>>>> --- snip ---
>>>>>> $ dpkg -s ros-diamondback-common
>>>>>> Package: ros-diamondback-common
>>>>>> Status: install ok installed
>>>>>> Priority: optional
>>>>>> Section: unknown
>>>>>> Installed-Size: 34224
>>>>>> Maintainer: Tully Foote tfoote at willowgarage.com
>>>>>> Architecture: amd64
>>>>>> Version: 1.4.3-s1299027665~maverick
>>>>>> Depends: ros-diamondback-common-msgs (= 1.4.0-s1299027564~maverick),
>>>>>> ros-diamondback-ros (= 1.4.6-s1298587732~maverick), ros-diamondback-ros-comm
>>>>>> (= 1.4.5-s1299027222~maverick), libc6, build-essential, cmake, python-yaml,
>>>>>> subversion, uuid-dev, libcppunit-dev
>>>>>> Description: common code for personal robots
>>>>>> This stack contains tools built on top of ROS core which are commonly
>>>>>> used throughout the ROS ecosystem.
>>>>>> Wg-Rosdistro: diamondback
>>>>>> --- snip ---
>>>>>>
>>>>>> Here is some information from gdb from when the program halts. I've
>>>>>> printed a list of the running threads as well as the backtraces of a few of
>>>>>> the ones in different functions.
>>>>>>
>>>>>> --- snip ---
>>>>>> Current State: ABORTED
>>>>>> Server called. Aborting goal.
>>>>>> Active.
>>>>>> ^C
>>>>>> Program received signal SIGINT, Interrupt.
>>>>>> pthread_cond_timedwait@@GLIBC_2.3.2 () at
>>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>> 212 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:
>>>>>> No such file or directory.
>>>>>> in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
>>>>>> (gdb) info thread
>>>>>>  12 Thread 0x7fffecea2700 (LWP 9649)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>>  11 Thread 0x7fffed6a3700 (LWP 9646)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>>  10 Thread 0x7fffedea4700 (LWP 9640)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>>  9 Thread 0x7fffee6a5700 (LWP 9639)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>>  8 Thread 0x7fffeeea6700 (LWP 9638)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>>  7 Thread 0x7fffef6a7700 (LWP 9637)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>>  6 Thread 0x7fffefea8700 (LWP 9636)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>>  5 Thread 0x7ffff06a9700 (LWP 9635)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>>  4 Thread 0x7ffff0eaa700 (LWP 9630)  pthread_cond_wait@@GLIBC_2.3.2
>>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
>>>>>>  3 Thread 0x7ffff16ab700 (LWP 9629)  0x00007ffff534e2c3 in select ()
>>>>>> at ../sysdeps/unix/syscall-template.S:82
>>>>>>  2 Thread 0x7ffff1eac700 (LWP 9628)  0x00007ffff5349203 in __poll
>>>>>> (fds=<value optimized out>, nfds=<value optimized out>, timeout=100) at
>>>>>> ../sysdeps/unix/sysv/linux/poll.c:87
>>>>>> * 1 Thread 0x7ffff7fcd760 (LWP 9609)  pthread_cond_timedwait@@GLIBC_2.3.2
>>>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>> (gdb) bt
>>>>>> #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
>>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>> #1  0x000000000044c698 in timed_wait<boost::unique_lock<boost::mutex>
>>>>>> > (this=0x7fffffffe110, m=..., wait_duration=<value optimized out>) at
>>>>>> /usr/include/boost/thread/pthread/condition_variable.hpp:110
>>>>>> #2
>>>>>>  boost::condition_variable_any::timed_wait<boost::unique_lock<boost::mutex>,
>>>>>> boost::posix_time::milliseconds> (this=0x7fffffffe110, m=...,
>>>>>> wait_duration=<value optimized out>)
>>>>>>    at /usr/include/boost/thread/pthread/condition_variable.hpp:133
>>>>>> #3  0x000000000044d355 in
>>>>>> actionlib::SimpleActionClient<chores::DoDishesAction_<std::allocator<void> >
>>>>>> >::waitForResult (this=0x7fffffffe040, timeout=...)
>>>>>>    at
>>>>>> /opt/ros/diamondback/stacks/common/actionlib/include/actionlib/client/simple_action_client.h:567
>>>>>> #4  0x00000000004583f6 in ClientGuy::run (this=0x7fffffffe040) at
>>>>>> /home/pesckal/prometheus/ros/chores/src/main.cpp:50
>>>>>> #5  0x000000000042d8d5 in main (argc=1, argv=<value optimized out>) at
>>>>>> /home/pesckal/prometheus/ros/chores/src/main.cpp:87
>>>>>> (gdb) thread 2
>>>>>> [Switching to thread 2 (Thread 0x7ffff1eac700 (LWP 9628))]#0
>>>>>> 0x00007ffff5349203 in __poll (fds=<value optimized out>, nfds=<value
>>>>>> optimized out>, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:87
>>>>>> 87    ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
>>>>>>     in ../sysdeps/unix/sysv/linux/poll.c
>>>>>> (gdb) bt
>>>>>> #0  0x00007ffff5349203 in __poll (fds=<value optimized out>,
>>>>>> nfds=<value optimized out>, timeout=100) at
>>>>>> ../sysdeps/unix/sysv/linux/poll.c:87
>>>>>> #1  0x00007ffff78c169a in ros::PollSet::update (this=0x69c650,
>>>>>> poll_timeout=100)
>>>>>>     at
>>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/poll_set.cpp:202
>>>>>> #2  0x00007ffff7906778 in ros::PollManager::threadFunc (this=0x69c650)
>>>>>>     at
>>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/poll_manager.cpp:95
>>>>>> #3  0x00007ffff6df5230 in thread_proxy () from
>>>>>> /usr/lib/libboost_thread.so.1.42.0
>>>>>> #4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
>>>>>> pthread_create.c:304
>>>>>> #5  0x00007ffff535592d in clone () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
>>>>>> #6  0x0000000000000000 in ?? ()
>>>>>> (gdb) thread 3
>>>>>> [Switching to thread 3 (Thread 0x7ffff16ab700 (LWP 9629))]#0
>>>>>> 0x00007ffff534e2c3 in select () at ../sysdeps/unix/syscall-template.S:82
>>>>>> 82    ../sysdeps/unix/syscall-template.S: No such file or directory.
>>>>>>     in ../sysdeps/unix/syscall-template.S
>>>>>> (gdb) bt
>>>>>> #0  0x00007ffff534e2c3 in select () at
>>>>>> ../sysdeps/unix/syscall-template.S:82
>>>>>> #1  0x00007ffff723313a in XmlRpc::XmlRpcDispatch::work (this=0x69ac18,
>>>>>> timeout=<value optimized out>)
>>>>>>     at
>>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/utilities/xmlrpcpp/src/XmlRpcDispatch.cpp:110
>>>>>> #2  0x00007ffff7903de9 in ros::XMLRPCManager::serverThreadFunc
>>>>>> (this=0x69abc0)
>>>>>>     at
>>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/xmlrpc_manager.cpp:256
>>>>>> #3  0x00007ffff6df5230 in thread_proxy () from
>>>>>> /usr/lib/libboost_thread.so.1.42.0
>>>>>> #4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
>>>>>> pthread_create.c:304
>>>>>> #5  0x00007ffff535592d in clone () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
>>>>>> #6  0x0000000000000000 in ?? ()
>>>>>> (gdb) thread 4
>>>>>> [Switching to thread 4 (Thread 0x7ffff0eaa700 (LWP 9630))]#0
>>>>>> pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
>>>>>> 162    ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No
>>>>>> such file or directory.
>>>>>>     in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
>>>>>> (gdb) bt
>>>>>> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
>>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
>>>>>> #1  0x00007ffff79603b3 in wait (this=<value optimized out>) at
>>>>>> /usr/include/boost/thread/pthread/condition_variable.hpp:20
>>>>>> #2  ros::ROSOutAppender::logThread (this=<value optimized out>)
>>>>>>     at
>>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/rosout_appender.cpp:135
>>>>>> #3  0x00007ffff6df5230 in thread_proxy () from
>>>>>> /usr/lib/libboost_thread.so.1.42.0
>>>>>> #4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
>>>>>> pthread_create.c:304
>>>>>> #5  0x00007ffff535592d in clone () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
>>>>>> #6  0x0000000000000000 in ?? ()
>>>>>> (gdb) thread 5
>>>>>> [Switching to thread 5 (Thread 0x7ffff06a9700 (LWP 9635))]#0
>>>>>> pthread_cond_timedwait@@GLIBC_2.3.2 () at
>>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>> 212
>>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No such
>>>>>> file or directory.
>>>>>>     in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
>>>>>> (gdb) bt
>>>>>> #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
>>>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>>>>>> #1  0x00007ffff796885d in bool
>>>>>> boost::condition_variable::timed_wait<boost::date_time::subsecond_duration<boost::posix_time::time_duration,
>>>>>> 1000000l> >(boost::unique_lock<boost::mutex>&,
>>>>>> boost::date_time::subsecond_duration<boost::posix_time::time_duration,
>>>>>> 1000000l> const&) () from
>>>>>> /opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/lib/libros.so
>>>>>> #2  0x00007ffff7964e99 in ros::CallbackQueue::callAvailable
>>>>>> (this=0x69c410, timeout=DWARF-2 expression error: DW_OP_reg operations must
>>>>>> be used either alone or in conjuction with DW_OP_piece or DW_OP_bit_piece.
>>>>>> )
>>>>>>     at
>>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/callback_queue.cpp:307
>>>>>> #3  0x00007ffff78f22e1 in ros::internalCallbackQueueThreadFunc ()
>>>>>>     at
>>>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/init.cpp:265
>>>>>> #4  0x00007ffff6df5230 in thread_proxy () from
>>>>>> /usr/lib/libboost_thread.so.1.42.0
>>>>>> #5  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
>>>>>> pthread_create.c:304
>>>>>> #6  0x00007ffff535592d in clone () at
>>>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
>>>>>> #7  0x0000000000000000 in ?? ()
>>>>>> (gdb)
>>>>>> --- snip ---
>>>>>>
>>>>>> Also, I'm sorry to hear you had to create a new package to compile the
>>>>>> program. As a sanity check, I also created a new package and copied over the
>>>>>> cpp and action files, but this did not change the result I was having, as
>>>>>> expected. (I've attached the package, for reference.)
>>>>>>
>>>>>> I really appreciate the help, thank you.
>>>>>>
>>>>>> -Ryan
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2011 at 2:03 PM, Vijay Pradeep <
>>>>>> vpradeep at willowgarage.com> wrote:
>>>>>> > Hi Ryan,
>>>>>> >
>>>>>> > Here's one more thing to try.  Turn on rosconsole debugging (See
>>>>>> rosconole
>>>>>> > configuration)
>>>>>> >
>>>>>> > You're going to want the following two lines in your
>>>>>> rosonsole.config:
>>>>>> > log4j.logger.ros.actionlib=DEBUG
>>>>>> > log4j.logger.ros.chores=DEBUG
>>>>>> >
>>>>>> >
>>>>>> > You'll get an output that looks like this:
>>>>>> > [DEBUG] [1301680806.580203868]: Setting the current goal as aborted
>>>>>> > [DEBUG] [1301680806.580254603]: Setting status to aborted on goal,
>>>>>> id:
>>>>>> > /test_client_server-1118-1301680806.579359151, stamp: 1301680806.58
>>>>>> > [DEBUG] [1301680806.580301868]: Publishing result for goal with id:
>>>>>> > /test_client_server-1118-1301680806.579359151 and stamp:
>>>>>> 1301680806.58
>>>>>> > [DEBUG] [1301680806.580439068]: Getting status over the wire.
>>>>>> > [DEBUG] [1301680806.580509625]: Trying to transition to ACTIVE
>>>>>> > [DEBUG] [1301680806.580553872]: Transitioning CommState from
>>>>>> > WAITING_FOR_GOAL_ACK to ACTIVE
>>>>>> > [DEBUG] [1301680806.580603042]: Transitioning SimpleState from
>>>>>> [PENDING] to
>>>>>> > [ACTIVE]
>>>>>> > Active.
>>>>>> > [DEBUG] [1301680806.580934461]: Trying to transition to
>>>>>> WAITING_FOR_RESULT
>>>>>> > [DEBUG] [1301680806.580991611]: Transitioning CommState from ACTIVE
>>>>>> to
>>>>>> > WAITING_FOR_RESULT
>>>>>> > [DEBUG] [1301680806.581051376]: Trying to transition to DONE
>>>>>> > [DEBUG] [1301680806.581108948]: Transitioning CommState from
>>>>>> > WAITING_FOR_RESULT to DONE
>>>>>> > [DEBUG] [1301680806.581164618]: Transitioning SimpleState from
>>>>>> [ACTIVE] to
>>>>>> > [DONE]
>>>>>> > Done.
>>>>>> > [ INFO] [1301680806.581243326]: Hogging time or something
>>>>>> > Current State: ABORTED
>>>>>> > [DEBUG] [1301680806.581302953]: IN DELETER
>>>>>> > [DEBUG] [1301680806.581339010]: About to erase CommStateMachine
>>>>>> > [DEBUG] [1301680806.581377433]: Done erasing CommStateMachine
>>>>>> > [DEBUG] [1301680806.581414297]: about to start initGoal()
>>>>>> > [DEBUG] [1301680806.581489440]: Done with initGoal()
>>>>>> > [DEBUG] [1301680806.581544720]: The action server has received a new
>>>>>> goal
>>>>>> > request
>>>>>> > [DEBUG] [1301680806.581638097]: A new goal has been recieved by the
>>>>>> single
>>>>>> > goal action server
>>>>>> > [DEBUG] [1301680806.581705716]: Accepting a new goal
>>>>>> > [DEBUG] [1301680806.581758609]: Accepting goal, id:
>>>>>> > /test_client_server-1119-1301680806.581440270, stamp: 1301680806.58
>>>>>> > Server called. Aborting goal.
>>>>>> >
>>>>>> > Can send me the debug output for the last few cycles when your app
>>>>>> freezes?
>>>>>> >
>>>>>> > Vijay
>>>>>> >
>>>>>> > On Fri, Apr 1, 2011 at 10:05 AM, Vijay Pradeep <
>>>>>> vpradeep at willowgarage.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >> Hi Ryan,
>>>>>> >>
>>>>>> >> So I actually ran your original chores app overnight (~10 hours),
>>>>>> and it
>>>>>> >> still didn't freeze.
>>>>>> >>
>>>>>> >> Also, right now, I tried running your new chores app about 30 times
>>>>>> (for
>>>>>> >> about 5 seconds on each run), and it still hadn't frozen.  I
>>>>>> couldn't get
>>>>>> >> the app to build using your CMakeLists and Makefile, so I rewrote
>>>>>> them based
>>>>>> >> on the standard "roscreate-pkg" versions of these files. I doubt
>>>>>> this would
>>>>>> >> affect the functionality.
>>>>>> >>
>>>>>> >> The fact that it freezes close to startup is interesting.  My guess
>>>>>> would
>>>>>> >> be that it is having some sort of startup interaction with the
>>>>>> AsyncSpinner.
>>>>>> >>
>>>>>> >> Instead of:
>>>>>> >> int main(int argc, char** argv) {
>>>>>> >>   ros::init(argc, argv, "test_client_server");
>>>>>> >>
>>>>>> >>   ros::AsyncSpinner spinner(4);
>>>>>> >>   spinner.start();
>>>>>> >>
>>>>>> >>   ServerDude sd;
>>>>>> >>
>>>>>> >>   ClientGuy client;
>>>>>> >>   client.run();
>>>>>> >>
>>>>>> >>   return 0;
>>>>>> >> }
>>>>>> >>
>>>>>> >> Can you try:
>>>>>> >> int main(int argc, char** argv) {
>>>>>> >>   ros::init(argc, argv, "test_client_server");
>>>>>> >>
>>>>>> >>   ros::AsyncSpinner spinner(4);
>>>>>> >>
>>>>>> >>   ServerDude sd;
>>>>>> >>
>>>>>> >>   ClientGuy client;
>>>>>> >>   spinner.start();
>>>>>> >>   client.run();
>>>>>> >>
>>>>>> >>   return 0;
>>>>>> >> }
>>>>>> >>
>>>>>> >> I would really like to get to the bottom of this, but it's hard to
>>>>>> do much
>>>>>> >> on my end if I can't actually get it to halt.
>>>>>> >>
>>>>>> >> Vijay
>>>>>> >>
>>>>>> >> On Fri, Apr 1, 2011 at 7:51 AM, Ryan Miller <rmiller4589 at gmail.com>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> Hey Vijay, thanks for helping (and sorry for the double post; I
>>>>>> >>> accidentally forgot to reply all, and I figure some people on the
>>>>>> list
>>>>>> >>> may be interested in the email). Yes, I did mean waitForResult,
>>>>>> and I
>>>>>> >>> am also using common-1.4.3 (diamondback too, if that's relevant).
>>>>>> >>> Since you're having trouble duplicating the issue, I went back to
>>>>>> >>> create an example closer to what I'm actually doing, and I managed
>>>>>> to
>>>>>> >>> get it to crash just as often. I've attached the project.
>>>>>> >>>
>>>>>> >>> If it doesn't crash after about 5 seconds, restart it. It seems if
>>>>>> it
>>>>>> >>> continues for more than 5 seconds, it is more likely to continue
>>>>>> >>> running for much longer, but it usually halts within a second.
>>>>>> >>>
>>>>>> >>> Thanks a ton.
>>>>>> >>>
>>>>>> >>> -Ryan
>>>>>> >>>
>>>>>> >>> On Thu, Mar 31, 2011 at 11:00 PM, Vijay Pradeep
>>>>>> >>> <vpradeep at willowgarage.com> wrote:
>>>>>> >>> > Hi Ryan,
>>>>>> >>> >
>>>>>> >>> > I'm sorry that you're having trouble getting actionlib to work.
>>>>>> What
>>>>>> >>> > version of the common stack do you have?
>>>>>> >>> >
>>>>>> >>> > Note that it is always possible for some of the ROS messages to
>>>>>> be
>>>>>> >>> > dropped.
>>>>>> >>> > If the result message doesn't make it to the action client, then
>>>>>> >>> > waitForResult is going to block forever.  I'd suggest adding a
>>>>>> timeout
>>>>>> >>> > on
>>>>>> >>> > waitForServer, and then preempting the goal if you end up
>>>>>> waiting too
>>>>>> >>> > long.
>>>>>> >>> >
>>>>>> >>> >> I made each callback print when
>>>>>> >>> >> it was called, and I found that quite often, the program would
>>>>>> block
>>>>>> >>> >> forever when I called waitForServer.
>>>>>> >>> > I'm guessing you meant "the program would block forever when I
>>>>>> called
>>>>>> >>> > waitForResult"
>>>>>> >>> >
>>>>>> >>> > I'm using common-1.4.3, and after running your example for ~10
>>>>>> minutes,
>>>>>> >>> > it
>>>>>> >>> > doesn't seem to freeze for me.  I can rerun this test with the
>>>>>> exact
>>>>>> >>> > version
>>>>>> >>> > of common that you're using.  Is there anything you can do to
>>>>>> make this
>>>>>> >>> > minimal example freeze more often? Are there some strategic
>>>>>> sleeps you
>>>>>> >>> > could
>>>>>> >>> > add to make it mimic your original app more closely?
>>>>>> >>> >
>>>>>> >>> > If you can get the chores app to freeze, you could try attaching
>>>>>> gdb to
>>>>>> >>> > the
>>>>>> >>> > process and seeing where all the threads are (using the "info
>>>>>> threads"
>>>>>> >>> > command inside gdb).  A bunch of them will be stuck on
>>>>>> >>> > pthread_cond_timedwait calls, but I'd be curious if there's a
>>>>>> thread
>>>>>> >>> > stuck
>>>>>> >>> > on a lock inside of actionlib.  That would be indicative of a
>>>>>> race
>>>>>> >>> > condition
>>>>>> >>> > in actionlib.
>>>>>> >>> >
>>>>>> >>> > Vijay
>>>>>> >>> >
>>>>>> >>> > On Thu, Mar 31, 2011 at 6:55 PM, Ryan Miller <
>>>>>> rmiller4589 at gmail.com>
>>>>>> >>> > wrote:
>>>>>> >>> >>
>>>>>> >>> >> Because of a timing issue I can't quite pin down, actionlib
>>>>>> >>> >> occasionally appears to exhibit a race condition. I found the
>>>>>> problem
>>>>>> >>> >> after adding callbacks to my client. I made each callback print
>>>>>> when
>>>>>> >>> >> it was called, and I found that quite often, the program would
>>>>>> block
>>>>>> >>> >> forever when I called waitForServer. The problem was that the
>>>>>> client's
>>>>>> >>> >> active callback was called but the server's execute method was
>>>>>> never
>>>>>> >>> >> called.
>>>>>> >>> >>
>>>>>> >>> >> I have reduced the problem into a simple ROS package with a
>>>>>> single
>>>>>> >>> >> executable that I have attached. After running the node for a
>>>>>> while, I
>>>>>> >>> >> finally noticed the same problem. It's last output before
>>>>>> blocking
>>>>>> >>> >> was:
>>>>>> >>> >>
>>>>>> >>> >> --- snip ---
>>>>>> >>> >> Current State: ABORTED
>>>>>> >>> >> Aborting.
>>>>>> >>> >> Active.
>>>>>> >>> >> Done.
>>>>>> >>> >> Current State: ABORTED
>>>>>> >>> >> Aborting.
>>>>>> >>> >> Active.
>>>>>> >>> >> --- snip ---
>>>>>> >>> >>
>>>>>> >>> >> In my actual code, the condition happens extremely frequently,
>>>>>> but I
>>>>>> >>> >> found I could mitigate the problem by sleeping for one
>>>>>> millisecond
>>>>>> >>> >> before returning from the server's execute method. (I should
>>>>>> warn you
>>>>>> >>> >> that in the attached example, the problem almost never occurs).
>>>>>> >>> >>
>>>>>> >>> >> Is this likely a bug, or might I doing something wrong? Any
>>>>>> >>> >> suggestions would be appreciated. Thanks for the help.
>>>>>> >>> >>
>>>>>> >>> >> -Ryan
>>>>>> >>> >>
>>>>>> >>> >> _______________________________________________
>>>>>> >>> >> ros-users mailing list
>>>>>> >>> >> ros-users at code.ros.org
>>>>>> >>> >> https://code.ros.org/mailman/listinfo/ros-users
>>>>>> >>> >>
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> > --
>>>>>> >>> > Vijay Pradeep
>>>>>> >>> > Systems Engineer
>>>>>> >>> > Willow Garage, Inc.
>>>>>> >>> > vpradeep at willowgarage.com
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> --
>>>>>> >> Vijay Pradeep
>>>>>> >> Systems Engineer
>>>>>> >> Willow Garage, Inc.
>>>>>> >> vpradeep at willowgarage.com
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Vijay Pradeep
>>>>>> > Systems Engineer
>>>>>> > Willow Garage, Inc.
>>>>>> > vpradeep at willowgarage.com
>>>>>> >
>>>>>> >
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Vijay Pradeep
>>>>> Systems Engineer
>>>>> Willow Garage, Inc.
>>>>>  <tfoote at willowgarage.com>vpradeep at willowgarage.com
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Vijay Pradeep
>> Systems Engineer
>> Willow Garage, Inc.
>>  <tfoote at willowgarage.com>vpradeep at willowgarage.com
>>
>>
>
> _______________________________________________
> ros-users mailing list
> ros-users at code.ros.org
> https://code.ros.org/mailman/listinfo/ros-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ros.org/pipermail/ros-users/attachments/20110403/6c5e30ce/attachment-0003.html>


More information about the ros-users mailing list