[ros-users] Race condition in actionlib

Vijay Pradeep vpradeep at willowgarage.com
Fri Apr 1 20:54:56 UTC 2011


Hi Ryan,

Ok, I think we're actually getting somewhere.....

> When I move the called to spinner.start() to where you suggested, the
client gets stuck on its call to waitForServer as follows:
Totally my bad.  What I meant for you to do was the following:
1) Construct client & server
2) Spin up threads
3) waitForServer
4) Start sending goals

To do this, you'll have to move spinner.start() to where I suggested, and
also move client_.waitForServer() to the top of run().  Currently, the app
is trying to waitForServer without any spin threads running, which is
causing it to get stuck.  Sorry about that.  I'd also suggest trying to run
with 1 spinner instead of 4, and see if that helps.

Thanks for the gdb thread info.  It looks like all the threads we're worried
about are happily waiting for inputs in pthread conditions.  This makes it
seem even more like a message was being dropped.

Also, I finally got it to freeze on my side, *with* debug on.  See output:

Done.
[ INFO] [1301690082.884716841]: Hogging time or something
Current State: ABORTED
[DEBUG] [1301690082.884784275]: IN DELETER
[DEBUG] [1301690082.884838215]: About to erase CommStateMachine
[DEBUG] [1301690082.884891598]: Done erasing CommStateMachine
[DEBUG] [1301690082.884948750]: about to start initGoal()
[DEBUG] [1301690082.885037423]: The action server has received a new goal
request
[DEBUG] [1301690082.885152395]: A new goal has been recieved by the single
goal action server
[DEBUG] [1301690082.885263065]: Accepting a new goal
[DEBUG] [1301690082.885330292]: Accepting goal, id:
/test_client_server-281-1301690082.884983820, stamp: 1301690082.88
Server called. Aborting goal.
[DEBUG] [1301690082.885555049]: Setting the current goal as aborted
[DEBUG] [1301690082.885618287]: Setting status to aborted on goal, id:
/test_client_server-281-1301690082.884983820, stamp: 1301690082.88
[DEBUG] [1301690082.885675967]: Publishing result for goal with id:
/test_client_server-281-1301690082.884983820 and stamp: 1301690082.88
[DEBUG] [1301690082.885743824]: Getting status over the wire.
[DEBUG] [1301690082.885956181]: Done with initGoal()
[DEBUG] [1301690082.887905041]: Getting status over the wire.
[DEBUG] [1301690082.888000669]: Trying to transition to ACTIVE
[DEBUG] [1301690082.888064094]: Transitioning CommState from
WAITING_FOR_GOAL_ACK to ACTIVE
[DEBUG] [1301690082.888130878]: Transitioning SimpleState from [PENDING] to
[ACTIVE]
Active.
[DEBUG] [1301690082.888203846]: Trying to transition to WAITING_FOR_RESULT
*[DEBUG] [1301690082.888262411]: Transitioning CommState from ACTIVE to
WAITING_FOR_RESULT*
[DEBUG] [1301690083.088191441]: Getting status over the wire.
[DEBUG] [1301690083.288523198]: Getting status over the wire.
[DEBUG] [1301690083.488836856]: Getting status over the wire.
[DEBUG] [1301690083.688115542]: Getting status over the wire.
[DEBUG] [1301690083.888460218]: Getting status over the wire.
[DEBUG] [1301690084.088794991]: Getting status over the wire.
[DEBUG] [1301690084.288137860]: Getting status over the wire.
[DEBUG] [1301690084.488456451]: Getting status over the wire.
[DEBUG] [1301690084.688742829]: Getting status over the wire.
[DEBUG] [1301690084.888085527]: Getting status over the wire.
[DEBUG] [1301690085.088418107]: Getting status over the wire.

On the client side, the goal transitioned to WAITING_FOR_RESULT, which means
that the server successfully set the goal to a terminal state (in this case,
ABORTED).  However, for some reason, the client never gets the result
message.  With so many messages flying across so quickly, I guess one of the
result messages happens to get dropped.  This also explains why adding a
sleep in the server callback stopped the freezing.  The sleep significantly
slowed down the messaging rate, preventing the result message from being
dropped.

It is definitely best practice to always add a timeout to waitForResult().
I've updated the actionlib quickstart
guide<http://www.ros.org/wiki/actionlib#C.2B-.2B-_SimpleActionClient>accordingly.

And, please let me know if any of the spinner tweaks I suggested make a
difference.

Vijay

On Fri, Apr 1, 2011 at 1:07 PM, Ryan Miller <rmiller4589 at gmail.com> wrote:

> Hi Vijay,
>
> When I move the called to spinner.start() to where you suggested, the
> client gets stuck on its call to waitForServer as follows:
>
> --- snip ---
> $ rosrun chores chores
> [DEBUG] [1301684841.767308639]: Spinning up a thread for the
> SimpleActionClient
> [DEBUG] [1301684841.794376567]: goalConnectCallback: Adding
> [/test_client_server] to goalSubscribers
> [DEBUG] [1301684841.794470922]: Goal Subscribers (1 total)
>   - /test_client_server
> [DEBUG] [1301684841.800129524]: cancelConnectCallback: Adding
> [/test_client_server] to cancelSubscribers
> [DEBUG] [1301684841.800199855]: cancel Subscribers (1 total)
>   - /test_client_server
> [DEBUG] [1301684841.809336199]: isServerConnected: Didn't receive status
> yet, so not connected yet
> [DEBUG] [1301684841.809441030]: isServerConnected: Didn't receive status
> yet, so not connected yet
> [DEBUG] [1301684842.309665431]: isServerConnected: Didn't receive status
> yet, so not connected yet
> [DEBUG] [1301684842.810034890]: isServerConnected: Didn't receive status
> yet, so not connected yet
> ...
> [DEBUG] [1301684842.810034890]: isServerConnected: Didn't receive status
> yet, so not connected yet
> --- snip ---
>
> Also, it appears I have a heisenbug on my hands because when I turn on
> rosconsole debugging, the problem seemingly disappears, but when I keep it
> off, it crashes on almost every run. By the way, the single call to ROS_INFO
> I placed in the code is critical to making the program halt more often.
> Remove that, and you'll have more trouble getting the program to halt.
>
> Is there any other information that might help you replicate this problem?
> For now, I'll just throw some facts here about the computer I'm using. I am
> running Ubuntu 10.10 64-bit. I have replicated the problem on a relatively
> slow laptop and two pretty fast desktops. The specific package information
> for the version of ros-diamondback-common I am using is:
>
> --- snip ---
> $ dpkg -s ros-diamondback-common
> Package: ros-diamondback-common
> Status: install ok installed
> Priority: optional
> Section: unknown
> Installed-Size: 34224
> Maintainer: Tully Foote tfoote at willowgarage.com
> Architecture: amd64
> Version: 1.4.3-s1299027665~maverick
> Depends: ros-diamondback-common-msgs (= 1.4.0-s1299027564~maverick),
> ros-diamondback-ros (= 1.4.6-s1298587732~maverick), ros-diamondback-ros-comm
> (= 1.4.5-s1299027222~maverick), libc6, build-essential, cmake, python-yaml,
> subversion, uuid-dev, libcppunit-dev
> Description: common code for personal robots
> This stack contains tools built on top of ROS core which are commonly used
> throughout the ROS ecosystem.
> Wg-Rosdistro: diamondback
> --- snip ---
>
> Here is some information from gdb from when the program halts. I've printed
> a list of the running threads as well as the backtraces of a few of the ones
> in different functions.
>
> --- snip ---
> Current State: ABORTED
> Server called. Aborting goal.
> Active.
> ^C
> Program received signal SIGINT, Interrupt.
> pthread_cond_timedwait@@GLIBC_2.3.2 () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
> 212 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No
> such file or directory.
> in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
> (gdb) info thread
>  12 Thread 0x7fffecea2700 (LWP 9649)  pthread_cond_timedwait@@GLIBC_2.3.2
> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>  11 Thread 0x7fffed6a3700 (LWP 9646)  pthread_cond_timedwait@@GLIBC_2.3.2
> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>  10 Thread 0x7fffedea4700 (LWP 9640)  pthread_cond_timedwait@@GLIBC_2.3.2
> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>  9 Thread 0x7fffee6a5700 (LWP 9639)  pthread_cond_timedwait@@GLIBC_2.3.2
> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>  8 Thread 0x7fffeeea6700 (LWP 9638)  pthread_cond_timedwait@@GLIBC_2.3.2
> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>  7 Thread 0x7fffef6a7700 (LWP 9637)  pthread_cond_timedwait@@GLIBC_2.3.2
> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>  6 Thread 0x7fffefea8700 (LWP 9636)  pthread_cond_timedwait@@GLIBC_2.3.2
> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>  5 Thread 0x7ffff06a9700 (LWP 9635)  pthread_cond_timedwait@@GLIBC_2.3.2
> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
>  4 Thread 0x7ffff0eaa700 (LWP 9630)  pthread_cond_wait@@GLIBC_2.3.2 () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
>  3 Thread 0x7ffff16ab700 (LWP 9629)  0x00007ffff534e2c3 in select () at
> ../sysdeps/unix/syscall-template.S:82
>  2 Thread 0x7ffff1eac700 (LWP 9628)  0x00007ffff5349203 in __poll
> (fds=<value optimized out>, nfds=<value optimized out>, timeout=100) at
> ../sysdeps/unix/sysv/linux/poll.c:87
> * 1 Thread 0x7ffff7fcd760 (LWP 9609)  pthread_cond_timedwait@@GLIBC_2.3.2
> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
> (gdb) bt
> #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
> #1  0x000000000044c698 in timed_wait<boost::unique_lock<boost::mutex> >
> (this=0x7fffffffe110, m=..., wait_duration=<value optimized out>) at
> /usr/include/boost/thread/pthread/condition_variable.hpp:110
> #2
>  boost::condition_variable_any::timed_wait<boost::unique_lock<boost::mutex>,
> boost::posix_time::milliseconds> (this=0x7fffffffe110, m=...,
> wait_duration=<value optimized out>)
>    at /usr/include/boost/thread/pthread/condition_variable.hpp:133
> #3  0x000000000044d355 in
> actionlib::SimpleActionClient<chores::DoDishesAction_<std::allocator<void> >
> >::waitForResult (this=0x7fffffffe040, timeout=...)
>    at
> /opt/ros/diamondback/stacks/common/actionlib/include/actionlib/client/simple_action_client.h:567
> #4  0x00000000004583f6 in ClientGuy::run (this=0x7fffffffe040) at
> /home/pesckal/prometheus/ros/chores/src/main.cpp:50
> #5  0x000000000042d8d5 in main (argc=1, argv=<value optimized out>) at
> /home/pesckal/prometheus/ros/chores/src/main.cpp:87
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7ffff1eac700 (LWP 9628))]#0
> 0x00007ffff5349203 in __poll (fds=<value optimized out>, nfds=<value
> optimized out>, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:87
> 87    ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
>     in ../sysdeps/unix/sysv/linux/poll.c
> (gdb) bt
> #0  0x00007ffff5349203 in __poll (fds=<value optimized out>, nfds=<value
> optimized out>, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:87
> #1  0x00007ffff78c169a in ros::PollSet::update (this=0x69c650,
> poll_timeout=100)
>     at
> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/poll_set.cpp:202
> #2  0x00007ffff7906778 in ros::PollManager::threadFunc (this=0x69c650)
>     at
> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/poll_manager.cpp:95
> #3  0x00007ffff6df5230 in thread_proxy () from
> /usr/lib/libboost_thread.so.1.42.0
> #4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
> pthread_create.c:304
> #5  0x00007ffff535592d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 3
> [Switching to thread 3 (Thread 0x7ffff16ab700 (LWP 9629))]#0
> 0x00007ffff534e2c3 in select () at ../sysdeps/unix/syscall-template.S:82
> 82    ../sysdeps/unix/syscall-template.S: No such file or directory.
>     in ../sysdeps/unix/syscall-template.S
> (gdb) bt
> #0  0x00007ffff534e2c3 in select () at
> ../sysdeps/unix/syscall-template.S:82
> #1  0x00007ffff723313a in XmlRpc::XmlRpcDispatch::work (this=0x69ac18,
> timeout=<value optimized out>)
>     at
> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/utilities/xmlrpcpp/src/XmlRpcDispatch.cpp:110
> #2  0x00007ffff7903de9 in ros::XMLRPCManager::serverThreadFunc
> (this=0x69abc0)
>     at
> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/xmlrpc_manager.cpp:256
> #3  0x00007ffff6df5230 in thread_proxy () from
> /usr/lib/libboost_thread.so.1.42.0
> #4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
> pthread_create.c:304
> #5  0x00007ffff535592d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 4
> [Switching to thread 4 (Thread 0x7ffff0eaa700 (LWP 9630))]#0
> pthread_cond_wait@@GLIBC_2.3.2 () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> 162    ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such
> file or directory.
>     in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S
> (gdb) bt
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
> #1  0x00007ffff79603b3 in wait (this=<value optimized out>) at
> /usr/include/boost/thread/pthread/condition_variable.hpp:20
> #2  ros::ROSOutAppender::logThread (this=<value optimized out>)
>     at
> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/rosout_appender.cpp:135
> #3  0x00007ffff6df5230 in thread_proxy () from
> /usr/lib/libboost_thread.so.1.42.0
> #4  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
> pthread_create.c:304
> #5  0x00007ffff535592d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #6  0x0000000000000000 in ?? ()
> (gdb) thread 5
> [Switching to thread 5 (Thread 0x7ffff06a9700 (LWP 9635))]#0
> pthread_cond_timedwait@@GLIBC_2.3.2 () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
> 212    ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No
> such file or directory.
>     in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S
> (gdb) bt
> #0  pthread_cond_timedwait@@GLIBC_2.3.2 () at
> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212
> #1  0x00007ffff796885d in bool
> boost::condition_variable::timed_wait<boost::date_time::subsecond_duration<boost::posix_time::time_duration,
> 1000000l> >(boost::unique_lock<boost::mutex>&,
> boost::date_time::subsecond_duration<boost::posix_time::time_duration,
> 1000000l> const&) () from
> /opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/lib/libros.so
> #2  0x00007ffff7964e99 in ros::CallbackQueue::callAvailable (this=0x69c410,
> timeout=DWARF-2 expression error: DW_OP_reg operations must be used either
> alone or in conjuction with DW_OP_piece or DW_OP_bit_piece.
> )
>     at
> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/callback_queue.cpp:307
> #3  0x00007ffff78f22e1 in ros::internalCallbackQueueThreadFunc ()
>     at
> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/init.cpp:265
> #4  0x00007ffff6df5230 in thread_proxy () from
> /usr/lib/libboost_thread.so.1.42.0
> #5  0x00007ffff55f9971 in start_thread (arg=<value optimized out>) at
> pthread_create.c:304
> #6  0x00007ffff535592d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
> #7  0x0000000000000000 in ?? ()
> (gdb)
> --- snip ---
>
> Also, I'm sorry to hear you had to create a new package to compile the
> program. As a sanity check, I also created a new package and copied over the
> cpp and action files, but this did not change the result I was having, as
> expected. (I've attached the package, for reference.)
>
> I really appreciate the help, thank you.
>
> -Ryan
>
>
> On Fri, Apr 1, 2011 at 2:03 PM, Vijay Pradeep <vpradeep at willowgarage.com>
> wrote:
> > Hi Ryan,
> >
> > Here's one more thing to try.  Turn on rosconsole debugging (See
> rosconole
> > configuration)
> >
> > You're going to want the following two lines in your rosonsole.config:
> > log4j.logger.ros.actionlib=DEBUG
> > log4j.logger.ros.chores=DEBUG
> >
> >
> > You'll get an output that looks like this:
> > [DEBUG] [1301680806.580203868]: Setting the current goal as aborted
> > [DEBUG] [1301680806.580254603]: Setting status to aborted on goal, id:
> > /test_client_server-1118-1301680806.579359151, stamp: 1301680806.58
> > [DEBUG] [1301680806.580301868]: Publishing result for goal with id:
> > /test_client_server-1118-1301680806.579359151 and stamp: 1301680806.58
> > [DEBUG] [1301680806.580439068]: Getting status over the wire.
> > [DEBUG] [1301680806.580509625]: Trying to transition to ACTIVE
> > [DEBUG] [1301680806.580553872]: Transitioning CommState from
> > WAITING_FOR_GOAL_ACK to ACTIVE
> > [DEBUG] [1301680806.580603042]: Transitioning SimpleState from [PENDING]
> to
> > [ACTIVE]
> > Active.
> > [DEBUG] [1301680806.580934461]: Trying to transition to
> WAITING_FOR_RESULT
> > [DEBUG] [1301680806.580991611]: Transitioning CommState from ACTIVE to
> > WAITING_FOR_RESULT
> > [DEBUG] [1301680806.581051376]: Trying to transition to DONE
> > [DEBUG] [1301680806.581108948]: Transitioning CommState from
> > WAITING_FOR_RESULT to DONE
> > [DEBUG] [1301680806.581164618]: Transitioning SimpleState from [ACTIVE]
> to
> > [DONE]
> > Done.
> > [ INFO] [1301680806.581243326]: Hogging time or something
> > Current State: ABORTED
> > [DEBUG] [1301680806.581302953]: IN DELETER
> > [DEBUG] [1301680806.581339010]: About to erase CommStateMachine
> > [DEBUG] [1301680806.581377433]: Done erasing CommStateMachine
> > [DEBUG] [1301680806.581414297]: about to start initGoal()
> > [DEBUG] [1301680806.581489440]: Done with initGoal()
> > [DEBUG] [1301680806.581544720]: The action server has received a new goal
> > request
> > [DEBUG] [1301680806.581638097]: A new goal has been recieved by the
> single
> > goal action server
> > [DEBUG] [1301680806.581705716]: Accepting a new goal
> > [DEBUG] [1301680806.581758609]: Accepting goal, id:
> > /test_client_server-1119-1301680806.581440270, stamp: 1301680806.58
> > Server called. Aborting goal.
> >
> > Can send me the debug output for the last few cycles when your app
> freezes?
> >
> > Vijay
> >
> > On Fri, Apr 1, 2011 at 10:05 AM, Vijay Pradeep <
> vpradeep at willowgarage.com>
> > wrote:
> >>
> >> Hi Ryan,
> >>
> >> So I actually ran your original chores app overnight (~10 hours), and it
> >> still didn't freeze.
> >>
> >> Also, right now, I tried running your new chores app about 30 times (for
> >> about 5 seconds on each run), and it still hadn't frozen.  I couldn't
> get
> >> the app to build using your CMakeLists and Makefile, so I rewrote them
> based
> >> on the standard "roscreate-pkg" versions of these files. I doubt this
> would
> >> affect the functionality.
> >>
> >> The fact that it freezes close to startup is interesting.  My guess
> would
> >> be that it is having some sort of startup interaction with the
> AsyncSpinner.
> >>
> >> Instead of:
> >> int main(int argc, char** argv) {
> >>   ros::init(argc, argv, "test_client_server");
> >>
> >>   ros::AsyncSpinner spinner(4);
> >>   spinner.start();
> >>
> >>   ServerDude sd;
> >>
> >>   ClientGuy client;
> >>   client.run();
> >>
> >>   return 0;
> >> }
> >>
> >> Can you try:
> >> int main(int argc, char** argv) {
> >>   ros::init(argc, argv, "test_client_server");
> >>
> >>   ros::AsyncSpinner spinner(4);
> >>
> >>   ServerDude sd;
> >>
> >>   ClientGuy client;
> >>   spinner.start();
> >>   client.run();
> >>
> >>   return 0;
> >> }
> >>
> >> I would really like to get to the bottom of this, but it's hard to do
> much
> >> on my end if I can't actually get it to halt.
> >>
> >> Vijay
> >>
> >> On Fri, Apr 1, 2011 at 7:51 AM, Ryan Miller <rmiller4589 at gmail.com>
> wrote:
> >>>
> >>> Hey Vijay, thanks for helping (and sorry for the double post; I
> >>> accidentally forgot to reply all, and I figure some people on the list
> >>> may be interested in the email). Yes, I did mean waitForResult, and I
> >>> am also using common-1.4.3 (diamondback too, if that's relevant).
> >>> Since you're having trouble duplicating the issue, I went back to
> >>> create an example closer to what I'm actually doing, and I managed to
> >>> get it to crash just as often. I've attached the project.
> >>>
> >>> If it doesn't crash after about 5 seconds, restart it. It seems if it
> >>> continues for more than 5 seconds, it is more likely to continue
> >>> running for much longer, but it usually halts within a second.
> >>>
> >>> Thanks a ton.
> >>>
> >>> -Ryan
> >>>
> >>> On Thu, Mar 31, 2011 at 11:00 PM, Vijay Pradeep
> >>> <vpradeep at willowgarage.com> wrote:
> >>> > Hi Ryan,
> >>> >
> >>> > I'm sorry that you're having trouble getting actionlib to work.  What
> >>> > version of the common stack do you have?
> >>> >
> >>> > Note that it is always possible for some of the ROS messages to be
> >>> > dropped.
> >>> > If the result message doesn't make it to the action client, then
> >>> > waitForResult is going to block forever.  I'd suggest adding a
> timeout
> >>> > on
> >>> > waitForServer, and then preempting the goal if you end up waiting too
> >>> > long.
> >>> >
> >>> >> I made each callback print when
> >>> >> it was called, and I found that quite often, the program would block
> >>> >> forever when I called waitForServer.
> >>> > I'm guessing you meant "the program would block forever when I called
> >>> > waitForResult"
> >>> >
> >>> > I'm using common-1.4.3, and after running your example for ~10
> minutes,
> >>> > it
> >>> > doesn't seem to freeze for me.  I can rerun this test with the exact
> >>> > version
> >>> > of common that you're using.  Is there anything you can do to make
> this
> >>> > minimal example freeze more often? Are there some strategic sleeps
> you
> >>> > could
> >>> > add to make it mimic your original app more closely?
> >>> >
> >>> > If you can get the chores app to freeze, you could try attaching gdb
> to
> >>> > the
> >>> > process and seeing where all the threads are (using the "info
> threads"
> >>> > command inside gdb).  A bunch of them will be stuck on
> >>> > pthread_cond_timedwait calls, but I'd be curious if there's a thread
> >>> > stuck
> >>> > on a lock inside of actionlib.  That would be indicative of a race
> >>> > condition
> >>> > in actionlib.
> >>> >
> >>> > Vijay
> >>> >
> >>> > On Thu, Mar 31, 2011 at 6:55 PM, Ryan Miller <rmiller4589 at gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> Because of a timing issue I can't quite pin down, actionlib
> >>> >> occasionally appears to exhibit a race condition. I found the
> problem
> >>> >> after adding callbacks to my client. I made each callback print when
> >>> >> it was called, and I found that quite often, the program would block
> >>> >> forever when I called waitForServer. The problem was that the
> client's
> >>> >> active callback was called but the server's execute method was never
> >>> >> called.
> >>> >>
> >>> >> I have reduced the problem into a simple ROS package with a single
> >>> >> executable that I have attached. After running the node for a while,
> I
> >>> >> finally noticed the same problem. It's last output before blocking
> >>> >> was:
> >>> >>
> >>> >> --- snip ---
> >>> >> Current State: ABORTED
> >>> >> Aborting.
> >>> >> Active.
> >>> >> Done.
> >>> >> Current State: ABORTED
> >>> >> Aborting.
> >>> >> Active.
> >>> >> --- snip ---
> >>> >>
> >>> >> In my actual code, the condition happens extremely frequently, but I
> >>> >> found I could mitigate the problem by sleeping for one millisecond
> >>> >> before returning from the server's execute method. (I should warn
> you
> >>> >> that in the attached example, the problem almost never occurs).
> >>> >>
> >>> >> Is this likely a bug, or might I doing something wrong? Any
> >>> >> suggestions would be appreciated. Thanks for the help.
> >>> >>
> >>> >> -Ryan
> >>> >>
> >>> >> _______________________________________________
> >>> >> ros-users mailing list
> >>> >> ros-users at code.ros.org
> >>> >> https://code.ros.org/mailman/listinfo/ros-users
> >>> >>
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > Vijay Pradeep
> >>> > Systems Engineer
> >>> > Willow Garage, Inc.
> >>> > vpradeep at willowgarage.com
> >>> >
> >>> >
> >>
> >>
> >>
> >> --
> >> Vijay Pradeep
> >> Systems Engineer
> >> Willow Garage, Inc.
> >> vpradeep at willowgarage.com
> >>
> >
> >
> >
> > --
> > Vijay Pradeep
> > Systems Engineer
> > Willow Garage, Inc.
> > vpradeep at willowgarage.com
> >
> >
>
>


-- 
Vijay Pradeep
Systems Engineer
Willow Garage, Inc.
 <tfoote at willowgarage.com>vpradeep at willowgarage.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ros.org/pipermail/ros-users/attachments/20110401/877655b3/attachment-0002.html>


More information about the ros-users mailing list