Thanks for the design tip. It's working much better now. My path planner is also a service for navigation goals, so I guess I should also send those out periodically too. It would be a serious problem if my navigation goal sender didn't know the robot actually finished navigating to a goal. > Although TCP/IP is reliable, there are cases where ROS cannot send messages fast enough over the wire, so ROS decides to drop some messages. Wow! I think this should be written on a blinking red marquee somewhere. Is there any document on the ROS wiki where this, and maybe other interesting details about ROS, are written? I guess I'll be treating all ROS messages as unreliable from now on... Thank you again for all the help! It's been great :D -Ryan P.S. I still think it's crazy you guys built an unreliable data transfer protocol on top of TCP/IP... On Fri, Apr 1, 2011 at 6:30 PM, Vijay Pradeep wrote: > > Is it acceptable for messages to be dropped? I thought actionlib used > reliable data transfer. > It is acceptable for messages to be dropped. actionlib uses ROS messages > which use TCP/IP transport. Although TCP/IP is reliable, there are cases > where ROS cannot send messages fast enough over the wire, so ROS decides to > drop some messages. This means on a given topic, messages get dropped only > when there are even newer messages that need to be sent. > > In your case, it is totally reasonable for the goal to seem like it is not > getting canceled. As far as the server is concerned, it already completed > the goal, and has already sent the result. Thus, there is nothing to cancel. > This causes the client to indefinitely wait for the goal to terminate. > > I don't think you're current approach is best for your application. You're > sending a new goal to the server only after the previous goal completes, > which means that the robot is probably going to briefly stop moving between > goals. To avoid this issue, you could continuously send goals at some rate, > and never wait for them to finish. In the server's execute loop, you can > see if a new goal is available and then accept it and start using the new > goal. If you ever need to stop due to a potential collision, you can always > send a cancel request for the latest goal. > > Does this seem to fit your need? There's also a chance that I'm > misunderstanding your application. > > Vijay > > > On Fri, Apr 1, 2011 at 3:04 PM, Ryan Miller wrote: > >> Sorry, forgot to attach the file. >> >> -Ryan >> >> >> On Fri, Apr 1, 2011 at 6:01 PM, Ryan Miller wrote: >> >>> Hi Vijay, >>> >>> A few questions then. Is it acceptable for messages to be dropped? I >>> thought actionlib used reliable data transfer. In my actual program, I am >>> depending on every message being sent, and I would rather not use delays and >>> risk my robot still locking up. >>> >>> To give you some perspective, I am using the client to plan a path to a >>> destination, and I am using the server to execute the motor controls to >>> drive on a path it is given. Using a timeout would be fine (it's what I'm >>> doing in my production code, actually). But with this new result, it seems I >>> should now cancel the goal any time waitForResult times out because I don't >>> know if the server is just taking a long time or if its message was dropped >>> and it's actually finished. What I was planning on doing was letting the >>> motion planner follow the path until it is in imminent danger of colliding >>> with something, in which case I would cancel the goal. Unfortunately, I have >>> a new problem now where I cannot even cancel the goal. >>> >>> After timing out, the client always thinks it's goal is active, and I am >>> unable to cancel the goal. The output I am getting from the attached example >>> is: >>> >>> --- snip --- >>> *[ INFO] [1301694170.491982098]: Sending goal >>> [ INFO] [1301694170.492052098]: Waiting for result... >>> [ INFO] [1301694170.492138988]: Server is aborting goal. >>> [ INFO] [1301694170.492193892]: After waiting for the result, the state >>> is ABORTED >>> [ INFO] [1301694170.492232340]: Sending goal >>> [ INFO] [1301694170.492279138]: Waiting for result... >>> [ INFO] [1301694170.492341676]: Server is aborting goal. >>> [ INFO] [1301694170.492388975]: After waiting for the result, the state >>> is ABORTED* >>> [ INFO] [1301694170.492443452]: Sending goal >>> [ INFO] [1301694170.492543467]: Server is aborting goal. >>> [ INFO] [1301694170.492620163]: Waiting for result... >>> [ INFO] [1301694171.492789064]: After waiting for the result, the state >>> is ACTIVE >>> [ INFO] [1301694171.492909135]: Timeout, cancelling the goal >>> *[ INFO] [1301694171.492953475]: After cancelling, the state is ACTIVE* >>> *[ INFO] [1301694171.492977808]: The server is not done. >>> [ INFO] [1301694172.493114619]: The server is not done. >>> [ INFO] [1301694173.493301543]: The server is not done. >>> [ INFO] [1301694174.493476758]: The server is not done.* >>> ^C >>> --- snip --- >>> (Normal behaviour is italicised. Unexpected behaviour is bold) >>> >>> Might this be another problem, or am I misusing SimpleActionClient? >>> >>> Thanks again, >>> >>> -Ryan >>> >>> >>> On Fri, Apr 1, 2011 at 4:54 PM, Vijay Pradeep >> > wrote: >>> >>>> Hi Ryan, >>>> >>>> Ok, I think we're actually getting somewhere..... >>>> >>>> >>>> > When I move the called to spinner.start() to where you suggested, the >>>> client gets stuck on its call to waitForServer as follows: >>>> Totally my bad. What I meant for you to do was the following: >>>> 1) Construct client & server >>>> 2) Spin up threads >>>> 3) waitForServer >>>> 4) Start sending goals >>>> >>>> To do this, you'll have to move spinner.start() to where I suggested, >>>> and also move client_.waitForServer() to the top of run(). Currently, >>>> the app is trying to waitForServer without any spin threads running, which >>>> is causing it to get stuck. Sorry about that. I'd also suggest trying to >>>> run with 1 spinner instead of 4, and see if that helps. >>>> >>>> Thanks for the gdb thread info. It looks like all the threads we're >>>> worried about are happily waiting for inputs in pthread conditions. This >>>> makes it seem even more like a message was being dropped. >>>> >>>> Also, I finally got it to freeze on my side, *with* debug on. See >>>> output: >>>> >>>> Done. >>>> [ INFO] [1301690082.884716841]: Hogging time or something >>>> Current State: ABORTED >>>> [DEBUG] [1301690082.884784275]: IN DELETER >>>> [DEBUG] [1301690082.884838215]: About to erase CommStateMachine >>>> [DEBUG] [1301690082.884891598]: Done erasing CommStateMachine >>>> [DEBUG] [1301690082.884948750]: about to start initGoal() >>>> [DEBUG] [1301690082.885037423]: The action server has received a new >>>> goal request >>>> [DEBUG] [1301690082.885152395]: A new goal has been recieved by the >>>> single goal action server >>>> [DEBUG] [1301690082.885263065]: Accepting a new goal >>>> [DEBUG] [1301690082.885330292]: Accepting goal, id: >>>> /test_client_server-281-1301690082.884983820, stamp: 1301690082.88 >>>> Server called. Aborting goal. >>>> [DEBUG] [1301690082.885555049]: Setting the current goal as aborted >>>> [DEBUG] [1301690082.885618287]: Setting status to aborted on goal, id: >>>> /test_client_server-281-1301690082.884983820, stamp: 1301690082.88 >>>> [DEBUG] [1301690082.885675967]: Publishing result for goal with id: >>>> /test_client_server-281-1301690082.884983820 and stamp: 1301690082.88 >>>> [DEBUG] [1301690082.885743824]: Getting status over the wire. >>>> [DEBUG] [1301690082.885956181]: Done with initGoal() >>>> [DEBUG] [1301690082.887905041]: Getting status over the wire. >>>> [DEBUG] [1301690082.888000669]: Trying to transition to ACTIVE >>>> [DEBUG] [1301690082.888064094]: Transitioning CommState from >>>> WAITING_FOR_GOAL_ACK to ACTIVE >>>> [DEBUG] [1301690082.888130878]: Transitioning SimpleState from [PENDING] >>>> to [ACTIVE] >>>> Active. >>>> [DEBUG] [1301690082.888203846]: Trying to transition to >>>> WAITING_FOR_RESULT >>>> *[DEBUG] [1301690082.888262411]: Transitioning CommState from ACTIVE to >>>> WAITING_FOR_RESULT* >>>> [DEBUG] [1301690083.088191441]: Getting status over the wire. >>>> [DEBUG] [1301690083.288523198]: Getting status over the wire. >>>> [DEBUG] [1301690083.488836856]: Getting status over the wire. >>>> [DEBUG] [1301690083.688115542]: Getting status over the wire. >>>> [DEBUG] [1301690083.888460218]: Getting status over the wire. >>>> [DEBUG] [1301690084.088794991]: Getting status over the wire. >>>> [DEBUG] [1301690084.288137860]: Getting status over the wire. >>>> [DEBUG] [1301690084.488456451]: Getting status over the wire. >>>> [DEBUG] [1301690084.688742829]: Getting status over the wire. >>>> [DEBUG] [1301690084.888085527]: Getting status over the wire. >>>> [DEBUG] [1301690085.088418107]: Getting status over the wire. >>>> >>>> On the client side, the goal transitioned to WAITING_FOR_RESULT, which >>>> means that the server successfully set the goal to a terminal state (in this >>>> case, ABORTED). However, for some reason, the client never gets the result >>>> message. With so many messages flying across so quickly, I guess one of the >>>> result messages happens to get dropped. This also explains why adding a >>>> sleep in the server callback stopped the freezing. The sleep significantly >>>> slowed down the messaging rate, preventing the result message from being >>>> dropped. >>>> >>>> It is definitely best practice to always add a timeout to >>>> waitForResult(). I've updated the actionlib quickstart guideaccordingly. >>>> >>>> And, please let me know if any of the spinner tweaks I suggested make a >>>> difference. >>>> >>>> Vijay >>>> >>>> >>>> On Fri, Apr 1, 2011 at 1:07 PM, Ryan Miller wrote: >>>> >>>>> Hi Vijay, >>>>> >>>>> When I move the called to spinner.start() to where you suggested, the >>>>> client gets stuck on its call to waitForServer as follows: >>>>> >>>>> --- snip --- >>>>> $ rosrun chores chores >>>>> [DEBUG] [1301684841.767308639]: Spinning up a thread for the >>>>> SimpleActionClient >>>>> [DEBUG] [1301684841.794376567]: goalConnectCallback: Adding >>>>> [/test_client_server] to goalSubscribers >>>>> [DEBUG] [1301684841.794470922]: Goal Subscribers (1 total) >>>>> - /test_client_server >>>>> [DEBUG] [1301684841.800129524]: cancelConnectCallback: Adding >>>>> [/test_client_server] to cancelSubscribers >>>>> [DEBUG] [1301684841.800199855]: cancel Subscribers (1 total) >>>>> - /test_client_server >>>>> [DEBUG] [1301684841.809336199]: isServerConnected: Didn't receive >>>>> status yet, so not connected yet >>>>> [DEBUG] [1301684841.809441030]: isServerConnected: Didn't receive >>>>> status yet, so not connected yet >>>>> [DEBUG] [1301684842.309665431]: isServerConnected: Didn't receive >>>>> status yet, so not connected yet >>>>> [DEBUG] [1301684842.810034890]: isServerConnected: Didn't receive >>>>> status yet, so not connected yet >>>>> ... >>>>> [DEBUG] [1301684842.810034890]: isServerConnected: Didn't receive >>>>> status yet, so not connected yet >>>>> --- snip --- >>>>> >>>>> Also, it appears I have a heisenbug on my hands because when I turn on >>>>> rosconsole debugging, the problem seemingly disappears, but when I keep it >>>>> off, it crashes on almost every run. By the way, the single call to ROS_INFO >>>>> I placed in the code is critical to making the program halt more often. >>>>> Remove that, and you'll have more trouble getting the program to halt. >>>>> >>>>> Is there any other information that might help you replicate this >>>>> problem? For now, I'll just throw some facts here about the computer I'm >>>>> using. I am running Ubuntu 10.10 64-bit. I have replicated the problem on a >>>>> relatively slow laptop and two pretty fast desktops. The specific package >>>>> information for the version of ros-diamondback-common I am using is: >>>>> >>>>> --- snip --- >>>>> $ dpkg -s ros-diamondback-common >>>>> Package: ros-diamondback-common >>>>> Status: install ok installed >>>>> Priority: optional >>>>> Section: unknown >>>>> Installed-Size: 34224 >>>>> Maintainer: Tully Foote tfoote@willowgarage.com >>>>> Architecture: amd64 >>>>> Version: 1.4.3-s1299027665~maverick >>>>> Depends: ros-diamondback-common-msgs (= 1.4.0-s1299027564~maverick), >>>>> ros-diamondback-ros (= 1.4.6-s1298587732~maverick), ros-diamondback-ros-comm >>>>> (= 1.4.5-s1299027222~maverick), libc6, build-essential, cmake, python-yaml, >>>>> subversion, uuid-dev, libcppunit-dev >>>>> Description: common code for personal robots >>>>> This stack contains tools built on top of ROS core which are commonly >>>>> used throughout the ROS ecosystem. >>>>> Wg-Rosdistro: diamondback >>>>> --- snip --- >>>>> >>>>> Here is some information from gdb from when the program halts. I've >>>>> printed a list of the running threads as well as the backtraces of a few of >>>>> the ones in different functions. >>>>> >>>>> --- snip --- >>>>> Current State: ABORTED >>>>> Server called. Aborting goal. >>>>> Active. >>>>> ^C >>>>> Program received signal SIGINT, Interrupt. >>>>> pthread_cond_timedwait@@GLIBC_2.3.2 () at >>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> 212 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No >>>>> such file or directory. >>>>> in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S >>>>> (gdb) info thread >>>>> 12 Thread 0x7fffecea2700 (LWP 9649) pthread_cond_timedwait@@GLIBC_2.3.2 >>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> 11 Thread 0x7fffed6a3700 (LWP 9646) pthread_cond_timedwait@@GLIBC_2.3.2 >>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> 10 Thread 0x7fffedea4700 (LWP 9640) pthread_cond_timedwait@@GLIBC_2.3.2 >>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> 9 Thread 0x7fffee6a5700 (LWP 9639) pthread_cond_timedwait@@GLIBC_2.3.2 >>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> 8 Thread 0x7fffeeea6700 (LWP 9638) pthread_cond_timedwait@@GLIBC_2.3.2 >>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> 7 Thread 0x7fffef6a7700 (LWP 9637) pthread_cond_timedwait@@GLIBC_2.3.2 >>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> 6 Thread 0x7fffefea8700 (LWP 9636) pthread_cond_timedwait@@GLIBC_2.3.2 >>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> 5 Thread 0x7ffff06a9700 (LWP 9635) pthread_cond_timedwait@@GLIBC_2.3.2 >>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> 4 Thread 0x7ffff0eaa700 (LWP 9630) pthread_cond_wait@@GLIBC_2.3.2 () >>>>> at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 >>>>> 3 Thread 0x7ffff16ab700 (LWP 9629) 0x00007ffff534e2c3 in select () at >>>>> ../sysdeps/unix/syscall-template.S:82 >>>>> 2 Thread 0x7ffff1eac700 (LWP 9628) 0x00007ffff5349203 in __poll >>>>> (fds=, nfds=, timeout=100) at >>>>> ../sysdeps/unix/sysv/linux/poll.c:87 >>>>> * 1 Thread 0x7ffff7fcd760 (LWP 9609) pthread_cond_timedwait@@GLIBC_2.3.2 >>>>> () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> (gdb) bt >>>>> #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at >>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> #1 0x000000000044c698 in timed_wait > >>>>> (this=0x7fffffffe110, m=..., wait_duration=) at >>>>> /usr/include/boost/thread/pthread/condition_variable.hpp:110 >>>>> #2 >>>>> boost::condition_variable_any::timed_wait, >>>>> boost::posix_time::milliseconds> (this=0x7fffffffe110, m=..., >>>>> wait_duration=) >>>>> at /usr/include/boost/thread/pthread/condition_variable.hpp:133 >>>>> #3 0x000000000044d355 in >>>>> actionlib::SimpleActionClient > >>>>> >::waitForResult (this=0x7fffffffe040, timeout=...) >>>>> at >>>>> /opt/ros/diamondback/stacks/common/actionlib/include/actionlib/client/simple_action_client.h:567 >>>>> #4 0x00000000004583f6 in ClientGuy::run (this=0x7fffffffe040) at >>>>> /home/pesckal/prometheus/ros/chores/src/main.cpp:50 >>>>> #5 0x000000000042d8d5 in main (argc=1, argv=) at >>>>> /home/pesckal/prometheus/ros/chores/src/main.cpp:87 >>>>> (gdb) thread 2 >>>>> [Switching to thread 2 (Thread 0x7ffff1eac700 (LWP 9628))]#0 >>>>> 0x00007ffff5349203 in __poll (fds=, nfds=>>>> optimized out>, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:87 >>>>> 87 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory. >>>>> in ../sysdeps/unix/sysv/linux/poll.c >>>>> (gdb) bt >>>>> #0 0x00007ffff5349203 in __poll (fds=, >>>>> nfds=, timeout=100) at >>>>> ../sysdeps/unix/sysv/linux/poll.c:87 >>>>> #1 0x00007ffff78c169a in ros::PollSet::update (this=0x69c650, >>>>> poll_timeout=100) >>>>> at >>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/poll_set.cpp:202 >>>>> #2 0x00007ffff7906778 in ros::PollManager::threadFunc (this=0x69c650) >>>>> at >>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/poll_manager.cpp:95 >>>>> #3 0x00007ffff6df5230 in thread_proxy () from >>>>> /usr/lib/libboost_thread.so.1.42.0 >>>>> #4 0x00007ffff55f9971 in start_thread (arg=) at >>>>> pthread_create.c:304 >>>>> #5 0x00007ffff535592d in clone () at >>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 >>>>> #6 0x0000000000000000 in ?? () >>>>> (gdb) thread 3 >>>>> [Switching to thread 3 (Thread 0x7ffff16ab700 (LWP 9629))]#0 >>>>> 0x00007ffff534e2c3 in select () at ../sysdeps/unix/syscall-template.S:82 >>>>> 82 ../sysdeps/unix/syscall-template.S: No such file or directory. >>>>> in ../sysdeps/unix/syscall-template.S >>>>> (gdb) bt >>>>> #0 0x00007ffff534e2c3 in select () at >>>>> ../sysdeps/unix/syscall-template.S:82 >>>>> #1 0x00007ffff723313a in XmlRpc::XmlRpcDispatch::work (this=0x69ac18, >>>>> timeout=) >>>>> at >>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/utilities/xmlrpcpp/src/XmlRpcDispatch.cpp:110 >>>>> #2 0x00007ffff7903de9 in ros::XMLRPCManager::serverThreadFunc >>>>> (this=0x69abc0) >>>>> at >>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/xmlrpc_manager.cpp:256 >>>>> #3 0x00007ffff6df5230 in thread_proxy () from >>>>> /usr/lib/libboost_thread.so.1.42.0 >>>>> #4 0x00007ffff55f9971 in start_thread (arg=) at >>>>> pthread_create.c:304 >>>>> #5 0x00007ffff535592d in clone () at >>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 >>>>> #6 0x0000000000000000 in ?? () >>>>> (gdb) thread 4 >>>>> [Switching to thread 4 (Thread 0x7ffff0eaa700 (LWP 9630))]#0 >>>>> pthread_cond_wait@@GLIBC_2.3.2 () at >>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 >>>>> 162 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No >>>>> such file or directory. >>>>> in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S >>>>> (gdb) bt >>>>> #0 pthread_cond_wait@@GLIBC_2.3.2 () at >>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 >>>>> #1 0x00007ffff79603b3 in wait (this=) at >>>>> /usr/include/boost/thread/pthread/condition_variable.hpp:20 >>>>> #2 ros::ROSOutAppender::logThread (this=) >>>>> at >>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/rosout_appender.cpp:135 >>>>> #3 0x00007ffff6df5230 in thread_proxy () from >>>>> /usr/lib/libboost_thread.so.1.42.0 >>>>> #4 0x00007ffff55f9971 in start_thread (arg=) at >>>>> pthread_create.c:304 >>>>> #5 0x00007ffff535592d in clone () at >>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 >>>>> #6 0x0000000000000000 in ?? () >>>>> (gdb) thread 5 >>>>> [Switching to thread 5 (Thread 0x7ffff06a9700 (LWP 9635))]#0 >>>>> pthread_cond_timedwait@@GLIBC_2.3.2 () at >>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> 212 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: >>>>> No such file or directory. >>>>> in ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S >>>>> (gdb) bt >>>>> #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at >>>>> ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:212 >>>>> #1 0x00007ffff796885d in bool >>>>> boost::condition_variable::timed_wait>>>> 1000000l> >(boost::unique_lock&, >>>>> boost::date_time::subsecond_duration>>>> 1000000l> const&) () from >>>>> /opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/lib/libros.so >>>>> #2 0x00007ffff7964e99 in ros::CallbackQueue::callAvailable >>>>> (this=0x69c410, timeout=DWARF-2 expression error: DW_OP_reg operations must >>>>> be used either alone or in conjuction with DW_OP_piece or DW_OP_bit_piece. >>>>> ) >>>>> at >>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/callback_queue.cpp:307 >>>>> #3 0x00007ffff78f22e1 in ros::internalCallbackQueueThreadFunc () >>>>> at >>>>> /tmp/buildd/ros-diamondback-ros-comm-1.4.5/debian/ros-diamondback-ros-comm/opt/ros/diamondback/stacks/ros_comm/clients/cpp/roscpp/src/libros/init.cpp:265 >>>>> #4 0x00007ffff6df5230 in thread_proxy () from >>>>> /usr/lib/libboost_thread.so.1.42.0 >>>>> #5 0x00007ffff55f9971 in start_thread (arg=) at >>>>> pthread_create.c:304 >>>>> #6 0x00007ffff535592d in clone () at >>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 >>>>> #7 0x0000000000000000 in ?? () >>>>> (gdb) >>>>> --- snip --- >>>>> >>>>> Also, I'm sorry to hear you had to create a new package to compile the >>>>> program. As a sanity check, I also created a new package and copied over the >>>>> cpp and action files, but this did not change the result I was having, as >>>>> expected. (I've attached the package, for reference.) >>>>> >>>>> I really appreciate the help, thank you. >>>>> >>>>> -Ryan >>>>> >>>>> >>>>> On Fri, Apr 1, 2011 at 2:03 PM, Vijay Pradeep < >>>>> vpradeep@willowgarage.com> wrote: >>>>> > Hi Ryan, >>>>> > >>>>> > Here's one more thing to try. Turn on rosconsole debugging (See >>>>> rosconole >>>>> > configuration) >>>>> > >>>>> > You're going to want the following two lines in your >>>>> rosonsole.config: >>>>> > log4j.logger.ros.actionlib=DEBUG >>>>> > log4j.logger.ros.chores=DEBUG >>>>> > >>>>> > >>>>> > You'll get an output that looks like this: >>>>> > [DEBUG] [1301680806.580203868]: Setting the current goal as aborted >>>>> > [DEBUG] [1301680806.580254603]: Setting status to aborted on goal, >>>>> id: >>>>> > /test_client_server-1118-1301680806.579359151, stamp: 1301680806.58 >>>>> > [DEBUG] [1301680806.580301868]: Publishing result for goal with id: >>>>> > /test_client_server-1118-1301680806.579359151 and stamp: >>>>> 1301680806.58 >>>>> > [DEBUG] [1301680806.580439068]: Getting status over the wire. >>>>> > [DEBUG] [1301680806.580509625]: Trying to transition to ACTIVE >>>>> > [DEBUG] [1301680806.580553872]: Transitioning CommState from >>>>> > WAITING_FOR_GOAL_ACK to ACTIVE >>>>> > [DEBUG] [1301680806.580603042]: Transitioning SimpleState from >>>>> [PENDING] to >>>>> > [ACTIVE] >>>>> > Active. >>>>> > [DEBUG] [1301680806.580934461]: Trying to transition to >>>>> WAITING_FOR_RESULT >>>>> > [DEBUG] [1301680806.580991611]: Transitioning CommState from ACTIVE >>>>> to >>>>> > WAITING_FOR_RESULT >>>>> > [DEBUG] [1301680806.581051376]: Trying to transition to DONE >>>>> > [DEBUG] [1301680806.581108948]: Transitioning CommState from >>>>> > WAITING_FOR_RESULT to DONE >>>>> > [DEBUG] [1301680806.581164618]: Transitioning SimpleState from >>>>> [ACTIVE] to >>>>> > [DONE] >>>>> > Done. >>>>> > [ INFO] [1301680806.581243326]: Hogging time or something >>>>> > Current State: ABORTED >>>>> > [DEBUG] [1301680806.581302953]: IN DELETER >>>>> > [DEBUG] [1301680806.581339010]: About to erase CommStateMachine >>>>> > [DEBUG] [1301680806.581377433]: Done erasing CommStateMachine >>>>> > [DEBUG] [1301680806.581414297]: about to start initGoal() >>>>> > [DEBUG] [1301680806.581489440]: Done with initGoal() >>>>> > [DEBUG] [1301680806.581544720]: The action server has received a new >>>>> goal >>>>> > request >>>>> > [DEBUG] [1301680806.581638097]: A new goal has been recieved by the >>>>> single >>>>> > goal action server >>>>> > [DEBUG] [1301680806.581705716]: Accepting a new goal >>>>> > [DEBUG] [1301680806.581758609]: Accepting goal, id: >>>>> > /test_client_server-1119-1301680806.581440270, stamp: 1301680806.58 >>>>> > Server called. Aborting goal. >>>>> > >>>>> > Can send me the debug output for the last few cycles when your app >>>>> freezes? >>>>> > >>>>> > Vijay >>>>> > >>>>> > On Fri, Apr 1, 2011 at 10:05 AM, Vijay Pradeep < >>>>> vpradeep@willowgarage.com> >>>>> > wrote: >>>>> >> >>>>> >> Hi Ryan, >>>>> >> >>>>> >> So I actually ran your original chores app overnight (~10 hours), >>>>> and it >>>>> >> still didn't freeze. >>>>> >> >>>>> >> Also, right now, I tried running your new chores app about 30 times >>>>> (for >>>>> >> about 5 seconds on each run), and it still hadn't frozen. I >>>>> couldn't get >>>>> >> the app to build using your CMakeLists and Makefile, so I rewrote >>>>> them based >>>>> >> on the standard "roscreate-pkg" versions of these files. I doubt >>>>> this would >>>>> >> affect the functionality. >>>>> >> >>>>> >> The fact that it freezes close to startup is interesting. My guess >>>>> would >>>>> >> be that it is having some sort of startup interaction with the >>>>> AsyncSpinner. >>>>> >> >>>>> >> Instead of: >>>>> >> int main(int argc, char** argv) { >>>>> >> ros::init(argc, argv, "test_client_server"); >>>>> >> >>>>> >> ros::AsyncSpinner spinner(4); >>>>> >> spinner.start(); >>>>> >> >>>>> >> ServerDude sd; >>>>> >> >>>>> >> ClientGuy client; >>>>> >> client.run(); >>>>> >> >>>>> >> return 0; >>>>> >> } >>>>> >> >>>>> >> Can you try: >>>>> >> int main(int argc, char** argv) { >>>>> >> ros::init(argc, argv, "test_client_server"); >>>>> >> >>>>> >> ros::AsyncSpinner spinner(4); >>>>> >> >>>>> >> ServerDude sd; >>>>> >> >>>>> >> ClientGuy client; >>>>> >> spinner.start(); >>>>> >> client.run(); >>>>> >> >>>>> >> return 0; >>>>> >> } >>>>> >> >>>>> >> I would really like to get to the bottom of this, but it's hard to >>>>> do much >>>>> >> on my end if I can't actually get it to halt. >>>>> >> >>>>> >> Vijay >>>>> >> >>>>> >> On Fri, Apr 1, 2011 at 7:51 AM, Ryan Miller >>>>> wrote: >>>>> >>> >>>>> >>> Hey Vijay, thanks for helping (and sorry for the double post; I >>>>> >>> accidentally forgot to reply all, and I figure some people on the >>>>> list >>>>> >>> may be interested in the email). Yes, I did mean waitForResult, and >>>>> I >>>>> >>> am also using common-1.4.3 (diamondback too, if that's relevant). >>>>> >>> Since you're having trouble duplicating the issue, I went back to >>>>> >>> create an example closer to what I'm actually doing, and I managed >>>>> to >>>>> >>> get it to crash just as often. I've attached the project. >>>>> >>> >>>>> >>> If it doesn't crash after about 5 seconds, restart it. It seems if >>>>> it >>>>> >>> continues for more than 5 seconds, it is more likely to continue >>>>> >>> running for much longer, but it usually halts within a second. >>>>> >>> >>>>> >>> Thanks a ton. >>>>> >>> >>>>> >>> -Ryan >>>>> >>> >>>>> >>> On Thu, Mar 31, 2011 at 11:00 PM, Vijay Pradeep >>>>> >>> wrote: >>>>> >>> > Hi Ryan, >>>>> >>> > >>>>> >>> > I'm sorry that you're having trouble getting actionlib to work. >>>>> What >>>>> >>> > version of the common stack do you have? >>>>> >>> > >>>>> >>> > Note that it is always possible for some of the ROS messages to >>>>> be >>>>> >>> > dropped. >>>>> >>> > If the result message doesn't make it to the action client, then >>>>> >>> > waitForResult is going to block forever. I'd suggest adding a >>>>> timeout >>>>> >>> > on >>>>> >>> > waitForServer, and then preempting the goal if you end up waiting >>>>> too >>>>> >>> > long. >>>>> >>> > >>>>> >>> >> I made each callback print when >>>>> >>> >> it was called, and I found that quite often, the program would >>>>> block >>>>> >>> >> forever when I called waitForServer. >>>>> >>> > I'm guessing you meant "the program would block forever when I >>>>> called >>>>> >>> > waitForResult" >>>>> >>> > >>>>> >>> > I'm using common-1.4.3, and after running your example for ~10 >>>>> minutes, >>>>> >>> > it >>>>> >>> > doesn't seem to freeze for me. I can rerun this test with the >>>>> exact >>>>> >>> > version >>>>> >>> > of common that you're using. Is there anything you can do to >>>>> make this >>>>> >>> > minimal example freeze more often? Are there some strategic >>>>> sleeps you >>>>> >>> > could >>>>> >>> > add to make it mimic your original app more closely? >>>>> >>> > >>>>> >>> > If you can get the chores app to freeze, you could try attaching >>>>> gdb to >>>>> >>> > the >>>>> >>> > process and seeing where all the threads are (using the "info >>>>> threads" >>>>> >>> > command inside gdb). A bunch of them will be stuck on >>>>> >>> > pthread_cond_timedwait calls, but I'd be curious if there's a >>>>> thread >>>>> >>> > stuck >>>>> >>> > on a lock inside of actionlib. That would be indicative of a >>>>> race >>>>> >>> > condition >>>>> >>> > in actionlib. >>>>> >>> > >>>>> >>> > Vijay >>>>> >>> > >>>>> >>> > On Thu, Mar 31, 2011 at 6:55 PM, Ryan Miller < >>>>> rmiller4589@gmail.com> >>>>> >>> > wrote: >>>>> >>> >> >>>>> >>> >> Because of a timing issue I can't quite pin down, actionlib >>>>> >>> >> occasionally appears to exhibit a race condition. I found the >>>>> problem >>>>> >>> >> after adding callbacks to my client. I made each callback print >>>>> when >>>>> >>> >> it was called, and I found that quite often, the program would >>>>> block >>>>> >>> >> forever when I called waitForServer. The problem was that the >>>>> client's >>>>> >>> >> active callback was called but the server's execute method was >>>>> never >>>>> >>> >> called. >>>>> >>> >> >>>>> >>> >> I have reduced the problem into a simple ROS package with a >>>>> single >>>>> >>> >> executable that I have attached. After running the node for a >>>>> while, I >>>>> >>> >> finally noticed the same problem. It's last output before >>>>> blocking >>>>> >>> >> was: >>>>> >>> >> >>>>> >>> >> --- snip --- >>>>> >>> >> Current State: ABORTED >>>>> >>> >> Aborting. >>>>> >>> >> Active. >>>>> >>> >> Done. >>>>> >>> >> Current State: ABORTED >>>>> >>> >> Aborting. >>>>> >>> >> Active. >>>>> >>> >> --- snip --- >>>>> >>> >> >>>>> >>> >> In my actual code, the condition happens extremely frequently, >>>>> but I >>>>> >>> >> found I could mitigate the problem by sleeping for one >>>>> millisecond >>>>> >>> >> before returning from the server's execute method. (I should >>>>> warn you >>>>> >>> >> that in the attached example, the problem almost never occurs). >>>>> >>> >> >>>>> >>> >> Is this likely a bug, or might I doing something wrong? Any >>>>> >>> >> suggestions would be appreciated. Thanks for the help. >>>>> >>> >> >>>>> >>> >> -Ryan >>>>> >>> >> >>>>> >>> >> _______________________________________________ >>>>> >>> >> ros-users mailing list >>>>> >>> >> ros-users@code.ros.org >>>>> >>> >> https://code.ros.org/mailman/listinfo/ros-users >>>>> >>> >> >>>>> >>> > >>>>> >>> > >>>>> >>> > >>>>> >>> > -- >>>>> >>> > Vijay Pradeep >>>>> >>> > Systems Engineer >>>>> >>> > Willow Garage, Inc. >>>>> >>> > vpradeep@willowgarage.com >>>>> >>> > >>>>> >>> > >>>>> >> >>>>> >> >>>>> >> >>>>> >> -- >>>>> >> Vijay Pradeep >>>>> >> Systems Engineer >>>>> >> Willow Garage, Inc. >>>>> >> vpradeep@willowgarage.com >>>>> >> >>>>> > >>>>> > >>>>> > >>>>> > -- >>>>> > Vijay Pradeep >>>>> > Systems Engineer >>>>> > Willow Garage, Inc. >>>>> > vpradeep@willowgarage.com >>>>> > >>>>> > >>>>> >>>>> >>>> >>>> >>>> -- >>>> Vijay Pradeep >>>> Systems Engineer >>>> Willow Garage, Inc. >>>> vpradeep@willowgarage.com >>>> >>>> >>> >> > > > -- > Vijay Pradeep > Systems Engineer > Willow Garage, Inc. > vpradeep@willowgarage.com > >