Hello,

I am working on getting a good understanding of the tod_training pipeline
and especially the use of the pose estimations. I'd be happy about
indications
whether I am correct or wrong with my findings:

Step 1: Pose estimation -- For each object and view, the pose is estimated
from fiducial markers. The pose describes the orientation of the training
object with respect to the camera.

Step 2: Masking -- For each object and view, the point cloud belonging to
the training object is separated from the background by point cloud
segmentation.
The pose estimation from step 1 is crucial for this stage, as it determines
the pose a fixed-sized box. All the points in this box are said to belong to
the object after removing outliers, and all other points outside of this box
are no longer of interest. The mask is generated by perspective projection
using camera information.

Step 3: Feature detection and extraction -- For each object and view,
features
are detected and extracted from the ROI in the 2d gray-scale training
images.
There is one Features2d produced which contains keypoints and the
corresponding
generated feature descriptors, and finally the estimated pose. This stage
does
not use pose information.

Step 4: Features3d creation -- For each object and view, this stage creates
an
association of keypoints in the training images and corresponding 3d points
in
the point cloud. This creates a one-to-one mapping between 2d keypoints and
3d points.
3d to 3d matching will soon also be supported.

Altogether, we should be able to replace steps 1, 2 in case we are able to
supply pose estimation from another source than fiducial markers.

(Using object_detection in SVN revision 50425).

Best Regards,
    Julius