Hello, I am working on getting a good understanding of the tod_training pipeline and especially the use of the pose estimations. I'd be happy about indications whether I am correct or wrong with my findings: Step 1: Pose estimation -- For each object and view, the pose is estimated from fiducial markers. The pose describes the orientation of the training object with respect to the camera. Step 2: Masking -- For each object and view, the point cloud belonging to the training object is separated from the background by point cloud segmentation. The pose estimation from step 1 is crucial for this stage, as it determines the pose a fixed-sized box. All the points in this box are said to belong to the object after removing outliers, and all other points outside of this box are no longer of interest. The mask is generated by perspective projection using camera information. Step 3: Feature detection and extraction -- For each object and view, features are detected and extracted from the ROI in the 2d gray-scale training images. There is one Features2d produced which contains keypoints and the corresponding generated feature descriptors, and finally the estimated pose. This stage does not use pose information. Step 4: Features3d creation -- For each object and view, this stage creates an association of keypoints in the training images and corresponding 3d points in the point cloud. This creates a one-to-one mapping between 2d keypoints and 3d points. 3d to 3d matching will soon also be supported. Altogether, we should be able to replace steps 1, 2 in case we are able to supply pose estimation from another source than fiducial markers. (Using object_detection in SVN revision 50425). Best Regards, Julius