REP: XXX Title: Depth Images Version: $Revision: 83 $ Last-Modified: $Date: 2011-01-17 19:02:13 -0800 (Mon, 17 Jan 2011) $ Author: Patrick Mihelich Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 01-Dec-2011 ROS-Version: Fuerte Post-History: 30-Aug-2002 Abstract ======== This REP defines a representation for depth images in ROS. Depth images may be produced by a variety of camera technologies, including stereo, structured light and time-of-flight. Specification ============= Canonical Representation ------------------------ Depth images are published as `sensor_msgs/Image` encoded as 32-bit float. Each pixel is a depth (along the camera Z axis) in meters. The non-finite values `NaN`, `+Inf` and `-Inf` have special meanings as defined by REP 117. The ROS API for producers of depth images follows the standard camera driver API. Depth images are published on the `image` topic. The `camera_info` topic describes how to interpret the depth image geometrically. Whereas each pixel in a standard image can only be projected to a 3D ray, the depth image can (given the camera calibration) be converted to a 3D point cloud. OpenNI Raw Representation ------------------------- Alternatively, a device driver may publish depth images encoded as 16-bit unsigned integer, where each pixel is depth in millimeters. This differs from the standard units recommended in REP 103. The value 0 denotes an invalid depth, equivalent to a `NaN` floating point distance. Raw depth images are published on the `image_raw` topic. The `image_pipeline` stack will provide a nodelet to convert the `image_raw` topic to the canonical `image` topic. Consumers of depth images are only required to support the canonical floating point representation. Rationale ========= Why Not sensor_msgs/DisparityImage ---------------------------------- With the addition of depth images, ROS now has three messages suitable for representing dense depth data: `sensor_msgs/Image`, `sensor_msgs/DisparityImage`, and `sensor_msgs/PointCloud2`. `PointCloud2` is more general than a depth image, but also more verbose. The `DisparityImage` representation however is very similar. The `DisparityImage` message exists for historical reasons: stereo cameras were used with ROS long before any other type of depth sensor, and disparity images are the natural "raw" output of stereo correlation algorithms. For some vision algorithms (e.g. VSLAM), disparities are a convenient input to error metrics with pixel units. In practice, the `DisparityImage` message also has drawbacks. * It is tied to the stereo approach to 3D vision. Representing the output of a time-of-flight depth camera as a `DisparityImage` would be awkward. * Converting a disparity image to a point cloud requires two `CameraInfo` messages, for the left and right camera. Converting a depth image requires only one `CameraInfo` message. * It cannot be used with `image_transport`. Using `sensor_msgs/Image` already permits reasonable compression of 16-bit depth images with PNG, and easily allows adding compression algorithms specialized for depth images. * A major feature of OpenNI is registering the depth image to align with the RGB image, taken with a different camera. Registering a disparity image to a different camera frame is difficult to describe precisely, because converting disparity to depth depends on parameters (focal length and baseline) of the original camera. * In most robotics applications, depth is actually the quantity of interest. `sensor_msgs/DisparityImage` will continue to exist for backwards compatibility and for applications where it truly is the better representation. The `image_pipeline` stack will provide a nodelet for converting depth images to disparity images. Producers of dense depth data are encouraged to use `sensor_msgs/Image` instead of `sensor_msgs/DisparityImage`. Why Not a New Message Type -------------------------- Disparity images are represented by a distinct `sensor_msgs/DisparityImage` type, so why not define a `sensor_msgs/DepthImage`? Defining a new image-like message incurs significant tooling costs. The new message is incompatible with `image_transport`, standard image viewers, and various utilities such as converters between bags and images/video. On the other hand, perhaps there is additional metadata that a depth image ought to include. Let's consider the fields added by `sensor_msgs/DisparityImage`: * `f`, `T`: focal length and baseline. These are duplicated from the `CameraInfo` messages, and duplicated data is usually a bad sign. They are not even sufficient to correctly compute a point cloud, as `fx` may differ from `fy` and the principal point (`cx`, `cy`) is not included. * `valid_window`: The subwindow of potentially valid disparity values. This allows clients to iterate over the disparity image a bit more efficiently, but is hardly necessary. Another way is to publish the depth image cropped down to its valid window, and representing that with the `roi` field of `CameraInfo`. This has the advantage of not wasting bandwidth on necessarily invalid data. * `min_disparity`, `max_disparity`: define the minimum and maximum depth the camera can "see." This actually is useful information, but generally not required. * `delta_d`: Allows computation of the achievable depth resolution at any given depth. This is theoretically useful, and an analogous value could be calculated for the Kinect; but it may be hard to generalize over all 3D camera technologies. The main information we are unable to capture with an (`Image`, `CameraInfo`) pair is the min/max range. That does not seem to justify breaking from the established camera driver API. If necessary, the min/max range and other metadata could be published as another side channel, similar to the `camera_info` topic. Why Allow the OpenNI Representation ----------------------------------- Including the `uint16` OpenNI format is unfortunate in some ways. It adds complexity, is tied to a particular family of hardware, and uses different units from the rest of ROS. There are, nevertheless, some compelling reasons: * Strength in numbers: Over 10 million Microsoft Kinects have already been sold, and PrimeSense technology may make further inroads on the desktop with new products from Asus. The overwhelming market adoption makes OpenNI a de-facto standard for the foreseeable future. * Bandwidth: The `float` format is twice as large as the raw `uint16` one. The raw representation has a large advantange for network transmission and archival purposes. * Compression: The raw format can already be PNG-compressed. * Efficiency: Processing VGA depth data at 30fps stresses the capabilities of today's hardware, and many users are attempting to do so with relatively light-weight machines such as netbooks. In such resource-constrained environments, avoiding an intermediate conversion to the `float` format can be a noticeable win for performance. Backwards Compatibility ======================= This REP codifies existing behavior in the openni_kinect stack, so backwards compatibility is not expected to be an issue. References ========== Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: