[ros-users] [Discourse.ros.org] [ROS Projects] Slice and dice large ROS bag files on Hadoop and Spark

Wed Jul 26 15:21:59 UTC 2017

Large amount of sensor and robotic data is produced by the industry at an ever increasing peace. Be it from areas like mobility, perception, smart factory or from development tools through planing, modelling or simulation.

New effervescent robotic topics of research like self driving cars put pressure to develop new tools and techniques to deal with larger and more complex data sets. Some projects and industry players publicly announced the adoption of ROS as part of their process.

On the other hand, **Hadoop** and **Spark** Ecosystems are seeing a tremendous adoption for processing and analysing large data in parallel. (The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets)

**Why process large ROS bag files in parallel?**

![08|690x380](/uploads/ros/original/1X/d04ed3baa759c4cef69d41fac1c97122959f42f2.png)

ROS command **rosbag record** subscribes to topics and writes a bag file with the contents of all messages published on those topics. For performance reasons the messages are written interlaced as they come over the wires, with different frequencies.

**Associative** operations can be applied in parallel.  Or more precisely the parallelism requires associativity. (Although concurrency technically is not parallelism it also requires associativity.) Spark provides an unified functional API for processing locally, concurrently or on multiple machines.

** Now you do not need to convert ROS bag files to work with them in Spark**

The assumption was that the ROS bag files have to be converted into a more suitable format before they can be processed in parallel with tools like Hadoop or Spark. It turns out that the format is good enough for processing with a distributed file system like HDFS but it happened that nobody has written an Hadoop InputFormat for it.

So we did it. We took the time and wrote a Hadoop RosbagInputFormat :grinning: published under Apache 2.0 License.

[http://github.com/valtech/ros_hadoop](http://github.com/valtech/ros_hadoop)

RosbagInputFormat is an open source splittable Hadoop InputFormat for the rosbag file format. 

![16|653x482](/uploads/ros/original/1X/3e98bb3b9a14f0653b62244e6c03ba9906f9ea93.png)

We also prepared a Dockerfile and step-by-step tutorial that you could use to try the concepts presented here:  

[http://github.com/valtech/ros_hadoop](http://github.com/valtech/ros_hadoop)

We hope that the RosbagInputFormat would be useful for you. It would be great if you give us some feedback. 

Thanks!
Adrian, Jan

---
[Visit Topic](https://discourse.ros.org/t/slice-and-dice-large-ros-bag-files-on-hadoop-and-spark/2314/1) or reply to this email to respond.