Robots searching for workers trapped in a partially collapsed mine shaft must rapidly generate a map of the scene and identify their location within the scene while navigating through hazardous terrain.
Researchers have recently begun building powerful machine-learning models to perform this complex task using only images from robots’ onboard cameras, but even the best models can only process a few images at a time. In a real-world disaster where every second counts, a search-and-rescue robot will need to quickly traverse large areas and process thousands of images to accomplish its mission.
To overcome this problem, MIT researchers drew on ideas from both recent artificial intelligence vision models and classical computer vision to develop a new system that can process an arbitrary number of images. Their system accurately generates 3D maps of complex scenes, such as a crowded office corridor, in a matter of seconds.
The AI-powered system sequentially creates and aligns small submaps of the scene, which are then stitched together to reconstruct a full 3D map, estimating the robot’s position in real time.
Unlike many other approaches, their technique does not require calibrated cameras or an expert to tune complex system implementations. The simple nature of their approach, combined with the speed and quality of 3D reconstruction, will make it easy to scale up for real-world applications.
In addition to helping search-and-rescue robots navigate, this method could be used to create extended reality applications for wearable devices such as VR headsets or to enable industrial robots to quickly locate and move goods inside a warehouse.
“For robots to accomplish increasingly complex tasks, they need more complex map representations of the world around them. But at the same time, we don’t want to make it harder to implement these maps in practice. We’ve shown that it’s possible to generate an accurate 3D reconstruction in a matter of seconds with a tool that works out of the box,” says Dominic Maggio, an MIT graduate student and lead author of a paper on this method.
Maggio is joined by postdoc Heungtae Lim and senior author Luca Carlone, associate professor in MIT’s Department of Aeronautics and Astronautics (AeroAstro), principal investigator of the Laboratory for Information and Decision Systems (LIDS), and director of the MIT SPARC Laboratory. This research will be presented at the Conference on Neural Information Processing Systems.
mapping the solution
For years, researchers have been grappling with an essential element of robotic navigation called simultaneous localization and mapping (SLAM). In SLAM, a robot reconstructs a map of its environment while orienting itself within space.
Traditional optimization methods for this task fail in challenging scenes, or they require the robot’s onboard cameras to be calibrated in advance. To avoid these pitfalls, researchers train machine-learning models to learn this task from data.
Although they are easy to implement, even the best models can process only about 60 camera images at a time, making them impractical for applications where a robot needs to move rapidly through a diverse environment while processing thousands of images.
To solve this problem, MIT researchers designed a system that generates small sub-maps of the scene instead of the entire map. Their method “glues” these submaps together into an overall 3D reconstruction. The model is still processing only a few images at a time, but the system can reconstruct larger scenes much faster by combining smaller submaps together.
“It seemed like a very simple solution, but when I first tried it, I was surprised that it didn’t work that well,” says Maggio.
In search of an explanation, he scoured computer vision research papers from the 1980s and 1990s. Through this analysis, Maggio realized that errors in the way machine-learning models processed images made aligning submaps a more complex problem.
Traditional methods align submaps by applying rotation and translation until they line up. But these new models can introduce some ambiguity into the submaps, making them harder to align. For example, a 3D submap of one side of a room may contain walls that are slightly bent or extended. Simply rotating and translating these distorted submaps to align them does not work.
“We need to make sure that all the submaps are distorted in a consistent way so that we can align them well with each other,” explains Carlone.
more flexible approach
Borrowing ideas from classical computer vision, the researchers developed a more flexible, mathematical technique that can represent all distortions in these submaps. By applying mathematical transformations to each submap, this more flexible method can align them in a way that addresses ambiguity.
Based on the input images, the system outputs a 3D reconstruction of the scene and estimates of camera locations, which the robot will use to localize itself in space.
“Once Dominic had the intuition to bridge these two worlds – learning-based approaches and traditional optimization methods – implementation was fairly straightforward,” says Carlone. “With something this effective and simple, there is potential for many applications.
Without the need for specialized cameras or additional equipment to process the data, their system performs faster with lower reconstruction error than other methods. Researchers created near-real-time 3D reconstructions of complex scenes, such as the inside of the MIT Chapel, using only short videos captured on cell phones.
The average error in these 3D reconstructions was less than 5 centimeters.
In the future, the researchers want to make their method more reliable, especially for complex scenes and work towards applying it to real robots in challenging settings.
“Knowing about traditional geometry pays off. If you understand deeply what’s going on in the model, you can get better results and make things more scalable,” says Carlone.
This work is supported, in part, by the US National Science Foundation, the US Office of Naval Research, and the National Research Foundation of Korea. Carlone, who is currently on sabbatical as an Amazon Scholar, completed this work before joining Amazon.