Self-driving cars and other autonomous vehicles (AVs) depend on a combination of imaging sensors and object detection algorithms to avoid other vehicles, pedestrians, and obstacles on the road. To perform well in diverse driving conditions, these object detection models must fuse information from various sensor modalities like RGB cameras, LiDAR, and thermal cameras in a technique known as "deep sensor fusion."
However, like other deep learning algorithms, object detection models require a massive amount of data and synchronized datasets from each sensor stream. Additionally, with existing methods, any modifications to a specific sensor type requires that the entire model from all sensors be retrained. These factors pose a challenge in data collection, leading to increased computational time and financial costs, and hindering the flexibility and adaptability of self-driving vehicles in sensing objects.
Now, Caltech researchers in the lab of Soon-Jo Chung, Bren Professor of Control and Dynamical Systems; Jet Propulsion Laboratory Senior Research Scientist, in collaboration with researchers from the Ford Motor Company, have demonstrated a new approach to AV object detection modeling that is more efficient, requires less training data, and acts with greater precision when facing real-world conditions. This new model leverages existing object detection models that have been pretrained on large single-modality data sets from sensors, like RGB cameras and thermal imaging, in various weather conditions. The findings from this work were published in a paper on the preprint site arXiv on October 20, 2023.
"In our work, we fuse together two pretrained object detection models using scene-adaptive fusion modules. One detector takes in RGB—or color—imaging and the other takes in a thermal image," says Connor Lee, an aerospace engineering graduate student and co-first author of the paper. "We found that if we limit the amount of machine learning that's occurring during the model fusion process to a small number of parameters in the form of these fusion modules, we don't need that much training to swiftly obtain results surpassing state-of-the-art.
Despite the specific modalities (e.g., RGB imaging and thermal imaging) used in testing, this approach also applies to other multi-modal deep sensor fusion problems.
"Currently, a lot of researchers are fusing sensor modalities by training every parameter together. By training the entire thing, they require more training data—and likewise, more training time—to avoid the overfitting problem," Lee says. This is a problem in which a machine learning model learns its training data too well—including any errors or instances where inputs are biased—causing trouble with unexpected objects or conditions.
In contrast, the method that Lee and team propose is simpler—rather than training a deep sensor fusion model from scratch, this new approach collects multiple single-modality object detection models and then allocates fusion training to only a specific set of parameters contained within small fusion modules. "Instead of trying to repeat the works of others and retrain everything, let's just take these ready-made, pre-trained models off the shelf and fuse them together," Lee says.
"Another motivation of this work is to develop an adaptive perception system for self-driving cars and other AVs," adds Lu Gan, postdoctoral scholar research associate in aerospace, incoming assistant professor at Georgia Tech, and co-first author of the paper. "The lightweight and easy-to-train fusion modules allow us to train and switch certain parameters based on scene information, giving the model a more adaptive ability for the specific scene."
These "scenes" encapsulate different weather and light conditions, and act as different modes of "seeing" for autonomous vehicles. For instance, during a night scene, the RGB image sensors on a self-driving car will cease to be effective. In this case, the vehicle is better served to rely more on another sensor modality like thermal imaging. The approach detailed in the paper uses machine learning to train the object detection model in how to best fuse these different sensor modalities depending on the specific scene or weather/light condition. Rather than an AV that learns how to "see" based on a global rule, the use of scene information in the training process provides a swappable lens of "seeing" that is most relevant and informative for the current condition.
"The adaptiveness of our model lies in its ability to actively choose how to process information based on current conditions," Gan says. As an AV enters a new environment, the object detection module efficiently selects a relevant fusion module, enabling quick adaptation to the most applicable scene.
"In our research, we show that our model is effective with only 25% of the entire RGB/thermal dataset, achieving almost similar results to complex architectures that require end-to-end training," Lee says. By avoiding unnecessary complexity, this new approach results in an autonomous vehicle perception model that is more efficient and adaptable on the road.
The paper detailing the approach is titled "RGB-X Object Detection via Scene-Specific Fusion Modules," and has been recently accepted for the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
Funding for the research was provided by the Ford Motor Company.
Previously, Connor Lee, Lu Gan, and Soon-Jo Chung collaborated on another paper titled "Online Self-Supervised Thermal Water Segmentation for Aerial Vehicles," which was a Finalist for the Best Paper Award on Agri-Robotics at the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).