Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception

被引:2
|
作者
Han, Chunrui [1 ]
Yang, Jinrong [2 ]
Sun, Jianjian [1 ]
Ge, Zheng [1 ]
Dong, Runpei [3 ]
Zhou, Hongyu [1 ]
Mao, Weixin [4 ]
Peng, Yuang [5 ]
Zhang, Xiangyu [1 ]
机构
[1] Megvii Technol, Beijing 100080, Peoples R China
[2] Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China
[3] Xi An Jiao Tong Univ, Beijing 100084, Peoples R China
[4] Waseda Univ, Fukuoka 8070832, Japan
[5] Tsinghua Univ, Jian 343200, Peoples R China
来源
关键词
Three-dimensional displays; History; Task analysis; Feature extraction; Fuses; Pipelines; Detectors; Multi-view 3D object detection; recurrent network and long-term temporal fusion;
D O I
10.1109/LRA.2024.3401172
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this letter, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (<bold>55.4%</bold> mAP and <bold>62.9%</bold> NDS), segmentation (<bold>48.6%</bold> vehicle mIoU), tracking (<bold>54.8%</bold> AMOTA), and motion prediction (<bold>0.80 m</bold> minADE and <bold>0.463</bold> EPA).
引用
收藏
页码:6544 / 6551
页数:8
相关论文
共 50 条
  • [1] RecurrentBEV: A Long-Term Temporal Fusion Framework for Multi-view 3D Detection
    Change, Ming
    Zhang, Xishan
    Zhang, Rui
    Zhao, Zhipeng
    He, Guanhua
    Liu, Shaoli
    COMPUTER VISION - ECCV 2024, PT LXXII, 2025, 15130 : 131 - 147
  • [2] Multi-View Fusion-Based 3D Object Detection for Robot Indoor Scene Perception
    Wang, Li
    Li, Ruifeng
    Sun, Jingwen
    Liu, Xingxing
    Zhao, Lijun
    Seah, Hock Soon
    Quah, Chee Kwang
    Tandianus, Budianto
    SENSORS, 2019, 19 (19)
  • [3] ViewFormer: Exploring Spatiotemporal Modeling for Multi-view 3D Occupancy Perception via View-Guided Transformers
    Li, Jinke
    He, Xiao
    Zhou, Chonghua
    Cheng, Xiaoqiang
    Wen, Yang
    Zhang, Dan
    COMPUTER VISION-ECCV 2024, PT XLIII, 2025, 15101 : 90 - 106
  • [4] A multi-view recurrent neural network for 3D mesh segmentation
    Le, Truc
    Bui, Giang
    Duan, Ye
    COMPUTERS & GRAPHICS-UK, 2017, 66 : 103 - 112
  • [5] Emphasizing 3D Properties in Recurrent Multi-View Aggregation for 3D Shape Retrieval
    Xu, Cheng
    Leng, Biao
    Zhang, Cheng
    Zhou, Xiaochen
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 7428 - 7435
  • [6] Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection
    Wang, Shihao
    Liu, Yingfei
    Wang, Tiancai
    Li, Ying
    Zhang, Xiangyu
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3598 - 3608
  • [7] Multi-view Fusion with Deep Learning for 3D Shape Classification
    Huang, Xiang
    Wang, Mantao
    Zhang, Dejun
    Zhu, Yu
    Zou, Lu
    Sun, Jun
    Han, Fei
    He, Linchao
    2018 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), 2018, : 189 - 194
  • [8] PROGRESSIVE MULTI-VIEW FUSION FOR 3D HUMAN POSE ESTIMATION
    Zhang, Lijun
    Zhou, Kangkang
    Liu, Liangchen
    Li, Zhenghao
    Zhao, Xunyi
    Zhou, Xiang-Dong
    Shi, Yu
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1600 - 1604
  • [9] 3D Crowd Counting via Multi-View Fusion with 3D Gaussian Kernels
    Zhang, Qi
    Chan, Antoni B.
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 12837 - 12844
  • [10] Multi-View 3D Model Reconstruction Based on Multi-Level Perception
    Bai, Jing
    Xu, Hao
    Computer Engineering and Applications, 2024, 59 (02) : 232 - 239