Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception

被引:2
|
作者
Han, Chunrui [1 ]
Yang, Jinrong [2 ]
Sun, Jianjian [1 ]
Ge, Zheng [1 ]
Dong, Runpei [3 ]
Zhou, Hongyu [1 ]
Mao, Weixin [4 ]
Peng, Yuang [5 ]
Zhang, Xiangyu [1 ]
机构
[1] Megvii Technol, Beijing 100080, Peoples R China
[2] Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China
[3] Xi An Jiao Tong Univ, Beijing 100084, Peoples R China
[4] Waseda Univ, Fukuoka 8070832, Japan
[5] Tsinghua Univ, Jian 343200, Peoples R China
来源
关键词
Three-dimensional displays; History; Task analysis; Feature extraction; Fuses; Pipelines; Detectors; Multi-view 3D object detection; recurrent network and long-term temporal fusion;
D O I
10.1109/LRA.2024.3401172
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this letter, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (<bold>55.4%</bold> mAP and <bold>62.9%</bold> NDS), segmentation (<bold>48.6%</bold> vehicle mIoU), tracking (<bold>54.8%</bold> AMOTA), and motion prediction (<bold>0.80 m</bold> minADE and <bold>0.463</bold> EPA).
引用
收藏
页码:6544 / 6551
页数:8
相关论文
共 50 条
  • [31] Generation of Multi-View Video Using a Fusion Camera System for 3D Displays
    Lee, Eun-Kyung
    Ho, Yo-Sung
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2010, 56 (04) : 2797 - 2805
  • [32] Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation
    Zhou, Kangkang
    Zhang, Lijun
    Lu, Feng
    Zhou, Xiang-Dong
    Shi, Yu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7512 - 7520
  • [33] A Multi-View Texture Fusion Approach for High Quality 3D Face Modelling
    Li, Bing-chuan
    Ye, Yu-ping
    Song, Zhan
    Kong, Ling-sheng
    Tang, Su-ming
    2019 INTERNATIONAL CONFERENCE ON ENERGY, POWER, ENVIRONMENT AND COMPUTER APPLICATION (ICEPECA 2019), 2019, 334 : 296 - 300
  • [34] SKETCH-BASED 3D SHAPE RETRIEVAL WITH MULTI-VIEW FUSION TRANSFORMER
    Zhu, Cunjuan
    Cui, Dongdong
    Jia, Qi
    Wang, Weimin
    Liu, Yu
    Lew, Michael S.
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 3005 - 3009
  • [35] 3D Target Detection Method Combined with Multi-View Mutual Projection Fusion
    Zhao Y.
    Wang X.
    Gao L.
    Liu Y.
    Dai Y.
    Beijing Ligong Daxue Xuebao/Transaction of Beijing Institute of Technology, 2022, 42 (12): : 1273 - 1282
  • [36] 3D Facial Expression Recognition Based on Multi-View and Prior Knowledge Fusion
    Quang Nhat Vo
    Khanh Tran
    Zhao, Guoying
    2019 IEEE 21ST INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP 2019), 2019,
  • [37] Triangular Patch Based Texture fusion for Multi-view 3D Face Model
    Yang, Shan-min
    Lin, Yi
    Zhang, Jian-wei
    TENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2018), 2019, 11069
  • [38] MVF-GNN: Multi-View Fusion With GNN for 3D Semantic Segmentation
    Du, Zhenxiang
    Ren, Minglun
    Chu, Wei
    Chen, Nengying
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (04): : 3262 - 3269
  • [39] Multi-view 3D Morphable Face Reconstruction via Canonical Volume Fusion
    Tian, Jingqi
    Wang, Zhibo
    Lu, Ming
    Xu, Feng
    ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 : 545 - 558
  • [40] Adaptive Multi-View and Temporal Fusing Transformer for 3D Human Pose Estimation
    Shuai, Hui
    Wu, Lele
    Liu, Qingshan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (04) : 4122 - 4135