Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception

被引:2
|
作者
Han, Chunrui [1 ]
Yang, Jinrong [2 ]
Sun, Jianjian [1 ]
Ge, Zheng [1 ]
Dong, Runpei [3 ]
Zhou, Hongyu [1 ]
Mao, Weixin [4 ]
Peng, Yuang [5 ]
Zhang, Xiangyu [1 ]
机构
[1] Megvii Technol, Beijing 100080, Peoples R China
[2] Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China
[3] Xi An Jiao Tong Univ, Beijing 100084, Peoples R China
[4] Waseda Univ, Fukuoka 8070832, Japan
[5] Tsinghua Univ, Jian 343200, Peoples R China
来源
关键词
Three-dimensional displays; History; Task analysis; Feature extraction; Fuses; Pipelines; Detectors; Multi-view 3D object detection; recurrent network and long-term temporal fusion;
D O I
10.1109/LRA.2024.3401172
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this letter, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (<bold>55.4%</bold> mAP and <bold>62.9%</bold> NDS), segmentation (<bold>48.6%</bold> vehicle mIoU), tracking (<bold>54.8%</bold> AMOTA), and motion prediction (<bold>0.80 m</bold> minADE and <bold>0.463</bold> EPA).
引用
收藏
页码:6544 / 6551
页数:8
相关论文
共 50 条
  • [11] Multi-View Attentive Contextualization for Multi-View 3D Object Detection
    Liu, Xianpeng
    Zheng, Ce
    Qian, Ming
    Xue, Nan
    Chen, Chen
    Zhang, Zhebin
    Li, Chen
    Wu, Tianfu
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 16688 - 16698
  • [12] Multi-View 3D Face Reconstruction with Deep Recurrent Neural Networks
    Dou, Pengfei
    Kakadiaris, Ioannis A.
    2017 IEEE INTERNATIONAL JOINT CONFERENCE ON BIOMETRICS (IJCB), 2017, : 483 - 492
  • [13] Multi-view 3D face reconstruction with deep recurrent neural networks
    Dou, Pengfei
    Kakadiaris, Ioannis A.
    IMAGE AND VISION COMPUTING, 2018, 80 : 80 - 91
  • [14] Recognition of 3D Object Based on Multi-View Recurrent Neural Networks
    Dong S.
    Li W.-S.
    Zhang W.-Q.
    Zou K.
    Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2020, 49 (02): : 269 - 275
  • [15] Sequential Fusion of Multi-view Video Frames for 3D Scene Generation
    Sun, Weilin
    Li, Xiangxian
    Li, Manyi
    Wang, Yuqing
    Zheng, Yuze
    Meng, Xiangxu
    Meng, Lei
    ARTIFICIAL INTELLIGENCE, CICAI 2022, PT I, 2022, 13604 : 597 - 608
  • [16] DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion
    Duzceker, Arda
    Galliani, Silvano
    Vogel, Christoph
    Speciale, Pablo
    Dusmanu, Mihai
    Pollefeys, Marc
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15319 - 15328
  • [17] Multi-View Hierarchical Fusion Network for 3D Object Retrieval and Classification
    Liu, An-An
    Hu, Nian
    Song, Dan
    Guo, Fu-Bin
    Zhou, He-Yu
    Hao, Tong
    IEEE ACCESS, 2019, 7 : 153021 - 153030
  • [18] 3D model classification based on DRSN and multi-view feature fusion
    Gao, Xueyao
    Zhang, Yunkai
    Zhang, Chunxiang
    Xue, Yongzeng
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 273
  • [19] Fusion of Multi-view Tissue Classification Based on Wound 3D Model
    Wannous, Hazem
    Lucas, Yves
    Treuillet, Sylvie
    Albouy, Benjamin
    ADVANCED CONCEPTS FOR INTELLIGENT VISION SYSTEMS, PROCEEDINGS, 2008, 5259 : 924 - +
  • [20] Multi-View Token Clustering and Fusion for 3D Object Recognition and Retrieval
    Fan, Linlong
    Ge, Yanqi
    Li, Wen
    Duan, Lixin
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1145 - 1150