A multi-modal spatial-temporal model for accurate motion forecasting with visual fusion

被引:7
|
作者
Wang, Xiaoding [1 ,2 ]
Liu, Jianmin [1 ,2 ]
Lin, Hui [1 ,2 ]
Garg, Sahil [3 ]
Alrashoud, Mubarak [4 ]
机构
[1] Fujian Normal Univ, Coll Comp & Cyber Secur, 8 Xuefu South Rd, Fuzhou 350117, Fujian, Peoples R China
[2] Fujian Prov Univ, Engn Res Ctr Cyber Secur & Educ Informatizat, 8 Xuefu South Rd, Fuzhou 350117, Fujian, Peoples R China
[3] Ecole Technol Super, Elect Engn Dept, Montreal, PQ H3C 1K3, Canada
[4] King Saud Univ, Coll Comp & Informat Sci CCIS, Dept Software Engn SWE, Riyadh 11543, Saudi Arabia
关键词
Motion forecasting; Intelligent transportation; Spatial-temporal cross attention; Multi-source visual fusion;
D O I
10.1016/j.inffus.2023.102046
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The multi-source visual information from ring cameras and stereo cameras provides a direct observation of the road, traffic conditions, and vehicle behavior. However, relying solely on visual information may not provide a complete environmental understanding. It is crucial for intelligent transportation systems to effectively utilize multi-source, multi-modal data to accurately predict the future motion trajectory of vehicles accurately. Therefore, this paper presents a new model for predicting multi-modal trajectories by integrating multi-source visual feature. A spatial-temporal cross attention fusion module is developed to capture the spatiotemporal interactions among vehicles, while leveraging the road's geographic structure to improve prediction accuracy. The experimental results on the realistic dataset Argoverse 2 demonstrate that, in comparison to other methods, ours improves the metrics of minADE (Minimum Average Displacement Error), minFDE (Minimum Final Displacement Error), and MR (Miss Rate) by 1.08%, 3.15%, and 2.14% , respectively, in unimodal prediction. In multimodal prediction, the improvements are 5.47%, 4.46%, and 6.50%. Our method effectively captures the temporal and spatial characteristics of vehicle movement trajectories, making it suitable for autonomous driving applications.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
    Ahmad, Mobeen
    Park, Geonwoo
    Park, Dongchan
    Park, Sanguk
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4659 - 4664
  • [32] Medical Visual Question-Answering Model Based on Knowledge Enhancement and Multi-Modal Fusion
    Zhang, Dianyuan
    Yu, Chuanming
    An, Lu
    Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 703 - 708
  • [33] Online video visual relation detection with hierarchical multi-modal fusion
    He, Yuxuan
    Gan, Ming-Gang
    Ma, Qianzhao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 65707 - 65727
  • [34] Text-Guided Multi-Modal Fusion for Underwater Visual Tracking
    Michael, Yonathan
    Alansari, Mohamad
    Javed, Sajid
    2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE, AVSS 2024, 2024,
  • [35] Video Visual Relation Detection via Multi-modal Feature Fusion
    Sun, Xu
    Ren, Tongwei
    Zi, Yuan
    Wu, Gangshan
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2657 - 2661
  • [36] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [37] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [38] Learning Visual Emotion Distributions via Multi-Modal Features Fusion
    Zhao, Sicheng
    Ding, Guiguang
    Gao, Yue
    Han, Jungong
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 369 - 377
  • [39] STQS: Interpretable multi-modal Spatial-Temporal-seQuential model for automatic Sleep scoring
    Pathak, Shreyasi
    Lu, Changqing
    Nagaraj, Sunil Belur
    van Putten, Michel
    Seifert, Christin
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2021, 114 (114)
  • [40] Multi-Modal and Multi-Temporal Data Fusion: Outcome of the 2012 GRSS Data Fusion Contest
    Berger, Christian
    Voltersen, Michael
    Eckardt, Robert
    Eberle, Jonas
    Heyer, Thomas
    Salepci, Nesrin
    Hese, Soeren
    Schmullius, Christiane
    Tao, Junyi
    Auer, Stefan
    Bamler, Richard
    Ewald, Ken
    Gartley, Michael
    Jacobson, John
    Buswell, Alan
    Du, Qian
    Pacifici, Fabio
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2013, 6 (03) : 1324 - 1340