A multi-modal spatial-temporal model for accurate motion forecasting with visual fusion

被引：7

作者：

Wang, Xiaoding ^{[1
,2
]}

Liu, Jianmin ^{[1
,2
]}

Lin, Hui ^{[1
,2
]}

Garg, Sahil ^{[3
]}

Alrashoud, Mubarak ^{[4
]}

机构：

[1] Fujian Normal Univ, Coll Comp & Cyber Secur, 8 Xuefu South Rd, Fuzhou 350117, Fujian, Peoples R China

[2] Fujian Prov Univ, Engn Res Ctr Cyber Secur & Educ Informatizat, 8 Xuefu South Rd, Fuzhou 350117, Fujian, Peoples R China

[3] Ecole Technol Super, Elect Engn Dept, Montreal, PQ H3C 1K3, Canada

[4] King Saud Univ, Coll Comp & Informat Sci CCIS, Dept Software Engn SWE, Riyadh 11543, Saudi Arabia

来源：

INFORMATION FUSION | 2024年 / 102卷

关键词：

Motion forecasting; Intelligent transportation; Spatial-temporal cross attention; Multi-source visual fusion;

D O I：

10.1016/j.inffus.2023.102046

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The multi-source visual information from ring cameras and stereo cameras provides a direct observation of the road, traffic conditions, and vehicle behavior. However, relying solely on visual information may not provide a complete environmental understanding. It is crucial for intelligent transportation systems to effectively utilize multi-source, multi-modal data to accurately predict the future motion trajectory of vehicles accurately. Therefore, this paper presents a new model for predicting multi-modal trajectories by integrating multi-source visual feature. A spatial-temporal cross attention fusion module is developed to capture the spatiotemporal interactions among vehicles, while leveraging the road's geographic structure to improve prediction accuracy. The experimental results on the realistic dataset Argoverse 2 demonstrate that, in comparison to other methods, ours improves the metrics of minADE (Minimum Average Displacement Error), minFDE (Minimum Final Displacement Error), and MR (Miss Rate) by 1.08%, 3.15%, and 2.14% , respectively, in unimodal prediction. In multimodal prediction, the improvements are 5.47%, 4.46%, and 6.50%. Our method effectively captures the temporal and spatial characteristics of vehicle movement trajectories, making it suitable for autonomous driving applications.

引用

页数：12

共 50 条

[31] MMTF: Multi-Modal Temporal Fusion for Commonsense Video Question Answering
Ahmad, Mobeen
Park, Geonwoo
Park, Dongchan
Park, Sanguk
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4659 - 4664
[32] Medical Visual Question-Answering Model Based on Knowledge Enhancement and Multi-Modal Fusion
Zhang, Dianyuan
Yu, Chuanming
An, Lu
Proceedings of the Association for Information Science and Technology, 2024, 61 (01) : 703 - 708
[33] Online video visual relation detection with hierarchical multi-modal fusion
He, Yuxuan
Gan, Ming-Gang
Ma, Qianzhao
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 65707 - 65727
[34] Text-Guided Multi-Modal Fusion for Underwater Visual Tracking
Michael, Yonathan
Alansari, Mohamad
Javed, Sajid
2024 IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE, AVSS 2024, 2024,
[35] Video Visual Relation Detection via Multi-modal Feature Fusion
Sun, Xu
Ren, Tongwei
Zi, Yuan
Wu, Gangshan
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2657 - 2661
[36] The multi-modal fusion in visual question answering: a review of attention mechanisms
Lu, Siyu
Liu, Mingzhe
Yin, Lirong
Yin, Zhengtong
Liu, Xuan
Zheng, Wenfeng
PEERJ COMPUTER SCIENCE, 2023, 9
[37] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
Siebert, Tim
Clasen, Kai Norman
Ravanbakhsh, Mahdyar
Demir, Beguem
IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
[38] Learning Visual Emotion Distributions via Multi-Modal Features Fusion
Zhao, Sicheng
Ding, Guiguang
Gao, Yue
Han, Jungong
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 369 - 377
[39] STQS: Interpretable multi-modal Spatial-Temporal-seQuential model for automatic Sleep scoring
Pathak, Shreyasi
Lu, Changqing
Nagaraj, Sunil Belur
van Putten, Michel
Seifert, Christin
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2021, 114 (114)
[40] Multi-Modal and Multi-Temporal Data Fusion: Outcome of the 2012 GRSS Data Fusion Contest
Berger, Christian
Voltersen, Michael
Eckardt, Robert
Eberle, Jonas
Heyer, Thomas
Salepci, Nesrin
Hese, Soeren
Schmullius, Christiane
Tao, Junyi
Auer, Stefan
Bamler, Richard
Ewald, Ken
Gartley, Michael
Jacobson, John
Buswell, Alan
Du, Qian
Pacifici, Fabio
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2013, 6 (03) : 1324 - 1340

← 1 2 3 4 5 →