Attention-Based Multimodal Deep Learning on Vision-Language Data: Models, Datasets, Tasks, Evaluation Metrics and Applications

被引：3

作者：

Bose, Priyankar ^{[1
]}

Rana, Pratip ^{[1
,2
]}

Ghosh, Preetam ^{[1
]}

机构：

[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA

[2] Bennett Aerosp, Raleigh, NC 27603 USA

来源：

IEEE ACCESS | 2023年 / 11卷

关键词：

Task analysis; Data models; Deep learning; Transformers; Visualization; Training; Surveys; Question answering (information retrieval); Image segmentation; Image texture analysis; Attention mechanism; data fusion; multimodal learning; vision-language classification; vision-language question-answering; vision-language segmentation; IMAGE; NETWORK;

D O I：

10.1109/ACCESS.2023.3299877

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Multimodal learning has gained immense popularity due to the explosive growth in the volume of image and textual data in various domains. Vision-language heterogeneous multimodal data has been utilized to solve a variety of tasks including classification, image segmentation, image captioning, question-answering, etc. Consequently, several attention mechanism-based approaches with deep learning have been proposed on image-text multimodal data. In this paper, we highlight the current status of attention-based deep learning approaches on vision-language multimodal data by presenting a detailed description of the existing models, their performances and the variety of evaluation metrics used therein. We revisited the various attention mechanisms on image-text multimodal data since its inception in 2015 till 2022 and considered a total of 75 articles for the survey. Our comprehensive discussion also encompasses the current tasks, datasets, application areas and future directions in this domain. This is the very first attempt to discuss the vast scope of attention-based deep learning mechanisms on image-text multimodal data.

引用

页码：80624 / 80646

页数：23

共 50 条

[1] A Framework for Vision-Language Warm-up Tasks in Multimodal Dialogue Models
Lee, Jaewook
Park, Seongsik
Park, Seong-Heum
Kim, Hongjin
Kim, Harksoo
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2789 - 2799
[2] DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Liu, Fenglin
Wu, Xian
Ge, Shen
Ren, Xuancheng
Fan, Wei
Sun, Xu
Zou, Yuexian
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (01)
[3] Automated Quality Evaluation of Large-Scale Benchmark Datasets for Vision-Language Tasks
Zhao, Ruibin
Xie, Zhiwei
Zhuang, Yipeng
L. H. Yu, Philip
INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2024, 34 (03)
[4] Deep Learning for Language and Vision Tasks in Surveillance Applications
Pastor Lopez-Monroy, A.
Arturo Elias-Miranda, Alfredo
Vallejo-Aldana, Daniel
Manuel Garcia-Carmona, Juan
Perez-Espinosa, Humberto
COMPUTACION Y SISTEMAS, 2021, 25 (02): : 317 - 328
[5] VL-Meta: Vision-Language Models for Multimodal Meta-Learning
Ma, Han
Fan, Baoyu
Ng, Benjamin K.
Lam, Chan-Tong
MATHEMATICS, 2024, 12 (02)
[6] VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
Yin, Ziyi
Ye, Muchao
Zhang, Tianrong
Du, Tianyu
Zhu, Jinguo
Liu, Han
Chen, Jinghui
Wang, Ting
Ma, Fenglong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[7] Multimodal attention-based deep learning for automatic modulation classification
Han, Jia
Yu, Zhiyong
Yang, Jian
FRONTIERS IN ENERGY RESEARCH, 2023, 10
[8] A Survey on Multimodal Deep Learning for Image Synthesis Applications, methods, datasets, evaluation metrics, and results comparison
Luo, Sanbi
2021 5TH INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE (ICIAI 2021), 2021, : 108 - 120
[9] A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
Khaled Bayoudh
Raja Knani
Fayçal Hamdaoui
Abdellatif Mtibaa
The Visual Computer, 2022, 38 : 2939 - 2970
[10] A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
Bayoudh, Khaled
Knani, Raja
Hamdaoui, Faycal
Mtibaa, Abdellatif
VISUAL COMPUTER, 2022, 38 (08): : 2939 - 2970

← 1 2 3 4 5 →