Attention-Based Multimodal Deep Learning on Vision-Language Data: Models, Datasets, Tasks, Evaluation Metrics and Applications

被引:3
|
作者
Bose, Priyankar [1 ]
Rana, Pratip [1 ,2 ]
Ghosh, Preetam [1 ]
机构
[1] Virginia Commonwealth Univ, Dept Comp Sci, Richmond, VA 23284 USA
[2] Bennett Aerosp, Raleigh, NC 27603 USA
关键词
Task analysis; Data models; Deep learning; Transformers; Visualization; Training; Surveys; Question answering (information retrieval); Image segmentation; Image texture analysis; Attention mechanism; data fusion; multimodal learning; vision-language classification; vision-language question-answering; vision-language segmentation; IMAGE; NETWORK;
D O I
10.1109/ACCESS.2023.3299877
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multimodal learning has gained immense popularity due to the explosive growth in the volume of image and textual data in various domains. Vision-language heterogeneous multimodal data has been utilized to solve a variety of tasks including classification, image segmentation, image captioning, question-answering, etc. Consequently, several attention mechanism-based approaches with deep learning have been proposed on image-text multimodal data. In this paper, we highlight the current status of attention-based deep learning approaches on vision-language multimodal data by presenting a detailed description of the existing models, their performances and the variety of evaluation metrics used therein. We revisited the various attention mechanisms on image-text multimodal data since its inception in 2015 till 2022 and considered a total of 75 articles for the survey. Our comprehensive discussion also encompasses the current tasks, datasets, application areas and future directions in this domain. This is the very first attempt to discuss the vast scope of attention-based deep learning mechanisms on image-text multimodal data.
引用
收藏
页码:80624 / 80646
页数:23
相关论文
共 50 条
  • [1] A Framework for Vision-Language Warm-up Tasks in Multimodal Dialogue Models
    Lee, Jaewook
    Park, Seongsik
    Park, Seong-Heum
    Kim, Hongjin
    Kim, Harksoo
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2789 - 2799
  • [2] DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Ren, Xuancheng
    Fan, Wei
    Sun, Xu
    Zou, Yuexian
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (01)
  • [3] Automated Quality Evaluation of Large-Scale Benchmark Datasets for Vision-Language Tasks
    Zhao, Ruibin
    Xie, Zhiwei
    Zhuang, Yipeng
    L. H. Yu, Philip
    INTERNATIONAL JOURNAL OF NEURAL SYSTEMS, 2024, 34 (03)
  • [4] Deep Learning for Language and Vision Tasks in Surveillance Applications
    Pastor Lopez-Monroy, A.
    Arturo Elias-Miranda, Alfredo
    Vallejo-Aldana, Daniel
    Manuel Garcia-Carmona, Juan
    Perez-Espinosa, Humberto
    COMPUTACION Y SISTEMAS, 2021, 25 (02): : 317 - 328
  • [5] VL-Meta: Vision-Language Models for Multimodal Meta-Learning
    Ma, Han
    Fan, Baoyu
    Ng, Benjamin K.
    Lam, Chan-Tong
    MATHEMATICS, 2024, 12 (02)
  • [6] VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
    Yin, Ziyi
    Ye, Muchao
    Zhang, Tianrong
    Du, Tianyu
    Zhu, Jinguo
    Liu, Han
    Chen, Jinghui
    Wang, Ting
    Ma, Fenglong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] Multimodal attention-based deep learning for automatic modulation classification
    Han, Jia
    Yu, Zhiyong
    Yang, Jian
    FRONTIERS IN ENERGY RESEARCH, 2023, 10
  • [8] A Survey on Multimodal Deep Learning for Image Synthesis Applications, methods, datasets, evaluation metrics, and results comparison
    Luo, Sanbi
    2021 5TH INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE (ICIAI 2021), 2021, : 108 - 120
  • [9] A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
    Khaled Bayoudh
    Raja Knani
    Fayçal Hamdaoui
    Abdellatif Mtibaa
    The Visual Computer, 2022, 38 : 2939 - 2970
  • [10] A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
    Bayoudh, Khaled
    Knani, Raja
    Hamdaoui, Faycal
    Mtibaa, Abdellatif
    VISUAL COMPUTER, 2022, 38 (08): : 2939 - 2970