Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering

被引:0
|
作者
Lin, Ying-Jia [1 ]
Tseng, Ching-Shan [1 ]
Kao, Hung-Yu [1 ]
机构
[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 701, Taiwan
关键词
visual question answering; explainable VQA; multi-task learning; graph attention networks; vision-language model;
D O I
10.6688/JISE.202405_40(3).0014
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent studies leveraging object detection as the preliminary step for Visual Question Answering (VQA) ignore the relationships between different objects inside an image based on the textual question. In addition, the previous VQA models work like black -box functions, which means it is difficult to explain why a model provides such answers to the corresponding inputs. To address the issues above, we propose a new model structure to strengthen the representations for different objects and provide explainability for the VQA task. We construct a relation graph to capture the relative positions between region pairs and then create relation -aware visual features with a relation encoder based on graph attention networks. To make the final VQA predictions explainable, we introduce a multi -task learning framework with an additional explanation generator to help our model produce reasonable explanations. Simultaneously, the generated explanations are incorporated with the visual features using a novel Hybrid -Attention mechanism to enhance cross -modal understanding. Experiments show that the proposed method performs better on the VQA task than the several baselines. In addition, incorporation with the explanation generator can provide reasonable explanations along with the predicted answers.
引用
收藏
页码:649 / 659
页数:11
相关论文
共 50 条
  • [1] Relation-Aware Image Captioning for Explainable Visual Question Answering
    Tseng, Ching-Shan
    Lin, Ying-Jia
    Kao, Hung-Yu
    [J]. 2022 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, TAAI, 2022, : 149 - 154
  • [2] Relation-Aware Graph Attention Network for Visual Question Answering
    Li, Linjie
    Gan, Zhe
    Cheng, Yu
    Liu, Jingjing
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10312 - 10321
  • [3] Visual question answering with gated relation-aware auxiliary
    Shao, Xiangjun
    Xiang, Zhenglong
    Li, Yuanxiang
    [J]. IET IMAGE PROCESSING, 2022, 16 (05) : 1424 - 1432
  • [4] Relation-aware Hierarchical Attention Framework for Video Question Answering
    Li, Fangtao
    Liu, Zihe
    Bai, Ting
    Yan, Chenghao
    Cao, Chenyu
    Wu, Bin
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 164 - 172
  • [5] Visual Relation-Aware Unsupervised Video Captioning
    Ji, Puzhao
    Cao, Meng
    Zou, Yuexian
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 495 - 507
  • [6] Image captioning improved visual question answering
    Himanshu Sharma
    Anand Singh Jalal
    [J]. Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
  • [7] A BERT-based Approach with Relation-aware Attention for Knowledge Base Question Answering
    Luo, Da
    Su, Jindian
    Yu, Shanshan
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [8] Image captioning improved visual question answering
    Sharma, Himanshu
    Jalal, Anand Singh
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
  • [9] Relation-aware attention for video captioning via graph learning
    Tu, Yunbin
    Zhou, Chang
    Guo, Junjun
    Li, Huafeng
    Gao, Shengxiang
    Yu, Zhengtao
    [J]. PATTERN RECOGNITION, 2023, 136
  • [10] A visual question answering model based on image captioning
    Zhou, Kun
    Liu, Qiongjie
    Zhao, Dexin
    [J]. Multimedia Systems, 2024, 30 (06)