Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering

被引:0
|
作者
Lin, Ying-Jia [1 ]
Tseng, Hing-Shan [1 ]
Kao, Hung-Yu [1 ]
机构
[1] Natl Cheng Kung Univiserty, Dept Comp Sci & Informat Engn, Tainan 701, Taiwan
关键词
visual question answering; explainable VQA; multi-task learning; graph atten- tion networks; vision-language model;
D O I
10.6688/JISE.202405_40(3).0014
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent studies leveraging object detection as the preliminary step for Visual Question Answering (VQA) ignore the relationships between different objects inside an image based on the textual question. In addition, the previous VQA models work like black -box functions, which means it is difficult to explain why a model provides such answers to the corresponding inputs. To address the issues above, we propose a new model structure to strengthen the representations for different objects and provide explainability for the VQA task. We construct a relation graph to capture the relative positions between region pairs and then create relation -aware visual features with a relation encoder based on graph attention networks. To make the final VQA predictions explainable, we introduce a multi -task learning framework with an additional explanation generator to help our model produce reasonable explanations. Simultaneously, the generated explanations are incorporated with the visual features using a novel Hybrid -Attention mechanism to enhance cross -modal understanding. Experiments show that the proposed method performs better on the VQA task than the several baselines. In addition, incorporation with the explanation generator can provide reasonable explanations along with the predicted answers.
引用
下载
收藏
页码:649 / 659
页数:11
相关论文
共 50 条
  • [21] Local relation network with multilevel attention for visual question answering
    Sun, Bo
    Yao, Zeng
    Zhang, Yinghui
    Yu, Lejun
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
  • [22] Global Relation-Aware Attention Network for Image-Text Retrieval
    Cao, Jie
    Qian, Shengsheng
    Zhang, Huaiwen
    Fang, Quan
    Xu, Changsheng
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 19 - 28
  • [23] Improving Complex Knowledge Base Question Answering with Relation-Aware Subgraph Retrieval and Reasoning Network
    Luo, Dan
    Sheng, Jiawei
    Xu, Hongbo
    Wang, Lihong
    Wang, Bin
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [24] CAAN: Context-Aware attention network for visual question answering
    Chen, Chongqing
    Han, Dezhi
    Chang, Chin-Chen
    Pattern Recognition, 2022, 132
  • [25] CAAN: Context-Aware attention network for visual question answering
    Chen, Chongqing
    Han, Dezhi
    Chang, Chin -Chen
    PATTERN RECOGNITION, 2022, 132
  • [26] Semantic Relation-aware Difference Representation Learning for Change Captioning
    Tu, Yunbin
    Yao, Tingting
    Li, Liang
    Lou, Jiedong
    Gao, Shengxiang
    Yu, Zhengtao
    Yan, Chenggang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 63 - 73
  • [27] An Improved Attention and Hybrid Optimization Technique for Visual Question Answering
    Sharma, Himanshu
    Jalal, Anand Singh
    NEURAL PROCESSING LETTERS, 2022, 54 (01) : 709 - 730
  • [28] An Improved Attention and Hybrid Optimization Technique for Visual Question Answering
    Himanshu Sharma
    Anand Singh Jalal
    Neural Processing Letters, 2022, 54 : 709 - 730
  • [29] Learning visual relationship and context-aware attention for image captioning
    Wang, Junbo
    Wang, Wei
    Wang, Liang
    Wang, Zhiyong
    Feng, David Dagan
    Tan, Tieniu
    PATTERN RECOGNITION, 2020, 98
  • [30] Hierarchical Question-Image Co-Attention for Visual Question Answering
    Lu, Jiasen
    Yang, Jianwei
    Batra, Dhruv
    Parikh, Devi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29