Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering

被引：0

作者：

Lin, Ying-Jia ^{[1
]}

Tseng, Ching-Shan ^{[1
]}

Kao, Hung-Yu ^{[1
]}

机构：

[1] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 701, Taiwan

来源：

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING | 2024年 / 40卷 / 03期

关键词：

visual question answering; explainable VQA; multi-task learning; graph attention networks; vision-language model;

D O I：

10.6688/JISE.202405_40(3).0014

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recent studies leveraging object detection as the preliminary step for Visual Question Answering (VQA) ignore the relationships between different objects inside an image based on the textual question. In addition, the previous VQA models work like black -box functions, which means it is difficult to explain why a model provides such answers to the corresponding inputs. To address the issues above, we propose a new model structure to strengthen the representations for different objects and provide explainability for the VQA task. We construct a relation graph to capture the relative positions between region pairs and then create relation -aware visual features with a relation encoder based on graph attention networks. To make the final VQA predictions explainable, we introduce a multi -task learning framework with an additional explanation generator to help our model produce reasonable explanations. Simultaneously, the generated explanations are incorporated with the visual features using a novel Hybrid -Attention mechanism to enhance cross -modal understanding. Experiments show that the proposed method performs better on the VQA task than the several baselines. In addition, incorporation with the explanation generator can provide reasonable explanations along with the predicted answers.

引用

页码：649 / 659

页数：11

共 50 条

[1] Relation-Aware Image Captioning for Explainable Visual Question Answering
Tseng, Ching-Shan
Lin, Ying-Jia
Kao, Hung-Yu
[J]. 2022 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, TAAI, 2022, : 149 - 154
[2] Relation-Aware Graph Attention Network for Visual Question Answering
Li, Linjie
Gan, Zhe
Cheng, Yu
Liu, Jingjing
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10312 - 10321
[3] Visual question answering with gated relation-aware auxiliary
Shao, Xiangjun
Xiang, Zhenglong
Li, Yuanxiang
[J]. IET IMAGE PROCESSING, 2022, 16 (05) : 1424 - 1432
[4] Relation-aware Hierarchical Attention Framework for Video Question Answering
Li, Fangtao
Liu, Zihe
Bai, Ting
Yan, Chenghao
Cao, Chenyu
Wu, Bin
[J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 164 - 172
[5] Visual Relation-Aware Unsupervised Video Captioning
Ji, Puzhao
Cao, Meng
Zou, Yuexian
[J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 495 - 507
[6] Image captioning improved visual question answering
Himanshu Sharma
Anand Singh Jalal
[J]. Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
[7] A BERT-based Approach with Relation-aware Attention for Knowledge Base Question Answering
Luo, Da
Su, Jindian
Yu, Shanshan
[J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[8] Image captioning improved visual question answering
Sharma, Himanshu
Jalal, Anand Singh
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
[9] Relation-aware attention for video captioning via graph learning
Tu, Yunbin
Zhou, Chang
Guo, Junjun
Li, Huafeng
Gao, Shengxiang
Yu, Zhengtao
[J]. PATTERN RECOGNITION, 2023, 136
[10] A visual question answering model based on image captioning
Zhou, Kun
Liu, Qiongjie
Zhao, Dexin
[J]. Multimedia Systems, 2024, 30 (06)

← 1 2 3 4 5 →