K-armed Bandit based Multi-modal Network Architecture Search for Visual Question Answering

被引：10

作者：

Zhou, Yiyi ^{[1
,2
]}

Ji, Rongrong ^{[1
,2
]}

Sun, Xiaoshuai ^{[1
,2
]}

Luo, Gen ^{[1
,2
]}

Hong, Xiaopeng ^{[3
]}

Su, Jinsong ^{[2
]}

Ding, Xinghao ^{[2
]}

Shao, Ling ^{[4
,5
]}

机构：

[1] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Media Analyt & Comp Lab, Xiamen, Peoples R China

[2] Xiamen Univ, Sch Informat, Xiamen, Peoples R China

[3] Xi An Jiao Tong Univ, Sch Cyber Sci & Engn, Xian, Peoples R China

[4] Incept Inst Artificial Intelligence, Abu Dhabi, U Arab Emirates

[5] Mohamed Bin Zayed Univ Artificial Intelligence, Abu Dhabi, U Arab Emirates

来源：

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年

基金：

中国国家自然科学基金;

关键词：

Visual Question Answering; Network Architecture Search;

D O I：

10.1145/3394171.3413998

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph. These graph layers are used to form the direction of information propagation in the graph network, and their optimal combinations are searched by KAB-NAS. To evaluate KAB-NAS and AGAN, we conduct extensive experiments on two VQA benchmark datasets, i.e., VQA2.0 and GQA, and also test AGAN with the popular BERT-style pre-training. The experimental results show that with the help of KAB-NAS, AGAN can achieve the state-of-the-art performance on both benchmark datasets with much fewer parameters and computations.

引用

页码：1245 / 1254

页数：10

共 50 条

[1] MoQA - A Multi-modal Question Answering Architecture
Haurilet, Monica
Al-Halah, Ziad
Stiefelhagen, Rainer
[J]. COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 106 - 113
[2] Hierarchical deep multi-modal network for medical visual question answering
Gupta D.
Suman S.
Ekbal A.
[J]. Expert Systems with Applications, 2021, 164
[3] Multi-modal Contextual Graph Neural Network for Text Visual Question Answering
Liang, Yaoyuan
Wang, Xin
Duan, Xuguang
Zhu, Wenwu
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3491 - 3498
[4] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
Liu, Yun
Zhang, Xiaoming
Huang, Feiran
Cheng, Lei
Li, Zhoujun
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
[5] Multi-modal adaptive gated mechanism for visual question answering
Xu, Yangshuyi
Zhang, Lin
Shen, Xiang
[J]. PLOS ONE, 2023, 18 (06):
[6] Holistic Multi-Modal Memory Network for Movie Question Answering
Wang, Anran
Anh Tuan Luu
Foo, Chuan-Sheng
Zhu, Hongyuan
Tay, Yi
Chandrasekhar, Vijay
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 489 - 499
[7] Dynamic Resource Allocation Based on K-armed Bandit for Multi-UAV Air-Ground Network
Ma, Nan
Xu, Kui
Xia, Xiaochen
Xie, Wei
Xu, Jianhui
Shen, Maiying
[J]. Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology, 2022, 44 (09): : 3117 - 3125
[8] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
Xia, Qihao
Yu, Chao
Hou, Yinong
Peng, Pingping
Zheng, Zhengqi
Chen, Wen
[J]. ELECTRONICS, 2022, 11 (11)
[9] Multi-scale relation reasoning for multi-modal Visual Question Answering
Wu, Yirui
Ma, Yuntao
Wan, Shaohua
[J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2021, 96
[10] Choosing multi-issue negotiating object based on K-armed bandit problem
Wang, LM
Chai, YM
Huang, HK
[J]. Proceedings of the 8th Joint Conference on Information Sciences, Vols 1-3, 2005, : 976 - 979

← 1 2 3 4 5 →