K-armed Bandit based Multi-modal Network Architecture Search for Visual Question Answering

被引:10
|
作者
Zhou, Yiyi [1 ,2 ]
Ji, Rongrong [1 ,2 ]
Sun, Xiaoshuai [1 ,2 ]
Luo, Gen [1 ,2 ]
Hong, Xiaopeng [3 ]
Su, Jinsong [2 ]
Ding, Xinghao [2 ]
Shao, Ling [4 ,5 ]
机构
[1] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Media Analyt & Comp Lab, Xiamen, Peoples R China
[2] Xiamen Univ, Sch Informat, Xiamen, Peoples R China
[3] Xi An Jiao Tong Univ, Sch Cyber Sci & Engn, Xian, Peoples R China
[4] Incept Inst Artificial Intelligence, Abu Dhabi, U Arab Emirates
[5] Mohamed Bin Zayed Univ Artificial Intelligence, Abu Dhabi, U Arab Emirates
基金
中国国家自然科学基金;
关键词
Visual Question Answering; Network Architecture Search;
D O I
10.1145/3394171.3413998
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph. These graph layers are used to form the direction of information propagation in the graph network, and their optimal combinations are searched by KAB-NAS. To evaluate KAB-NAS and AGAN, we conduct extensive experiments on two VQA benchmark datasets, i.e., VQA2.0 and GQA, and also test AGAN with the popular BERT-style pre-training. The experimental results show that with the help of KAB-NAS, AGAN can achieve the state-of-the-art performance on both benchmark datasets with much fewer parameters and computations.
引用
收藏
页码:1245 / 1254
页数:10
相关论文
共 50 条
  • [1] MoQA - A Multi-modal Question Answering Architecture
    Haurilet, Monica
    Al-Halah, Ziad
    Stiefelhagen, Rainer
    [J]. COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 106 - 113
  • [2] Hierarchical deep multi-modal network for medical visual question answering
    Gupta D.
    Suman S.
    Ekbal A.
    [J]. Expert Systems with Applications, 2021, 164
  • [3] Multi-modal Contextual Graph Neural Network for Text Visual Question Answering
    Liang, Yaoyuan
    Wang, Xin
    Duan, Xuguang
    Zhu, Wenwu
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3491 - 3498
  • [4] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [5] Multi-modal adaptive gated mechanism for visual question answering
    Xu, Yangshuyi
    Zhang, Lin
    Shen, Xiang
    [J]. PLOS ONE, 2023, 18 (06):
  • [6] Holistic Multi-Modal Memory Network for Movie Question Answering
    Wang, Anran
    Anh Tuan Luu
    Foo, Chuan-Sheng
    Zhu, Hongyuan
    Tay, Yi
    Chandrasekhar, Vijay
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 489 - 499
  • [7] Dynamic Resource Allocation Based on K-armed Bandit for Multi-UAV Air-Ground Network
    Ma, Nan
    Xu, Kui
    Xia, Xiaochen
    Xie, Wei
    Xu, Jianhui
    Shen, Maiying
    [J]. Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology, 2022, 44 (09): : 3117 - 3125
  • [8] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
    Xia, Qihao
    Yu, Chao
    Hou, Yinong
    Peng, Pingping
    Zheng, Zhengqi
    Chen, Wen
    [J]. ELECTRONICS, 2022, 11 (11)
  • [9] Multi-scale relation reasoning for multi-modal Visual Question Answering
    Wu, Yirui
    Ma, Yuntao
    Wan, Shaohua
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2021, 96
  • [10] Choosing multi-issue negotiating object based on K-armed bandit problem
    Wang, LM
    Chai, YM
    Huang, HK
    [J]. Proceedings of the 8th Joint Conference on Information Sciences, Vols 1-3, 2005, : 976 - 979