Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering

被引:260
|
作者
Gao, Peng [1 ]
Jiang, Zhengkai [3 ]
You, Haoxuan [4 ]
Lu, Pan [4 ]
Hoi, Steven [2 ]
Wang, Xiaogang [1 ]
Li, Hongsheng [1 ]
机构
[1] Chinese Univ Hong Kong, CUHK SenseTime Joint Lab, Hong Kong, Peoples R China
[2] Singapore Management Univ, Singapore, Singapore
[3] CASIA, NLPR, Beijing, Peoples R China
[4] Tsinghua Univ, Beijing, Peoples R China
关键词
D O I
10.1109/CVPR.2019.00680
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fuse multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that, the proposed dynamic intra modality attention flow conditioned on the other modality can dynamically modulate the intra-modality attention of the current modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves the state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.
引用
收藏
页码:6632 / 6641
页数:10
相关论文
共 50 条
  • [31] SkillCLIP: Skill Aware Modality Fusion Visual Question Answering (Student Abstract)
    Naik, Atharva
    Butala, Yash Parag
    Vaikunthan, Navaneethan
    Kapoor, Raghav
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23592 - 23593
  • [32] Multimodal feature fusion by relational reasoning and attention for visual question answering
    Zhang, Weifeng
    Yu, Jing
    Hu, Hua
    Hu, Haiyang
    Qin, Zengchang
    INFORMATION FUSION, 2020, 55 (55) : 116 - 126
  • [33] MDAnet: Multiple Fusion Network with Double Attention for Visual Question Answering
    Feng, Junyi
    Gong, Ping
    Qiu, Guanghui
    ICVIP 2019: PROCEEDINGS OF 2019 3RD INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING, 2019, : 143 - 147
  • [34] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662
  • [35] Differential Attention for Visual Question Answering
    Patro, Badri
    Namboodiri, Vinay P.
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
  • [36] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
  • [37] Fusing Attention with Visual Question Answering
    Burt, Ryan
    Cudic, Mihael
    Principe, Jose C.
    2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
  • [38] Question -Led object attention for visual question answering
    Gao, Lianli
    Cao, Liangfu
    Xu, Xing
    Shao, Jie
    Song, Jingkuan
    NEUROCOMPUTING, 2020, 391 : 227 - 233
  • [39] Question-Agnostic Attention for Visual Question Answering
    Farazi, Moshiur
    Khan, Salman
    Barnes, Nick
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3542 - 3549
  • [40] Question Type Guided Attention in Visual Question Answering
    Shi, Yang
    Furlanello, Tommaso
    Zha, Sheng
    Anandkumar, Animashree
    COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 158 - 175