Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

被引:2
|
作者
Li, Linjun [1 ,2 ]
Jin, Tao [3 ]
Lin, Wang [1 ]
Jiang, Hao [4 ]
Pan, Wenwen [3 ]
Wang, Jian [4 ]
Xiao, Shuwen [4 ]
Xia, Yan [3 ]
Jiang, Weihao [3 ]
Zhao, Zhou [2 ,3 ]
机构
[1] Zhejiang Univ, Sch Software Technol, Hangzhou 310063, Peoples R China
[2] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310063, Peoples R China
[4] Alibaba Inc, Hangzhou 310023, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Question answering (information retrieval); Labeling; Manuals; Electronic commerce; Task analysis; Cognition; Audio-visual question answering; multi-granularity relational attention network; e-commerce dataset; GRAPH CONVOLUTIONAL NETWORKS; VIDEO; REPRESENTATION; TRANSFORMER;
D O I
10.1109/TCSVT.2023.3264524
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recent methods for video question answering (VideoQA), aiming to generate answers based on given questions and video content, have made significant progress in cross-modal interaction. From the perspective of video understating, these existing frameworks concentrate on the various levels of visual content, partially assisted by subtitles. However, audio information is also instrumental in helping get correct answers, especially in videos with real-life scenarios. Indeed, in some cases, both audio and visual contents are required and complement each other to answer questions, which is defined as audio-visual question answering (AVQA). In this paper, we focus on importing raw audio for AVQA and contribute in three ways. Firstly, due to no dataset annotating QA pairs for raw audio, we introduce E-AVQA, a manually annotated and large-scale dataset involving multiple modalities. E-AVQA consists of 34,033 QA pairs on 33,340 clips of 18,786 videos from the e-commerce scenarios. Secondly, we propose a multi-granularity relational attention method with contrastive constraints between audio and visual features after the interaction, named MGN, which captures local sequential representation by leveraging the pairwise potential attention mechanism and obtains global multi-modal representation via designing the novel ternary potential attention mechanism. Thirdly, our proposed MGN outperforms the baseline on dataset E-AVQA, achieving 20.73% on WUPS@0.0 and 19.81% on BLEU@1, demonstrating its superiority with at least 1.02 improvement on WUPS@0.0 and about 10% on timing complexity over the baseline.
引用
收藏
页码:7080 / 7094
页数:15
相关论文
共 50 条
  • [1] Multi-Granularity Cross-Attention Network for Visual Question Answering
    Wang, Yue
    Gao, Wei
    Cheng, Xinzhou
    Wang, Xin
    Zhao, Huiying
    Xie, Zhipu
    Xu, Lexi
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 2098 - 2103
  • [2] Multi-Granularity Interaction and Integration Network for Video Question Answering
    Wang, Yuanyuan
    Liu, Meng
    Wu, Jianlong
    Nie, Liqiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (12) : 7684 - 7695
  • [3] Heterogeneous Interactive Graph Network for Audio-Visual Question Answering
    Zhao, Yihan
    Xi, Wei
    Bai, Gairui
    Liu, Xinhui
    Zhao, Jizhong
    KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [4] Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering
    Wang, Wei
    Yan, Ming
    Wu, Chen
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1705 - 1714
  • [5] M2FNet: Multi-granularity Feature Fusion Network for Medical Visual Question Answering
    Wang, He
    Pan, Haiwei
    Zhang, Kejia
    He, Shuning
    Chen, Chunling
    PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT II, 2022, 13630 : 141 - 154
  • [6] Multi-granularity Hierarchical Attention Siamese Network for Visual Tracking
    Chen, Xing
    Zhang, Xiang
    Tan, Huibin
    Lan, Long
    Luo, Zhigang
    Huang, Xuhui
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [7] Chinese Knowledge Base Question Answering by Attention-Based Multi-Granularity Model
    Shen, Cun
    Huang, Tinglei
    Liang, Xiao
    Li, Feng
    Fu, Kun
    INFORMATION, 2018, 9 (04)
  • [8] AVQA: A Dataset for Audio-Visual Question Answering on Videos
    Yang, Pinci
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Hou, Runze
    Jin, Cong
    Zhu, Wenwu
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3480 - 3491
  • [9] Multi-granularity Temporal Question Answering over Knowledge Graphs
    Chen, Ziyang
    Liao, Jinzhi
    Zhao, Xiang
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 11378 - 11392
  • [10] Multi-Attention Audio-Visual Fusion Network for Audio Spatialization
    Zhang, Wen
    Shao, Jie
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 394 - 401