Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

被引:2
|
作者
Li, Linjun [1 ,2 ]
Jin, Tao [3 ]
Lin, Wang [1 ]
Jiang, Hao [4 ]
Pan, Wenwen [3 ]
Wang, Jian [4 ]
Xiao, Shuwen [4 ]
Xia, Yan [3 ]
Jiang, Weihao [3 ]
Zhao, Zhou [2 ,3 ]
机构
[1] Zhejiang Univ, Sch Software Technol, Hangzhou 310063, Peoples R China
[2] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310063, Peoples R China
[4] Alibaba Inc, Hangzhou 310023, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Question answering (information retrieval); Labeling; Manuals; Electronic commerce; Task analysis; Cognition; Audio-visual question answering; multi-granularity relational attention network; e-commerce dataset; GRAPH CONVOLUTIONAL NETWORKS; VIDEO; REPRESENTATION; TRANSFORMER;
D O I
10.1109/TCSVT.2023.3264524
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recent methods for video question answering (VideoQA), aiming to generate answers based on given questions and video content, have made significant progress in cross-modal interaction. From the perspective of video understating, these existing frameworks concentrate on the various levels of visual content, partially assisted by subtitles. However, audio information is also instrumental in helping get correct answers, especially in videos with real-life scenarios. Indeed, in some cases, both audio and visual contents are required and complement each other to answer questions, which is defined as audio-visual question answering (AVQA). In this paper, we focus on importing raw audio for AVQA and contribute in three ways. Firstly, due to no dataset annotating QA pairs for raw audio, we introduce E-AVQA, a manually annotated and large-scale dataset involving multiple modalities. E-AVQA consists of 34,033 QA pairs on 33,340 clips of 18,786 videos from the e-commerce scenarios. Secondly, we propose a multi-granularity relational attention method with contrastive constraints between audio and visual features after the interaction, named MGN, which captures local sequential representation by leveraging the pairwise potential attention mechanism and obtains global multi-modal representation via designing the novel ternary potential attention mechanism. Thirdly, our proposed MGN outperforms the baseline on dataset E-AVQA, achieving 20.73% on WUPS@0.0 and 19.81% on BLEU@1, demonstrating its superiority with at least 1.02 improvement on WUPS@0.0 and about 10% on timing complexity over the baseline.
引用
收藏
页码:7080 / 7094
页数:15
相关论文
共 50 条
  • [41] Multi-granularity visual explanations for CNN
    Bao, Huanan
    Wang, Guoyin
    Li, Shuai
    Liu, Qun
    KNOWLEDGE-BASED SYSTEMS, 2022, 253
  • [42] Visual Question Generation Under Multi-granularity Cross-Modal Interaction
    Chai, Zi
    Wan, Xiaojun
    Han, Soyeon Caren
    Poon, Josiah
    MULTIMEDIA MODELING, MMM 2023, PT I, 2023, 13833 : 255 - 266
  • [43] Multi-granularity cross attention network for person re-identification
    Han, Chengmei
    Jiang, Bo
    Tang, Jin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (10) : 14755 - 14773
  • [44] Multi-granularity cross attention network for person re-identification
    Chengmei Han
    Bo Jiang
    Jin Tang
    Multimedia Tools and Applications, 2023, 82 : 14755 - 14773
  • [45] Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering
    Li, Zhangbin
    Guo, Dan
    Zhou, Jinxing
    Zhang, Jing
    Wang, Meng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3306 - 3314
  • [46] Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention
    Xu, Xinmeng
    Wang, Yang
    Jia, Jie
    Chen, Binbin
    Li, Dejun
    INTERSPEECH 2022, 2022, : 971 - 975
  • [47] Feature pyramid attention network for audio-visual scene classification
    Zhou, Liguang
    Zhou, Yuhongze
    Qi, Xiaonan
    Hu, Junjie
    Lam, Tin Lun
    Xu, Yangsheng
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2024,
  • [48] Local relation network with multilevel attention for visual question answering
    Sun, Bo
    Yao, Zeng
    Zhang, Yinghui
    Yu, Lejun
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
  • [49] Latent Attention Network With Position Perception for Visual Question Answering
    Zhang, Jing
    Liu, Xiaoqiang
    Wang, Zhe
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (03) : 5059 - 5069
  • [50] Deep Attention Neural Tensor Network for Visual Question Answering
    Bai, Yalong
    Fu, Jianlong
    Zhao, Tiejun
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 21 - 37