Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

被引:2
|
作者
Li, Linjun [1 ,2 ]
Jin, Tao [3 ]
Lin, Wang [1 ]
Jiang, Hao [4 ]
Pan, Wenwen [3 ]
Wang, Jian [4 ]
Xiao, Shuwen [4 ]
Xia, Yan [3 ]
Jiang, Weihao [3 ]
Zhao, Zhou [2 ,3 ]
机构
[1] Zhejiang Univ, Sch Software Technol, Hangzhou 310063, Peoples R China
[2] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310063, Peoples R China
[4] Alibaba Inc, Hangzhou 310023, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Question answering (information retrieval); Labeling; Manuals; Electronic commerce; Task analysis; Cognition; Audio-visual question answering; multi-granularity relational attention network; e-commerce dataset; GRAPH CONVOLUTIONAL NETWORKS; VIDEO; REPRESENTATION; TRANSFORMER;
D O I
10.1109/TCSVT.2023.3264524
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recent methods for video question answering (VideoQA), aiming to generate answers based on given questions and video content, have made significant progress in cross-modal interaction. From the perspective of video understating, these existing frameworks concentrate on the various levels of visual content, partially assisted by subtitles. However, audio information is also instrumental in helping get correct answers, especially in videos with real-life scenarios. Indeed, in some cases, both audio and visual contents are required and complement each other to answer questions, which is defined as audio-visual question answering (AVQA). In this paper, we focus on importing raw audio for AVQA and contribute in three ways. Firstly, due to no dataset annotating QA pairs for raw audio, we introduce E-AVQA, a manually annotated and large-scale dataset involving multiple modalities. E-AVQA consists of 34,033 QA pairs on 33,340 clips of 18,786 videos from the e-commerce scenarios. Secondly, we propose a multi-granularity relational attention method with contrastive constraints between audio and visual features after the interaction, named MGN, which captures local sequential representation by leveraging the pairwise potential attention mechanism and obtains global multi-modal representation via designing the novel ternary potential attention mechanism. Thirdly, our proposed MGN outperforms the baseline on dataset E-AVQA, achieving 20.73% on WUPS@0.0 and 19.81% on BLEU@1, demonstrating its superiority with at least 1.02 improvement on WUPS@0.0 and about 10% on timing complexity over the baseline.
引用
收藏
页码:7080 / 7094
页数:15
相关论文
共 50 条
  • [31] Collaborative Attention Network to Enhance Visual Question Answering
    Gu, Rui
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 304 - 305
  • [32] Fair Attention Network for Robust Visual Question Answering
    Bi, Yandong
    Jiang, Huajie
    Hu, Yongli
    Sun, Yanfeng
    Yin, Baocai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 7870 - 7881
  • [33] Multimodal feature fusion by relational reasoning and attention for visual question answering
    Zhang, Weifeng
    Yu, Jing
    Hu, Hua
    Hu, Haiyang
    Qin, Zengchang
    INFORMATION FUSION, 2020, 55 (55) : 116 - 126
  • [34] Multi-Granularity Semantic Collaborative Reasoning Network for Visual Dialog
    Zhang, Hongwei
    Wang, Xiaojie
    Jiang, Si
    Li, Xuefeng
    APPLIED SCIENCES-BASEL, 2022, 12 (18):
  • [35] Multi-scale network via progressive multi-granularity attention for fine-grained visual classification
    An, Chen
    Wang, Xiaodong
    Wei, Zhiqiang
    Zhang, Ke
    Huang, Lei
    APPLIED SOFT COMPUTING, 2023, 146
  • [36] Combining Multi-granularity Text Semantics with Graph Relational Semantics for Question Retrieval in CQA
    Li, Hong
    Li, Jianjun
    Jin, Huazhong
    Chen, Zixuan
    Zou, Wei
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT II, ICIC 2024, 2024, 14876 : 53 - 64
  • [37] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [38] Multi-Granularity Temporal Knowledge Graph Question Answering Based on Data Augmentation and Convolutional Networks
    Lu, Yizhi
    Su, Lei
    Wu, Liping
    Jiang, Di
    APPLIED SCIENCES-BASEL, 2025, 15 (06):
  • [39] Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers
    Yona, Gal
    Aharoni, Roee
    Geva, Mor
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6737 - 6751
  • [40] A multi-scale contextual attention network for remote sensing visual question answering
    Feng, Jiangfan
    Wang, Hui
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 126