Multi-Granularity Relational Attention Network for Audio-Visual Question Answering

被引:2
|
作者
Li, Linjun [1 ,2 ]
Jin, Tao [3 ]
Lin, Wang [1 ]
Jiang, Hao [4 ]
Pan, Wenwen [3 ]
Wang, Jian [4 ]
Xiao, Shuwen [4 ]
Xia, Yan [3 ]
Jiang, Weihao [3 ]
Zhao, Zhou [2 ,3 ]
机构
[1] Zhejiang Univ, Sch Software Technol, Hangzhou 310063, Peoples R China
[2] Xidian Univ, State Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
[3] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou 310063, Peoples R China
[4] Alibaba Inc, Hangzhou 310023, Peoples R China
基金
中国国家自然科学基金;
关键词
Visualization; Question answering (information retrieval); Labeling; Manuals; Electronic commerce; Task analysis; Cognition; Audio-visual question answering; multi-granularity relational attention network; e-commerce dataset; GRAPH CONVOLUTIONAL NETWORKS; VIDEO; REPRESENTATION; TRANSFORMER;
D O I
10.1109/TCSVT.2023.3264524
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Recent methods for video question answering (VideoQA), aiming to generate answers based on given questions and video content, have made significant progress in cross-modal interaction. From the perspective of video understating, these existing frameworks concentrate on the various levels of visual content, partially assisted by subtitles. However, audio information is also instrumental in helping get correct answers, especially in videos with real-life scenarios. Indeed, in some cases, both audio and visual contents are required and complement each other to answer questions, which is defined as audio-visual question answering (AVQA). In this paper, we focus on importing raw audio for AVQA and contribute in three ways. Firstly, due to no dataset annotating QA pairs for raw audio, we introduce E-AVQA, a manually annotated and large-scale dataset involving multiple modalities. E-AVQA consists of 34,033 QA pairs on 33,340 clips of 18,786 videos from the e-commerce scenarios. Secondly, we propose a multi-granularity relational attention method with contrastive constraints between audio and visual features after the interaction, named MGN, which captures local sequential representation by leveraging the pairwise potential attention mechanism and obtains global multi-modal representation via designing the novel ternary potential attention mechanism. Thirdly, our proposed MGN outperforms the baseline on dataset E-AVQA, achieving 20.73% on WUPS@0.0 and 19.81% on BLEU@1, demonstrating its superiority with at least 1.02 improvement on WUPS@0.0 and about 10% on timing complexity over the baseline.
引用
收藏
页码:7080 / 7094
页数:15
相关论文
共 50 条
  • [21] Learning Trimodal Relation for Audio-Visual Question Answering with Missing Modality
    Park, Kyu Ri
    Lee, Hong Joo
    Kim, Jung Uk
    COMPUTER VISION - ECCV 2024, PT XV, 2025, 15073 : 42 - 59
  • [22] Progressive Spatio-temporal Perception for Audio-Visual Question Answering
    Li, Guangyao
    Hou, Wenxuan
    Hu, Di
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7808 - 7816
  • [23] Object-Difference Attention: A Simple Relational Attention for Visual Question Answering
    Wu, Chenfei
    Liu, Jinlai
    Wang, Xiaojie
    Dong, Xuan
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 519 - 527
  • [24] Multi-Channel Co-Attention Network for Visual Question Answering
    Tian, Weidong
    He, Bin
    Wang, Nanxun
    Zhao, Zhongqiu
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [25] Efficient Multi-step Reasoning Attention Network for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    Zhang, Meng
    THIRTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2021), 2022, 12083
  • [26] Multi-Modality Global Fusion Attention Network for Visual Question Answering
    Yang, Cheng
    Wu, Weijia
    Wang, Yuxing
    Zhou, Hong
    ELECTRONICS, 2020, 9 (11) : 1 - 12
  • [27] Pano-AVQA: Grounded Audio-Visual Question Answering on 360° Videos
    Yun, Heeseung
    Yu, Youngjae
    Yang, Wonsuk
    Lee, Kangil
    Kim, Gunhee
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2011 - 2021
  • [28] Learning knowledge graph embedding with multi-granularity relational augmentation network
    Xue, Zengcan
    Zhang, Zhaoli
    Liu, Hai
    Yang, Shuoqiu
    Han, Shuyun
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 233
  • [29] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
    Gu, Geonmo
    Kim, Seong Tae
    Ro, Yong Man
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002
  • [30] Triple attention network for sentimental visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Song, Heping
    Jia, Hongjie
    Dong, Ming
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189