Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering

被引:2
|
作者
Chen, Zailong [1 ]
Wang, Lei [1 ]
Wang, Peng [2 ]
Gao, Peng [3 ]
机构
[1] Univ Wollongong, Sch Comp & Informat Technol, Wollongong, NSW 2522, Australia
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610056, Peoples R China
[3] Beijing Normal Univ Hong Kong Baptist Univ United, Inst Comp Sci, Zhuhai 519000, Peoples R China
关键词
Feature extraction; Visualization; Task analysis; Question answering (information retrieval); Data mining; Fuses; Focusing; Audio-visual question answering; video understanding; multimodal learning; deep learning; DIALOG;
D O I
10.1109/TCSVT.2023.3318220
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
As a newly emerging task, audio-visual question answering (AVQA) has attracted research attention. Compared with traditional single-modality (e.g., audio or visual) QA tasks, it poses new challenges due to the higher complexity of feature extraction and fusion brought by the multimodal inputs. First, AVQA requires more comprehensive understanding of the scene which involves both audio and visual information; Second, in the presence of more information, feature extraction has to be better connected with a given question; Third, features from different modalities need to be sufficiently correlated and fused. To address this situation, this work proposes a novel framework for multimodal question answering task. It characterises an audiovisual scene at both global and local levels, and within each level, the features from different modalities are well fused. Furthermore, the given question is utilised to guide not only the feature extraction at the local level but also the final fusion of global and local features to predict the answer. Our framework provides a new perspective for audio-visual scene understanding through focusing on both general and specific representations as well as aggregating multimodalities by prioritizing question-related information. As experimentally demonstrated, our method significantly improves the existing audio-visual question answering performance, with the averaged absolute gain of 3.3% and 3.1% on MUSIC-AVQA and AVQA datasets, respectively. Moreover, the ablation study verifies the necessity and effectiveness of our design. Our code will be publicly released.
引用
收藏
页码:4109 / 4119
页数:11
相关论文
共 50 条
  • [21] CAAN: Context-Aware attention network for visual question answering
    Chen, Chongqing
    Han, Dezhi
    Chang, Chin-Chen
    Pattern Recognition, 2022, 132
  • [22] CAAN: Context-Aware attention network for visual question answering
    Chen, Chongqing
    Han, Dezhi
    Chang, Chin -Chen
    PATTERN RECOGNITION, 2022, 132
  • [23] Relation-Aware Graph Attention Network for Visual Question Answering
    Li, Linjie
    Gan, Zhe
    Cheng, Yu
    Liu, Jingjing
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10312 - 10321
  • [24] Affective Visual Question Answering Network
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Dong, Ming
    IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 170 - 173
  • [25] Language-Guided Visual Aggregation Network for Video Question Answering
    Liang, Xiao
    Wang, Di
    Wang, Quan
    Wan, Bo
    An, Lingling
    He, Lihuo
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5195 - 5203
  • [26] ReasonFuse: Reason Path Driven and Global-Local Fusion Network for Numerical Table-Text Question Answering
    Xia, Yuancheng
    Li, Feng
    Liu, Qing
    Jin, Li
    Zhang, Zequn
    Sun, Xian
    Shao, Lixu
    NEUROCOMPUTING, 2023, 516 : 169 - 181
  • [27] CAPTURING GLOBAL AND LOCAL INFORMATION IN REMOTE SENSING VISUAL QUESTION ANSWERING
    Guo, Yan
    Huang, Yuancheng
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 6340 - 6343
  • [28] Redundancy-aware Transformer for Video Question Answering
    Li, Yicong
    Yang, Xun
    Zhang, An
    Feng, Chun
    Wang, Xiang
    Chua, Tat-Seng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3172 - 3180
  • [29] KVQA: Knowledge-Aware Visual Question Answering
    Shah, Sanket
    Mishra, Anand
    Yadati, Naganand
    Talukdar, Partha Pratim
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8876 - 8884
  • [30] Complementary spatiotemporal network for video question answering
    Xinrui Li
    Aming Wu
    Yahong Han
    Multimedia Systems, 2022, 28 : 161 - 169