Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering

被引:2
|
作者
Chen, Zailong [1 ]
Wang, Lei [1 ]
Wang, Peng [2 ]
Gao, Peng [3 ]
机构
[1] Univ Wollongong, Sch Comp & Informat Technol, Wollongong, NSW 2522, Australia
[2] Univ Elect Sci & Technol China, Sch Comp Sci & Engn, Chengdu 610056, Peoples R China
[3] Beijing Normal Univ Hong Kong Baptist Univ United, Inst Comp Sci, Zhuhai 519000, Peoples R China
关键词
Feature extraction; Visualization; Task analysis; Question answering (information retrieval); Data mining; Fuses; Focusing; Audio-visual question answering; video understanding; multimodal learning; deep learning; DIALOG;
D O I
10.1109/TCSVT.2023.3318220
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
As a newly emerging task, audio-visual question answering (AVQA) has attracted research attention. Compared with traditional single-modality (e.g., audio or visual) QA tasks, it poses new challenges due to the higher complexity of feature extraction and fusion brought by the multimodal inputs. First, AVQA requires more comprehensive understanding of the scene which involves both audio and visual information; Second, in the presence of more information, feature extraction has to be better connected with a given question; Third, features from different modalities need to be sufficiently correlated and fused. To address this situation, this work proposes a novel framework for multimodal question answering task. It characterises an audiovisual scene at both global and local levels, and within each level, the features from different modalities are well fused. Furthermore, the given question is utilised to guide not only the feature extraction at the local level but also the final fusion of global and local features to predict the answer. Our framework provides a new perspective for audio-visual scene understanding through focusing on both general and specific representations as well as aggregating multimodalities by prioritizing question-related information. As experimentally demonstrated, our method significantly improves the existing audio-visual question answering performance, with the averaged absolute gain of 3.3% and 3.1% on MUSIC-AVQA and AVQA datasets, respectively. Moreover, the ablation study verifies the necessity and effectiveness of our design. Our code will be publicly released.
引用
收藏
页码:4109 / 4119
页数:11
相关论文
共 50 条
  • [1] Question-Aware Tube-Switch Network for Video Question Answering
    Yang, Tianhao
    Zha, Zheng-Jun
    Xie, Hongtao
    Wang, Meng
    Zhang, Hanwang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1184 - 1192
  • [2] Question-aware dynamic scene graph of local semantic representation learning for visual question answering
    Wu, Jinmeng
    Ge, Fulin
    Hong, Hanyu
    Shi, Yu
    Hao, Yanbin
    Ma, Lei
    PATTERN RECOGNITION LETTERS, 2023, 170 : 93 - 99
  • [3] Question-aware prediction with candidate answer recommendation for visual question answering
    Kim, B.
    Kim, J.
    ELECTRONICS LETTERS, 2017, 53 (18) : 1244 - 1245
  • [4] Heterogeneous Interactive Graph Network for Audio-Visual Question Answering
    Zhao, Yihan
    Xi, Wei
    Bai, Gairui
    Liu, Xinhui
    Zhao, Jizhong
    KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [5] AVQA: A Dataset for Audio-Visual Question Answering on Videos
    Yang, Pinci
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Hou, Runze
    Jin, Cong
    Zhu, Wenwu
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 3480 - 3491
  • [6] Object-Aware Adaptive-Positivity Learning for Audio-Visual Question Answering
    Li, Zhangbin
    Guo, Dan
    Zhou, Jinxing
    Zhang, Jing
    Wang, Meng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3306 - 3314
  • [7] Multi-Granularity Relational Attention Network for Audio-Visual Question Answering
    Li, Linjun
    Jin, Tao
    Lin, Wang
    Jiang, Hao
    Pan, Wenwen
    Wang, Jian
    Xiao, Shuwen
    Xia, Yan
    Jiang, Weihao
    Zhao, Zhou
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 7080 - 7094
  • [8] Enhancing question answering in educational knowledge bases using question-aware graph convolutional network
    He, Ping
    Chen, Jingfang
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (06) : 12037 - 12048
  • [9] Question-aware memory network for multi-hop question answering in human–robot interaction
    Xinmeng Li
    Mamoun Alazab
    Qian Li
    Keping Yu
    Quanjun Yin
    Complex & Intelligent Systems, 2022, 8 : 851 - 861
  • [10] COCA: COllaborative CAusal Regularization for Audio-Visual Question Answering
    Lao, Mingrui
    Pu, Nan
    Liu, Yu
    He, Kai
    Bakker, Erwin M.
    Lew, Michael S.
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12995 - 13003