Masked co-attention model for audio-visual event localization

被引:0
|
作者
Hengwei Liu
Xiaodong Gu
机构
[1] Fudan University,Department of Electronic Engineering
来源
Applied Intelligence | 2024年 / 54卷
关键词
Audio-visual event localization; Video representation; Multi-modal learning; Machine learning;
D O I
暂无
中图分类号
学科分类号
摘要
The objective of Audio-Visual Event Localization (AVEL) is to leverage audio and video cues in a combined manner to localize video segments that contain audio-visual events and classify their respective categories. The primary focus is on enhancing the semantic consistency between the video and audio segments while mitigating the influence of unrelated segments. However, data from different modalities are encoded in separated spaces, leading to modality gap. To address this issue, we propose a model based on masked co-attention (MCA) mechanism to better explore the multi-modal correlations. In this approach, both intra and cross modal attention are employed to determine the correlation between visual and audio segments. Furthermore, we introduce a mask strategy of two levels. At the feature level, a random masking method is proposed to alleviate overfitting concerns during training. At the attention level, the mask is applied to the co-attention map to filter out redundant information, thereby obtaining fine-grained multi-modal embeddings. Our proposed framework MCA achieves state-of-the-art results on the publicly available AVE dataset.
引用
收藏
页码:1691 / 1705
页数:14
相关论文
共 50 条
  • [1] Masked co-attention model for audio-visual event localization
    Liu, Hengwei
    Gu, Xiaodong
    [J]. APPLIED INTELLIGENCE, 2024, 54 (02) : 1691 - 1705
  • [2] Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
    Xue, Cheng
    Zhong, Xionghu
    Cai, Minjie
    Chen, Hao
    Wang, Wenwu
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 418 - 429
  • [3] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
    Duan, Bin
    Tang, Hao
    Wang, Wei
    Zong, Ziliang
    Yang, Guowei
    Yan, Yan
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4012 - 4021
  • [4] Dual Attention Matching for Audio-Visual Event Localization
    Wu, Yu
    Zhu, Linchao
    Yan, Yan
    Yang, Yi
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309
  • [5] Temporal Cross-Modal Attention for Audio-Visual Event Localization
    Nagasaki Y.
    Hayashi M.
    Kaneko N.
    Aoki Y.
    [J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
  • [6] Audio-Visual Event Localization in Unconstrained Videos
    Tian, Yapeng
    Shi, Jing
    Li, Bochen
    Duan, Zhiyao
    Xu, Chenliang
    [J]. COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268
  • [7] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
    Xuan, Hanyu
    Zhang, Zhenyu
    Chen, Shuo
    Yang, Jian
    Yan, Yan
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 279 - 286
  • [8] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
    Ge, Shiping
    Jiang, Zhiwei
    Yin, Yafeng
    Wang, Cong
    Cheng, Zifeng
    Gu, Qing
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
  • [9] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
    Cheng, Ying
    Wang, Ruize
    Pan, Zhihao
    Feng, Rui
    Zhang, Yuejie
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
  • [10] Dual Perspective Network for Audio-Visual Event Localization
    Rao, Varshanth
    Khalil, Md Ibrahim
    Li, Haoda
    Dai, Peng
    Lu, Juwei
    [J]. COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 689 - 704