Masked co-attention model for audio-visual event localization

被引：0

作者：

Hengwei Liu

Xiaodong Gu

机构：

[1] Fudan University,Department of Electronic Engineering

来源：

Applied Intelligence | 2024年 / 54卷

关键词：

Audio-visual event localization; Video representation; Multi-modal learning; Machine learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The objective of Audio-Visual Event Localization (AVEL) is to leverage audio and video cues in a combined manner to localize video segments that contain audio-visual events and classify their respective categories. The primary focus is on enhancing the semantic consistency between the video and audio segments while mitigating the influence of unrelated segments. However, data from different modalities are encoded in separated spaces, leading to modality gap. To address this issue, we propose a model based on masked co-attention (MCA) mechanism to better explore the multi-modal correlations. In this approach, both intra and cross modal attention are employed to determine the correlation between visual and audio segments. Furthermore, we introduce a mask strategy of two levels. At the feature level, a random masking method is proposed to alleviate overfitting concerns during training. At the attention level, the mask is applied to the co-attention map to filter out redundant information, thereby obtaining fine-grained multi-modal embeddings. Our proposed framework MCA achieves state-of-the-art results on the publicly available AVE dataset.

引用

页码：1691 / 1705

页数：14

共 50 条

[1] Masked co-attention model for audio-visual event localization
Liu, Hengwei
Gu, Xiaodong
[J]. APPLIED INTELLIGENCE, 2024, 54 (02) : 1691 - 1705
[2] Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
Xue, Cheng
Zhong, Xionghu
Cai, Minjie
Chen, Hao
Wang, Wenwu
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 418 - 429
[3] Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention
Duan, Bin
Tang, Hao
Wang, Wei
Zong, Ziliang
Yang, Guowei
Yan, Yan
[J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021, 2021, : 4012 - 4021
[4] Dual Attention Matching for Audio-Visual Event Localization
Wu, Yu
Zhu, Linchao
Yan, Yan
Yang, Yi
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309
[5] Temporal Cross-Modal Attention for Audio-Visual Event Localization
Nagasaki Y.
Hayashi M.
Kaneko N.
Aoki Y.
[J]. Seimitsu Kogaku Kaishi/Journal of the Japan Society for Precision Engineering, 2022, 88 (03): : 263 - 268
[6] Audio-Visual Event Localization in Unconstrained Videos
Tian, Yapeng
Shi, Jing
Li, Bochen
Duan, Zhiyao
Xu, Chenliang
[J]. COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268
[7] Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization
Xuan, Hanyu
Zhang, Zhenyu
Chen, Shuo
Yang, Jian
Yan, Yan
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 279 - 286
[8] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
Ge, Shiping
Jiang, Zhiwei
Yin, Yafeng
Wang, Cong
Cheng, Zifeng
Gu, Qing
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
[9] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
Cheng, Ying
Wang, Ruize
Pan, Zhihao
Feng, Rui
Zhang, Yuejie
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
[10] Dual Perspective Network for Audio-Visual Event Localization
Rao, Varshanth
Khalil, Md Ibrahim
Li, Haoda
Dai, Peng
Lu, Juwei
[J]. COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 689 - 704

← 1 2 3 4 5 →