Dual Perspective Network for Audio-Visual Event Localization

被引：5

作者：

Rao, Varshanth ^{[1
]}

Khalil, Md Ibrahim ^{[1
]}

Li, Haoda ^{[1
,2
]}

Dai, Peng ^{[1
]}

Lu, Juwei ^{[1
]}

机构：

[1] Huawei Noahs Ark Lab, Montreal, PQ, Canada

[2] Univ Toronto, Toronto, ON, Canada

来源：

COMPUTER VISION, ECCV 2022, PT XXXIV | 2022年 / 13694卷

关键词：

D O I：

10.1007/978-3-031-19830-4_39

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The Audio-Visual Event Localization (AVEL) problem involves tackling three core sub-tasks: the creation of efficient audiovisual representations using cross-modal guidance, the formation of short-term temporal feature aggregations, and its accumulation to achieve long-term dependency resolution. These sub-tasks are often performed by tailored modules, where the limited inter-module interaction restricts feature learning to a serialized manner. Past works have traditionally viewed videos as temporally sequenced multi-modal streams. We improve and extend on this view by proposing a novel architecture, the Dual Perspective Network (DPNet), that - (1) additionally operates on an intuitive graph perspective of a video to simultaneously facilitate cross-modal guidance and short-term temporal aggregation using a Graph Neural Network (GNN), (2) deploys a Temporal Convolutional Network (TCN) to achieve long-term dependency resolution, and (3) encourages interactive feature learning using a cyclic feature refinement process that alternates between the GNN and TCN. Further, we introduce the Relational Graph Convolutional Transformer, a novel GNN integrated into the DPNet, to express and attend each segment node's relational representation with its different relational neighborhoods. Lastly, we diversify the input to the DPNet through a new video augmentation technique called Replicate and Link, which outputs semantically identical video blends whose graph representations can be linked to that of the source videos. Experiments reveal that our DPNet framework outperforms prior state-of-the-art methods by large margins for the AVEL task on the public AVE dataset, while extensive ablation studies corroborate the efficacy of each proposed method.

引用

页码：689 / 704

页数：16

共 50 条

[1] Dual Attention Matching for Audio-Visual Event Localization
Wu, Yu
Zhu, Linchao
Yan, Yan
Yang, Yi
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309
[2] Dynamic interactive learning network for audio-visual event localization
Chen, Jincai
Liang, Han
Wang, Ruili
Zeng, Jiangfeng
Lu, Ping
[J]. APPLIED INTELLIGENCE, 2023, 53 (24) : 30431 - 30442
[3] Dense Modality Interaction Network for Audio-Visual Event Localization
Liu, Shuo
Quan, Weize
Wang, Chaoqun
Liu, Yuan
Liu, Bin
Yan, Dong-Ming
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2734 - 2748
[4] Dynamic interactive learning network for audio-visual event localization
Jincai Chen
Han Liang
Ruili Wang
Jiangfeng Zeng
Ping Lu
[J]. Applied Intelligence, 2023, 53 : 30431 - 30442
[5] WHAT MAKES THE SOUND?: A DUAL-MODALITY INTERACTING NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
Ramaswamy, Janani
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4372 - 4376
[6] DUAL-MODALITY SEQ2SEQ NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
Lin, Yan-Bo
Li, Yu-Jhe
Wang, Yu-Chiang Frank
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2002 - 2006
[7] Audio-Visual Event Localization in Unconstrained Videos
Tian, Yapeng
Shi, Jing
Li, Bochen
Duan, Zhiyao
Xu, Chenliang
[J]. COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268
[8] BI-DIRECTIONAL MODALITY FUSION NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
Liu, Shuo
Quan, Weize
Liu, Yuan
Yan, Dong-Ming
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4868 - 4872
[9] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
Ge, Shiping
Jiang, Zhiwei
Yin, Yafeng
Wang, Cong
Cheng, Zifeng
Gu, Qing
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
[10] Semantic and Relation Modulation for Audio-Visual Event Localization
Wang, Hao
Zha, Zheng-Jun
Li, Liang
Chen, Xuejin
Luo, Jiebo
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725

← 1 2 3 4 5 →