Dual Perspective Network for Audio-Visual Event Localization

被引:5
|
作者
Rao, Varshanth [1 ]
Khalil, Md Ibrahim [1 ]
Li, Haoda [1 ,2 ]
Dai, Peng [1 ]
Lu, Juwei [1 ]
机构
[1] Huawei Noahs Ark Lab, Montreal, PQ, Canada
[2] Univ Toronto, Toronto, ON, Canada
来源
关键词
D O I
10.1007/978-3-031-19830-4_39
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Audio-Visual Event Localization (AVEL) problem involves tackling three core sub-tasks: the creation of efficient audiovisual representations using cross-modal guidance, the formation of short-term temporal feature aggregations, and its accumulation to achieve long-term dependency resolution. These sub-tasks are often performed by tailored modules, where the limited inter-module interaction restricts feature learning to a serialized manner. Past works have traditionally viewed videos as temporally sequenced multi-modal streams. We improve and extend on this view by proposing a novel architecture, the Dual Perspective Network (DPNet), that - (1) additionally operates on an intuitive graph perspective of a video to simultaneously facilitate cross-modal guidance and short-term temporal aggregation using a Graph Neural Network (GNN), (2) deploys a Temporal Convolutional Network (TCN) to achieve long-term dependency resolution, and (3) encourages interactive feature learning using a cyclic feature refinement process that alternates between the GNN and TCN. Further, we introduce the Relational Graph Convolutional Transformer, a novel GNN integrated into the DPNet, to express and attend each segment node's relational representation with its different relational neighborhoods. Lastly, we diversify the input to the DPNet through a new video augmentation technique called Replicate and Link, which outputs semantically identical video blends whose graph representations can be linked to that of the source videos. Experiments reveal that our DPNet framework outperforms prior state-of-the-art methods by large margins for the AVEL task on the public AVE dataset, while extensive ablation studies corroborate the efficacy of each proposed method.
引用
收藏
页码:689 / 704
页数:16
相关论文
共 50 条
  • [1] Dual Attention Matching for Audio-Visual Event Localization
    Wu, Yu
    Zhu, Linchao
    Yan, Yan
    Yang, Yi
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6301 - 6309
  • [2] Dynamic interactive learning network for audio-visual event localization
    Chen, Jincai
    Liang, Han
    Wang, Ruili
    Zeng, Jiangfeng
    Lu, Ping
    [J]. APPLIED INTELLIGENCE, 2023, 53 (24) : 30431 - 30442
  • [3] Dense Modality Interaction Network for Audio-Visual Event Localization
    Liu, Shuo
    Quan, Weize
    Wang, Chaoqun
    Liu, Yuan
    Liu, Bin
    Yan, Dong-Ming
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2734 - 2748
  • [4] Dynamic interactive learning network for audio-visual event localization
    Jincai Chen
    Han Liang
    Ruili Wang
    Jiangfeng Zeng
    Ping Lu
    [J]. Applied Intelligence, 2023, 53 : 30431 - 30442
  • [5] WHAT MAKES THE SOUND?: A DUAL-MODALITY INTERACTING NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
    Ramaswamy, Janani
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 4372 - 4376
  • [6] DUAL-MODALITY SEQ2SEQ NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
    Lin, Yan-Bo
    Li, Yu-Jhe
    Wang, Yu-Chiang Frank
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 2002 - 2006
  • [7] Audio-Visual Event Localization in Unconstrained Videos
    Tian, Yapeng
    Shi, Jing
    Li, Bochen
    Duan, Zhiyao
    Xu, Chenliang
    [J]. COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268
  • [8] BI-DIRECTIONAL MODALITY FUSION NETWORK FOR AUDIO-VISUAL EVENT LOCALIZATION
    Liu, Shuo
    Quan, Weize
    Liu, Yuan
    Yan, Dong-Ming
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4868 - 4872
  • [9] Learning Event-Specific Localization Preferences for Audio-Visual Event Localization
    Ge, Shiping
    Jiang, Zhiwei
    Yin, Yafeng
    Wang, Cong
    Cheng, Zifeng
    Gu, Qing
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3446 - 3454
  • [10] Semantic and Relation Modulation for Audio-Visual Event Localization
    Wang, Hao
    Zha, Zheng-Jun
    Li, Liang
    Chen, Xuejin
    Luo, Jiebo
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7711 - 7725