Audio-visual collaborative representation learning for Dynamic Saliency Prediction

被引:0
|
作者
Ning, Hailong [1 ,2 ,3 ]
Zhao, Bin [4 ]
Hu, Zhanxuan [1 ,2 ,3 ]
He, Lang [1 ,2 ,3 ]
Pei, Ercheng [1 ,2 ,3 ]
机构
[1] Xian Univ Posts & Telecommun, Sch Comp Sci & Technol, Xian 710121, Peoples R China
[2] Shaanxi Key Lab Network Data Anal & Intelligent Pr, Xian 710121, Peoples R China
[3] Xian Key Lab Big Data & Intelligent Comp, Xian 710121, Peoples R China
[4] Northwestern Polytech Univ, Sch Artificial Intelligence Opt & Elect iOPEN, Xian 710072, Peoples R China
基金
中国国家自然科学基金;
关键词
Dynamic Saliency Prediction; Audio-visual; Multi-modal; Collaborative representation learning; Knowledge representation; VISUAL SALIENCY; NEURAL-NETWORK; ATTENTION; MODEL; FRAMEWORK;
D O I
10.1016/j.knosys.2022.109675
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive a dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglecting the accompanying audio information, which can provide complementary information for the scene understanding. Indeed, there exists a strong relation between auditory and visual cues, and humans generally perceive the surrounding scene by collaboratively sensing these cues. Motivated by this, an audio-visual collaborative representation learning method is proposed for the DSP task, which explores the implicit knowledge in the audio modality to better predict the dynamic saliency map by assisting the visual modality. The proposed method consists of three parts: (1) audio-visual encoding, (2) audio-visual localization, and (3) collaborative integration parts. First, a refined SoundNet architecture is adopted to encode audio modality for obtaining corresponding features, and a modified 3D ResNet-50 architecture is employed to learn visual features, containing both spatial location and temporal motion information. Secondly, an audio-visual localization part is devised to locate the sounding salient object in the visual scene by learning the correspondence between audio and visual information. Thirdly, a collaborative integration part is devised to adaptively aggregate the audio-visual information and a center-bias prior to generate the final saliency map. Extensive experiments are conducted on six challenging audiovisual eye -tracking datasets, namely DIEM, AVAD, Coutrot1, Coutrot2, SumMe, and ETMD, which shows significant superiority over state-of-the-art DSP models.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Audio–visual collaborative representation learning for Dynamic Saliency Prediction
    Ning, Hailong
    Zhao, Bin
    Hu, Zhanxuan
    He, Lang
    Pei, Ercheng
    [J]. Knowledge-Based Systems, 2022, 256
  • [2] Audio-visual saliency prediction with multisensory perception and integration
    Xie, Jiawei
    Liu, Zhi
    Li, Gongyang
    Song, Yingjie
    [J]. IMAGE AND VISION COMPUTING, 2024, 143
  • [3] Does Audio help in deep Audio-Visual Saliency prediction models?
    Agrawal, Ritvik
    Jyoti, Shreyank
    Girmaji, Rohit
    Sivaprasad, Sarath
    Gandhi, Vineet
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 48 - 56
  • [4] Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio
    Chao, Fang-Yi
    Ozcinar, Cagri
    Zhang, Lu
    Hamidouche, Wassim
    Deforges, Olivier
    Smolic, Aljosa
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 355 - 358
  • [5] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
    Zhu, Dandan
    Shao, Xuan
    Zhang, Kaiwei
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. APPLIED INTELLIGENCE, 2023, 53 (19) : 22615 - 22634
  • [6] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
    Dandan Zhu
    Xuan Shao
    Kaiwei Zhang
    Xiongkuo Min
    Guangtao Zhai
    Xiaokang Yang
    [J]. Applied Intelligence, 2023, 53 : 22615 - 22634
  • [7] ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction
    Jain, Samyak
    Yarlagadda, Pradeep
    Jyoti, Shreyank
    Karthik, Shyamgopal
    Subramanian, Ramanathan
    Gandhi, Vineet
    [J]. 2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2021, : 3520 - 3527
  • [8] Saliency Prediction in Uncategorized Videos Based on Audio-Visual Correlation
    Qamar, Maryam
    Qamar, Suleman
    Muneeb, Muhammad
    Bae, Sung-Ho
    Rahman, Anis
    [J]. IEEE ACCESS, 2023, 11 : 15460 - 15470
  • [9] From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model With Implicit Neural Representations
    Zhu, Dandan
    Zhang, Kaiwei
    Zhu, Kun
    Zhang, Nana
    Ding, Weiping
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,
  • [10] Joint Learning of Audio-Visual Saliency Prediction and Sound Source Localization on Multi-face Videos
    Qiao, Minglang
    Liu, Yufan
    Xu, Mai
    Deng, Xin
    Li, Bing
    Hu, Weiming
    Borji, Ali
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2003 - 2025