Audio-visual saliency prediction with multisensory perception and integration

被引:1
|
作者
Xie, Jiawei [1 ]
Liu, Zhi [1 ,2 ]
Li, Gongyang [1 ,2 ]
Song, Yingjie [1 ]
机构
[1] Shanghai Univ, Shanghai Inst Adv Commun & Data Sci, Sch Commun & Informat Engn, Shanghai 200444, Peoples R China
[2] Shanghai Univ, Wenzhou Inst, Wenzhou 325000, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Audio-visual saliency prediction; Audio-visual fusion; Image saliency prediction; Self-supervised learning; VISUAL-ATTENTION; OBJECT DETECTION; DRIVEN; MODEL;
D O I
10.1016/j.imavis.2024.104955
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio-visual saliency prediction (AVSP) is a task that aims to model human attention patterns in the perception of auditory and visual scenes. Given the challenges associated with perceiving and combining multi-modal saliency features from videos, this paper presents a multi-sensory framework for AVSP. This framework is designed to extract audio, motion and image saliency features and integrate them effectively, which can then serve as a general architecture for the AVSP task. To obtain multi-sensory information, we develop a three-stream encoder that extracts audio, motion and image saliency features. In particular, we utilize a pre-trained encoder with knowledge related to image saliency to extract saliency features for each frame. The image saliency features are then incorporated with motion features using a spatial attention module. For motion features, 3D convolutional neural networks (CNNs) like S3D are commonly used in AVSP models. However, these networks are unable to effectively capture the global motion relationship in videos. To tackle this problem, we incorporate Transformerand MLP-based motion encoders into the AVSP models. To learn joint audio-visual representations, an audiovisual fusion block is exploited to enhance the correlation between audio and visual motion features under the supervision of a cosine similarity loss in a self-supervised manner. Finally, a multi-stage decoder integrates audio, motion and image saliency features to generate the final saliency map. We evaluate our methods on six audio-visual eye-tracking datasets. Experimental results demonstrate that our method achieves compelling performance compared to the state-of-the-art methods. The source code is available at https://github. com/oraclefina/MSPI.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Audio-visual multisensory integration in superior parietal lobule revealed by human intracranial recordings
    Molholm, Sophie
    Sehatpour, Pejman
    Mehta, Ashesh D.
    Shpaner, Marina
    Gomez-Ramirez, Manuel
    Ortigue, Stephanie
    Dyke, Jonathan P.
    Schwartz, Theodore H.
    Foxe, John J.
    [J]. JOURNAL OF NEUROPHYSIOLOGY, 2006, 96 (02) : 721 - 729
  • [22] Effects of aging on audio-visual speech integration Effects of aging on audio-visual speech integration
    Huyse, Aurelie
    Leybaert, Jacqueline
    Berthommier, Frederic
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2014, 136 (04): : 1918 - 1931
  • [23] Multisensory Integration of Audio-Visual Motion Cues during Active Self-Movement
    Gallagher, Maria
    Culling, John F.
    Freeman, Tom C. A.
    [J]. PERCEPTION, 2021, 50 (1_SUPPL) : 158 - 158
  • [24] Multisensory integration of audio-visual motion cues during active self-movement
    Gallagher, Maria
    Culling, John F.
    Freeman, Tom C. A.
    [J]. PERCEPTION, 2022, 51 (05) : 358 - 359
  • [25] Audio-Visual Causality and Stimulus Reliability Affect Audio-Visual Synchrony Perception
    Li, Shao
    Ding, Qi
    Yuan, Yichen
    Yue, Zhenzhu
    [J]. FRONTIERS IN PSYCHOLOGY, 2021, 12
  • [26] Neural processing of audio-visual integration in speech perception: An MEG study
    Hiroe, Nobuo
    Shinozaki, Jun
    Yoshioka, Taku
    Sato, Masa-aki
    Sekiyama, Kaoru
    [J]. NEUROSCIENCE RESEARCH, 2010, 68 : E332 - E332
  • [27] Visual limitations shape audio-visual integration
    Perez-Bellido, Alexis
    Ernst, Marc O.
    Soto-Faraco, Salvador
    Lopez-Moliner, Joan
    [J]. JOURNAL OF VISION, 2015, 15 (14):
  • [28] Audio-visual speech perception is special
    Tuomainen, J
    Andersen, TS
    Tiippana, K
    Sams, M
    [J]. COGNITION, 2005, 96 (01) : B13 - B22
  • [29] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
    Zhu, Dandan
    Zhang, Kaiwei
    Zhang, Nana
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 764 - 775
  • [30] AUDIO-VISUAL TRAINING OF PERCEPTION IN AGEING
    O'Brien, Jessica
    Jason, Chan
    Setti, Annalisa
    [J]. AGE AND AGEING, 2019, 48