Audio-visual saliency prediction with multisensory perception and integration

被引：1

作者：

Xie, Jiawei ^{[1
]}

Liu, Zhi ^{[1
,2
]}

Li, Gongyang ^{[1
,2
]}

Song, Yingjie ^{[1
]}

机构：

[1] Shanghai Univ, Shanghai Inst Adv Commun & Data Sci, Sch Commun & Informat Engn, Shanghai 200444, Peoples R China

[2] Shanghai Univ, Wenzhou Inst, Wenzhou 325000, Peoples R China

来源：

IMAGE AND VISION COMPUTING | 2024年 / 143卷

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Audio-visual saliency prediction; Audio-visual fusion; Image saliency prediction; Self-supervised learning; VISUAL-ATTENTION; OBJECT DETECTION; DRIVEN; MODEL;

D O I：

10.1016/j.imavis.2024.104955

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Audio-visual saliency prediction (AVSP) is a task that aims to model human attention patterns in the perception of auditory and visual scenes. Given the challenges associated with perceiving and combining multi-modal saliency features from videos, this paper presents a multi-sensory framework for AVSP. This framework is designed to extract audio, motion and image saliency features and integrate them effectively, which can then serve as a general architecture for the AVSP task. To obtain multi-sensory information, we develop a three-stream encoder that extracts audio, motion and image saliency features. In particular, we utilize a pre-trained encoder with knowledge related to image saliency to extract saliency features for each frame. The image saliency features are then incorporated with motion features using a spatial attention module. For motion features, 3D convolutional neural networks (CNNs) like S3D are commonly used in AVSP models. However, these networks are unable to effectively capture the global motion relationship in videos. To tackle this problem, we incorporate Transformerand MLP-based motion encoders into the AVSP models. To learn joint audio-visual representations, an audiovisual fusion block is exploited to enhance the correlation between audio and visual motion features under the supervision of a cosine similarity loss in a self-supervised manner. Finally, a multi-stage decoder integrates audio, motion and image saliency features to generate the final saliency map. We evaluate our methods on six audio-visual eye-tracking datasets. Experimental results demonstrate that our method achieves compelling performance compared to the state-of-the-art methods. The source code is available at https://github. com/oraclefina/MSPI.

引用

页数：14

共 50 条

[1] Audio-visual integration in temporal perception
Wada, Y
Kitagawa, N
Noguchi, K
[J]. INTERNATIONAL JOURNAL OF PSYCHOPHYSIOLOGY, 2003, 50 (1-2) : 117 - 124
[2] Audio-visual multisensory integration and haptic perception are altered in adults with developmental coordination disorder
Mayes, William P.
Gentle, Judith
Ivanova, Mirela
Violante, Ines R.
[J]. HUMAN MOVEMENT SCIENCE, 2024, 93
[3] Does Audio help in deep Audio-Visual Saliency prediction models?
Agrawal, Ritvik
Jyoti, Shreyank
Girmaji, Rohit
Sivaprasad, Sarath
Gandhi, Vineet
[J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 48 - 56
[4] Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio
Chao, Fang-Yi
Ozcinar, Cagri
Zhang, Lu
Hamidouche, Wassim
Deforges, Olivier
Smolic, Aljosa
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 355 - 358
[5] Crossmodal interactions and multisensory integration in the perception of audio-visual motion - A free-field study
Schmiedchen, Kristina
Freigang, Claudia
Nitsche, Ines
Ruebsamen, Rudolf
[J]. BRAIN RESEARCH, 2012, 1466 : 99 - 111
[6] Sensorimotor synchronization with audio-visual stimuli: limited multisensory integration
Armstrong, Alan
Issartel, Johann
[J]. EXPERIMENTAL BRAIN RESEARCH, 2014, 232 (11) : 3453 - 3463
[7] Sensorimotor synchronization with audio-visual stimuli: limited multisensory integration
Alan Armstrong
Johann Issartel
[J]. Experimental Brain Research, 2014, 232 : 3453 - 3463
[8] Audio-visual integration in the perception of tap dancing
Arrighi, R.
Marini, F.
Burr, D.
[J]. PERCEPTION, 2007, 36 : 172 - 172
[9] ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction
Jain, Samyak
Yarlagadda, Pradeep
Jyoti, Shreyank
Karthik, Shyamgopal
Subramanian, Ramanathan
Gandhi, Vineet
[J]. 2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2021, : 3520 - 3527
[10] Saliency Prediction in Uncategorized Videos Based on Audio-Visual Correlation
Qamar, Maryam
Qamar, Suleman
Muneeb, Muhammad
Bae, Sung-Ho
Rahman, Anis
[J]. IEEE ACCESS, 2023, 11 : 15460 - 15470

← 1 2 3 4 5 →