Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio

被引:14
|
作者
Chao, Fang-Yi [1 ]
Ozcinar, Cagri [2 ]
Zhang, Lu [1 ]
Hamidouche, Wassim [1 ]
Deforges, Olivier [1 ]
Smolic, Aljosa [2 ]
机构
[1] Univ Rennes, CNRS, INSA Rennes, IETR UMR 6164, F-35000 Rennes, France
[2] Trinity Coll Dublin, Sch Comp Sci & Stat, V SENSE, Dublin, Ireland
基金
爱尔兰科学基金会;
关键词
Audio-visual saliency; spatial sound; ambisonics; omnidirectional video (ODV); virtual reality (VR);
D O I
10.1109/vcip49819.2020.9301766
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Omnidirectional videos (ODVs) with spatial audio enable viewers to perceive 360 degrees directions of audio and visual signals during the consumption of ODVs with head-mounted displays (HMDs). By predicting salient audio-visual regions, ODV systems can be optimized to provide an immersive sensation of audio-visual stimuli with high-quality. Despite the intense recent effort for ODV saliency prediction, the current literature still does not consider the impact of auditory information in ODVs. In this work, we propose an audio-visual saliency (AVS360) model that incorporates 360 degrees spatial-temporal visual representation and spatial auditory information in ODVs. The proposed AVS360 model is composed of two 3D residual networks (ResNets) to encode visual and audio cues. The first one is embedded with a spherical representation technique to extract 360 degrees visual features, and the second one extracts the features of audio using the log mel-spectrogram. We emphasize sound source locations by integrating audio energy map (AEM) generated from spatial audio description (i.e., ambisonics) and equator viewing behavior with equator center bias (ECB). The audio and visual features are combined and fused with AEM and ECB via attention mechanism. Our experimental results show that the AVS360 model has significant superiority over five state-of-the-art saliency models. To the best of our knowledge, it is the first work that develops the audio-visual saliency model in ODVs. The code will be publicly available to foster future research on audio-visual saliency in ODVs.
引用
收藏
页码:355 / 358
页数:4
相关论文
共 50 条
  • [1] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
    Zhu, Dandan
    Zhang, Kaiwei
    Zhang, Nana
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 764 - 775
  • [2] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
    Zhu, Dandan
    Shao, Xuan
    Zhang, Kaiwei
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. APPLIED INTELLIGENCE, 2023, 53 (19) : 22615 - 22634
  • [3] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
    Dandan Zhu
    Xuan Shao
    Kaiwei Zhang
    Xiongkuo Min
    Guangtao Zhai
    Xiaokang Yang
    [J]. Applied Intelligence, 2023, 53 : 22615 - 22634
  • [4] Does Audio help in deep Audio-Visual Saliency prediction models?
    Agrawal, Ritvik
    Jyoti, Shreyank
    Girmaji, Rohit
    Sivaprasad, Sarath
    Gandhi, Vineet
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 48 - 56
  • [5] Audio-visual saliency prediction with multisensory perception and integration
    Xie, Jiawei
    Liu, Zhi
    Li, Gongyang
    Song, Yingjie
    [J]. IMAGE AND VISION COMPUTING, 2024, 143
  • [6] AUDIO-VISUAL PERCEPTION OF OMNIDIRECTIONAL VIDEO FOR VIRTUAL REALITY APPLICATIONS
    Chao, Fang-Yi
    Ozcinar, Cagri
    Wang, Chen
    Zerman, Emin
    Zhang, Lu
    Hamidouche, Wassim
    Deforges, Olivier
    Smolic, Aljosa
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2020,
  • [7] ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction
    Jain, Samyak
    Yarlagadda, Pradeep
    Jyoti, Shreyank
    Karthik, Shyamgopal
    Subramanian, Ramanathan
    Gandhi, Vineet
    [J]. 2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2021, : 3520 - 3527
  • [8] Saliency Prediction in Uncategorized Videos Based on Audio-Visual Correlation
    Qamar, Maryam
    Qamar, Suleman
    Muneeb, Muhammad
    Bae, Sung-Ho
    Rahman, Anis
    [J]. IEEE ACCESS, 2023, 11 : 15460 - 15470
  • [9] Audio-visual collaborative representation learning for Dynamic Saliency Prediction
    Ning, Hailong
    Zhao, Bin
    Hu, Zhanxuan
    He, Lang
    Pei, Ercheng
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 256
  • [10] Towards multimodal saliency detection: an enhancement of audio-visual correlation estimation
    Rodriguez-Hidalgo, Antonio
    Pelaez-Moreno, Carmen
    Gallardo-Antolin, Ascension
    [J]. 2017 IEEE 16TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC), 2017, : 438 - 443