Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning

被引:1
|
作者
Zhu, Dandan [1 ,2 ]
Shao, Xuan [3 ]
Zhang, Kaiwei [4 ]
Min, Xiongkuo [4 ]
Zhai, Guangtao [4 ]
Yang, Xiaokang [2 ]
机构
[1] East China Normal Univ, Inst AI Educ, Shanghai 200333, Peoples R China
[2] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China
[3] Donghua Univ, Sch Comp Sci & Technol, Shanghai 201620, Peoples R China
[4] Shanghai Jiao Tong Univ, Inst Image Commun & Network Engn, Shanghai 200240, Peoples R China
基金
中国国家自然科学基金;
关键词
Temporal alignment; Audio-visual saliency; Spatial sound source localization; Implicit neural representation; Omnidirectional videos; VISUAL-ATTENTION; PREDICTION;
D O I
10.1007/s10489-023-04714-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Since the audio information is fully explored and leveraged in omnidirectional videos (ODVs), the performance of existing audio-visual saliency models has been improving dramatically and significantly. However, these models are still in their infancy stages, and there are two significant issues in modeling human attention between visual and auditory modalities: (1) Temporal non-alignment problem between auditory and visual modalities is rarely considered; (2) Most audio-visual saliency models are audio content attributes-agnostic. Thus, they need to learn audio features with fine details. This paper proposes a novel audio-visual aligned saliency (AVAS) model that can simultaneously tackle two issues as mentioned above in an effective end-to-end training manner. In order to solve the temporal non-alignment problem between the two modalities, a Hanning window method is employed on the audio stream to truncate the audio signal per unit time (frame-time interval) to match the visual information stream of the corresponding duration, which can capture the potential correlation of two modalities across time steps and facilitate audio-visual features fusion. Regarding the audio content attribute-agnostic issue, an effective periodic audio encoding method is proposed based on implicit neural representation (INR) to map audio sampling points to their corresponding audio frequency values, which can better discriminate and interpret audio content attributes. Comprehensive experiments and detailed ablation analyses are performed on the benchmark dataset to demonstrate the efficacy of the proposed model. The experimental results indicate that the proposed model consistently outperforms other competitors by a large margin.
引用
收藏
页码:22615 / 22634
页数:20
相关论文
共 50 条
  • [1] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
    Dandan Zhu
    Xuan Shao
    Kaiwei Zhang
    Xiongkuo Min
    Guangtao Zhai
    Xiaokang Yang
    [J]. Applied Intelligence, 2023, 53 : 22615 - 22634
  • [2] Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio
    Chao, Fang-Yi
    Ozcinar, Cagri
    Zhang, Lu
    Hamidouche, Wassim
    Deforges, Olivier
    Smolic, Aljosa
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 355 - 358
  • [3] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
    Zhu, Dandan
    Zhang, Kaiwei
    Zhang, Nana
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 764 - 775
  • [4] From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model With Implicit Neural Representations
    Zhu, Dandan
    Zhang, Kaiwei
    Zhu, Kun
    Zhang, Nana
    Ding, Weiping
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,
  • [5] Audio-visual collaborative representation learning for Dynamic Saliency Prediction
    Ning, Hailong
    Zhao, Bin
    Hu, Zhanxuan
    He, Lang
    Pei, Ercheng
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 256
  • [6] An audio-visual saliency model for movie summarization
    Rapantzikos, Konstantinos
    Evangelopoulos, Georgios
    Maragos, Petros
    Avrithis, Yannis
    [J]. 2007 IEEE NINTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2007, : 320 - 323
  • [7] AUDIO-VISUAL PERCEPTION OF OMNIDIRECTIONAL VIDEO FOR VIRTUAL REALITY APPLICATIONS
    Chao, Fang-Yi
    Ozcinar, Cagri
    Wang, Chen
    Zerman, Emin
    Zhang, Lu
    Hamidouche, Wassim
    Deforges, Olivier
    Smolic, Aljosa
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2020,
  • [8] A Novel Lightweight Audio-visual Saliency Model for Videos
    Zhu, Dandan
    Shao, Xuan
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (04)
  • [9] Deep Audio-Visual Saliency: Baseline Model and Data
    Tavakoli, Hamed R.
    Borji, Ali
    Kannala, Juho
    Rahtu, Esa
    [J]. ETRA 2020 SHORT PAPERS: ACM SYMPOSIUM ON EYE TRACKING RESEARCH & APPLICATIONS, 2020,
  • [10] DEEP AUDIO-VISUAL FUSION NEURAL NETWORK FOR SALIENCY ESTIMATION
    Yao, Shunyu
    Min, Xiongkuo
    Zhai, Guangtao
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1604 - 1608