Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning

被引：1

作者：

Zhu, Dandan ^{[1
,2
]}

Shao, Xuan ^{[3
]}

Zhang, Kaiwei ^{[4
]}

Min, Xiongkuo ^{[4
]}

Zhai, Guangtao ^{[4
]}

Yang, Xiaokang ^{[2
]}

机构：

[1] East China Normal Univ, Inst AI Educ, Shanghai 200333, Peoples R China

[2] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China

[3] Donghua Univ, Sch Comp Sci & Technol, Shanghai 201620, Peoples R China

[4] Shanghai Jiao Tong Univ, Inst Image Commun & Network Engn, Shanghai 200240, Peoples R China

来源：

APPLIED INTELLIGENCE | 2023年 / 53卷 / 19期

基金：

中国国家自然科学基金;

关键词：

Temporal alignment; Audio-visual saliency; Spatial sound source localization; Implicit neural representation; Omnidirectional videos; VISUAL-ATTENTION; PREDICTION;

D O I：

10.1007/s10489-023-04714-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Since the audio information is fully explored and leveraged in omnidirectional videos (ODVs), the performance of existing audio-visual saliency models has been improving dramatically and significantly. However, these models are still in their infancy stages, and there are two significant issues in modeling human attention between visual and auditory modalities: (1) Temporal non-alignment problem between auditory and visual modalities is rarely considered; (2) Most audio-visual saliency models are audio content attributes-agnostic. Thus, they need to learn audio features with fine details. This paper proposes a novel audio-visual aligned saliency (AVAS) model that can simultaneously tackle two issues as mentioned above in an effective end-to-end training manner. In order to solve the temporal non-alignment problem between the two modalities, a Hanning window method is employed on the audio stream to truncate the audio signal per unit time (frame-time interval) to match the visual information stream of the corresponding duration, which can capture the potential correlation of two modalities across time steps and facilitate audio-visual features fusion. Regarding the audio content attribute-agnostic issue, an effective periodic audio encoding method is proposed based on implicit neural representation (INR) to map audio sampling points to their corresponding audio frequency values, which can better discriminate and interpret audio content attributes. Comprehensive experiments and detailed ablation analyses are performed on the benchmark dataset to demonstrate the efficacy of the proposed model. The experimental results indicate that the proposed model consistently outperforms other competitors by a large margin.

引用

页码：22615 / 22634

页数：20

共 50 条

[1] Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
Dandan Zhu
Xuan Shao
Kaiwei Zhang
Xiongkuo Min
Guangtao Zhai
Xiaokang Yang
[J]. Applied Intelligence, 2023, 53 : 22615 - 22634
[2] Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio
Chao, Fang-Yi
Ozcinar, Cagri
Zhang, Lu
Hamidouche, Wassim
Deforges, Olivier
Smolic, Aljosa
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 355 - 358
[3] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
Zhu, Dandan
Zhang, Kaiwei
Zhang, Nana
Zhou, Qiangqiang
Min, Xiongkuo
Zhai, Guangtao
Yang, Xiaokang
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 764 - 775
[4] From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model With Implicit Neural Representations
Zhu, Dandan
Zhang, Kaiwei
Zhu, Kun
Zhang, Nana
Ding, Weiping
Zhai, Guangtao
Yang, Xiaokang
[J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024,
[5] Audio-visual collaborative representation learning for Dynamic Saliency Prediction
Ning, Hailong
Zhao, Bin
Hu, Zhanxuan
He, Lang
Pei, Ercheng
[J]. KNOWLEDGE-BASED SYSTEMS, 2022, 256
[6] An audio-visual saliency model for movie summarization
Rapantzikos, Konstantinos
Evangelopoulos, Georgios
Maragos, Petros
Avrithis, Yannis
[J]. 2007 IEEE NINTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2007, : 320 - 323
[7] AUDIO-VISUAL PERCEPTION OF OMNIDIRECTIONAL VIDEO FOR VIRTUAL REALITY APPLICATIONS
Chao, Fang-Yi
Ozcinar, Cagri
Wang, Chen
Zerman, Emin
Zhang, Lu
Hamidouche, Wassim
Deforges, Olivier
Smolic, Aljosa
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2020,
[8] A Novel Lightweight Audio-visual Saliency Model for Videos
Zhu, Dandan
Shao, Xuan
Zhou, Qiangqiang
Min, Xiongkuo
Zhai, Guangtao
Yang, Xiaokang
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (04)
[9] Deep Audio-Visual Saliency: Baseline Model and Data
Tavakoli, Hamed R.
Borji, Ali
Kannala, Juho
Rahtu, Esa
[J]. ETRA 2020 SHORT PAPERS: ACM SYMPOSIUM ON EYE TRACKING RESEARCH & APPLICATIONS, 2020,
[10] DEEP AUDIO-VISUAL FUSION NEURAL NETWORK FOR SALIENCY ESTIMATION
Yao, Shunyu
Min, Xiongkuo
Zhai, Guangtao
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1604 - 1608

← 1 2 3 4 5 →