A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence

被引:135
|
作者
Min, Xiongkuo [1 ]
Zhai, Guangtao [1 ]
Zhou, Jiantao [2 ]
Zhang, Xiao-Ping [3 ]
Yang, Xiaokang [1 ]
Guan, Xinping [4 ]
机构
[1] Shanghai Jiao Tong Univ, Inst Image Commun & Network Engn, Shanghai 200240, Peoples R China
[2] Univ Macau, Fac Sci & Technol, Dept Comp & Informat Sci, Macau 999078, Peoples R China
[3] Ryerson Univ, Dept Elect & Comp Engn, Toronto, ON M5B 2K3, Canada
[4] Shanghai Jiao Tong Univ, Dept Automat, Shanghai 200240, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Audio-visual attention; visual attention; multimodal; saliency; attention fusion; FREE-ENERGY PRINCIPLE; BLIND QUALITY ASSESSMENT; SEGMENTATION; ATTENTION;
D O I
10.1109/TIP.2020.2966082
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Audio information has been bypassed by most of current visual attention prediction studies. However, sound could have influence on visual attention and such influence has been widely investigated and proofed by many psychological studies. In this paper, we propose a novel multi-modal saliency (MMS) model for videos containing scenes with high audio-visual correspondence. In such scenes, humans tend to be attracted by the sound sources and it is also possible to localize the sound sources via cross-modal analysis. Specifically, we first detect the spatial and temporal saliency maps from the visual modality by using a novel free energy principle. Then we propose to detect the audio saliency map from both audio and visual modalities by localizing the moving-sounding objects using cross-modal kernel canonical correlation analysis, which is first of its kind in the literature. Finally we propose a new two-stage adaptive audiovisual saliency fusion method to integrate the spatial, temporal and audio saliency maps to our audio-visual saliency map. The proposed MMS model has captured the influence of audio, which is not considered in the latest deep learning based saliency models. To take advantages of both deep saliency modeling and audio-visual saliency modeling, we propose to combine deep saliency models and the MMS model via a later fusion, and we find that an average of 5 performance gain is obtained. Experimental results on audio-visual attention databases show that the introduced models incorporating audio cues have significant superiority over state-of-the-art image and video saliency models which utilize a single visual modality.
引用
收藏
页码:3805 / 3819
页数:15
相关论文
共 50 条
  • [1] A Novel Lightweight Audio-visual Saliency Model for Videos
    Zhu, Dandan
    Shao, Xuan
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (04)
  • [2] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
    Zhu, Dandan
    Zhang, Kaiwei
    Zhang, Nana
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 764 - 775
  • [3] Saliency Prediction in Uncategorized Videos Based on Audio-Visual Correlation
    Qamar, Maryam
    Qamar, Suleman
    Muneeb, Muhammad
    Bae, Sung-Ho
    Rahman, Anis
    [J]. IEEE ACCESS, 2023, 11 : 15460 - 15470
  • [4] An audio-visual saliency model for movie summarization
    Rapantzikos, Konstantinos
    Evangelopoulos, Georgios
    Maragos, Petros
    Avrithis, Yannis
    [J]. 2007 IEEE NINTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2007, : 320 - 323
  • [5] Towards multimodal saliency detection: an enhancement of audio-visual correlation estimation
    Rodriguez-Hidalgo, Antonio
    Pelaez-Moreno, Carmen
    Gallardo-Antolin, Ascension
    [J]. 2017 IEEE 16TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC), 2017, : 438 - 443
  • [6] Deep Audio-Visual Saliency: Baseline Model and Data
    Tavakoli, Hamed R.
    Borji, Ali
    Kannala, Juho
    Rahtu, Esa
    [J]. ETRA 2020 SHORT PAPERS: ACM SYMPOSIUM ON EYE TRACKING RESEARCH & APPLICATIONS, 2020,
  • [7] Multimodal framework based on audio-visual features for summarisation of cricket videos
    Javed, Ali
    Irtaza, Aun
    Malik, Hafiz
    Mahmood, Muhammad Tariq
    Adnan, Syed
    [J]. IET IMAGE PROCESSING, 2019, 13 (04) : 615 - 622
  • [8] Audio-visual interaction in multimodal communication
    Chellappa, R
    Chen, TH
    Katsaggelos, A
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 1997, 14 (04) : 37 - 38
  • [9] Audio-visual integration in multimodal communication
    Chen, T
    Rao, RR
    [J]. PROCEEDINGS OF THE IEEE, 1998, 86 (05) : 837 - 852
  • [10] Audio-Visual Event Localization in Unconstrained Videos
    Tian, Yapeng
    Shi, Jing
    Li, Bochen
    Duan, Zhiyao
    Xu, Chenliang
    [J]. COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 252 - 268