ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

被引:27
|
作者
Jain, Samyak [1 ]
Yarlagadda, Pradeep [1 ]
Jyoti, Shreyank [1 ]
Karthik, Shyamgopal [1 ]
Subramanian, Ramanathan [2 ]
Gandhi, Vineet [1 ]
机构
[1] Int Inst Informat Technol, KCIS, CVIT, Hyderabad, India
[2] Univ Canberra, Canberra, ACT, Australia
关键词
D O I
10.1109/IROS51168.2021.9635989
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose the ViNet architecture for audiovisual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first model to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models [1] for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.
引用
收藏
页码:3520 / 3527
页数:8
相关论文
共 50 条
  • [1] Audio-visual saliency prediction with multisensory perception and integration
    Xie, Jiawei
    Liu, Zhi
    Li, Gongyang
    Song, Yingjie
    [J]. IMAGE AND VISION COMPUTING, 2024, 143
  • [2] Does Audio help in deep Audio-Visual Saliency prediction models?
    Agrawal, Ritvik
    Jyoti, Shreyank
    Girmaji, Rohit
    Sivaprasad, Sarath
    Gandhi, Vineet
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 48 - 56
  • [3] Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio
    Chao, Fang-Yi
    Ozcinar, Cagri
    Zhang, Lu
    Hamidouche, Wassim
    Deforges, Olivier
    Smolic, Aljosa
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 355 - 358
  • [4] Saliency Prediction in Uncategorized Videos Based on Audio-Visual Correlation
    Qamar, Maryam
    Qamar, Suleman
    Muneeb, Muhammad
    Bae, Sung-Ho
    Rahman, Anis
    [J]. IEEE ACCESS, 2023, 11 : 15460 - 15470
  • [5] Audio-visual collaborative representation learning for Dynamic Saliency Prediction
    Ning, Hailong
    Zhao, Bin
    Hu, Zhanxuan
    He, Lang
    Pei, Ercheng
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 256
  • [6] An audio-visual saliency model for movie summarization
    Rapantzikos, Konstantinos
    Evangelopoulos, Georgios
    Maragos, Petros
    Avrithis, Yannis
    [J]. 2007 IEEE NINTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2007, : 320 - 323
  • [7] Audio-visual saliency prediction for movie viewing in immersive environments: Dataset and benchmarks
    Chen, Zhao
    Zhang, Kao
    Cai, Hao
    Ding, Xiaoying
    Jiang, Chenxi
    Chen, Zhenzhong
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 100
  • [8] A Novel Lightweight Audio-visual Saliency Model for Videos
    Zhu, Dandan
    Shao, Xuan
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (04)
  • [9] Deep Audio-Visual Saliency: Baseline Model and Data
    Tavakoli, Hamed R.
    Borji, Ali
    Kannala, Juho
    Rahtu, Esa
    [J]. ETRA 2020 SHORT PAPERS: ACM SYMPOSIUM ON EYE TRACKING RESEARCH & APPLICATIONS, 2020,
  • [10] Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial Audio
    Zhu, Dandan
    Zhang, Kaiwei
    Zhang, Nana
    Zhou, Qiangqiang
    Min, Xiongkuo
    Zhai, Guangtao
    Yang, Xiaokang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 764 - 775