ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

被引:27
|
作者
Jain, Samyak [1 ]
Yarlagadda, Pradeep [1 ]
Jyoti, Shreyank [1 ]
Karthik, Shyamgopal [1 ]
Subramanian, Ramanathan [2 ]
Gandhi, Vineet [1 ]
机构
[1] Int Inst Informat Technol, KCIS, CVIT, Hyderabad, India
[2] Univ Canberra, Canberra, ACT, Australia
关键词
D O I
10.1109/IROS51168.2021.9635989
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose the ViNet architecture for audiovisual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first model to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models [1] for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.
引用
收藏
页码:3520 / 3527
页数:8
相关论文
共 50 条
  • [41] 3DSEAVNet: 3D-Squeeze-and-Excitation Networks for Audio-Visual Saliency Prediction
    Liang, Silong
    Li, Chunxiao
    Cui, Naying
    Sun, Minghui
    Xue, Hao
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [42] An audio-visual speech recognition system for testing new audio-visual databases
    Pao, Tsang-Long
    Liao, Wen-Yuan
    [J]. VISAPP 2006: PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS, VOL 2, 2006, : 192 - +
  • [43] CASP-Net: Rethinking Video Saliency Prediction from an Audio-Visual Consistency Perceptual Perspective
    Xiong, Junwen
    Wang, Ganglai
    Zhang, Peng
    Huang, Wei
    Zha, Yufei
    Zhai, Guangtao
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6441 - 6450
  • [44] Joint Learning of Audio-Visual Saliency Prediction and Sound Source Localization on Multi-face Videos
    Qiao, Minglang
    Liu, Yufan
    Xu, Mai
    Deng, Xin
    Li, Bing
    Hu, Weiming
    Borji, Ali
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (06) : 2003 - 2025
  • [45] Transfer of Audio-Visual Temporal Training to Temporal and Spatial Audio-Visual Tasks
    Suerig, Ralf
    Bottari, Davide
    Roeder, Brigitte
    [J]. MULTISENSORY RESEARCH, 2018, 31 (06) : 556 - 578
  • [46] Audio-visual event detection based on mining of semantic audio-visual labels
    Goh, KS
    Miyahara, K
    Radhakrishan, R
    Xiong, ZY
    Divakaran, A
    [J]. STORAGE AND RETRIEVAL METHODS AND APPLICATIONS FOR MULTIMEDIA 2004, 2004, 5307 : 292 - 299
  • [47] LEARNING CONTEXTUALLY FUSED AUDIO-VISUAL REPRESENTATIONS FOR AUDIO-VISUAL SPEECH RECOGNITION
    Zhang, Zi-Qiang
    Zhang, Jie
    Zhang, Jian-Shu
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1346 - 1350
  • [48] Audio-Visual Causality and Stimulus Reliability Affect Audio-Visual Synchrony Perception
    Li, Shao
    Ding, Qi
    Yuan, Yichen
    Yue, Zhenzhu
    [J]. FRONTIERS IN PSYCHOLOGY, 2021, 12
  • [49] Temporal Feature Prediction in Audio-Visual Deepfake Detection
    Gao, Yuan
    Wang, Xuelong
    Zhang, Yu
    Zeng, Ping
    Ma, Yingjie
    [J]. ELECTRONICS, 2024, 13 (17)
  • [50] Modality matters: Testing bilingual irony comprehension in the textual, auditory, and audio-visual modality
    Bromberek-Dyzman, Katarzyna
    Jankowiak, Katarzyna
    Chelminiak, Pawel
    [J]. JOURNAL OF PRAGMATICS, 2021, 180 : 219 - 231