ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

被引:27
|
作者
Jain, Samyak [1 ]
Yarlagadda, Pradeep [1 ]
Jyoti, Shreyank [1 ]
Karthik, Shyamgopal [1 ]
Subramanian, Ramanathan [2 ]
Gandhi, Vineet [1 ]
机构
[1] Int Inst Informat Technol, KCIS, CVIT, Hyderabad, India
[2] Univ Canberra, Canberra, ACT, Australia
关键词
D O I
10.1109/IROS51168.2021.9635989
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose the ViNet architecture for audiovisual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first model to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models [1] for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.
引用
收藏
页码:3520 / 3527
页数:8
相关论文
共 50 条
  • [21] Dense Modality Interaction Network for Audio-Visual Event Localization
    Liu, Shuo
    Quan, Weize
    Wang, Chaoqun
    Liu, Yuan
    Liu, Bin
    Yan, Dong-Ming
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2734 - 2748
  • [22] An audio-visual speech recognition with a new mandarin audio-visual database
    Liao, Wen-Yuan
    Pao, Tsang-Long
    Chen, Yu-Te
    Chang, Tsun-Wei
    [J]. INT CONF ON CYBERNETICS AND INFORMATION TECHNOLOGIES, SYSTEMS AND APPLICATIONS/INT CONF ON COMPUTING, COMMUNICATIONS AND CONTROL TECHNOLOGIES, VOL 1, 2007, : 19 - +
  • [23] Audio-visual entrainment as a treatment modality for seasonal affective disorder
    Berg, K
    Siever, D
    [J]. APPLIED PSYCHOPHYSIOLOGY AND BIOFEEDBACK, 2001, 26 (03) : 232 - 232
  • [24] Measuring the visual in audio-visual input
    Pujadas, Georgia
    Munoz, Carmen
    [J]. ITL-INTERNATIONAL JOURNAL OF APPLIED LINGUISTICS, 2023, 174 (02) : 263 - 290
  • [25] Audio-Visual Saliency Map: Overview, Basic Models and Hardware Implementation
    Ramenahalli, Sudarshan
    Mendat, Daniel R.
    Dura-Bernal, Salvador
    Culurciello, Eugenio
    Niebur, Ernst
    Andreou, Andreas
    [J]. 2013 47TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2013,
  • [26] Towards multimodal saliency detection: an enhancement of audio-visual correlation estimation
    Rodriguez-Hidalgo, Antonio
    Pelaez-Moreno, Carmen
    Gallardo-Antolin, Ascension
    [J]. 2017 IEEE 16TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC), 2017, : 438 - 443
  • [27] AUDIO-VISUAL EDUCATION
    Brickman, William W.
    [J]. SCHOOL AND SOCIETY, 1948, 67 (1739): : 320 - 326
  • [28] Audio-Visual Objects
    Kubovy M.
    Schutz M.
    [J]. Review of Philosophy and Psychology, 2010, 1 (1) : 41 - 61
  • [29] Audio-Visual Segmentation
    Zhou, Jinxing
    Wang, Jianyuan
    Zhang, Jiayi
    Sun, Weixuan
    Zhang, Jing
    Birchfield, Stan
    Guo, Dan
    Kong, Lingpeng
    Wang, Meng
    Zhong, Yiran
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVII, 2022, 13697 : 386 - 403
  • [30] AUDIO-VISUAL DEVELOPMENTS
    Schwartz, Mortimer
    [J]. JOURNAL OF LEGAL EDUCATION, 1952, 5 (01) : 88 - 95