ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction

被引：27

作者：

Jain, Samyak ^{[1
]}

Yarlagadda, Pradeep ^{[1
]}

Jyoti, Shreyank ^{[1
]}

Karthik, Shyamgopal ^{[1
]}

Subramanian, Ramanathan ^{[2
]}

Gandhi, Vineet ^{[1
]}

机构：

[1] Int Inst Informat Technol, KCIS, CVIT, Hyderabad, India

[2] Univ Canberra, Canberra, ACT, Australia

来源：

2021 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS) | 2021年

关键词：

D O I：

10.1109/IROS51168.2021.9635989

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We propose the ViNet architecture for audiovisual saliency prediction. ViNet is a fully convolutional encoder-decoder architecture. The encoder uses visual features from a network trained for action recognition, and the decoder infers a saliency map via trilinear interpolation and 3D convolutions, combining features from multiple hierarchies. The overall architecture of ViNet is conceptually simple; it is causal and runs in real-time (60 fps). ViNet does not use audio as input and still outperforms the state-of-the-art audio-visual saliency prediction models on nine different datasets (three visual-only and six audio-visual datasets). ViNet also surpasses human performance on the CC, SIM and AUC metrics for the AVE dataset, and to our knowledge, it is the first model to do so. We also explore a variation of ViNet architecture by augmenting audio features into the decoder. To our surprise, upon sufficient training, the network becomes agnostic to the input audio and provides the same output irrespective of the input. Interestingly, we also observe similar behaviour in the previous state-of-the-art models [1] for audio-visual saliency prediction. Our findings contrast with previous works on deep learning-based audio-visual saliency prediction, suggesting a clear avenue for future explorations incorporating audio in a more effective manner. The code and pre-trained models are available at https://github.com/samyak0210/ViNet.

引用

页码：3520 / 3527

页数：8

共 50 条

[31] AUDIO-VISUAL TECHNOLOGIES
TAKESHITA, M
FURUKAWA, M
HAYATSU, R
MURAKAMI, R
SUZUKI, K
HASHIZUME, K
[J]. NEC RESEARCH & DEVELOPMENT, 1990, (96): : 265 - 277
[32] AUDIO-VISUAL UNIT
WHARTON, BA
[J]. PEDIATRICS, 1971, 47 (05) : 957 - &
[33] Audio-Visual Techniques
Sears, William P., Jr.
[J]. EDUCATION, 1948, 69 (02): : 132 - 132
[34] AUDIO-VISUAL POTPOURRI
不详
[J]. INDUSTRIAL PHOTOGRAPHY, 1968, 17 (07): : 30 - &
[35] Audio-visual imposture
Karam, Walid
Mokbel, Chafic
Greige, Hanna
Chollet, Gerard
[J]. MOBILE MULTIMEDIA/IMAGE PROCESSING FOR MILITARY AND SECURITY APPLICATIONS, 2006, 6250
[36] Audio-visual biometrics
Aleksic, Petar S.
Katsaggelos, Aggelos K.
[J]. PROCEEDINGS OF THE IEEE, 2006, 94 (11) : 2025 - 2044
[37] AUDIO-VISUAL DEVELOPMENTS
Schwartz, Mortimer
[J]. JOURNAL OF LEGAL EDUCATION, 1952, 5 (01) : 88 - 95
[38] AUDIO-VISUAL FOR THE PATIENT
STUTTLE, FL
[J]. JOURNAL OF BONE AND JOINT SURGERY-AMERICAN VOLUME, 1959, 41 (07): : 1362 - 1362
[39] The Audio-Visual Reader
不详
[J]. JOURNAL OF EDUCATIONAL RESEARCH, 1955, 48 (07): : 552 - 553
[40] Perceptual thresholds of audio-visual spatial coherence for a variety of audio-visual objects
Stenzel, Hanne
Jackson, Philip J. B.
[J]. 2018 AES INTERNATIONAL CONFERENCE ON AUDIO FOR VIRTUAL AND AUGMENTED REALITY, 2018,

← 1 2 3 4 5 →