Semantic Audio-Visual Navigation

被引：44

作者：

Chen, Changan ^{[1
,2
]}

Al-Halah, Ziad ^{[1
]}

Grauman, Kristen ^{[1
,2
]}

机构：

[1] UT Austin, Austin, TX 78712 USA

[2] Facebook AI Res, Menlo Pk, CA 94025 USA

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

关键词：

D O I：

10.1109/CVPR46437.2021.01526

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing, door creaking) and acoustic events are sporadic or short in duration. We propose a transformer-based model to tackle this new semantic AudioGoal task, incorporating an inferred goal descriptor that captures both spatial and semantic properties of the target. Our model's persistent multimodal memory enables it to reach the goal even long after the acoustic event stops. In support of the new task, we also expand the SoundSpaces audio simulations to provide semantically grounded sounds for an array of objects in Matterport3D. Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.(1)

引用

页码：15511 / 15520

页数：10

共 50 条

[41] CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments
Liu, Xiulong
Paul, Sudipta
Chatterjee, Moitreya
Cherian, Anoop
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3765 - 3773
[42] Multi-goal Audio-visual Navigation using Sound Direction Map
Kondoh, Haru
Kanezaki, Asako
[J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 5219 - 5226
[43] Audio-visual speech experience with age influences perceived audio-visual asynchrony in speech
Alm, Magnus
Behne, Dawn
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2013, 134 (04): : 3001 - 3010
[44] Audio-visual spatial alignment improves integration in the presence of a competing audio-visual stimulus
Fleming, Justin T.
Noyce, Abigail L.
Shinn-Cunningham, Barbara G.
[J]. NEUROPSYCHOLOGIA, 2020, 146
[45] Extracting semantic information from basketball video based on audio-visual features
Kim, K
Choi, J
Kim, N
Kim, P
[J]. IMAGE AND VIDEO RETRIEVAL, 2002, 2383 : 278 - 288
[46] EXPERIMENT IN AUDIO AND AUDIO-VISUAL GROUP THERAPY
GORDON, MT
[J]. BRITISH JOURNAL OF DISORDERS OF COMMUNICATION, 1969, 4 (01): : 83 - 88
[47] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
Jensen, Jesper Rindom
Christensen, Mads Graesboll
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
[48] Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
Xue, Cheng
Zhong, Xionghu
Cai, Minjie
Chen, Hao
Wang, Wenwu
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 418 - 429
[49] Research on emotional semantic retrieval of attention mechanism oriented to audio-visual synesthesia
Wang, Weixing
Li, Qianqian
Xie, Jingwen
Hu, Ningfeng
Wang, Ziao
Zhang, Ning
[J]. NEUROCOMPUTING, 2023, 519 : 194 - 204
[50] Expressive audio-visual speech
Bevacqua, E
Pelachaud, C
[J]. COMPUTER ANIMATION AND VIRTUAL WORLDS, 2004, 15 (3-4) : 297 - 304

← 1 2 3 4 5 →