Semantic Audio-Visual Navigation

被引:44
|
作者
Chen, Changan [1 ,2 ]
Al-Halah, Ziad [1 ]
Grauman, Kristen [1 ,2 ]
机构
[1] UT Austin, Austin, TX 78712 USA
[2] Facebook AI Res, Menlo Pk, CA 94025 USA
关键词
D O I
10.1109/CVPR46437.2021.01526
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing, door creaking) and acoustic events are sporadic or short in duration. We propose a transformer-based model to tackle this new semantic AudioGoal task, incorporating an inferred goal descriptor that captures both spatial and semantic properties of the target. Our model's persistent multimodal memory enables it to reach the goal even long after the acoustic event stops. In support of the new task, we also expand the SoundSpaces audio simulations to provide semantically grounded sounds for an array of objects in Matterport3D. Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.(1)
引用
收藏
页码:15511 / 15520
页数:10
相关论文
共 50 条
  • [41] CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments
    Liu, Xiulong
    Paul, Sudipta
    Chatterjee, Moitreya
    Cherian, Anoop
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3765 - 3773
  • [42] Multi-goal Audio-visual Navigation using Sound Direction Map
    Kondoh, Haru
    Kanezaki, Asako
    [J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2023, : 5219 - 5226
  • [43] Audio-visual speech experience with age influences perceived audio-visual asynchrony in speech
    Alm, Magnus
    Behne, Dawn
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2013, 134 (04): : 3001 - 3010
  • [44] Audio-visual spatial alignment improves integration in the presence of a competing audio-visual stimulus
    Fleming, Justin T.
    Noyce, Abigail L.
    Shinn-Cunningham, Barbara G.
    [J]. NEUROPSYCHOLOGIA, 2020, 146
  • [45] Extracting semantic information from basketball video based on audio-visual features
    Kim, K
    Choi, J
    Kim, N
    Kim, P
    [J]. IMAGE AND VIDEO RETRIEVAL, 2002, 2383 : 278 - 288
  • [46] EXPERIMENT IN AUDIO AND AUDIO-VISUAL GROUP THERAPY
    GORDON, MT
    [J]. BRITISH JOURNAL OF DISORDERS OF COMMUNICATION, 1969, 4 (01): : 83 - 88
  • [47] A JOINT AUDIO-VISUAL APPROACH TO AUDIO LOCALIZATION
    Jensen, Jesper Rindom
    Christensen, Mads Graesboll
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 454 - 458
  • [48] Audio-Visual Event Localization by Learning Spatial and Semantic Co-Attention
    Xue, Cheng
    Zhong, Xionghu
    Cai, Minjie
    Chen, Hao
    Wang, Wenwu
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 418 - 429
  • [49] Research on emotional semantic retrieval of attention mechanism oriented to audio-visual synesthesia
    Wang, Weixing
    Li, Qianqian
    Xie, Jingwen
    Hu, Ningfeng
    Wang, Ziao
    Zhang, Ning
    [J]. NEUROCOMPUTING, 2023, 519 : 194 - 204
  • [50] Expressive audio-visual speech
    Bevacqua, E
    Pelachaud, C
    [J]. COMPUTER ANIMATION AND VIRTUAL WORLDS, 2004, 15 (3-4) : 297 - 304