Look, Listen, and Act: Towards Audio-Visual Embodied Navigation

被引:0
|
作者
Gan, Chuang [1 ]
Zhang, Yiwei [2 ]
Wu, Jiajun [3 ]
Gong, Boqing [4 ]
Tenenbaum, Joshua B. [3 ]
机构
[1] MIT IBM Watson AI Lab, Cambridge, MA 02142 USA
[2] Tsinghua Univ, Beijing, Peoples R China
[3] MIT, 77 Massachusetts Ave, Cambridge, MA 02139 USA
[4] Google, Mountain View, CA 94043 USA
关键词
D O I
10.1109/icra40945.2020.9197008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A crucial ability of mobile intelligent agents is to integrate the evidence from multiple sensory inputs in an environment and to make a sequence of actions to reach their goals. In this paper, we attempt to approach the problem of Audio-Visual Embodied Navigation, the task of planning the shortest path from a random starting location in a scene to the sound source in an indoor environment, given only raw egocentric visual and audio sensory data. To accomplish this task, the agent is required to learn from various modalities, i.e., relating the audio signal to the visual environment. Here we describe an approach to audio-visual embodied navigation that takes advantage of both visual and audio pieces of evidence. Our solution is based on three key ideas: a visual perception mapper module that constructs its spatial memory of the environment, a sound perception module that infers the relative location of the sound source from the agent, and a dynamic path planner that plans a sequence of actions based on the audio-visual observations and the spatial memory of the environment to navigate toward the goal. Experimental results on a newly collected Visual-Audio-Room dataset using the simulated multi-modal environment demonstrate the effectiveness of our approach over several competitive baselines.
引用
收藏
页码:9701 / 9707
页数:7
相关论文
共 50 条
  • [1] Listen and Look: Audio-Visual Matching Assisted Speech Source Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2018, 25 (09) : 1315 - 1319
  • [2] Sporadic Audio-Visual Embodied Assistive Robot Navigation For Human Tracking
    Singh, Gaurav
    Ghanem, Paul
    Padir, Taskin
    [J]. PROCEEDINGS OF THE 16TH ACM INTERNATIONAL CONFERENCE ON PERVASIVE TECHNOLOGIES RELATED TO ASSISTIVE ENVIRONMENTS, PETRA 2023, 2023, : 99 - 105
  • [3] Semantic Audio-Visual Navigation
    Chen, Changan
    Al-Halah, Ziad
    Grauman, Kristen
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 15511 - 15520
  • [4] CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments
    Liu, Xiulong
    Paul, Sudipta
    Chatterjee, Moitreya
    Cherian, Anoop
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3765 - 3773
  • [5] LOOK, LISTEN, AND LEARN. A Manual on the Use of Audio-Visual Materials in Informal Education
    Gilkinson, Howard
    Howell, William S.
    [J]. QUARTERLY JOURNAL OF SPEECH, 1948, 34 (04) : 529 - 530
  • [6] Filmic geographies: audio-visual, embodied-material
    Ernwein, Marion
    [J]. SOCIAL & CULTURAL GEOGRAPHY, 2022, 23 (06) : 779 - 796
  • [7] Transportation into Audio-visual Narratives: A Closer Look
    Reinhart, Amber Marie
    Zwarun, Lara
    Hall, Alice E.
    Tian, Yan
    [J]. COMMUNICATION QUARTERLY, 2021, 69 (05) : 564 - 585
  • [8] Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
    Cheng, Ying
    Wang, Ruize
    Pan, Zhihao
    Feng, Rui
    Zhang, Yuejie
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 3884 - 3892
  • [9] Audio-Visual Depth and Material Estimation for Robot Navigation
    Wilson, Justin
    Rewkowski, Nicholas
    Lin, Ming C.
    [J]. 2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, : 9239 - 9246
  • [10] Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio
    Chao, Fang-Yi
    Ozcinar, Cagri
    Zhang, Lu
    Hamidouche, Wassim
    Deforges, Olivier
    Smolic, Aljosa
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 355 - 358