Semantic Audio-Visual Navigation

被引:44
|
作者
Chen, Changan [1 ,2 ]
Al-Halah, Ziad [1 ]
Grauman, Kristen [1 ,2 ]
机构
[1] UT Austin, Austin, TX 78712 USA
[2] Facebook AI Res, Menlo Pk, CA 94025 USA
关键词
D O I
10.1109/CVPR46437.2021.01526
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing, door creaking) and acoustic events are sporadic or short in duration. We propose a transformer-based model to tackle this new semantic AudioGoal task, incorporating an inferred goal descriptor that captures both spatial and semantic properties of the target. Our model's persistent multimodal memory enables it to reach the goal even long after the acoustic event stops. In support of the new task, we also expand the SoundSpaces audio simulations to provide semantically grounded sounds for an array of objects in Matterport3D. Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues.(1)
引用
收藏
页码:15511 / 15520
页数:10
相关论文
共 50 条
  • [21] AUDIO-VISUAL CLINICS
    GRABER, TM
    HANNETT, HA
    [J]. AMERICAN JOURNAL OF ORTHODONTICS AND DENTOFACIAL ORTHOPEDICS, 1963, 49 (07) : 538 - &
  • [22] Audio-visual biometrics
    Aleksic, Petar S.
    Katsaggelos, Aggelos K.
    [J]. PROCEEDINGS OF THE IEEE, 2006, 94 (11) : 2025 - 2044
  • [23] AUDIO-VISUAL TECHNOLOGIES
    TAKESHITA, M
    FURUKAWA, M
    HAYATSU, R
    MURAKAMI, R
    SUZUKI, K
    HASHIZUME, K
    [J]. NEC RESEARCH & DEVELOPMENT, 1990, (96): : 265 - 277
  • [24] AUDIO-VISUAL UNIT
    WHARTON, BA
    [J]. PEDIATRICS, 1971, 47 (05) : 957 - &
  • [25] Audio-Visual Techniques
    Sears, William P., Jr.
    [J]. EDUCATION, 1948, 69 (02): : 132 - 132
  • [26] AUDIO-VISUAL POTPOURRI
    不详
    [J]. INDUSTRIAL PHOTOGRAPHY, 1968, 17 (07): : 30 - &
  • [27] Audio-visual imposture
    Karam, Walid
    Mokbel, Chafic
    Greige, Hanna
    Chollet, Gerard
    [J]. MOBILE MULTIMEDIA/IMAGE PROCESSING FOR MILITARY AND SECURITY APPLICATIONS, 2006, 6250
  • [28] AUDIO-VISUAL DEVELOPMENTS
    Schwartz, Mortimer
    [J]. JOURNAL OF LEGAL EDUCATION, 1952, 5 (01) : 88 - 95
  • [29] AUDIO-VISUAL FOR THE PATIENT
    STUTTLE, FL
    [J]. JOURNAL OF BONE AND JOINT SURGERY-AMERICAN VOLUME, 1959, 41 (07): : 1362 - 1362
  • [30] The Audio-Visual Reader
    不详
    [J]. JOURNAL OF EDUCATIONAL RESEARCH, 1955, 48 (07): : 552 - 553