Active Audio-Visual Separation of Dynamic Sound Sources

被引:8
|
作者
Majumder, Sagnik [1 ]
Grauman, Kristen [1 ,2 ]
机构
[1] UT Austin, Austin, TX 78712 USA
[2] Facebook AI Res, Austin, TX USA
来源
关键词
D O I
10.1007/978-3-031-19842-7_32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound accurately at every step using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, using selfattention to make high-quality estimates for current timesteps and also simultaneously improve its past estimates. Using highly realistic acoustic SoundSpaces [13] simulations in real-world scanned Matterport3D [11] environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a dynamic audio target. Project: https://vision.cs.utexas.edu/projects/active- av- dynamic-separation/.
引用
收藏
页码:551 / 569
页数:19
相关论文
共 50 条
  • [1] Bayesian separation of audio-visual speech sources
    Rajaram, S
    Nefian, AV
    Huang, TS
    [J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 657 - 660
  • [2] iQuery: Instruments as Queries for Audio-Visual Sound Separation
    Chen, Jiaben
    Zhang, Renrui
    Lian, Dongze
    Yang, Jiaqi
    Zeng, Ziyao
    Shi, Jianbo
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14675 - 14686
  • [3] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
    Sodoyer, D
    Schwartz, JL
    Girin, L
    Klinkisch, J
    Jutten, C
    [J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1165 - 1173
  • [4] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
    [J]. Sodoyer, D. (sodoyer@icp.inpg.fr), 1600, Hindawi Publishing Corporation (2002):
  • [5] Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli
    David Sodoyer
    Jean-Luc Schwartz
    Laurent Girin
    Jacob Klinkisch
    Christian Jutten
    [J]. EURASIP Journal on Advances in Signal Processing, 2002
  • [6] Audio-visual sound separation via hidden Markov models
    Hershey, J
    Casey, M
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 14, VOLS 1 AND 2, 2002, 14 : 1173 - 1180
  • [7] Real-time sound source localization and separation based on active audio-visual integration
    Okuno, HG
    Nakadai, K
    [J]. COMPUTATIONAL METHODS IN NEURAL MODELING, PT 1, 2003, 2686 : 118 - 125
  • [8] Visually Guided Sound Source Separation With Audio-Visual Predictive Coding
    Song, Zengjie
    Zhang, Zhaoxiang
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (11) : 1 - 15
  • [9] Separation between sound and light enhances audio-visual prior entry effect
    Hongoh, Yuki
    Kita, Shinichi
    Soeta, Yoshiharu
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (06) : 1641 - 1648
  • [10] Move2Hear: Active Audio-Visual Source Separation
    Majumder, Sagnik
    Al-Halah, Ziad
    Grauman, Kristen
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 275 - 285