Active Audio-Visual Separation of Dynamic Sound Sources

被引：8

作者：

Majumder, Sagnik ^{[1
]}

Grauman, Kristen ^{[1
,2
]}

机构：

[1] UT Austin, Austin, TX 78712 USA

[2] Facebook AI Res, Austin, TX USA

来源：

COMPUTER VISION, ECCV 2022, PT XXXIX | 2022年 / 13699卷

关键词：

D O I：

10.1007/978-3-031-19842-7_32

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound accurately at every step using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, using selfattention to make high-quality estimates for current timesteps and also simultaneously improve its past estimates. Using highly realistic acoustic SoundSpaces [13] simulations in real-world scanned Matterport3D [11] environments, we show that our model is able to learn efficient behavior to carry out continuous separation of a dynamic audio target. Project: https://vision.cs.utexas.edu/projects/active- av- dynamic-separation/.

引用

页码：551 / 569

页数：19

共 50 条

[1] Bayesian separation of audio-visual speech sources
Rajaram, S
Nefian, AV
Huang, TS
[J]. 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: DESIGN AND IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS INDUSTRY TECHNOLOGY TRACKS MACHINE LEARNING FOR SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING SIGNAL PROCESSING FOR EDUCATION, 2004, : 657 - 660
[2] iQuery: Instruments as Queries for Audio-Visual Sound Separation
Chen, Jiaben
Zhang, Renrui
Lian, Dongze
Yang, Jiaqi
Zeng, Ziyao
Shi, Jianbo
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14675 - 14686
[3] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
Sodoyer, D
Schwartz, JL
Girin, L
Klinkisch, J
Jutten, C
[J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1165 - 1173
[4] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
[J]. Sodoyer, D. (sodoyer@icp.inpg.fr), 1600, Hindawi Publishing Corporation (2002):
[5] Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli
David Sodoyer
Jean-Luc Schwartz
Laurent Girin
Jacob Klinkisch
Christian Jutten
[J]. EURASIP Journal on Advances in Signal Processing, 2002
[6] Audio-visual sound separation via hidden Markov models
Hershey, J
Casey, M
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 14, VOLS 1 AND 2, 2002, 14 : 1173 - 1180
[7] Real-time sound source localization and separation based on active audio-visual integration
Okuno, HG
Nakadai, K
[J]. COMPUTATIONAL METHODS IN NEURAL MODELING, PT 1, 2003, 2686 : 118 - 125
[8] Visually Guided Sound Source Separation With Audio-Visual Predictive Coding
Song, Zengjie
Zhang, Zhaoxiang
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (11) : 1 - 15
[9] Separation between sound and light enhances audio-visual prior entry effect
Hongoh, Yuki
Kita, Shinichi
Soeta, Yoshiharu
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (06) : 1641 - 1648
[10] Move2Hear: Active Audio-Visual Source Separation
Majumder, Sagnik
Al-Halah, Ziad
Grauman, Kristen
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 275 - 285

← 1 2 3 4 5 →