Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking

被引:25
|
作者
Naqvi, S. Mohsen [1 ]
Wang, W. [2 ]
Khan, M. Salman [1 ]
Barnard, M. [2 ]
Chambers, J. A. [1 ]
机构
[1] Univ Loughborough, Sch Elect Elect & Syst Engn, Adv Signal Proc Grp, Loughborough LE11 3TU, Leics, England
[2] Univ Surrey, Dept Elect Engn, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, Surrey, England
基金
英国工程与自然科学研究理事会;
关键词
BLIND SOURCE SEPARATION; CONVOLUTIVE MIXTURES; SPEECH SEPARATION; AUDIO;
D O I
10.1049/iet-spr.2011.0124
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
A novel multimodal source separation approach is proposed for physically moving and stationary sources which exploits a circular microphone array, multiple video cameras, robust spatial beamforming and time-frequency masking. The challenge of separating moving sources, including higher reverberation time (RT) even for physically stationary sources, is that the mixing filters are time varying; as such the unmixing filters should also be time varying but these are difficult to determine from only audio measurements. Therefore in the proposed approach, visual modality is used to facilitate the separation for both stationary and moving sources. The movement of the sources is detected by a three-dimensional tracker based on a Markov Chain Monte Carlo particle filter. The audio separation is performed by a robust least squares frequency invariant data-independent beamformer. The uncertainties in source localisation and direction of arrival information obtained from the 3D video-based tracker are controlled by using a convex optimisation approach in the beamformer design. In the final stage, the separated audio sources are further enhanced by applying a binary time-frequency masking technique in the cepstral domain. Experimental results show that using the visual modality, the proposed algorithm cannot only achieve performance better than conventional frequency-domain source separations algorithms, but also provide acceptable separation performance for moving sources.
引用
收藏
页码:466 / 477
页数:12
相关论文
共 50 条
  • [1] Exploiting the Complementarity of Audio and Visual Data in Multi-Speaker Tracking
    Ban, Yutong
    Girin, Laurent
    Alameda-Pineda, Xavier
    Horaud, Radu
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 446 - 454
  • [2] Audio-Visual Multi-Speaker Tracking Based On the GLMB Framework
    Lin, Shoufeng
    Qian, Xinyuan
    [J]. INTERSPEECH 2020, 2020, : 3082 - 3086
  • [3] ACCOUNTING FOR ROOM ACOUSTICS IN AUDIO-VISUAL MULTI-SPEAKER TRACKING
    Ban, Yutong
    Li, Xiaofei
    Alameda-Pineda, Xavier
    Girin, Laurent
    Horaud, Radu
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6553 - 6557
  • [4] Multi-Speaker Tracking From an Audio-Visual Sensing Device
    Qian, Xinyuan
    Brutti, Alessio
    Lanz, Oswald
    Omologo, Maurizio
    Cavallaro, Andrea
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (10) : 2576 - 2588
  • [5] Multi-Speaker Audio-Visual Corpus RUSAVIC: Russian Audio-Visual Speech in Cars
    Ivanko, Denis
    Ryumin, Dmitry
    Axyonov, Alexandr
    Kashevnik, Alexey
    Karpov, Alexey
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1555 - 1559
  • [6] Source Separation of Convolutive and Noisy Mixtures Using Audio-Visual Dictionary Learning and Probabilistic Time-Frequency Masking
    Liu, Qingju
    Wang, Wenwu
    Jackson, Philip J. B.
    Barnard, Mark
    Kittler, Josef
    Chambers, Jonathon
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2013, 61 (22) : 5520 - 5535
  • [7] Integration of audio-visual information for multi-speaker multimedia speaker recognition
    Yang, Jichen
    Chen, Fangfan
    Cheng, Yu
    Lin, Pei
    [J]. DIGITAL SIGNAL PROCESSING, 2024, 145
  • [8] Particle Flow SMC-PHD Filter for Audio-Visual Multi-speaker Tracking
    Liu, Yang
    Wang, Wenwu
    Chambers, Jonathon
    Kilic, Volkan
    Hilton, Adrian
    [J]. LATENT VARIABLE ANALYSIS AND SIGNAL SEPARATION (LVA/ICA 2017), 2017, 10169 : 344 - 353
  • [9] Audio-Visual Particle Flow SMC-PHD Filtering for Multi-Speaker Tracking
    Liu, Yang
    Kilic, Volkan
    Guan, Jian
    Wang, Wenwu
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) : 934 - 948
  • [10] Sound Source Separation by Using Matched Beamforming and Time-Frequency Masking
    Beh, Jounghoon
    Lee, Taekjin
    Han, David
    Ko, Hanseok
    [J]. IEEE/RSJ 2010 INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS 2010), 2010,