MUSE: MULTI-MODAL TARGET SPEAKER EXTRACTION WITH VISUAL CUES

被引:12
|
作者
Pan, Zexu [1 ,2 ]
Tao, Ruijie [3 ]
Xu, Chenglin [3 ]
Li, Haizhou [1 ,3 ,4 ]
机构
[1] Natl Univ Singapore NUS, Inst Data Sci, Singapore, Singapore
[2] NUS, Grad Sch Integrat Sci & Engn, Singapore, Singapore
[3] NUS, Dept Elect & Comp Engn, Singapore, Singapore
[4] Univ Bremen, Machine Listening Lab, Bremen, Germany
基金
新加坡国家研究基金会;
关键词
Multi-modal; target speaker extraction; time domain; robustness; SPEECH; SEPARATION;
D O I
10.1109/ICASSP39728.2021.9414023
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention. Such a reference speech is typically pre-recorded. On the other hand, the temporal synchronization between speech and lip movement also serves as an informative cue. Motivated by this idea, we study a novel technique to use speech-lip visual cues to extract reference target speech directly from mixture speech during inference time, without the need of pre-recorded reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence. MuSE not only outperforms other competitive baselines in terms of SI-SDR and PESQ, but also shows consistent improvement in cross-dataset evaluations.
引用
收藏
页码:6678 / 6682
页数:5
相关论文
共 50 条
  • [1] LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION
    Liu, Qinghua
    Huang, Yating
    Hao, Yunzhe
    Xu, Jiaming
    Xu, Bo
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 488 - 495
  • [2] Multi-Modal Anomaly Detection by Using Audio and Visual Cues
    Rehman, Ata-Ur
    Ullah, Hafiz Sami
    Farooq, Haroon
    Khan, Muhammad Salman
    Mahmood, Tayyeb
    Khan, Hafiz Owais Ahmed
    [J]. IEEE ACCESS, 2021, 9 : 30587 - 30603
  • [3] Automatic extraction of geometric lip features with application to multi-modal speaker identification
    Arsic, Ivana
    Vilagut, Roger
    Thiran, Jean-Philippe
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO - ICME 2006, VOLS 1-5, PROCEEDINGS, 2006, : 161 - +
  • [4] On-Line Multi-Modal Speaker Diarization
    Noulas, Athanasios K.
    Krose, Ben J. A.
    [J]. ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, 2007, : 350 - 357
  • [5] Multi-modal orientation cues in homing pigeons
    Walcott, C
    [J]. INTEGRATIVE AND COMPARATIVE BIOLOGY, 2005, 45 (03) : 574 - 581
  • [6] Audio-visual Speaker Recognition via Multi-modal Correlated Neural Networks
    Geng, Jiajia
    Liu, Xin
    Cheung, Yiu-ming
    [J]. 2016 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE WORKSHOPS (WIW 2016), 2016, : 123 - 128
  • [7] MSDWILD: MULTI-MODAL SPEAKER DIARIZATION DATASET IN THE WILD
    Liu, Tao
    Fang, Shuai
    Xiang, Xu
    Song, Hongbo
    Lin, Shaoxiong
    Sun, Jiaqi
    Han, Tianyuan
    Chen, Siyuan
    Yao, Binwei
    Liu, Sen
    Wu, Yifei
    Qian, Yanmin
    Yu, Kai
    [J]. INTERSPEECH 2022, 2022, : 1476 - 1480
  • [8] MAAS: Multi-modal Assignation for Active Speaker Detection
    Leon Alcazar, Juan
    Heilbron, Fabian Caba
    Thabet, Ali K.
    Ghanem, Bernard
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 265 - 274
  • [9] Visual Prompt Multi-Modal Tracking
    Zhu, Jiawen
    Lai, Simiao
    Chen, Xin
    Wang, Dong
    Lu, Huchuan
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 9516 - 9526
  • [10] VISUAL AS MULTI-MODAL ARGUMENTATION IN LAW
    Novak, Marko
    [J]. BRATISLAVA LAW REVIEW, 2021, 5 (01): : 91 - 110