MUSE: MULTI-MODAL TARGET SPEAKER EXTRACTION WITH VISUAL CUES

被引：12

作者：

Pan, Zexu ^{[1
,2
]}

Tao, Ruijie ^{[3
]}

Xu, Chenglin ^{[3
]}

Li, Haizhou ^{[1
,3
,4
]}

机构：

[1] Natl Univ Singapore NUS, Inst Data Sci, Singapore, Singapore

[2] NUS, Grad Sch Integrat Sci & Engn, Singapore, Singapore

[3] NUS, Dept Elect & Comp Engn, Singapore, Singapore

[4] Univ Bremen, Machine Listening Lab, Bremen, Germany

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年

基金：

新加坡国家研究基金会;

关键词：

Multi-modal; target speaker extraction; time domain; robustness; SPEECH; SEPARATION;

D O I：

10.1109/ICASSP39728.2021.9414023

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention. Such a reference speech is typically pre-recorded. On the other hand, the temporal synchronization between speech and lip movement also serves as an informative cue. Motivated by this idea, we study a novel technique to use speech-lip visual cues to extract reference target speech directly from mixture speech during inference time, without the need of pre-recorded reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence. MuSE not only outperforms other competitive baselines in terms of SI-SDR and PESQ, but also shows consistent improvement in cross-dataset evaluations.

引用

页码：6678 / 6682

页数：5

共 50 条

[1] LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION
Liu, Qinghua
Huang, Yating
Hao, Yunzhe
Xu, Jiaming
Xu, Bo
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 488 - 495
[2] Multi-Modal Anomaly Detection by Using Audio and Visual Cues
Rehman, Ata-Ur
Ullah, Hafiz Sami
Farooq, Haroon
Khan, Muhammad Salman
Mahmood, Tayyeb
Khan, Hafiz Owais Ahmed
[J]. IEEE ACCESS, 2021, 9 : 30587 - 30603
[3] Automatic extraction of geometric lip features with application to multi-modal speaker identification
Arsic, Ivana
Vilagut, Roger
Thiran, Jean-Philippe
[J]. 2006 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO - ICME 2006, VOLS 1-5, PROCEEDINGS, 2006, : 161 - +
[4] On-Line Multi-Modal Speaker Diarization
Noulas, Athanasios K.
Krose, Ben J. A.
[J]. ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, 2007, : 350 - 357
[5] Multi-modal orientation cues in homing pigeons
Walcott, C
[J]. INTEGRATIVE AND COMPARATIVE BIOLOGY, 2005, 45 (03) : 574 - 581
[6] Audio-visual Speaker Recognition via Multi-modal Correlated Neural Networks
Geng, Jiajia
Liu, Xin
Cheung, Yiu-ming
[J]. 2016 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE WORKSHOPS (WIW 2016), 2016, : 123 - 128
[7] MSDWILD: MULTI-MODAL SPEAKER DIARIZATION DATASET IN THE WILD
Liu, Tao
Fang, Shuai
Xiang, Xu
Song, Hongbo
Lin, Shaoxiong
Sun, Jiaqi
Han, Tianyuan
Chen, Siyuan
Yao, Binwei
Liu, Sen
Wu, Yifei
Qian, Yanmin
Yu, Kai
[J]. INTERSPEECH 2022, 2022, : 1476 - 1480
[8] MAAS: Multi-modal Assignation for Active Speaker Detection
Leon Alcazar, Juan
Heilbron, Fabian Caba
Thabet, Ali K.
Ghanem, Bernard
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 265 - 274
[9] Visual Prompt Multi-Modal Tracking
Zhu, Jiawen
Lai, Simiao
Chen, Xin
Wang, Dong
Lu, Huchuan
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 9516 - 9526
[10] VISUAL AS MULTI-MODAL ARGUMENTATION IN LAW
Novak, Marko
[J]. BRATISLAVA LAW REVIEW, 2021, 5 (01): : 91 - 110

← 1 2 3 4 5 →