Using Visual Speech Information in Masking Methods for Audio Speaker Separation

被引：6

作者：

Khan, Faheem Ullah ^{[1
]}

Milner, Ben P. ^{[1
]}

Le Cornu, Thomas ^{[1
]}

机构：

[1] Univ East Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2018年 / 26卷 / 10期

关键词：

Speaker separation; audio-visual processing; binary masks; ratio mask; ENHANCEMENT; NOISE; INTELLIGIBILITY; SEGREGATION; PREDICTION; FREQUENCY; TRACKING;

D O I：

10.1109/TASLP.2018.2835719

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper examines whether visual speech information can be effective within audio-masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map the visual speech features to an audio feature space from which both visually derived binary masks and visually derived ratio masks are estimated, before application to the speech mixture. Second, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only, and audio-visual masking methods of speaker separation at mixing levels from - 10 to +10 dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.

引用

页码：1742 / 1754

页数：13

共 50 条

[31] Online blind speech separation using multiple acoustic speaker tracking and time-frequency masking
Pertila, P.
COMPUTER SPEECH AND LANGUAGE, 2013, 27 (03): : 683 - 702
[32] Integration of audio-visual information for multi-speaker multimedia speaker recognition
Yang, Jichen
Chen, Fangfan
Cheng, Yu
Lin, Pei
DIGITAL SIGNAL PROCESSING, 2024, 145
[33] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
Sodoyer, D
Schwartz, JL
Girin, L
Klinkisch, J
Jutten, C
EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1165 - 1173
[34] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
Sodoyer, D. (sodoyer@icp.inpg.fr), 1600, Hindawi Publishing Corporation (2002):
[35] Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli
David Sodoyer
Jean-Luc Schwartz
Laurent Girin
Jacob Klinkisch
Christian Jutten
EURASIP Journal on Advances in Signal Processing, 2002
[36] Enhancing Audio Speech using Visual Speech Features
Almajai, Ibrahim
Milner, Ben
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1915 - 1918
[37] The 'Audio-Visual Face Cover Corpus': Investigations into audio-visual speech and speaker recognition when the speaker's face is occluded by facewear
Fecher, Natalie
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2247 - 2250
[38] USING AUDIO AND VISUAL CUES FOR SPEAKER DIARISATION INITIALISATION
Garau, Giulia
Bourlard, Herve
2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4942 - 4945
[39] Fusion Methods for Speech Enhancement and Audio Source Separation
Jaureguiberry, Xabier
Vincent, Emmanuel
Richard, Gael
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (07) : 1266 - 1279
[40] Multimodal (audio-visual) source separation exploiting multi-speaker tracking, robust beamforming and time-frequency masking
Naqvi, S. Mohsen
Wang, W.
Khan, M. Salman
Barnard, M.
Chambers, J. A.
IET SIGNAL PROCESSING, 2012, 6 (05) : 466 - 477

← 1 2 3 4 5 →