Using Visual Speech Information in Masking Methods for Audio Speaker Separation

被引:6
|
作者
Khan, Faheem Ullah [1 ]
Milner, Ben P. [1 ]
Le Cornu, Thomas [1 ]
机构
[1] Univ East Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England
关键词
Speaker separation; audio-visual processing; binary masks; ratio mask; ENHANCEMENT; NOISE; INTELLIGIBILITY; SEGREGATION; PREDICTION; FREQUENCY; TRACKING;
D O I
10.1109/TASLP.2018.2835719
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper examines whether visual speech information can be effective within audio-masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map the visual speech features to an audio feature space from which both visually derived binary masks and visually derived ratio masks are estimated, before application to the speech mixture. Second, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only, and audio-visual masking methods of speaker separation at mixing levels from - 10 to +10 dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.
引用
收藏
页码:1742 / 1754
页数:13
相关论文
共 50 条
  • [41] AUDIO-VISUAL SPEECH SEPARATION USING CROSS-MODAL CORRESPONDENCE LOSS
    Makishima, Naoki
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Orihashi, Shota
    Masumura, Ryo
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6673 - 6677
  • [42] Speech signals separation: A new approach exploiting the coherence of audio and visual speech
    Girin, L
    Allard, A
    Schwartz, JL
    2001 IEEE FOURTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 2001, : 631 - 636
  • [43] DEEP AUDIO-VISUAL SPEECH SEPARATION WITH ATTENTION MECHANISM
    Li, Chenda
    Qian, Yanmin
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7314 - 7318
  • [44] The effect of visual speech information on linguistic release from masking
    Williams, Brittany T.
    Viswanathan, Navin
    Brouwer, Susanne
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2023, 153 (01): : 602 - 612
  • [45] Developing an audio-visual speech source separation algorithm
    Sodoyer, D
    Girin, L
    Jutten, C
    Schwartz, JL
    SPEECH COMMUNICATION, 2004, 44 (1-4) : 113 - 125
  • [46] Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training
    Zhang, Peng
    Xu, Jiaming
    Shi, Jing
    Hao, Yunzhe
    Qin, Lei
    Xu, Bo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [47] Speech separation using speaker-adapted eigenvoice speech models
    Weiss, Ron J.
    Ellis, Daniel P. W.
    COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01): : 16 - 29
  • [48] Audio-visual speaker localization using graphical models
    Kushal, Akash
    Rahurkar, Mandar
    Li Fei-Fei
    Ponce, Jean
    Huang, Thomas
    18TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2006, : 291 - +
  • [49] Multistage information fusion for audio-visual speech recognition
    Chu, SM
    Libal, V
    Marcheret, E
    Neti, C
    Potamianos, G
    2004 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXP (ICME), VOLS 1-3, 2004, : 1651 - 1654
  • [50] Effects of audio-visual information on the intelligibility of alaryngeal speech
    Evitts, Paul M.
    Portugal, Lindsay
    Van Dine, Ami
    Holler, Aline
    JOURNAL OF COMMUNICATION DISORDERS, 2010, 43 (02) : 92 - 104