Speech Enhancement for Multimodal Speaker Diarization System

被引:6
|
作者
Ahmad, Rehan [1 ]
Zubair, Syed [2 ,3 ]
Alquhayz, Hani [4 ]
机构
[1] Int Islamic Univ, Dept Elect Engn, Islamabad 44000, Pakistan
[2] Analyt Camp, Islamabad 44000, Pakistan
[3] Univ Sialkot, Dept Comp Sci, Sialkot 51310, Pakistan
[4] Majmaah Univ, Dept Comp Sci & Informat, Coll Sci Zulfi, Al Majmaah 11952, Saudi Arabia
关键词
Multimodal speaker diarization; LSTM; audio-visual synchronization; additive white Gaussian noise; Gaussian mixture model; diarization error rate; SEGMENTATION; SEPARATION; MEETINGS;
D O I
10.1109/ACCESS.2020.3007312
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speaker diarization system identifies the speaker homogenous regions in those set of recordings where multiple speakers are present. It answers the question 'who spoke when?'. The data set for speaker diarization usually consists of telephone, meetings, TV/ talk shows, broadcast news and other multi-speaker recordings. In this paper, we present the performance of our proposed multimodal speaker diarization system under noisy conditions. Two types of noises comprising additive white Gaussian noise (AWGN) and realistic environmental noise is used to evaluate the system. To mitigate the effect of noise, we propose to add an LSTM based speech enhancement block in our diarization pipeline. This block is trained on synthesized data set with more than 100 noise types to enhance the noisy speech. The enhanced speech is further used in multimodal speaker diarization system which utilizes a pre-trained audio-visual synchronization model to find the active speaker. High confidence active speaker segments are then used to train the speaker specific clusters on the enhanced speech. A subset of AMI corpus consisting of 5.4 h of recordings is used in this analysis. For AWGN, the LSTM model performance improvement is comparable with Wiener filter while in case of realistic environmental noise, the LSTM model improves significantly as compared to Wiener filter in terms of diarization error rate (DER).
引用
收藏
页码:126671 / 126680
页数:10
相关论文
共 50 条
  • [21] Speaker Diarization and Detection System using A Priori Speaker Information
    Kenai, Ouassila
    Asbai, Nassim
    Ouamour, Siham
    Guerti, Mhania
    Djeghiour, Salim
    [J]. 2018 2ND INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE AND SPEECH PROCESSING (ICNLSP), 2018, : 73 - 78
  • [22] Overlapped speech detection for improved speaker diarization in multiparty meetings
    Boakye, Kofi
    Trueba-Hornero, Beatriz
    Vinyals, Oriol
    Friedland, Gerald
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4353 - 4356
  • [23] Methodologies for the evaluation of Speaker Diarization and Automatic Speech Recognition in the presence of overlapping speech
    Galibert, Olivier
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 1130 - 1133
  • [24] Speech Recognition and Multi-Speaker Diarization of Long Conversations
    Mao, Huanru Henry
    Li, Shuyang
    McAuley, Julian
    Cottrell, Garrison W.
    [J]. INTERSPEECH 2020, 2020, : 691 - 695
  • [25] Joint Speech Recognition and Speaker Diarization via Sequence Transduction
    El Shafey, Laurent
    Soltau, Hagen
    Shafran, Izhak
    [J]. INTERSPEECH 2019, 2019, : 396 - 400
  • [26] Neural speech turn segmentation and affinity propagation for speaker diarization
    Yin, Ruiqing
    Bredin, Herve
    Barras, Claude
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1393 - 1397
  • [27] The LIA RT'07 speaker diarization system
    Fredouille, Corinne
    Evans, Nicholas
    [J]. MULTIMODAL TECHNOLOGIES FOR PERCEPTION OF HUMANS, 2008, 4625 : 520 - 532
  • [28] System output combination for improved speaker diarization
    Bozonnet, Simon
    Evans, Nicholas
    Anguera, Xavier
    Vinyals, Oriol
    Friedland, Gerald
    Fredouille, Corinne
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2650 - +
  • [29] Experiments with Segmentation in an Online Speaker Diarization System
    Kunesova, Marie
    Zajic, Zbynek
    Radova, Vlasta
    [J]. TEXT, SPEECH, AND DIALOGUE, TSD 2017, 2017, 10415 : 429 - 437
  • [30] Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation
    Lyu, Ke-Ming
    Lyu, Ren-yuan
    Chang, Hsien-Tsung
    [J]. PEERJ COMPUTER SCIENCE, 2024, 10