Speaker-adaptive speech recognition using speaker diarization for improved transcription of large spoken archives

被引:14
|
作者
Cerva, Petr [1 ]
Silovsky, Jan [1 ]
Zdansky, Jindrich [1 ]
Nouza, Jan [1 ]
Seps, Ladislav [1 ]
机构
[1] Tech Univ Liberec, Inst Informat Technol & Elect, Liberec 46117, Czech Republic
关键词
Speaker adaptive; Automatic speech recognition; Speaker adaptation; Speaker diarization; Automatic transcription; Large spoken archives; ADAPTATION; ACCESS;
D O I
10.1016/j.specom.2013.06.017
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper deals with speaker-adaptive speech recognition for large spoken archives. The goal is to improve the recognition accuracy of an automatic speech recognition (ASR) system that is being deployed for transcription of a large archive of Czech radio. This archive represents a significant part of Czech cultural heritage, as it contains recordings covering 90 years of broadcasting. A large portion of these documents (100,000 h) is to be transcribed and made public for browsing. To improve the transcription results, an efficient speaker-adaptive scheme is proposed. The scheme is based on integration of speaker diarization and adaptation methods and is designed to achieve a low Real-Time Factor (RTF) of the entire adaptation process, because the archive's size is enormous. It thus employs just two decoding passes, where the first one is carried out using the lexicon with a reduced number of items. Moreover, the transcripts from the first pass serve not only for adaptation, but also as the input to the speaker diarization module, which employs two-stage clustering. The output of diarization is then utilized for a cluster-based unsupervised Speaker Adaptation (SA) approach that also utilizes information based on the gender of each individual speaker. Presented experimental results on various types of programs show that our adaptation scheme yields a significant Word Error Rate (WER) reduction from 22.24% to 18.85% over the Speaker Independent (SI) system while operating at a reasonable RTF. (c) 2013 Elsevier B.V. All rights reserved.
引用
收藏
页码:1033 / 1046
页数:14
相关论文
共 50 条
  • [41] Adaptive systems for unsupervised speaker tracking and speech recognition
    Herbig, Tobias
    Gerl, Franz
    Minker, Wolfgang
    Haeb-Umbach, Reinhold
    EVOLVING SYSTEMS, 2011, 2 (03) : 199 - 214
  • [42] Improved automatic speech recognition through speaker normalization
    Giuliani, D
    Gerosa, M
    Brugnara, F
    COMPUTER SPEECH AND LANGUAGE, 2006, 20 (01): : 107 - 123
  • [43] Speaker clustering and transformation for speaker adaptation in large-vocabulary speech recognition systems
    Padmanabhan, M
    Bahl, LR
    Nahamoo, D
    Picheny, MA
    1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 701 - 704
  • [44] JOINT SPEAKER DIARIZATION AND RECOGNITION USING CONVOLUTIONAL AND RECURRENT NEURAL NETWORKS
    Zhou, Zhihan
    Zhang, Yichi
    Duan, Zhiyao
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2496 - 2500
  • [45] Robust Speaker Recognition Using Improved GFCC and Adaptive Feature Selection
    Zhang, Xingyu
    Zou, Xia
    Sun, Meng
    Wu, Penglong
    SECURITY WITH INTELLIGENT COMPUTING AND BIG-DATA SERVICES, 2020, 895 : 159 - 169
  • [46] Visual Speech Segmentation and Speaker Recognition for Transcription of TV News
    Chaloupka, Josef
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1284 - 1287
  • [47] Improved Speaker Recognition System for Stressed Speech using Deep Neural Networks
    Dumpala, Sri Harsha
    Kopparapu, Sunil Kumar
    2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 1257 - 1264
  • [48] SCALING AND BIAS CODES FOR MODELING SPEAKER-ADAPTIVE DNN-BASED SPEECH SYNTHESIS SYSTEMS
    Hieu-Thi Luong
    Yamagishi, Junichi
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 610 - 617
  • [49] Speech Segmentation and Speaker Diarization using Time-Delay Neural Network
    Toruk, Mesut
    Serbes, Ahmet
    Bilgin, Gokhan
    2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 335 - 339
  • [50] BOOTSTRAPPING NON-PARALLEL VOICE CONVERSION FROM SPEAKER-ADAPTIVE TEXT-TO-SPEECH
    Luong, Hieu-Thi
    Yamagishi, Junichi
    2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 200 - 207