Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system

被引:5
|
作者
Shi, Gui-Xin [1 ]
Zhang, Wei-Qiang [1 ]
Wang, Guan-Bo [1 ]
Zhao, Jing [1 ]
Chai, Shu-Zhou [1 ]
Zhao, Ze-Yu [1 ]
机构
[1] Tsinghua Univ, Beijing Natl Res Ctr Informat Sci & Technol, Dept Elect Engn, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
OpenSAT20; End-to-end ASR; End-to-end KWS; Force alignment; Biased loss; SPEECH RECOGNITION; ENERGY SCORER; SEARCH; ATTENTION;
D O I
10.1186/s13636-021-00212-9
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Many end-to-end approaches have been proposed to detect predefined keywords. For scenarios of multi-keywords, there are still two bottlenecks that need to be resolved: (1) the distribution of important data that contains keyword(s) is sparse, and (2) the timestamps of the detected keywords are inaccurate. In this paper, to alleviate the first issue and further improve the performance of the end-to-end ASR front-end, we propose the biased loss function for guiding the recognizer to pay more attention to the speech segments containing the predefined keywords. As for the second issue, we solve this problem by modifying the force alignment applied to the end-to-end ASR front-end. To get the frame-level alignment, we utilize a Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) based acoustic model (AM) for auxiliary. The proposed system is evaluated in the OpenSAT20 held by the National Institute of Standards and Technology (NIST). The performance of our end-to-end KWS system is comparable to the conventional hybrid KWS system, sometimes even slightly better. With fusion results of the end-to-end and conventional KWS systems, we won the first prize in the KWS track. On the dev dataset (a part of SAFE-T corpus), the system outperforms the baseline by a large margin, i.e., our system with GMM-HMM aligner has a lower segmentation-aware word error rates (relatively 7.9-19.2% decrease) and higher overall Actual term-weighted values (relatively 3.6-11.0% increase), which demonstrates the effectiveness of the proposed method. For more precise alignments, we can use DNN-based AM as alignmentor at the cost of more computation.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Timestamp-aligning and keyword-biasing end-to-end ASR front-end for a KWS system
    Gui-Xin Shi
    Wei-Qiang Zhang
    Guan-Bo Wang
    Jing Zhao
    Shu-Zhou Chai
    Ze-Yu Zhao
    EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [2] Contextual Biasing for End-to-End Chinese ASR
    Zhang, Kai
    Zhang, Qiuxia
    Wang, Chung-Che
    Jang, Jyh-Shing Roger
    IEEE ACCESS, 2024, 12 : 92960 - 92975
  • [3] ETEH: Unified Attention-Based End-to-End ASR and KWS Architecture
    Cheng, Gaofeng
    Miao, Haoran
    Yang, Runyan
    Deng, Keqi
    Yan, Yonghong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1360 - 1373
  • [4] Class LM and Word Mapping for Contextual Biasing in End-to-End ASR
    Huang, Rongqing
    Abdel-hamid, Ossama
    Li, Xinwei
    Evermann, Gunnar
    INTERSPEECH 2020, 2020, : 4348 - 4351
  • [5] END-TO-END ASR-FREE KEYWORD SEARCH FROM SPEECH
    Audhkhasi, Kartik
    Rosenberg, Andrew
    Sethy, Abhinav
    Ramabhadran, Bhuvana
    Kingsbury, Brian
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4840 - 4844
  • [6] End-to-End ASR-Free Keyword Search From Speech
    Audhkhasi, Kartik
    Rosenberg, Andrew
    Sethy, Abhinav
    Ramabhadran, Bhuvana
    Kingsbury, Brian
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1351 - 1359
  • [7] NAM plus : TOWARDS SCALABLE END-TO-END CONTEXTUAL BIASING FOR ADAPTIVE ASR
    Munkhdalai, Tsendsuren
    Wu, Zelin
    Pundak, Golan
    Sim, Khe Chai
    Li, Jiayang
    Rondon, Pat
    Sainath, Tara N.
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 190 - 196
  • [8] THE FRONT-END SYSTEM
    CHAPPELL, SG
    HENIG, FH
    WATSON, DS
    BELL SYSTEM TECHNICAL JOURNAL, 1982, 61 (06): : 1165 - 1176
  • [9] Evaluation of a wavelet based ASR front-end
    Farooq, Omar
    Datta, Sekharjit
    INTERNATIONAL JOURNAL OF WAVELETS MULTIRESOLUTION AND INFORMATION PROCESSING, 2007, 5 (04) : 641 - 654
  • [10] A phoneme-similarity based ASR front-end
    Applebaum, TH
    Morin, P
    Hanson, BA
    1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 33 - 36