Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring

被引:3
|
作者
Li, Qiujia [1 ]
Zhang, Chao [1 ,2 ]
Woodland, Philip C. [1 ]
机构
[1] Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England
[2] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
关键词
Speech recognition; System combination; Hybrid DNN-HMM systems; Attention-based encoder-decoder models; Lattice rescore; SPEECH; NETWORKS;
D O I
10.1016/j.specom.2022.12.002
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The traditional hybrid deep neural network (DNN)-hidden Markov model (HMM) system and attention-based encoder-decoder (AED) model are both commonly used automatic speech recognition (ASR) approaches with distinct characteristics and advantages. While hybrid systems are per-frame-based and highly modularised to leverage external phonetic and linguistic knowledge, AED models operate on a per-label basis and jointly learn the acoustic and language information using a single model in an end-to-end trainable fashion. In this paper, we propose combining these two approaches in a two-pass rescoring framework. The first-pass uses hybrid ASR systems to facilitate streaming and controllable ASR, and the second-pass re-scores the N-best hypotheses or lattices produced by the first-pass hybrid DNN-HMM system with AED models. We also propose an improved algorithm for lattice rescoring with AED models. Experiments show the combined two-pass systems achieve competitive performance without using extra speech or text data on two standard ASR tasks. For the 80-hour AMI IHM dataset, the combined system has a 13.7% word error rate (WER) on the evaluation set and is up to a 29% relative WER reduction over the individual systems. For the 300-hour Switchboard dataset, the WERs of the combined system are 5.7% and 12.1% on Switchboard and CallHome subsets of Hub5'00, and 13.2% and 7.6% on Switchboard Cellular and Fisher subsets of RT03, and are up to a 33% relative reduction in WER over the individual systems.
引用
收藏
页码:12 / 21
页数:10
相关论文
共 50 条
  • [1] On quantifying the quality of acoustic models in hybrid DNN-HMM ASR
    Dighe, Pranay
    Asaei, Afsaneh
    Bourlard, Herve
    [J]. SPEECH COMMUNICATION, 2020, 119 : 24 - 35
  • [2] Neural Speech-to-Text Language Models for Rescoring Hypotheses of DNN-HMM Hybrid Automatic Speech Recognition Systems
    Tanaka, Tomohiro
    Masumura, Ryo
    Moriya, Takafumi
    Aono, Yushi
    [J]. 2018 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2018, : 196 - 200
  • [3] Uncertainty decoding for DNN-HMM hybrid systems based on numerical sampling
    Huemmer, Christian
    Maas, Roland
    Schwarz, Andreas
    Astudillo, Ramon Fernandez
    Kellermann, Walter
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3556 - 3560
  • [4] Performance analysis of ASR system in hybrid DNN-HMM framework using a PWL euclidean activation function
    Anirban DUTTA
    Gudmalwar ASHISHKUMAR
    Ch V Rama RAO
    [J]. Frontiers of Computer Science, 2021, (04) : 196 - 206
  • [5] Performance analysis of ASR system in hybrid DNN-HMM framework using a PWL euclidean activation function
    Dutta, Anirban
    Ashishkumar, Gudmalwar
    Rao, Ch V. Rama
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2021, 15 (04)
  • [6] Performance analysis of ASR system in hybrid DNN-HMM framework using a PWL euclidean activation function
    Anirban Dutta
    Gudmalwar Ashishkumar
    Ch V. Rama Rao
    [J]. Frontiers of Computer Science, 2021, 15
  • [7] Recognizing the content types of network traffic based on a hybrid DNN-HMM model
    Tan, Xincheng
    Xie, Yi
    Ma, Haishou
    Yu, Shunzheng
    Hu, Jiankun
    [J]. JOURNAL OF NETWORK AND COMPUTER APPLICATIONS, 2019, 142 : 51 - 62
  • [8] A NEW UNCERTAINTY DECODING SCHEME FOR DNN-HMM HYBRID SYSTEMS WITH MULTICHANNEL SPEECH ENHANCEMENT
    Huemmer, Christian
    Schwarz, Andreas
    Maas, Roland
    Barfuss, Hendrik
    Astudillo, Ramon Fernandez
    Kellermann, Walter
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 5760 - 5764
  • [9] Robustness over time-varying channels in DNN-HMM ASR based human-robot interaction
    Novoa, Jose
    Wuth, Jorge
    Pablo Escudero, Juan
    Fredes, Josue
    Mahu, Rodrigo
    Stern, Richard
    Becerra Yoma, Nestor
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 839 - 843
  • [10] AN IMPROVED UNCERTAINTY DECODING SCHEME WITH WEIGHTED SAMPLES FOR MULTI-CHANNEL DNN-HMM HYBRID SYSTEMS
    Huemmer, Christian
    Astudillo, Ramon Fernandez
    Kellermann, Walter
    [J]. 2017 HANDS-FREE SPEECH COMMUNICATIONS AND MICROPHONE ARRAYS (HSCMA 2017), 2017, : 31 - 35