Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments

被引:8
|
作者
Yang, Runyan [1 ,2 ]
Cheng, Gaofeng [1 ]
Miao, Haoran [1 ,2 ]
Li, Ta [1 ]
Zhang, Pengyuan [1 ,2 ]
Yan, Yonghong [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
关键词
Task analysis; Hidden Markov models; Transformers; Speech recognition; Reliability; Training; Decoding; End-to-end speech recognition; keyword search; phoneme alignment; keyword confidence scoring; SPEECH RECOGNITION; NEURAL-NETWORKS; TRANSFORMER; DROPOUT; MODEL;
D O I
10.1109/TASLP.2021.3120632
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Attention-based end-to-end (E2E) automatic speech recognition (ASR) architectures are now the state-of-the-art in terms of recognition performance. However, despite their effectiveness, they have not been widely applied in keyword search (KWS) tasks yet. In this paper, we propose the Att-E2E-KWS architecture, an attention-based E2E ASR framework for KWS that can afford accurate and reliable keyword retrieval results. First, we design a basic framework to carry out KWS based on attention-based E2E ASR. We adopt the connectionist temporal classification and attention (CTC/Att) joint E2E ASR architecture and exploit the spike posterior property of CTC to provide the keywords time stamps. Second, we introduce the frame-synchronous phonemes modeling and use the dynamic programming (DP) algorithm to provide alignments between E2E grapheme outputs and phoneme outputs. We call this alignment procedure dynamic time alignment (DTA), which can provide the proposed Att-E2E-KWS system with more accurate time stamps and reliable confidence scores. Third, we use the Transformer, a self-attention-based encoder-decoder neural network, in place of conventional recurrent neural networks in order to yield more parallelizable models and increased training speed. We conduct comprehensive experiments on English and Mandarin Chinese. To the best of our knowledge, this is the first practical Att-E2E-KWS framework, and experimental results on Switchboard and HKUST corpora show that our proposed Att-E2E-KWS systems significantly outperform the CTC E2E ASR based KWS baselines.
引用
收藏
页码:3202 / 3215
页数:14
相关论文
共 50 条
  • [21] Non-autoregressive Deliberation-Attention based End-to-End ASR
    Gao, Changfeng
    Cheng, Gaofeng
    Zhou, Jun
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. 2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [22] CHARACTER-AWARE ATTENTION-BASED END-TO-END SPEECH RECOGNITION
    Meng, Zhong
    Gaur, Yashesh
    Li, Jinyu
    Gong, Yifan
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 949 - 955
  • [23] An End-to-End Location and Regression Tracker with Attention-based Fused Features
    Zhang, Qinyi
    Du, Shishuai
    Yang, Huihua
    [J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [24] AN ANALYSIS OF DECODING FOR ATTENTION-BASED END-TO-END MANDARIN SPEECH RECOGNITION
    Jiang, Dongwei
    Zou, Wei
    Zhao, Shuaijiang
    Yang, Guilin
    Li, Xiangang
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 384 - 388
  • [25] An End-to-End Attention-Based Neural Model for Complementary Clothing Matching
    Liu, Jinhuan
    Song, Xuemeng
    Nie, Liqiang
    Gan, Tian
    Ma, Jun
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (04)
  • [26] Real-time emotion recognition using end-to-end attention-based fusion network
    Shit, Sahadeb
    Rana, Aiswarya
    Das, Dibyendu Kumar
    Ray, Dip Narayan
    [J]. JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (01)
  • [27] STREAMING BILINGUAL END-TO-END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX
    Patil, Aditya
    Joshi, Vikas
    Agrawal, Purvi
    Mehta, Rupesh
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 252 - 259
  • [28] STREAMING ATTENTION-BASED MODELS WITH AUGMENTED MEMORY FOR END-TO-END SPEECH RECOGNITION
    Yeh, Ching-Feng
    Wang, Yongqiang
    Shi, Yangyang
    Wu, Chunyang
    Zhang, Frank
    Chan, Julian
    Seltzer, Michael L.
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 8 - 14
  • [29] Attention-Based End-to-End Differentiable Particle Filter for Audio Speaker Tracking
    Zhao, Jinzheng
    Xu, Yong
    Qian, Xinyuan
    Liu, Haohe
    Plumbley, Mark D.
    Wang, Wenwu
    [J]. IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 449 - 458
  • [30] STREAM ATTENTION-BASED MULTI-ARRAY END-TO-END SPEECH RECOGNITION
    Wang, Xiaofei
    Li, Ruizhi
    Mallidi, Sri Harish
    Hori, Takaaki
    Watanabe, Shinji
    Hermansky, Hynek
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7105 - 7109