Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments

被引：8

作者：

Yang, Runyan ^{[1
,2
]}

Cheng, Gaofeng ^{[1
]}

Miao, Haoran ^{[1
,2
]}

Li, Ta ^{[1
]}

Zhang, Pengyuan ^{[1
,2
]}

Yan, Yonghong ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2021年 / 29卷

关键词：

Task analysis; Hidden Markov models; Transformers; Speech recognition; Reliability; Training; Decoding; End-to-end speech recognition; keyword search; phoneme alignment; keyword confidence scoring; SPEECH RECOGNITION; NEURAL-NETWORKS; TRANSFORMER; DROPOUT; MODEL;

D O I：

10.1109/TASLP.2021.3120632

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Attention-based end-to-end (E2E) automatic speech recognition (ASR) architectures are now the state-of-the-art in terms of recognition performance. However, despite their effectiveness, they have not been widely applied in keyword search (KWS) tasks yet. In this paper, we propose the Att-E2E-KWS architecture, an attention-based E2E ASR framework for KWS that can afford accurate and reliable keyword retrieval results. First, we design a basic framework to carry out KWS based on attention-based E2E ASR. We adopt the connectionist temporal classification and attention (CTC/Att) joint E2E ASR architecture and exploit the spike posterior property of CTC to provide the keywords time stamps. Second, we introduce the frame-synchronous phonemes modeling and use the dynamic programming (DP) algorithm to provide alignments between E2E grapheme outputs and phoneme outputs. We call this alignment procedure dynamic time alignment (DTA), which can provide the proposed Att-E2E-KWS system with more accurate time stamps and reliable confidence scores. Third, we use the Transformer, a self-attention-based encoder-decoder neural network, in place of conventional recurrent neural networks in order to yield more parallelizable models and increased training speed. We conduct comprehensive experiments on English and Mandarin Chinese. To the best of our knowledge, this is the first practical Att-E2E-KWS framework, and experimental results on Switchboard and HKUST corpora show that our proposed Att-E2E-KWS systems significantly outperform the CTC E2E ASR based KWS baselines.

引用

页码：3202 / 3215

页数：14

共 50 条

[21] Non-autoregressive Deliberation-Attention based End-to-End ASR
Gao, Changfeng
Cheng, Gaofeng
Zhou, Jun
Zhang, Pengyuan
Yan, Yonghong
[J]. 2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[22] CHARACTER-AWARE ATTENTION-BASED END-TO-END SPEECH RECOGNITION
Meng, Zhong
Gaur, Yashesh
Li, Jinyu
Gong, Yifan
[J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 949 - 955
[23] An End-to-End Location and Regression Tracker with Attention-based Fused Features
Zhang, Qinyi
Du, Shishuai
Yang, Huihua
[J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
[24] AN ANALYSIS OF DECODING FOR ATTENTION-BASED END-TO-END MANDARIN SPEECH RECOGNITION
Jiang, Dongwei
Zou, Wei
Zhao, Shuaijiang
Yang, Guilin
Li, Xiangang
[J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 384 - 388
[25] An End-to-End Attention-Based Neural Model for Complementary Clothing Matching
Liu, Jinhuan
Song, Xuemeng
Nie, Liqiang
Gan, Tian
Ma, Jun
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (04)
[26] Real-time emotion recognition using end-to-end attention-based fusion network
Shit, Sahadeb
Rana, Aiswarya
Das, Dibyendu Kumar
Ray, Dip Narayan
[J]. JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (01)
[27] STREAMING BILINGUAL END-TO-END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX
Patil, Aditya
Joshi, Vikas
Agrawal, Purvi
Mehta, Rupesh
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 252 - 259
[28] STREAMING ATTENTION-BASED MODELS WITH AUGMENTED MEMORY FOR END-TO-END SPEECH RECOGNITION
Yeh, Ching-Feng
Wang, Yongqiang
Shi, Yangyang
Wu, Chunyang
Zhang, Frank
Chan, Julian
Seltzer, Michael L.
[J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 8 - 14
[29] Attention-Based End-to-End Differentiable Particle Filter for Audio Speaker Tracking
Zhao, Jinzheng
Xu, Yong
Qian, Xinyuan
Liu, Haohe
Plumbley, Mark D.
Wang, Wenwu
[J]. IEEE OPEN JOURNAL OF SIGNAL PROCESSING, 2024, 5 : 449 - 458
[30] STREAM ATTENTION-BASED MULTI-ARRAY END-TO-END SPEECH RECOGNITION
Wang, Xiaofei
Li, Ruizhi
Mallidi, Sri Harish
Hori, Takaaki
Watanabe, Shinji
Hermansky, Hynek
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7105 - 7109

← 1 2 3 4 5 →