End-to-End ASR with Adaptive Span Self-Attention

被引：3

作者：

Chang, Xuankai ^{[1
]}

Subramanian, Aswin Shanmugam ^{[1
]}

Guo, Pengcheng ^{[1
,2
]}

Watanabe, Shinji ^{[1
]}

Fujita, Yuya ^{[3
]}

Omachi, Motoi ^{[3
]}

机构：

[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA

[2] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China

[3] Yahoo Japan Corp, Tokyo, Japan

来源：

INTERSPEECH 2020 | 2020年

关键词：

Self-attention; adaptive; Transformer; end-to-end; speech recognition;

D O I：

10.21437/Interspeech.2020-2816

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Transformers have demonstrated state-of-the-art performance on many tasks in natural language processing and speech processing. One of the key components in Transformers is self-attention, which attends to the whole input sequence at every layer. However, the computational and memory cost of self-attention is square of the input sequence length, which is a major concern in automatic speech recognition (ASR) where the input sequence can be very long. In this paper, we propose to use a technique called adaptive span self-attention for ASR tasks, which is originally proposed for language modeling. Our method enables the network to learn an appropriate size and position of the window for each layer and head, and our newly introduced scheme can further control the window size depending on the future and past contexts. Thus, it can save both computational complexity and memory size from the square order of the input length to the adaptive linear order. We show the effectiveness of the proposed method by using several ASR tasks, and the proposed adaptive span methods consistently improved the performance from the conventional fixed span methods.

引用

页码：3595 / 3599

页数：5

共 50 条

[41] STREAMING BILINGUAL END-TO-END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX
Patil, Aditya
Joshi, Vikas
Agrawal, Purvi
Mehta, Rupesh
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 252 - 259
[42] Towards Lifelong Learning of End-to-end ASR
Chang, Heng-Jui
Lee, Hung-yi
Lee, Lin-shan
[J]. INTERSPEECH 2021, 2021, : 2551 - 2555
[43] Contextual Biasing for End-to-End Chinese ASR
Zhang, Kai
Zhang, Qiuxia
Wang, Chung-Che
Jang, Jyh-Shing Roger
[J]. IEEE ACCESS, 2024, 12 : 92960 - 92975
[44] Multi-Scale Visual Semantics Aggregation with Self-Attention for End-to-End Image-Text Matching
Zheng, Zhuobin
Ben, Youcheng
Yuan, Chun
[J]. ASIAN CONFERENCE ON MACHINE LEARNING, VOL 101, 2019, 101 : 940 - 955
[45] End-to-End Topic Classification without ASR
Dong, Zexian
Liu, Jia
Zhang, Wei-Qiang
[J]. 2019 IEEE 19TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT 2019), 2019,
[46] Phonemic competition in end-to-end ASR models
ten Bosch, Louis
Bentum, Martijn
Boves, Lou
[J]. INTERSPEECH 2023, 2023, : 586 - 590
[47] UNSUPERVISED MODEL ADAPTATION FOR END-TO-END ASR
Sivaraman, Ganesh
Casal, Ricardo
Garland, Matt
Khoury, Elie
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6987 - 6991
[48] Hash Self-Attention End-to-End Network for Sketch-Based 3D Shape Retrieval
Zhao X.
Pan X.
Liu F.
Zhang S.
[J]. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2021, 33 (05): : 798 - 805
[49] UNSUPERVISED SPEAKER ADAPTATION USING ATTENTION-BASED SPEAKER MEMORY FOR END-TO-END ASR
Sari, Leda
Moritz, Niko
Hori, Takaaki
Le Roux, Jonathan
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7384 - 7388
[50] AN END-TO-END SPEECH ACCENT RECOGNITION METHOD BASED ON HYBRID CTC/ATTENTION TRANSFORMER ASR
Gao, Qiang
Wu, Haiwei
Sun, Yanqing
Duan, Yitao
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7253 - 7257

← 1 2 3 4 5 →