Improving Transformer-based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation

被引:21
|
作者
Li, Sheng [1 ]
Raj, Dabre [1 ]
Lu, Xugang [1 ]
Shen, Peng [1 ]
Kawahara, Tatsuya [1 ,2 ]
Kawai, Hisashi [1 ]
机构
[1] Natl Inst Informat & Commun Technol, Kyoto, Japan
[2] Kyoto Univ, Kyoto, Japan
来源
关键词
Speech recognition; acoustic model; end-to-end model; transformer;
D O I
10.21437/Interspeech.2019-2112
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
The end-to-end (E2E) model allows for training of automatic speech recognition (ASR) systems without having to consider the acoustic model, lexicon, language model and complicated decoding algorithms, which are integral to conventional ASR systems. Recently, the transformer-based E2E ASR model (ASR-Transformer) showed promising results in many speech recognition tasks. The most common practice is to stack a number of feed-forward layers in the encoder and decoder. As a result, the addition of new layers improves speech recognition performance significantly. However, this also leads to a large increase in the number of parameters and severe decoding latency. In this paper, we propose to reduce the model complexity by simply reusing parameters across all stacked layers instead of introducing new parameters per layer. In order to address the slight reduction in recognition quality we propose to augment the speech inputs with bags-of-attributes. As a result we obtain a highly compressed, efficient and high quality ASR model.
引用
收藏
页码:4400 / 4404
页数:5
相关论文
共 50 条
  • [1] A transformer-based network for speech recognition
    Tang L.
    [J]. International Journal of Speech Technology, 2023, 26 (02) : 531 - 539
  • [2] Improving transformer-based speech recognition performance using data augmentation by local frame rate changes
    Lim, Seong Su
    Kang, Byung Ok
    Kwon, Oh-Wook
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2022, 41 (02): : 122 - 129
  • [3] Transformer-Based Turkish Automatic Speech Recognition
    Tasar, Davut Emre
    Koruyan, Kutan
    Cilgin, Cihan
    [J]. ACTA INFOLOGICA, 2024, 8 (01): : 1 - 10
  • [4] Transformer-Based Multilingual Speech Emotion Recognition Using Data Augmentation and Feature Fusion
    Al-onazi, Badriyya B.
    Nauman, Muhammad Asif
    Jahangir, Rashid
    Malik, Muhmmad Mohsin
    Alkhammash, Eman H.
    Elshewey, Ahmed M.
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (18):
  • [5] TRANSFORMER-BASED ACOUSTIC MODELING FOR HYBRID SPEECH RECOGNITION
    Wang, Yongqiang
    Mohamed, Abdelrahman
    Le, Duc
    Liu, Chunxi
    Xiao, Alex
    Mahadeokar, Jay
    Huang, Hongzhao
    Tjandra, Andros
    Zhang, Xiaohui
    Zhang, Frank
    Fuegen, Christian
    Zweig, Geoffrey
    Seltzer, Michael L.
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6874 - 6878
  • [6] RM-Transformer: A Transformer-based Model for Mandarin Speech Recognition
    Lu, Xingyu
    Hu, Jianguo
    Li, Shenhao
    Ding, Yanyu
    [J]. 2022 IEEE 2ND INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND ARTIFICIAL INTELLIGENCE (CCAI 2022), 2022, : 194 - 198
  • [7] UNTIED POSITIONAL ENCODINGS FOR EFFICIENT TRANSFORMER-BASED SPEECH RECOGNITION
    Samarakoon, Lahiru
    Fung, Ivan
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 108 - 114
  • [8] TRANSFORMER-BASED DIRECT SPEECH-TO-SPEECH TRANSLATION WITH TRANSCODER
    Kano, Takatomo
    Sakti, Sakriani
    Nakamura, Satoshi
    [J]. 2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 958 - 965
  • [9] Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
    Oneata, Dan
    Cucu, Horia
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4578 - 4587
  • [10] Simulating reading mistakes for child speech Transformer-based phone recognition
    Gelin, Lucile
    Pellegrini, Thomas
    Pinquier, Julien
    Daniel, Morgane
    [J]. INTERSPEECH 2021, 2021, : 3860 - 3864