DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH RECOGNITION ON LARGE-SCALE DATASET

被引:70
|
作者
Chen, Xie [1 ]
Wu, Yu [2 ]
Wang, Zhenghao [1 ]
Liu, Shujie [2 ]
Li, Jinyu [1 ]
机构
[1] Microsoft Speech & Language Grp, Hangzhou, Peoples R China
[2] Microsoft Res Asia, Hangzhou, Peoples R China
关键词
Transformer; Transducer; Real-time decoding; Speech Recognition;
D O I
10.1109/ICASSP39728.2021.9413535
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.
引用
收藏
页码:5904 / 5908
页数:5
相关论文
共 50 条
  • [1] Real-time recognition of large-scale driving patterns
    Engström, J
    Victor, T
    [J]. 2001 IEEE INTELLIGENT TRANSPORTATION SYSTEMS - PROCEEDINGS, 2001, : 1018 - 1023
  • [2] LARGE-SCALE, REAL-TIME LOGO RECOGNITION IN BROADCAST VIDEOS
    Natarajan, Pradeep
    Wu, Yue
    Saleem, Shirin
    Macrostie, Ehry
    Bernardin, Fred
    Prasad, Rohit
    Natarajan, Prem
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2011,
  • [3] LSSED: A LARGE-SCALE DATASET AND BENCHMARK FOR SPEECH EMOTION RECOGNITION
    Fan, Weiquan
    Xu, Xiangmin
    Xing, Xiaofen
    Chen, Weidong
    Huang, Dongyan
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 641 - 645
  • [4] Development of Highly Accurate Real-Time Large Scale Speech Recognition System
    Kim, I.
    Park, C.
    Lee, K.
    Kim, N.
    Lee, J.
    Kim, J.
    Lane, I.
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS (ICCE), 2015, : 493 - 496
  • [5] Real-time simulation of large-scale floods
    Liu, Q.
    Qin, Y.
    Li, G. D.
    Liu, Z.
    Cheng, D. J.
    Zhao, Y. H.
    [J]. INTERNATIONAL CONFERENCE ON WATER RESOURCE AND ENVIRONMENT 2016 (WRE2016), 2016, 39
  • [6] Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model
    Kannan, Anjuli
    Datta, Arindrima
    Sainath, Tara N.
    Weinstein, Eugene
    Ramabhadran, Bhuvana
    Wu, Yonghui
    Bapna, Ankur
    Chen, Zhifeng
    Lee, Seungji
    [J]. INTERSPEECH 2019, 2019, : 2130 - 2134
  • [7] CICIoT2023: A Real-Time Dataset and Benchmark for Large-Scale Attacks in IoT Environment
    Neto, Euclides Carlos Pinto
    Dadkhah, Sajjad
    Ferreira, Raphael
    Zohourian, Alireza
    Lu, Rongxing
    Ghorbani, Ali A.
    [J]. SENSORS, 2023, 23 (13)
  • [8] REAL-TIME SPEECH RECOGNITION
    CAELEN, J
    CASTAN, S
    PERENNOU, G
    [J]. AUTOMATISME, 1972, 17 (03): : 87 - &
  • [9] Rotation-invariant fast features for large-scale recognition and real-time tracking
    Takacs, Gabriel
    Chandrasekhar, Vijay
    Tsai, Sam
    Chen, David
    Grzeszczuk, Radek
    Girod, Bernd
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2013, 28 (04) : 334 - 344
  • [10] Large-Scale Visual Speech Recognition
    Shillingford, Brendan
    Assael, Yannis
    Hoffman, Matthew W.
    Paine, Thomas
    Hughes, Cian
    Prabhu, Utsav
    Liao, Hank
    Sak, Hasim
    Rao, Kanishka
    Bennett, Lorrayne
    Mulville, Marie
    Denil, Misha
    Coppin, Ben
    Laurie, Ben
    Senior, Andrew
    de Freitas, Nando
    [J]. INTERSPEECH 2019, 2019, : 4135 - 4139