Deformable TDNN with adaptive receptive fields for speech recognition

被引:1
|
作者
An, Keyu [1 ]
Zhang, Yi [1 ]
Ou, Zhijian [1 ]
机构
[1] Tsinghua Univ, Speech Proc & Machine Intelligence SPMI Lab, Beijing, Peoples R China
来源
关键词
speech recognition; deformable convolution; TDNN; adaptive receptive fields; neural architecture; END; ARCHITECTURE;
D O I
10.21437/Interspeech.2021-387
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Time Delay Neural Networks (TDNNs) are widely used in both DNN-HMM based hybrid speech recognition systems and recent end-to-end systems. Nevertheless, the receptive fields of TDNNs are limited and fixed, which is not desirable for tasks like speech recognition, where the temporal dynamics of speech are varied and affected by many factors. In this paper, we propose to use deformable TDNNs for adaptive temporal dynamics modeling in end-to-end speech recognition. Inspired by deformable ConvNets, deformable TDNNs augment the temporal sampling locations with additional offsets and learn the offsets automatically based on the ASR criterion, without additional supervision. Experiments show that deformable TDNNs obtain state-of-the-art results on WSJ benchmarks (1.42%/3.45% WER on WSJ eval92/dev93 respectively), outperforming standard TDNNs significantly. Furthermore, we propose the latency control mechanism for deformable TDNNs, which enables deformable TDNNs to do streaming ASR without accuracy degradation.
引用
收藏
页码:2067 / 2071
页数:5
相关论文
共 50 条
  • [1] A Framework for Speech Activity Detection Using Adaptive Auditory Receptive Fields
    Carlin, Michael A.
    Elhilali, Mounya
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (12) : 2422 - 2433
  • [2] Detection of speech tokens in noise using adaptive spectrotemporal receptive fields
    Bellur, Ashwin
    Elhilali, Mounya
    [J]. 2015 49TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS (CISS), 2015,
  • [3] Applying TDNN Architectures for Analyzing Duration Dependencies on Speech Emotion Recognition
    Kumawat, Pooja
    Routray, Aurobinda
    [J]. INTERSPEECH 2021, 2021, : 3410 - 3414
  • [4] Distributed TDNN-Fuzzy Vector Quantization For HMM Speech Recognition
    Debyeche, Mohamed
    Amrouche, Aderrahmane.
    Haton, Jean Paul
    [J]. 2009 INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS (ICMCS 2009), 2009, : 72 - +
  • [5] Research on transfer learning for Khalkha Mongolian speech recognition based on TDNN
    Shi, Linyan
    Bao, Feilong
    Wang, Yonghe
    Gao, Guanglai
    [J]. 2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 85 - 89
  • [6] Experimenting with Hybrid TDNN/HMM Acoustic Models for Russian Speech Recognition
    Kipyatkova, Irina
    [J]. SPEECH AND COMPUTER, SPECOM 2017, 2017, 10458 : 362 - 369
  • [7] AGCNN: Adaptive Gabor Convolutional Neural Networks with Receptive Fields for Vein Biometric Recognition
    Zhang, Yakun
    Li, Weijun
    Zhang, Liping
    Ning, Xin
    Sun, Linjun
    Lu, Yaxuan
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (12):
  • [8] Spectral-Temporal Receptive Fields and MFCC Balanced Feature Extraction for Noisy Speech Recognition
    Wang, Jia-Ching
    Lin, Chang-Hong
    Chen, En-Ting
    Chang, Pao-Chi
    [J]. 2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,
  • [9] Resource-efficient TDNN Architectures for Audio-visual Speech Recognition
    Koumparoulis, Alexandros
    Potamianos, Gerasimos
    Thomas, Samuel
    Morais, Edmilson da Silva
    [J]. 29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 506 - 510
  • [10] A new parameter smoothing method in the hybrid TDNN/HMM architecture for speech recognition
    Jang, CS
    Un, CK
    [J]. SPEECH COMMUNICATION, 1996, 19 (04) : 317 - 324