Lower Frame Rate Neural Network Acoustic Models

被引:69
|
作者
Pundak, Golan [1 ]
Sainath, Tara N. [1 ]
机构
[1] Google Inc, New York, NY 10011 USA
关键词
speech recognition; recurrent neural networks; connectionist temporal classification;
D O I
10.21437/Interspeech.2016-275
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently neural network acoustic models trained with Connectionist Temporal Classification (CTC) were proposed as an alternative approach to conventional cross-entropy trained neural network acoustic models which output frame-level decisions every 10ms [1]. As opposed to conventional models, CTC learns an alignment jointly with the acoustic model, and outputs a blank symbol in addition to the regular acoustic state units. This allows the CTC model to run with a lower frame rate, outputting decisions every 30ms rather than 10ms as in conventional models, thus improving overall system speed. In this work, we explore how conventional models behave with lower frame rates. On a large vocabulary Voice Search task, we will show that with conventional models, we can slow the frame rate to 40ms while improving WER by 3% relative over a CTC-based model.
引用
收藏
页码:22 / 26
页数:5
相关论文
共 50 条
  • [1] DYNAMIC FRAME SKIPPING FOR FAST SPEECH RECOGNITION IN RECURRENT NEURAL NETWORK BASED ACOUSTIC MODELS
    Song, Inchul
    Chung, Junyoung
    Kim, Taesup
    Bengio, Yoshua
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4984 - 4988
  • [2] Speaking Rate Dependent Multiple Acoustic Models Using Continuous Frame Rate Normalization
    Ban, Sung Min
    Kim, Hyung Soon
    2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2012,
  • [3] Neural Network Acoustic Models for the DARPA RATS Program
    Soltau, Hagen
    Kuo, Hong-Kwang
    Mangu, Lidia
    Saon, George
    Beran, Tomas
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3091 - 3095
  • [4] Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models
    Zhen-Hua Ling
    Zhi-Ping Zhou
    Journal of Signal Processing Systems, 2018, 90 : 1053 - 1062
  • [5] Unit Selection Speech Synthesis Using Frame-Sized Speech Segments and Neural Network Based Acoustic Models
    Ling, Zhen-Hua
    Zhou, Zhi-Ping
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2018, 90 (07): : 1053 - 1062
  • [6] Variable Frame Rate Acoustic Models using Minimum Error Reinforcement Learning
    Jiang, Dongcheng
    Zhang, Chao
    Woodland, Philip C.
    INTERSPEECH 2021, 2021, : 2601 - 2605
  • [7] Analysis of Deep Neural Network Models for Acoustic Scene Classification
    Basbug, Ahmet Melih
    Sert, Mustafa
    2019 27TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2019,
  • [8] On the Role of Nonlinear Transformations in Deep Neural Network Acoustic Models
    Nagamine, Tasha
    Seltzer, Michael L.
    Mesgarani, Nima
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 803 - 807
  • [9] Deep neural network acoustic models for spoken assessment applications
    Cheng, Jian
    Chen, Xin
    Metallinou, Angeliki
    SPEECH COMMUNICATION, 2015, 73 : 14 - 27
  • [10] MEMORY CAPACITY IN NEURAL NETWORK MODELS - RIGOROUS LOWER BOUNDS
    NEWMAN, CM
    NEURAL NETWORKS, 1988, 1 (03) : 223 - 238